[ 
https://issues.apache.org/jira/browse/PARQUET-799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15749788#comment-15749788
 ] 

William Forson commented on PARQUET-799:
----------------------------------------

Hi Deepak,

Without any explicit synchronization, the segfault happens quite reliably, say 
~99% of the time. With synchronization, the error rate is _much_ lower, 
something like ~5%.

As to location: that varies quite a bit. In cases where the error has 
manifested in parquet-cpp code, the most common site seems to be 
{{apache::thrift::transport::TBufferBase::readAll}} (via 
{{parquet::format::PageHeader::read}}). But interestingly, the bug has 
manifested in a variety of unrelated-looking stack traces, some of which 
contain no parquet-cpp frames at all.

So, as much as I'd love the community's help with what has become a marathon 
debugging venture, I don't think it would be appropriate for me to get much 
more detailed about my segfault(s).

At this point, I am simply trying to confirm that a couple dependencies (or 
rather, the way I am using them) at least _should_ be threadsafe...which has 
proved a bit tougher than I expected :)

(finally: AFAICT, this is independent of a specific parquet type, as I have 
been running my tests against a static data set, and as I said, when execution 
is serialized, the error goes away)

> concurrent usage of the file reader API
> ---------------------------------------
>
>                 Key: PARQUET-799
>                 URL: https://issues.apache.org/jira/browse/PARQUET-799
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-cpp
>            Reporter: William Forson
>
> I've recently been debugging a segfault that occurs when concurrently reading 
> (distinct) parquet files from multiple threads.
> I initially assumed this was a reasonable thing to do, since the project 
> README doesn't say anything about concurrency one way or the other. But then 
> I encountered [this TODO 
> comment|https://github.com/apache/parquet-cpp/blob/master/src/parquet/column/page.h#L35]:
> {quote}
> // TODO: Parallel processing is not yet safe because of memory-ownership
> // semantics (the PageReader may or may not own the memory referenced by a
> // page)
> {quote}
> And it has got me wondering: is parquet-cpp fundamentally NOT thread-safe, 
> even for the use case of reading a single file per thread at any given time? 
> Or is it basically thread-safe with a couple gotchas?
> Also, jfyi, I'm currently running against a build which incorporates [this 
> change|https://github.com/apache/parquet-cpp/commit/002466539f6aba7bf1f885b66f61f302ed88fa6b].
> (aside: my motivation for recently posting an issue re. {{THRIFT_HOME}} was 
> to rule out any ABI weirdness that might result from building parquet-cpp 
> against a different version of thrift than the applications that ultimately 
> consume parquet-cpp)
> Thanks!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to