[
https://issues.apache.org/jira/browse/ARROW-12547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17333519#comment-17333519
]
Jay Baywatch commented on ARROW-12547:
--------------------------------------
I can consistently reproduce this but seem like its more of a netapp over NFSv3
issue than arrow.
The background is that we have batch ETL jobs that run every 10 minutes that
replace parquet files on a r/o netapp volume. We write parquet to tmp and then
rename to the production path. Netapp should handle keeping the inode handles
for processes that are currently reading when that happens.
It seems that if a client is reading when a bunch of updates come in, something
weird happens during the page in process and we get a bus error. The files are
all about 200 MB, and this only seems to happen when mmaping is enabled in
pa.read_table and there are multiple updates to files in the same directory,
even if we are not reading them
it breaks here:
{code:java}
data = pq.read_table(file_name,
columns=columns,
use_pandas_metadata=False,
memory_map=True,
use_legacy_dataset=False)
{code}
Fatal Python error: Bus error
Thread 0x00007fc9d75cd700 (most recent call first):
File
"/home/baywatch/gitlab/xref_collider/pyenv/lib/python3.7/site-packages/pyarrow/parquet.py",
line 1582 in read
File
"/home/baywatch/gitlab/xref_collider/pyenv/lib/python3.7/site-packages/pyarrow/parquet.py",
line 1704 in read_table
File "cache_writer.py", line 69 in read
File "cache_writer.py", line 101 in write_test
File "cache_writer.py", line 113 in <module>
Bus error (core dumped)
I am not sure if you want to delete this issue or not, it sure feels more of a
netapp issue than an arrow issue. Maybe mmap on NFS is just a bad idea in
general.
> Sigbus when using mmap in multiprocessing env over netapp
> ---------------------------------------------------------
>
> Key: ARROW-12547
> URL: https://issues.apache.org/jira/browse/ARROW-12547
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Affects Versions: 3.0.0
> Reporter: Jay Baywatch
> Priority: Minor
>
> We have noticed a condition where using arrow to read parquet files that
> reside on our netapp from slurm (over python) raise an occasional signal 7.
> We haven’t yet tried disabling memory mapping yet, although we do expect that
> turning memory mapping off in read_table will resolve the issue.
> This seems to occur when we read a file that has just been written, even
> though we do write parquet files to a transient location and then swap the
> file in using os.rename
>
> All that said, we were not sure if this was known issue or if team pyarrow is
> interested in the stack trace.
>
>
> Thread 1 (Thread 0x7fafa7dff700 (LWP 44408)):
> #0 __memcpy_avx_unaligned () at
> ../sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S:238
> #1 0x00007fafb9c40aba in snappy::RawUncompress(snappy::Source*, char*) ()
> from
> /home/svc_backtest/portfolio_analytics/prod/pyenv/lib/python3.7/site-packages/pyarrow/*libarrow.so*.300
> #2 0x00007fafb9c41131 in snappy::RawUncompress(char const*, unsigned long,
> char*) () from
> /home/svc_backtest/portfolio_analytics/prod/pyenv/lib/python3.7/site-packages/pyarrow/libarrow.so.300
> #3 0x00007fafb942abbe in arrow::util::internal::(anonymous
> namespace)::SnappyCodec::Decompress(long, unsigned char const*, long,
> unsigned char*) () from
> /home/svc_backtest/portfolio_analytics/prod/pyenv/lib/python3.7/site-packages/pyarrow/libarrow.so.300
> #4 0x00007fafb4d0965e in parquet::(anonymous
> namespace)::SerializedPageReader::DecompressIfNeeded(std::shared_ptr<arrow::Buffer>,
> int, int, int) () from
> /home/svc_backtest/portfolio_analytics/prod/pyenv/lib/python3.7/site-packages/pyarrow/*libparquet.so*.300
> #5 0x00007fafb4d2bc2d in parquet::(anonymous
> namespace)::SerializedPageReader::NextPage() () from
> /home/svc_backtest/portfolio_analytics/prod/pyenv/lib/python3.7/site-packages/pyarrow/libparquet.so.300
> #6 0x00007fafb4d330c3 in parquet::(anonymous
> namespace)::ColumnReaderImplBase<parquet::PhysicalType<(parquet::Type::type)5>
> >::HasNextInternal() [clone .part.0] () from
> /home/svc_backtest/portfolio_analytics/prod/pyenv/lib/python3.7/site-packages/pyarrow/libparquet.so.300
> #7 0x00007fafb4d33eb8 in parquet::internal::(anonymous
> namespace)::TypedRecordReader<parquet::PhysicalType<(parquet::Type::type)5>
> >::ReadRecords(long) () from
> /home/svc_backtest/portfolio_analytics/prod/pyenv/lib/python3.7/site-packages/pyarrow/libparquet.so.300
> #8 0x00007fafb4d21bb8 in parquet::arrow::(anonymous
> namespace)::LeafReader::LoadBatch(long) () from
> /home/svc_backtest/portfolio_analytics/prod/pyenv/lib/python3.7/site-packages/pyarrow/libparquet.so.300
> #9 0x00007fafb4d489c8 in parquet::arrow::ColumnReaderImpl::NextBatch(long,
> std::shared_ptr<arrow::ChunkedArray>*) () from
> /home/svc_backtest/portfolio_analytics/prod/pyenv/lib/python3.7/site-packages/pyarrow/libparquet.so.300
> #10 0x00007fafb4d32db9 in arrow::internal::FnOnce<void
> ()>::FnImpl<std::_Bind<arrow::detail::ContinueFuture
> (arrow::Future<arrow::detail::Empty>, parquet::arrow::(anonymous
> namespace)::FileReaderImpl::GetRecordBatchReader(std::vector<int,
> std::allocator<int> > const&, std::vector<int, std::allocator<int> > const&,
> std::unique_ptr<arrow::RecordBatchReader,
> std::default_delete<arrow::RecordBatchReader>
> >*)::\{lambda()#1}::operator()()::\{lambda(int)#1}, int)> >::invoke() () from
> /home/svc_backtest/portfolio_analytics/prod/pyenv/lib/python3.7/site-packages/pyarrow/libparquet.so.300
> #11 0x00007fafb9444ddd in
> std::thread::_State_impl<std::thread::_Invoker<std::tuple<arrow::internal::ThreadPool::LaunchWorkersUnlocked(int)::\{lambda()#1}>
> > >::_M_run() () from
> /home/svc_backtest/portfolio_analytics/prod/pyenv/lib/python3.7/site-packages/pyarrow/libarrow.so.300
> #12 0x00007fafb9dd3580 in execute_native_thread_routine () from
> /home/svc_backtest/portfolio_analytics/prod/pyenv/lib/python3.7/site-packages/pyarrow/libarrow.so.300
> #13 0x00007fafefcdc6ba in start_thread (arg=0x7fafa7dff700) at
> pthread_create.c:333
> #14 0x00007fafefa1241d in clone () at
> ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
--
This message was sent by Atlassian Jira
(v8.3.4#803005)