[ https://issues.apache.org/jira/browse/ARROW-1873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Wes McKinney reassigned ARROW-1873: ----------------------------------- Assignee: Wes McKinney > [Python] Segmentation fault when loading total 2GB of parquet files > ------------------------------------------------------------------- > > Key: ARROW-1873 > URL: https://issues.apache.org/jira/browse/ARROW-1873 > Project: Apache Arrow > Issue Type: Bug > Reporter: DB Tsai > Assignee: Wes McKinney > Fix For: 0.8.0 > > > We are trying to load 100 parquet files, and each of them is around 20MB. > Before we port [ARROW-1830] into our pyarrow distribution, we use {{glob}} to > list all the files, and then load them as pandas dataframe through pyarrow. > The schema of the parquet files is like > {code:java} > root > |-- dateint: integer (nullable = true) > |-- profileid: long (nullable = true) > |-- time: long (nullable = true) > |-- label: double (nullable = true) > |-- weight: double (nullable = true) > |-- features: array (nullable = true) > | |-- element: double (containsNull = true) > {code} > If we only load couple of them, it works without any issue. However, when > loading 100 of them, we got segmentation fault as the following. FYI, if we > flatten {{features: array[double]}} into top level, the file sizes are around > the same, and work fine too. > Is there anything we can try to eliminate this issue? Thanks. > {code} > >>> import glob > >>> files = glob.glob("/home/dbt/data/*") > >>> data = pq.ParquetDataset(files).read().to_pandas() > [New Thread 0x7fffe8f84700 (LWP 23769)] > [New Thread 0x7fffe3b93700 (LWP 23770)] > [New Thread 0x7fffe3392700 (LWP 23771)] > [New Thread 0x7fffe2b91700 (LWP 23772)] > [Thread 0x7fffe2b91700 (LWP 23772) exited] > [Thread 0x7fffe3b93700 (LWP 23770) exited] > Thread 4 "python" received signal SIGSEGV, Segmentation fault. > [Switching to Thread 0x7fffe3392700 (LWP 23771)] > 0x00007ffff270fc94 in arrow::Status > arrow::VisitTypeInline<arrow::py::ArrowDeserializer>(arrow::DataType const&, > arrow::py::ArrowDeserializer*) () > from > /home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libarrow_python.so.0 > (gdb) backtrace > #0 0x00007ffff270fc94 in arrow::Status > arrow::VisitTypeInline<arrow::py::ArrowDeserializer>(arrow::DataType const&, > arrow::py::ArrowDeserializer*) () > from > /home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libarrow_python.so.0 > #1 0x00007ffff2700b5a in > arrow::py::ConvertColumnToPandas(arrow::py::PandasOptions, > std::shared_ptr<arrow::Column> const&, _object*, _object**) () > from > /home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libarrow_python.so.0 > #2 0x00007ffff2714985 in arrow::Status > arrow::py::ConvertListsLike<arrow::DoubleType>(arrow::py::PandasOptions, > std::shared_ptr<arrow::Column> const&, _object**) () from > /home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libarrow_python.so.0 > #3 0x00007ffff2716b92 in > arrow::py::ObjectBlock::Write(std::shared_ptr<arrow::Column> const&, long, > long) () > from > /home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libarrow_python.so.0 > #4 0x00007ffff270a489 in > arrow::py::DataFrameBlockCreator::WriteTableToBlocks(int)::{lambda(int)#1}::operator()(int) > const () > from > /home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libarrow_python.so.0 > #5 0x00007ffff270a67c in std::thread::_Impl<std::_Bind_simple<arrow::Status > arrow::ParallelFor<arrow::py::DataFrameBlockCreator::WriteTableToBlocks(int)::{lambda(int)#1}&>(int, > int, > arrow::py::DataFrameBlockCreator::WriteTableToBlocks(int)::{lambda(int)#1}&)::{lambda()#1} > ()> >::_M_run() () > from > /home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libarrow_python.so.0 > #6 0x00007ffff1e30c5c in std::execute_native_thread_routine_compat > (__p=<optimized out>) > at > /opt/conda/conda-bld/compilers_linux-64_1505664199673/work/.build/src/gcc-7.2.0/libstdc++-v3/src/c++11/thread.cc:110 > #7 0x00007ffff7bc16ba in start_thread (arg=0x7fffe3392700) at > pthread_create.c:333 > #8 0x00007ffff78f73dd in clone () at > ../sysdeps/unix/sysv/linux/x86_64/clone.S:109 > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029)