[ https://issues.apache.org/jira/browse/ARROW-3792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16710926#comment-16710926 ]
Tanya Schlusser commented on ARROW-3792: ---------------------------------------- Sweet! I'll stop on this then :) > [Python] Segmentation fault when writing empty RecordBatches to Parquet > ----------------------------------------------------------------------- > > Key: ARROW-3792 > URL: https://issues.apache.org/jira/browse/ARROW-3792 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python > Affects Versions: 0.11.1 > Environment: Fedora 28, pyarrow installed with pip > Fedora 29, pyarrow installed from conda-forge > Reporter: Suvayu Ali > Priority: Major > Labels: parquet > Fix For: 0.12.0 > > Attachments: minimal_bug_arrow3792.py, pq-bug.py > > > h2. Background > I am trying to convert a very sparse dataset to parquet (~3% rows in a range > are populated). The file I am working with spans upto ~63M rows. I decided to > iterate in batches of 500k rows, 127 batches in total. Each row batch is a > {{RecordBatch}}. I create 4 batches at a time, and write to a parquet file > incrementally. Something like this: > {code:python} > batches = [..] # 4 batches > tbl = pa.Table.from_batches(batches) > pqwriter.write_table(tbl, row_group_size=15000) > # same issue with pq.write_table(..) > {code} > I was getting a segmentation fault at the final step, I narrowed it down to a > specific iteration. I noticed that iteration had empty batches; specifically, > [0, 0, 2876, 14423]. The number of rows for each {{RecordBatch}} for the > whole dataset is below: > {code:python} > [14050, 16398, 14080, 14920, 15527, 14288, 15040, 14733, 15345, 15799, > 15728, 15942, 14734, 15241, 15721, 15255, 14167, 14009, 13753, 14800, > 14554, 14287, 15393, 14766, 16600, 15675, 14072, 13263, 12906, 14167, > 14455, 15428, 15129, 16141, 15478, 16257, 14639, 14887, 14919, 15535, > 13973, 14334, 13286, 15038, 15951, 17252, 15883, 19903, 16967, 16878, > 15845, 12205, 8761, 0, 0, 0, 0, 0, 2876, 14423, 13557, 12723, 14330, > 15452, 13551, 12723, 12396, 13531, 13539, 11512, 13175, 13941, 14634, > 15515, 14239, 13856, 13873, 14154, 14822, 13543, 14653, 15328, 16171, > 15101, 150 55, 15194, 14058, 13706, 14747, 14650, 14694, 15397, 15122, > 16055, 16635, 14153, 14665, 14781, 15462, 15426, 16150, 14632, 14532, > 15139, 15324, 15279, 16075, 16394, 16834, 15391, 16320, 1650 4, 17248, > 15913, 15341, 14754, 16637, 15695, 16642, 18143, 19481, 19072, 15742, > 18807, 18789, 14258, 0, 0] > {code} > On excluding the empty {{RecordBatch}}-es, the segfault goes away, but > unfortunately I couldn't create a proper minimal example with synthetic data. > h2. Not quite minimal example > The data I am using is from the 1000 Genome project, which has been public > for many years, so we can be reasonably sure the data is good. The following > steps should help you replicate the issue. > # Download the data file (and index), about 330MB: > {code:bash} > $ wget > ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chr20.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz{,.tbi} > {code} > # Install the Cython library {{pysam}}, a thin wrapper around the reference > implementation of the VCF file spec. You will need {{zlib}} headers, but > that's probably not a problem :) > {code:bash} > $ pip3 install --user pysam > {code} > # Now you can use the attached script to replicate the crash. > h2. Extra information > I have tried attaching gdb, the backtrace when the segfault occurs is shown > below (maybe it helps, this is how I realised empty batches could be the > reason). > {code} > (gdb) bt > #0 0x00007f3e7676d670 in > parquet::TypedColumnWriter<parquet::DataType<(parquet::Type::type)6> > >::WriteMiniBatch(long, short const*, short const*, parquet::ByteArray > const*) () > from > /home/user/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libparquet.so.11 > #1 0x00007f3e76733d1e in arrow::Status parquet::arrow::(anonymous > namespace)::ArrowColumnWriter::TypedWriteBatch<parquet::DataType<(parquet::Type::type)6>, > arrow::BinaryType>(arrow::Array const&, long, short const*, short const*) () > from > /home/user/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libparquet.so.11 > #2 0x00007f3e7673a3d4 in parquet::arrow::(anonymous > namespace)::ArrowColumnWriter::Write(arrow::Array const&) () > from > /home/user/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libparquet.so.11 > #3 0x00007f3e7673df09 in > parquet::arrow::FileWriter::Impl::WriteColumnChunk(std::shared_ptr<arrow::ChunkedArray> > const&, long, long) () > from > /home/user/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libparquet.so.11 > #4 0x00007f3e7673c74d in > parquet::arrow::FileWriter::WriteColumnChunk(std::shared_ptr<arrow::ChunkedArray> > const&, long, long) () > from > /home/user/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libparquet.so.11 > #5 0x00007f3e7673c8d2 in parquet::arrow::FileWriter::WriteTable(arrow::Table > const&, long) () > from > /home/user/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libparquet.so.11 > #6 0x00007f3e731e3a51 in > __pyx_pw_7pyarrow_8_parquet_13ParquetWriter_5write_table(_object*, _object*, > _object*) () > from > /home/user/miniconda3/lib/python3.6/site-packages/pyarrow/_parquet.cpython-36m-x86_64-linux-gnu.so > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)