[ 
https://issues.apache.org/jira/browse/ARROW-3792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16714219#comment-16714219
 ] 

Wes McKinney commented on ARROW-3792:
-------------------------------------

The Parquet writer code behaves incorrectly when writing a length-0 array. 
There is another bug report about writing length-0 record batches so possible 
the same fix involved. 

> [Python] Segmentation fault when writing empty RecordBatches to Parquet
> -----------------------------------------------------------------------
>
>                 Key: ARROW-3792
>                 URL: https://issues.apache.org/jira/browse/ARROW-3792
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++, Python
>    Affects Versions: 0.11.1
>         Environment: Fedora 28, pyarrow installed with pip
> Fedora 29, pyarrow installed from conda-forge
>            Reporter: Suvayu Ali
>            Assignee: Wes McKinney
>            Priority: Major
>              Labels: parquet
>             Fix For: 0.12.0
>
>         Attachments: minimal_bug_arrow3792.py, pq-bug.py
>
>
> h2. Background
> I am trying to convert a very sparse dataset to parquet (~3% rows in a range 
> are populated). The file I am working with spans upto ~63M rows. I decided to 
> iterate in batches of 500k rows, 127 batches in total. Each row batch is a 
> {{RecordBatch}}. I create 4 batches at a time, and write to a parquet file 
> incrementally. Something like this:
> {code:python}
> batches = [..]  # 4 batches
> tbl = pa.Table.from_batches(batches)
> pqwriter.write_table(tbl, row_group_size=15000)
> # same issue with pq.write_table(..)
> {code}
> I was getting a segmentation fault at the final step, I narrowed it down to a 
> specific iteration. I noticed that iteration had empty batches; specifically, 
> [0, 0, 2876, 14423]. The number of rows for each {{RecordBatch}} for the 
> whole dataset is below:
> {code:python}
> [14050, 16398, 14080, 14920, 15527, 14288, 15040, 14733, 15345, 15799,
> 15728, 15942, 14734, 15241, 15721, 15255, 14167, 14009, 13753, 14800,
> 14554, 14287, 15393, 14766, 16600, 15675, 14072, 13263, 12906, 14167,
> 14455, 15428, 15129, 16141, 15478, 16257, 14639, 14887, 14919, 15535,
> 13973, 14334, 13286, 15038, 15951, 17252, 15883, 19903, 16967, 16878,
> 15845, 12205, 8761, 0, 0, 0, 0, 0, 2876, 14423, 13557, 12723, 14330,
> 15452, 13551, 12723, 12396, 13531, 13539, 11512, 13175, 13941, 14634,
> 15515, 14239, 13856, 13873, 14154, 14822, 13543, 14653, 15328, 16171,
> 15101, 150 55, 15194, 14058, 13706, 14747, 14650, 14694, 15397, 15122,
> 16055, 16635, 14153, 14665, 14781, 15462, 15426, 16150, 14632, 14532,
> 15139, 15324, 15279, 16075, 16394, 16834, 15391, 16320, 1650 4, 17248,
> 15913, 15341, 14754, 16637, 15695, 16642, 18143, 19481, 19072, 15742,
> 18807, 18789, 14258, 0, 0]
> {code}
> On excluding the empty {{RecordBatch}}-es, the segfault goes away, but 
> unfortunately I couldn't create a proper minimal example with synthetic data.
> h2. Not quite minimal example
> The data I am using is from the 1000 Genome project, which has been public 
> for many years, so we can be reasonably sure the data is good. The following 
> steps should help you replicate the issue.
> # Download the data file (and index), about 330MB:
> {code:bash}
> $ wget 
> ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chr20.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz{,.tbi}
> {code}
> # Install the Cython library {{pysam}}, a thin wrapper around the reference 
> implementation of the VCF file spec. You will need {{zlib}} headers, but 
> that's probably not a problem :)
> {code:bash}
> $ pip3 install --user pysam
> {code}
> # Now you can use the attached script to replicate the crash.
> h2. Extra information
> I have tried attaching gdb, the backtrace when the segfault occurs is shown 
> below (maybe it helps, this is how I realised empty batches could be the 
> reason).
> {code}
> (gdb) bt
> #0  0x00007f3e7676d670 in 
> parquet::TypedColumnWriter<parquet::DataType<(parquet::Type::type)6> 
> >::WriteMiniBatch(long, short const*, short const*, parquet::ByteArray 
> const*) ()
>    from 
> /home/user/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libparquet.so.11
> #1  0x00007f3e76733d1e in arrow::Status parquet::arrow::(anonymous 
> namespace)::ArrowColumnWriter::TypedWriteBatch<parquet::DataType<(parquet::Type::type)6>,
>  arrow::BinaryType>(arrow::Array const&, long, short const*, short const*) ()
>    from 
> /home/user/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libparquet.so.11
> #2  0x00007f3e7673a3d4 in parquet::arrow::(anonymous 
> namespace)::ArrowColumnWriter::Write(arrow::Array const&) ()
>    from 
> /home/user/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libparquet.so.11
> #3  0x00007f3e7673df09 in 
> parquet::arrow::FileWriter::Impl::WriteColumnChunk(std::shared_ptr<arrow::ChunkedArray>
>  const&, long, long) ()
>    from 
> /home/user/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libparquet.so.11
> #4  0x00007f3e7673c74d in 
> parquet::arrow::FileWriter::WriteColumnChunk(std::shared_ptr<arrow::ChunkedArray>
>  const&, long, long) ()
>    from 
> /home/user/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libparquet.so.11
> #5  0x00007f3e7673c8d2 in parquet::arrow::FileWriter::WriteTable(arrow::Table 
> const&, long) ()
>    from 
> /home/user/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libparquet.so.11
> #6  0x00007f3e731e3a51 in 
> __pyx_pw_7pyarrow_8_parquet_13ParquetWriter_5write_table(_object*, _object*, 
> _object*) ()
>    from 
> /home/user/miniconda3/lib/python3.6/site-packages/pyarrow/_parquet.cpython-36m-x86_64-linux-gnu.so
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to