[ https://issues.apache.org/jira/browse/PARQUET-1245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Wes McKinney resolved PARQUET-1245. ----------------------------------- Resolution: Fixed Issue resolved by pull request 447 [https://github.com/apache/parquet-cpp/pull/447] > [C++] Segfault when writing Arrow table with duplicate columns > -------------------------------------------------------------- > > Key: PARQUET-1245 > URL: https://issues.apache.org/jira/browse/PARQUET-1245 > Project: Parquet > Issue Type: Bug > Environment: Linux Mint 18.2 > Anaconda Python distribution + pyarrow installed from the conda-forge channel > Reporter: Alexey Strokach > Assignee: Antoine Pitrou > Priority: Minor > Labels: pull-request-available > Fix For: cpp-1.5.0 > > > I accidentally created a large number of Parquet files with two > __index_level_0__ columns (through a Spark SQL query). > PyArrow can read these files into tables, but it segfaults when converting > the resulting tables to Pandas DataFrames or when saving the tables to > Parquet files. > {code:none} > # Duplicate columns cause segmentation faults > table = pq.read_table('/path/to/duplicate_column_file.parquet') > table.to_pandas() # Segmentation fault > pq.write_table(table, '/some/output.parquet') # Segmentation fault > {code} > If I remove the duplicate column using table.remove_column(...) everything > works without segfaults. > {code:none} > # After removing duplicate columns, everything works fine > table = pq.read_table('/path/to/duplicate_column_file.parquet') > table.remove_column(34) > table.to_pandas() # OK > pq.write_table(table, '/some/output.parquet') # OK > {code} > For more concrete examples, see `test_segfault_1.py` and `test_segfault_2.py` > here: https://gitlab.com/ostrokach/pyarrow_duplicate_column_errors. -- This message was sent by Atlassian JIRA (v7.6.3#76005)