[ https://issues.apache.org/jira/browse/ARROW-8694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17099866#comment-17099866 ]
Eric Kisslinger commented on ARROW-8694: ---------------------------------------- I can't really disagree with the founders of this very useful and innovative technology. However, there are several reputable big-data blogs that state the Parquet is well suited for wide tables with lots of columns (e.g. [https://docs.cloudera.com/runtime/7.1.0/impala-reference/topics/impala-parquet.html]). One common use case I have is to quickly read ~1000 columns from a ~100,000 column file. This used to be very fast but, performance has slowed over time with newer releases. It might be helpful to have a short section in the docs describing what Parquet is well suited for and what it is not. BTW, I'm finding performance, and storage footprints, of Feather files with the newly supported LZ4 compression to be very impressive. > [Python][Parquet] parquet.read_schema() fails when loading wide table created > from Pandas DataFrame > --------------------------------------------------------------------------------------------------- > > Key: ARROW-8694 > URL: https://issues.apache.org/jira/browse/ARROW-8694 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python > Affects Versions: 0.17.0 > Environment: Linux OS with RHEL 7.7 distribution > Reporter: Eric Kisslinger > Assignee: Wes McKinney > Priority: Critical > Labels: pull-request-available > Fix For: 1.0.0, 0.17.1 > > Time Spent: 40m > Remaining Estimate: 0h > > parquet.read_schema() fails when loading wide table schema created from > Pandas DataFrame with 50,000 columns. This works ok using pyarrow 0.16.0. > {code:java} > import numpy as np > import pandas as pd > import pyarrow as pa > import pyarrow.parquet as pq > print(pa.__version__) > df = pd.DataFrame(({'c' + str(i): np.random.randn(10) for i in range(50000)})) > table = pa.Table.from_pandas(df) > pq.write_table(table, "test_wide.parquet") > schema = pq.read_schema('test_wide.parquet'){code} > Output: > 0.17.0 > Traceback (most recent call last): > File > "/GAAL/kisseri/conda_envs/blkmamba-dev/lib/python3.6/site-packages/IPython/core/interactiveshell.py", > line 3319, in run_code > exec(code_obj, self.user_global_ns, self.user_ns) > File "<ipython-input-29-d5ef2df77263>", line 9, in <module> > table = pq.read_schema('test_wide.parquet') > File > "/GAAL/kisseri/conda_envs/blkmamba-dev/lib/python3.6/site-packages/pyarrow/parquet.py", > line 1793, in read_schema > return ParquetFile(where, memory_map=memory_map).schema.to_arrow_schema() > File > "/GAAL/kisseri/conda_envs/blkmamba-dev/lib/python3.6/site-packages/pyarrow/parquet.py", > line 210, in __init__ > read_dictionary=read_dictionary, metadata=metadata) > File "pyarrow/_parquet.pyx", line 1023, in > pyarrow._parquet.ParquetReader.open > File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status > OSError: Couldn't deserialize thrift: TProtocolException: Exceeded size limit > -- This message was sent by Atlassian Jira (v8.3.4#803005)