[ 
https://issues.apache.org/jira/browse/ARROW-8694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17099866#comment-17099866
 ] 

Eric Kisslinger commented on ARROW-8694:
----------------------------------------

I can't really disagree with the founders of this very useful and innovative 
technology. However, there are several reputable big-data blogs that state the 
Parquet is well suited for wide tables with lots of columns (e.g. 
[https://docs.cloudera.com/runtime/7.1.0/impala-reference/topics/impala-parquet.html]).
  One common use case I have is to quickly read ~1000 columns from a ~100,000 
column file. This used to be very fast but, performance has slowed over time 
with newer releases. It might be helpful to have a short section in the docs 
describing what Parquet is well suited for and what it is not. BTW, I'm finding 
performance, and storage footprints, of Feather files with the newly supported 
LZ4 compression to be very impressive.

> [Python][Parquet] parquet.read_schema() fails when loading wide table created 
> from Pandas DataFrame
> ---------------------------------------------------------------------------------------------------
>
>                 Key: ARROW-8694
>                 URL: https://issues.apache.org/jira/browse/ARROW-8694
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++, Python
>    Affects Versions: 0.17.0
>         Environment: Linux OS with RHEL 7.7 distribution
>            Reporter: Eric Kisslinger
>            Assignee: Wes McKinney
>            Priority: Critical
>              Labels: pull-request-available
>             Fix For: 1.0.0, 0.17.1
>
>          Time Spent: 40m
>  Remaining Estimate: 0h
>
> parquet.read_schema() fails when loading wide table schema created from 
> Pandas DataFrame with 50,000 columns. This works ok using pyarrow 0.16.0.
> {code:java}
> import numpy as np
> import pandas as pd
> import pyarrow as pa
> import pyarrow.parquet as pq
> print(pa.__version__)
> df = pd.DataFrame(({'c' + str(i): np.random.randn(10) for i in range(50000)}))
> table = pa.Table.from_pandas(df)
> pq.write_table(table, "test_wide.parquet")
> schema = pq.read_schema('test_wide.parquet'){code}
> Output:
> 0.17.0
> Traceback (most recent call last):
>  File 
> "/GAAL/kisseri/conda_envs/blkmamba-dev/lib/python3.6/site-packages/IPython/core/interactiveshell.py",
>  line 3319, in run_code
>  exec(code_obj, self.user_global_ns, self.user_ns)
>  File "<ipython-input-29-d5ef2df77263>", line 9, in <module>
>  table = pq.read_schema('test_wide.parquet')
>  File 
> "/GAAL/kisseri/conda_envs/blkmamba-dev/lib/python3.6/site-packages/pyarrow/parquet.py",
>  line 1793, in read_schema
>  return ParquetFile(where, memory_map=memory_map).schema.to_arrow_schema()
>  File 
> "/GAAL/kisseri/conda_envs/blkmamba-dev/lib/python3.6/site-packages/pyarrow/parquet.py",
>  line 210, in __init__
>  read_dictionary=read_dictionary, metadata=metadata)
>  File "pyarrow/_parquet.pyx", line 1023, in 
> pyarrow._parquet.ParquetReader.open
>  File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
> OSError: Couldn't deserialize thrift: TProtocolException: Exceeded size limit
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to