[jira] [Commented] (ARROW-8694) [Python][Parquet] parquet.read_schema() fails when loading wide table created from Pandas DataFrame

Wes McKinney (Jira) Tue, 05 May 2020 14:59:06 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-8694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17100282#comment-17100282
 ]


Wes McKinney commented on ARROW-8694:
-------------------------------------

[~ekisslinger] I don't mean to be argumentative, but where are you finding in 
https://docs.cloudera.com/runtime/7.1.0/impala-reference/topics/impala-parquet.html
 that Parquet is "well suited for wide tables with lots of columns"? In my 
experience working directly with the developers of Impala as colleagues, it was 
not advisable to have files with over 1000 columns. So it would be good to let 
Cloudera know which part of their documentation is misleading because I don't 
think this is their intention. 

> [Python][Parquet] parquet.read_schema() fails when loading wide table created 
> from Pandas DataFrame
> ---------------------------------------------------------------------------------------------------
>
>                 Key: ARROW-8694
>                 URL: https://issues.apache.org/jira/browse/ARROW-8694
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++, Python
>    Affects Versions: 0.17.0
>         Environment: Linux OS with RHEL 7.7 distribution
>            Reporter: Eric Kisslinger
>            Assignee: Wes McKinney
>            Priority: Critical
>              Labels: pull-request-available
>             Fix For: 1.0.0, 0.17.1
>
>          Time Spent: 1h
>  Remaining Estimate: 0h
>
> parquet.read_schema() fails when loading wide table schema created from 
> Pandas DataFrame with 50,000 columns. This works ok using pyarrow 0.16.0.
> {code:java}
> import numpy as np
> import pandas as pd
> import pyarrow as pa
> import pyarrow.parquet as pq
> print(pa.__version__)
> df = pd.DataFrame(({'c' + str(i): np.random.randn(10) for i in range(50000)}))
> table = pa.Table.from_pandas(df)
> pq.write_table(table, "test_wide.parquet")
> schema = pq.read_schema('test_wide.parquet'){code}
> Output:
> 0.17.0
> Traceback (most recent call last):
>  File 
> "/GAAL/kisseri/conda_envs/blkmamba-dev/lib/python3.6/site-packages/IPython/core/interactiveshell.py",
>  line 3319, in run_code
>  exec(code_obj, self.user_global_ns, self.user_ns)
>  File "<ipython-input-29-d5ef2df77263>", line 9, in <module>
>  table = pq.read_schema('test_wide.parquet')
>  File 
> "/GAAL/kisseri/conda_envs/blkmamba-dev/lib/python3.6/site-packages/pyarrow/parquet.py",
>  line 1793, in read_schema
>  return ParquetFile(where, memory_map=memory_map).schema.to_arrow_schema()
>  File 
> "/GAAL/kisseri/conda_envs/blkmamba-dev/lib/python3.6/site-packages/pyarrow/parquet.py",
>  line 210, in __init__
>  read_dictionary=read_dictionary, metadata=metadata)
>  File "pyarrow/_parquet.pyx", line 1023, in 
> pyarrow._parquet.ParquetReader.open
>  File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
> OSError: Couldn't deserialize thrift: TProtocolException: Exceeded size limit
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-8694) [Python][Parquet] parquet.read_schema() fails when loading wide table created from Pandas DataFrame

Reply via email to