[ 
https://issues.apache.org/jira/browse/ARROW-16546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-16546:
-----------------------------------
    Priority: Critical  (was: Major)

> [Python] Pyarrow fails to loads parquet file with long column names
> -------------------------------------------------------------------
>
>                 Key: ARROW-16546
>                 URL: https://issues.apache.org/jira/browse/ARROW-16546
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 8.0.0
>         Environment: Ubuntu 20.04, pandas 1.4.2
>            Reporter: Boris Urman
>            Priority: Critical
>         Attachments: Screenshot from 2022-05-12 16-59-10.png
>
>
> When loading parquet file "OSError: Couldn't deserialize thrift: 
> TProtocolException: Exceeded size limit" is raised. This seems to be related 
> to memory usage of table header. The issue may be coming from C code part. 
> Also pyarrow 0.16 version is capable to read that parquet file.
> Below is code snippet to reproduce the issue. Screenshot of jupyter-notebook 
> with more details is in attachments.
> Code snippet creates 2 pandas dataframes which only differ in column names. 
> One with short column names is stored and read without problem while the 
> other dataframe with long column names is stored but raises Exception during 
> reading.
> {code:java}
> import pandas as pd
> import numpy as np
> data = np.random.randn(10, 250000)
> index = range(10)
> short_column_names = [f"col_{i}" for i in range(250000)]
> long_column_names = 
> [f"some_really_long_column_name_ending_with_integer_number_{i}" for i in 
> range(250000)]
> df_short_cols = pd.DataFrame(columns=short_column_names, data=data, 
> index=index)
> df_long_cols = pd.DataFrame(columns=long_column_names, data=data, 
> index=index)# Identical dataframes only column names are different
> # Storing dataframe with long column names works OK but reading fails
> df_long_cols.to_parquet("long_cols.parquet", engine="pyarrow") # Storing works
> df_long_cols_loaded = pd.read_parquet("long_cols.parquet", engine="pyarrow") 
> # <--- Fails here{code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to