[
https://issues.apache.org/jira/browse/ARROW-16546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Antoine Pitrou updated ARROW-16546:
-----------------------------------
Priority: Critical (was: Major)
> [Python] Pyarrow fails to loads parquet file with long column names
> -------------------------------------------------------------------
>
> Key: ARROW-16546
> URL: https://issues.apache.org/jira/browse/ARROW-16546
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Affects Versions: 8.0.0
> Environment: Ubuntu 20.04, pandas 1.4.2
> Reporter: Boris Urman
> Priority: Critical
> Attachments: Screenshot from 2022-05-12 16-59-10.png
>
>
> When loading parquet file "OSError: Couldn't deserialize thrift:
> TProtocolException: Exceeded size limit" is raised. This seems to be related
> to memory usage of table header. The issue may be coming from C code part.
> Also pyarrow 0.16 version is capable to read that parquet file.
> Below is code snippet to reproduce the issue. Screenshot of jupyter-notebook
> with more details is in attachments.
> Code snippet creates 2 pandas dataframes which only differ in column names.
> One with short column names is stored and read without problem while the
> other dataframe with long column names is stored but raises Exception during
> reading.
> {code:java}
> import pandas as pd
> import numpy as np
> data = np.random.randn(10, 250000)
> index = range(10)
> short_column_names = [f"col_{i}" for i in range(250000)]
> long_column_names =
> [f"some_really_long_column_name_ending_with_integer_number_{i}" for i in
> range(250000)]
> df_short_cols = pd.DataFrame(columns=short_column_names, data=data,
> index=index)
> df_long_cols = pd.DataFrame(columns=long_column_names, data=data,
> index=index)# Identical dataframes only column names are different
> # Storing dataframe with long column names works OK but reading fails
> df_long_cols.to_parquet("long_cols.parquet", engine="pyarrow") # Storing works
> df_long_cols_loaded = pd.read_parquet("long_cols.parquet", engine="pyarrow")
> # <--- Fails here{code}
>
>
--
This message was sent by Atlassian Jira
(v8.20.7#820007)