[ https://issues.apache.org/jira/browse/ARROW-13655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Antoine Pitrou reassigned ARROW-13655: -------------------------------------- Assignee: Antoine Pitrou > [C++][Parquet] Reading large Parquet file can give "MaxMessageSize reached" > error with Thrift 0.14 > -------------------------------------------------------------------------------------------------- > > Key: ARROW-13655 > URL: https://issues.apache.org/jira/browse/ARROW-13655 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Parquet > Reporter: Joris Van den Bossche > Assignee: Antoine Pitrou > Priority: Major > Fix For: 6.0.0 > > > From https://github.com/dask/dask/issues/8027 > Apache Thrift introduced a `MaxMessageSize` configuration option > (https://github.com/apache/thrift/blob/master/doc/specs/thrift-tconfiguration.md#maxmessagesize) > in version 0.14 (THRIFT-5237). > I think this is the cause of an issue reported originally at > https://github.com/dask/dask/issues/8027, where one can get a _"OSError: > Couldn't deserialize thrift: MaxMessageSize reached"_ error while reading a > large Parquet (metadata-only) file. > In the original report, the file was writting using the python fastparquet > library (which uses the python thrift bindings, which still use Thrift 0.13), > but I was able to construct a reproducible code example with pyarrow. > Create a large metadata Parquet file with pyarrow in an environment with > Arrow built against Thrift 0.13 (eg with a local install from source, or > installing pyarrow 2.0 from conda-forge can be installed with libthrift 0.13): > {code:python} > import pyarrow as pa > import pyarrow.parquet as pq > table = pa.table({str(i): np.random.randn(10) for i in range(1_000)}) > pq.write_table(table, "__temp_file_for_metadata.parquet") > metadata = pq.read_metadata("__temp_file_for_metadata.parquet") > metadata2 = pq.read_metadata("__temp_file_for_metadata.parquet") > [metadata.append_row_groups(metadata2) for _ in range(4000)] > metadata.write_metadata_file("test_parquet_metadata_large_file.parquet") > {code} > And then reading this file again in the same environment works fine, but > reading it in an environment with recent Thrift 0.14 (eg installing latest > pyarrow with conda-forge) gives the following error: > {code:python} > In [1]: import pyarrow.parquet as pq > In [2]: pq.read_metadata("test_parquet_metadata_large_file.parquet") > ... > OSError: Couldn't deserialize thrift: MaxMessageSize reached > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)