[ https://issues.apache.org/jira/browse/PARQUET-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Wes McKinney resolved PARQUET-1405. ----------------------------------- Resolution: Fixed Issue resolved by pull request 4230 [https://github.com/apache/arrow/pull/4230] > [C++] 'Couldn't deserialize thrift' error when reading large binary column > -------------------------------------------------------------------------- > > Key: PARQUET-1405 > URL: https://issues.apache.org/jira/browse/PARQUET-1405 > Project: Parquet > Issue Type: Bug > Components: parquet-cpp > Environment: Ubuntu 16.04; Python 3.6; Pandas 0.23.4; Numpy 1.14.3 > Reporter: Jeremy Heffner > Assignee: Deepak Majeti > Priority: Major > Labels: parquet, pull-request-available > Fix For: cpp-1.6.0 > > Attachments: parquet-issue-example.py > > Time Spent: 1.5h > Remaining Estimate: 0h > > We've run into issues reading Parquet files that contain long binary columns > (utf8 strings). In particular, we were generating WKT representations of > polygons that contained ~34 million characters when we ran into the issue. > The attached example generates a dataframe with one record and one column > containing a random string with 10^7 characters. > Pandas (using the default pyarrow engine) successfully writes the file, but > fails upon reading the file: > {code:java} > --------------------------------------------------------------------------- > ArrowIOError Traceback (most recent call last) > <ipython-input-25-25d21204cbad> in <module>() > ----> 1 df_read_in = pd.read_parquet('test.parquet') > ~/anaconda3/envs/uda/lib/python3.6/site-packages/pandas/io/parquet.py in > read_parquet(path, engine, columns, **kwargs) > 286 > 287 impl = get_engine(engine) > --> 288 return impl.read(path, columns=columns, **kwargs) > ~/anaconda3/envs/uda/lib/python3.6/site-packages/pandas/io/parquet.py in > read(self, path, columns, **kwargs) > 129 kwargs['use_pandas_metadata'] = True > 130 result = self.api.parquet.read_table(path, columns=columns, > --> 131 **kwargs).to_pandas() > 132 if should_close: > 133 try: > ~/anaconda3/envs/uda/lib/python3.6/site-packages/pyarrow/parquet.py in > read_table(source, columns, nthreads, metadata, use_pandas_metadata) > 1044 fs = _get_fs_from_path(source) > 1045 return fs.read_parquet(source, columns=columns, metadata=metadata, > -> 1046 use_pandas_metadata=use_pandas_metadata) > 1047 > 1048 pf = ParquetFile(source, metadata=metadata) > ~/anaconda3/envs/uda/lib/python3.6/site-packages/pyarrow/filesystem.py in > read_parquet(self, path, columns, metadata, schema, nthreads, > use_pandas_metadata) > 175 filesystem=self) > 176 return dataset.read(columns=columns, nthreads=nthreads, > --> 177 use_pandas_metadata=use_pandas_metadata) > 178 > 179 def open(self, path, mode='rb'): > ~/anaconda3/envs/uda/lib/python3.6/site-packages/pyarrow/parquet.py in > read(self, columns, nthreads, use_pandas_metadata) > 896 partitions=self.partitions, > 897 open_file_func=open_file, > --> 898 use_pandas_metadata=use_pandas_metadata) > 899 tables.append(table) > 900 > ~/anaconda3/envs/uda/lib/python3.6/site-packages/pyarrow/parquet.py in > read(self, columns, nthreads, partitions, open_file_func, file, > use_pandas_metadata) > 459 table = reader.read_row_group(self.row_group, **options) > 460 else: > --> 461 table = reader.read(**options) > 462 > 463 if len(self.partition_keys) > 0: > ~/anaconda3/envs/uda/lib/python3.6/site-packages/pyarrow/parquet.py in > read(self, columns, nthreads, use_pandas_metadata) > 150 columns, use_pandas_metadata=use_pandas_metadata) > 151 return self.reader.read_all(column_indices=column_indices, > --> 152 nthreads=nthreads) > 153 > 154 def scan_contents(self, columns=None, batch_size=65536): > ~/anaconda3/envs/uda/lib/python3.6/site-packages/pyarrow/_parquet.pyx in > pyarrow._parquet.ParquetReader.read_all() > ~/anaconda3/envs/uda/lib/python3.6/site-packages/pyarrow/error.pxi in > pyarrow.lib.check_status() > ArrowIOError: Couldn't deserialize thrift: No more data to read. > Deserializing page header failed. > {code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005)