[ https://issues.apache.org/jira/browse/ARROW-3238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Wes McKinney updated ARROW-3238: -------------------------------- Component/s: Python > [Python] Can't read pyarrow string columns in fastparquet > --------------------------------------------------------- > > Key: ARROW-3238 > URL: https://issues.apache.org/jira/browse/ARROW-3238 > Project: Apache Arrow > Issue Type: Bug > Components: Python > Reporter: Theo Walker > Priority: Major > Labels: parquet > > Writing really long strings from pyarrow causes exception in fastparquet read. > {code:java} > Traceback (most recent call last): > File "/Users/twalker/repos/cloud-atlas/diag/right.py", line 47, in <module> > read_fastparquet() > File "/Users/twalker/repos/cloud-atlas/diag/right.py", line 41, in > read_fastparquet > dff = pf.to_pandas(['A']) > File > "/Users/twalker/anaconda/lib/python2.7/site-packages/fastparquet/api.py", > line 426, in to_pandas > index=index, assign=parts) > File > "/Users/twalker/anaconda/lib/python2.7/site-packages/fastparquet/api.py", > line 258, in read_row_group > scheme=self.file_scheme) > File > "/Users/twalker/anaconda/lib/python2.7/site-packages/fastparquet/core.py", > line 344, in read_row_group > cats, selfmade, assign=assign) > File > "/Users/twalker/anaconda/lib/python2.7/site-packages/fastparquet/core.py", > line 321, in read_row_group_arrays > catdef=out.get(name+'-catdef', None)) > File > "/Users/twalker/anaconda/lib/python2.7/site-packages/fastparquet/core.py", > line 235, in read_col > skip_nulls, selfmade=selfmade) > File > "/Users/twalker/anaconda/lib/python2.7/site-packages/fastparquet/core.py", > line 99, in read_data_page > raw_bytes = _read_page(f, header, metadata) > File > "/Users/twalker/anaconda/lib/python2.7/site-packages/fastparquet/core.py", > line 31, in _read_page > page_header.uncompressed_page_size) > AssertionError: found 175532 raw bytes (expected 200026){code} > If written with compression, it reports compression errors instead: > {code:java} > SNAPPY: snappy.UncompressError: Error while decompressing: invalid input > GZIP: zlib.error: Error -3 while decompressing data: incorrect header > check{code} > > > Minimal code to reproduce: > {code:java} > import os > import pandas as pd > import pyarrow > import pyarrow.parquet as arrow_pq > from fastparquet import ParquetFile > # data to generate > ROW_LENGTH = 40000 # decreasing below 32750ish eliminates exception > N_ROWS = 10 > # file write params > ROW_GROUP_SIZE = 5 # Lower numbers eliminate exception, but strange data is > read (e.g. Nones) > FILENAME = 'test.parquet' > def write_arrow(): > df = pd.DataFrame({'A': ['A'*ROW_LENGTH for _ in range(N_ROWS)]}) > if os.path.isfile(FILENAME): > os.remove(FILENAME) > arrow_table = pyarrow.Table.from_pandas(df) > arrow_pq.write_table(arrow_table, > FILENAME, > use_dictionary=False, > compression='NONE', > row_group_size=ROW_GROUP_SIZE) > def read_arrow(): > print "arrow:" > table2 = arrow_pq.read_table(FILENAME) > print table2.to_pandas().head() > def read_fastparquet(): > print "fastparquet:" > pf = ParquetFile(FILENAME) > dff = pf.to_pandas(['A']) > print dff.head() > write_arrow() > read_arrow() > read_fastparquet(){code} > > Versions: > {code:java} > fastparquet==0.1.6 > pyarrow==0.10.0 > pandas==0.22.0 > sys.version '2.7.15 |Anaconda custom (64-bit)| (default, May 1 2018, > 18:37:05) \n[GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)]'{code} > Also opened issue here: https://github.com/dask/fastparquet/issues/375 -- This message was sent by Atlassian JIRA (v7.6.3#76005)