[ https://issues.apache.org/jira/browse/ARROW-434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15770120#comment-15770120 ]
Wes McKinney commented on ARROW-434: ------------------------------------ artifacts are updated in conda-forge > Segfaults and encoding issues in Python Parquet reads > ----------------------------------------------------- > > Key: ARROW-434 > URL: https://issues.apache.org/jira/browse/ARROW-434 > Project: Apache Arrow > Issue Type: Bug > Components: Python > Environment: Ubuntu, Python 3.5, installed pyarrow from conda-forge > Reporter: Matthew Rocklin > Assignee: Wes McKinney > Priority: Minor > Labels: parquet, python > > I've conda installed pyarrow and am trying to read data from the > parquet-compatibility project. I haven't explicitly built parquet-cpp or > anything and may or may not have old versions lying around, so please take > this issue with some salt: > {code:python} > In [1]: import pyarrow.parquet > In [2]: t = pyarrow.parquet.read_table('nation.plain.parquet') > --------------------------------------------------------------------------- > ArrowException Traceback (most recent call last) > <ipython-input-2-5d966681a384> in <module>() > ----> 1 t = pyarrow.parquet.read_table('nation.plain.parquet') > /home/mrocklin/Software/anaconda/lib/python3.5/site-packages/pyarrow/parquet.pyx > in pyarrow.parquet.read_table > (/feedstock_root/build_artefacts/work/arrow-79344b335849c2eb43954b0751018051814019d6/python/build/temp.linux-x86_64-3.5/parquet.cxx:2783)() > /home/mrocklin/Software/anaconda/lib/python3.5/site-packages/pyarrow/parquet.pyx > in pyarrow.parquet.ParquetReader.read_all > (/feedstock_root/build_artefacts/work/arrow-79344b335849c2eb43954b0751018051814019d6/python/build/temp.linux-x86_64-3.5/parquet.cxx:2200)() > /home/mrocklin/Software/anaconda/lib/python3.5/site-packages/pyarrow/error.pyx > in pyarrow.error.check_status > (/feedstock_root/build_artefacts/work/arrow-79344b335849c2eb43954b0751018051814019d6/python/build/temp.linux-x86_64-3.5/error.cxx:1185)() > ArrowException: NotImplemented: list<: uint8> > {code} > Additionally I tried to read data from a Python file-like object pointing to > data on S3. Let me know if you'd prefer a separate issue. > {code:python} > In [1]: import s3fs > In [2]: fs = s3fs.S3FileSystem() > In [3]: f = fs.open('dask-data/nyc-taxi/2015/parquet/part.0.parquet') > In [4]: f.read(100) > Out[4]: > b'PAR1\x15\x00\x15\x90\xc4\xa2\x12\x15\x90\xc4\xa2\x12,\x15\xc2\xa8\xa4\x02\x15\x00\x15\x06\x15\x08\x00\x00\x00\x80\xbf\xe7\x8b\x0b\x05\x00\x00\x80\xbf\xe7\x8b\x0b\x05\x00\x00\x80\xbf\xe7\x8b\x0b\x05\x00@\xc2\xce\xe7\x8b\x0b\x05\x00\xc0F\xed\xe7\x8b\x0b\x05\x00\xc0F\xed\xe7\x8b\x0b\x05\x00\x00\x89\xfc\xe7\x8b\x0b\x05\x00@\xcb\x0b\xe8\x8b\x0b\x05\x00\x80\r\x1b\xe8\x8b\x0b' > In [5]: import pyarrow.parquet > In [6]: t = pyarrow.parquet.read_table(f) > Segmentation fault (core dumped) > {code} > Here is a more reproducible version: > {code:python} > In [1]: with open('nation.plain.parquet', 'rb') as f: > ...: data = f.read() > ...: > In [2]: from io import BytesIO > In [3]: f = BytesIO(data) > In [4]: f.seek(0) > Out[4]: 0 > In [5]: import pyarrow.parquet > In [6]: t = pyarrow.parquet.read_table(f) > Segmentation fault (core dumped) > {code} > I was however pleased with round-trip functionality within this project, > which was very pleasant. -- This message was sent by Atlassian JIRA (v6.3.4#6332)