[ https://issues.apache.org/jira/browse/PARQUET-1100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16165652#comment-16165652 ]
Wes McKinney commented on PARQUET-1100: --------------------------------------- [~jseppanen] I have this fixed in https://github.com/apache/parquet-cpp/pull/398 and am working on finalizing the patch so it can go in. Will make sure that conda/pip wheels for Arrow 0.7.0 include this bug fix > [C++] Reading repeated types should decode number of records rather than > number of values > ----------------------------------------------------------------------------------------- > > Key: PARQUET-1100 > URL: https://issues.apache.org/jira/browse/PARQUET-1100 > Project: Parquet > Issue Type: Bug > Components: parquet-cpp > Affects Versions: cpp-1.2.0 > Reporter: Jarno Seppanen > Assignee: Wes McKinney > Fix For: cpp-1.3.0 > > Attachments: > part-00000-6570e34b-b42c-4a39-8adf-21d3a97fb87d.snappy.parquet > > > Reading the attached parquet file into pandas dataframe and then using the > dataframe segfaults. > {noformat} > Python 3.5.3 |Continuum Analytics, Inc.| (default, Mar 6 2017, 11:58:13) > [GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux > Type "help", "copyright", "credits" or "license" for more information. > >>> > >>> import pyarrow > >>> import pyarrow.parquet as pq > >>> pyarrow.__version__ > '0.6.0' > >>> import pandas as pd > >>> pd.__version__ > '0.19.0' > >>> df = > >>> pq.read_table('part-00000-6570e34b-b42c-4a39-8adf-21d3a97fb87d.snappy.parquet') > >>> \ > ... .to_pandas() > >>> len(df) > 69 > >>> df.info() > <class 'pandas.core.frame.DataFrame'> > RangeIndex: 69 entries, 0 to 68 > Data columns (total 6 columns): > label 69 non-null int32 > account_meta 69 non-null object > features_type 69 non-null int32 > features_size 69 non-null int32 > features_indices 1 non-null object > features_values 1 non-null object > dtypes: int32(3), object(3) > memory usage: 2.5+ KB > >>> > >>> pd.concat([df, df]) > Segmentation fault (core dumped) > {noformat} > Actually just print(df) is enough to trigger the segfault -- This message was sent by Atlassian JIRA (v6.4.14#64029)