[jira] [Commented] (PARQUET-1100) [C++] Reading repeated types should decode number of records rather than number of values

Wes McKinney (JIRA) Wed, 13 Sep 2017 20:18:40 -0700

    [ 
https://issues.apache.org/jira/browse/PARQUET-1100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16165652#comment-16165652
 ]


Wes McKinney commented on PARQUET-1100:
---------------------------------------

[~jseppanen] I have this fixed in 
https://github.com/apache/parquet-cpp/pull/398 and am working on finalizing the 
patch so it can go in. Will make sure that conda/pip wheels for Arrow 0.7.0 
include this bug fix

> [C++] Reading repeated types should decode number of records rather than 
> number of values
> -----------------------------------------------------------------------------------------
>
>                 Key: PARQUET-1100
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1100
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-cpp
>    Affects Versions: cpp-1.2.0
>            Reporter: Jarno Seppanen
>            Assignee: Wes McKinney
>             Fix For: cpp-1.3.0
>
>         Attachments: 
> part-00000-6570e34b-b42c-4a39-8adf-21d3a97fb87d.snappy.parquet
>
>
> Reading the attached parquet file into pandas dataframe and then using the 
> dataframe segfaults.
> {noformat}
> Python 3.5.3 |Continuum Analytics, Inc.| (default, Mar  6 2017, 11:58:13) 
> [GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux
> Type "help", "copyright", "credits" or "license" for more information.
> >>> 
> >>> import pyarrow
> >>> import pyarrow.parquet as pq
> >>> pyarrow.__version__
> '0.6.0'
> >>> import pandas as pd
> >>> pd.__version__
> '0.19.0'
> >>> df = 
> >>> pq.read_table('part-00000-6570e34b-b42c-4a39-8adf-21d3a97fb87d.snappy.parquet')
> >>>  \
> ...        .to_pandas()
> >>> len(df)
> 69
> >>> df.info()
> <class 'pandas.core.frame.DataFrame'>
> RangeIndex: 69 entries, 0 to 68
> Data columns (total 6 columns):
> label               69 non-null int32
> account_meta        69 non-null object
> features_type       69 non-null int32
> features_size       69 non-null int32
> features_indices    1 non-null object
> features_values     1 non-null object
> dtypes: int32(3), object(3)
> memory usage: 2.5+ KB
> >>> 
> >>> pd.concat([df, df])
> Segmentation fault (core dumped)
> {noformat}
> Actually just print(df) is enough to trigger the segfault



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (PARQUET-1100) [C++] Reading repeated types should decode number of records rather than number of values

Reply via email to