[jira] [Commented] (ARROW-6520) [Python] Segmentation fault on writing tables with fixed size binary fields

Joris Van den Bossche (Jira) Wed, 11 Sep 2019 04:43:45 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-6520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16927504#comment-16927504
 ]


Joris Van den Bossche commented on ARROW-6520:
----------------------------------------------

So this is due to an invalid creation of the table in 0.14.1. With the table as 
created above:

{code}
 In [4]: table.column(0)                                                        
                                                                                
                                                    
Out[4]: 
<Column name='col' type=FixedSizeBinaryType(fixed_size_binary[4])>
[
  [
    31323334,
    31323334,
    31323334,
    31323334,
    31323334,
    31323334,
    31323334,
    31323334,
    31323334,
    31323334
  ]
]

In [5]: table.column(0).type 
Out[5]: FixedSizeBinaryType(fixed_size_binary[4])

In [7]: table.column(0).data.chunk(0).type
Out[7]: DataType(binary)
{code}

So the Column (now gone) and ChunkedArray are of FixedSizeBinary type, but the 
actual Array that they are holding is variable binary type. This is probably 
related to the fixed issue where the passed schema was not properly used (there 
have been several PRs related to that I think, eg 
https://github.com/apache/arrow/pull/5189. We now also validate the created 
table in cython).





> [Python] Segmentation fault on writing tables with fixed size binary fields 
> ----------------------------------------------------------------------------
>
>                 Key: ARROW-6520
>                 URL: https://issues.apache.org/jira/browse/ARROW-6520
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 0.14.1
>         Environment: python(3.7.3), pyarrow(0.14.1), arrow-cpp(0.14.1), 
> parquet-cpp(1.5.1), Arch Linux x86_64
>            Reporter: Furkan Tektas
>            Priority: Critical
>              Labels: newbie
>             Fix For: 0.15.0
>
>
> I'm not sure if this should be reported to Parquet or here.
> When I tried to serialize a pyarrow table with a fixed size binary field 
> (holds 16 byte UUID4 information) to a parquet file, segmentation fault 
> occurs.
> Here is the minimal example to reproduce:
> {{import pyarrow as pa}}
> {{from pyarrow import parquet as pq}}
> {{data = \{"col": pa.array([b"1234" for _ in range(10)])}}}
> {{fields = [("col", pa.binary(4))]}}
> {{schema = pa.schema(fields)}}
> {{table = pa.table(data, schema)}}
> {{pq.write_table(table, "test.parquet")}}
> {{segmentation fault (core dumped) ipython}}
>  
> Yet, it works if I don't specify the size of the binary field.
> {{import pyarrow as pa}}
> {{from pyarrow import parquet as pq}}
> {{data = \{"col": pa.array([b"1234" for _ in range(10)])}}}
> {{fields = [("col", pa.binary())]}}
> {{schema = pa.schema(fields)}}
> {{table = pa.table(data, schema)}}
> {{pq.write_table(table, "test.parquet")}}
> Thanks,



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (ARROW-6520) [Python] Segmentation fault on writing tables with fixed size binary fields

Reply via email to