[ 
https://issues.apache.org/jira/browse/ARROW-10493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-10493:
------------------------------------------
    Fix Version/s: 2.0.1

> [C++][Parquet] Writing nullable nested strings results in wrong data in file
> ----------------------------------------------------------------------------
>
>                 Key: ARROW-10493
>                 URL: https://issues.apache.org/jira/browse/ARROW-10493
>             Project: Apache Arrow
>          Issue Type: Bug
>    Affects Versions: 2.0.0
>         Environment: Python 3.6
>            Reporter: Christian Lundgren
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 2.0.1
>
>          Time Spent: 40m
>  Remaining Estimate: 0h
>
> When I try writing a column of type `struct(string)` that has more elements 
> than the write_batch_size, the output will only contain the first batch, 
> repeated. The data in batches after the first batch are not written to the 
> output.
> I am only seeing this behaviour with arrow 2.0.0, in 1.0.1 the output 
> contains all the data as expected.
>  
> This python test case reproduces the problem, the last value in the output is 
> "key-0" instead of the expected "key-1024":
>  
> {code:python}
> import io
> import pyarrow as pa
> import pyarrow.parquet as pq
> def test_struct_array():
>     default_writer_batch_size = 1024
>     n_samples = default_writer_batch_size + 1
>     keys = [f"key-{i}" for i in range(n_samples)]
>     expected = list(keys)
>     struct_array = pa.StructArray.from_arrays(
>         [pa.array(keys, type=pa.string())],
>         names=["string"],
>     )
>     table = pa.table({"struct": struct_array})
>     buf = io.BytesIO()
>     pq.write_table(table, buf)
>     actual = pq.read_table(buf).flatten()[0].to_pylist()
>     assert actual[:1024] == expected[:1024]
>     assert actual[-1] == expected[-1], (actual[-1], expected[-1])
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to