[ https://issues.apache.org/jira/browse/ARROW-10493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Joris Van den Bossche updated ARROW-10493: ------------------------------------------ Fix Version/s: 2.0.1 > [C++][Parquet] Writing nullable nested strings results in wrong data in file > ---------------------------------------------------------------------------- > > Key: ARROW-10493 > URL: https://issues.apache.org/jira/browse/ARROW-10493 > Project: Apache Arrow > Issue Type: Bug > Affects Versions: 2.0.0 > Environment: Python 3.6 > Reporter: Christian Lundgren > Priority: Major > Labels: pull-request-available > Fix For: 2.0.1 > > Time Spent: 40m > Remaining Estimate: 0h > > When I try writing a column of type `struct(string)` that has more elements > than the write_batch_size, the output will only contain the first batch, > repeated. The data in batches after the first batch are not written to the > output. > I am only seeing this behaviour with arrow 2.0.0, in 1.0.1 the output > contains all the data as expected. > > This python test case reproduces the problem, the last value in the output is > "key-0" instead of the expected "key-1024": > > {code:python} > import io > import pyarrow as pa > import pyarrow.parquet as pq > def test_struct_array(): > default_writer_batch_size = 1024 > n_samples = default_writer_batch_size + 1 > keys = [f"key-{i}" for i in range(n_samples)] > expected = list(keys) > struct_array = pa.StructArray.from_arrays( > [pa.array(keys, type=pa.string())], > names=["string"], > ) > table = pa.table({"struct": struct_array}) > buf = io.BytesIO() > pq.write_table(table, buf) > actual = pq.read_table(buf).flatten()[0].to_pylist() > assert actual[:1024] == expected[:1024] > assert actual[-1] == expected[-1], (actual[-1], expected[-1]) > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005)