[ 
https://issues.apache.org/jira/browse/ARROW-10493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Christian Lundgren updated ARROW-10493:
---------------------------------------
    Description: 
When I try writing a column of type `struct(string)` that has more elements 
than the write_batch_size, the output will only contain the first batch, 
repeated. The data in batches after the first batch are not written to the 
output.

I am only seeing this behaviour with arrow 2.0.0, in 1.0.1 the output contains 
all the data as expected.
 
This python test cases reproduces the problem, the last value in the output is 
"key-0" instead of the expected "key-1024":
 
{code:python}
import io
import pyarrow as pa
import pyarrow.parquet as pq

def test_struct_array():
    default_writer_batch_size = 1024
    n_samples = default_writer_batch_size + 1
    keys = [f"key-{i}" for i in range(n_samples)]
    expected = list(keys)

    struct_array = pa.StructArray.from_arrays(
        [pa.array(keys, type=pa.string())],
        names=["string"],
    )
    table = pa.table({"struct": struct_array})

    buf = io.BytesIO()
    pq.write_table(table, buf)

    actual = pq.read_table(buf).flatten()[0].to_pylist()

    assert actual[:1024] == expected[:1024]
    assert actual[-1] == expected[-1], (actual[-1], expected[-1])
{code}
 

  was:
When I try writing a column of type `struct(string)` that has more elements 
than the write_batch_size, the output will only contain the first batch, 
repeated. The data in batches after the first batch are not written to the 
output.

I am only seeing this behaviour with arrow 2.0.0, in 1.0.1 the output contains 
all the data as expected.
 
This python test cases reproduces the problem:
 
{code:python}
import io
import pyarrow as pa
import pyarrow.parquet as pq

def test_struct_array():
    default_writer_batch_size = 1024
    n_samples = default_writer_batch_size + 1
    keys = [f"key-{i}" for i in range(n_samples)]
    expected = list(keys)

    struct_array = pa.StructArray.from_arrays(
        [pa.array(keys, type=pa.string())],
        names=["string"],
    )
    table = pa.table({"struct": struct_array})

    buf = io.BytesIO()
    pq.write_table(table, buf)

    actual = pq.read_table(buf).flatten()[0].to_pylist()

    assert actual[:1024] == expected[:1024]
    assert actual[-1] == expected[-1], (actual[-1], expected[-1])
{code}
 


> [C++][Parquet] Writing nullable nested strings results in wrong data in file
> ----------------------------------------------------------------------------
>
>                 Key: ARROW-10493
>                 URL: https://issues.apache.org/jira/browse/ARROW-10493
>             Project: Apache Arrow
>          Issue Type: Bug
>    Affects Versions: 2.0.0
>         Environment: Python 3.6
>            Reporter: Christian Lundgren
>            Priority: Major
>
> When I try writing a column of type `struct(string)` that has more elements 
> than the write_batch_size, the output will only contain the first batch, 
> repeated. The data in batches after the first batch are not written to the 
> output.
> I am only seeing this behaviour with arrow 2.0.0, in 1.0.1 the output 
> contains all the data as expected.
>  
> This python test cases reproduces the problem, the last value in the output 
> is "key-0" instead of the expected "key-1024":
>  
> {code:python}
> import io
> import pyarrow as pa
> import pyarrow.parquet as pq
> def test_struct_array():
>     default_writer_batch_size = 1024
>     n_samples = default_writer_batch_size + 1
>     keys = [f"key-{i}" for i in range(n_samples)]
>     expected = list(keys)
>     struct_array = pa.StructArray.from_arrays(
>         [pa.array(keys, type=pa.string())],
>         names=["string"],
>     )
>     table = pa.table({"struct": struct_array})
>     buf = io.BytesIO()
>     pq.write_table(table, buf)
>     actual = pq.read_table(buf).flatten()[0].to_pylist()
>     assert actual[:1024] == expected[:1024]
>     assert actual[-1] == expected[-1], (actual[-1], expected[-1])
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to