[ https://issues.apache.org/jira/browse/ARROW-809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15965731#comment-15965731 ]
Itai Incze commented on ARROW-809: ---------------------------------- The reason for the bug seems to be that sliced arrays are tested with the condition {{(array.offset() != 0)}} throughout {{arrow/ipc/writer.cc}}, which doesn't account for a {{\[0:x\]}} slice. To test for 0-based slices there's a need to know the number of elements in the original array, which is easily computed in cases like fixed-width array types but could be harder in others. Another approach could be adding a class member to mark the slices. This could be either by a simple boolean, the original dimensions or a reference to the original array. > C++: Writing sliced record batch to IPC writes the entire array > --------------------------------------------------------------- > > Key: ARROW-809 > URL: https://issues.apache.org/jira/browse/ARROW-809 > Project: Apache Arrow > Issue Type: Bug > Components: C++ > Reporter: Itai Incze > Priority: Minor > > The bug can be triggered through python: > {code} > import pyarrow.parquet > array = pyarrow.array.from_pylist([1] * 1000000) > rb = pyarrow.RecordBatch.from_arrays([array], ['a']) > rb2 = rb.slice(0,2) > with open('/tmp/t.arrow', 'wb') as f: > w = pyarrow.ipc.FileWriter(f, rb.schema) > w.write_batch(rb2) > w.close() > {code} > which will result in a big file: > {code} > $ ll /tmp/t2.arrow > -rw-rw-r-- 1 itai itai 800618 Apr 12 13:22 /tmp/t.arrow > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346)