[jira] [Commented] (ARROW-809) C++: Writing sliced record batch to IPC writes the entire array
[ https://issues.apache.org/jira/browse/ARROW-809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15970657#comment-15970657 ] Wes McKinney commented on ARROW-809: PR (WIP): https://github.com/apache/arrow/pull/555 > C++: Writing sliced record batch to IPC writes the entire array > --- > > Key: ARROW-809 > URL: https://issues.apache.org/jira/browse/ARROW-809 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Itai Incze >Assignee: Wes McKinney >Priority: Minor > Fix For: 0.3.0 > > > The bug can be triggered through python: > {code} > import pyarrow.parquet > array = pyarrow.array.from_pylist([1] * 100) > rb = pyarrow.RecordBatch.from_arrays([array], ['a']) > rb2 = rb.slice(0,2) > with open('/tmp/t.arrow', 'wb') as f: > w = pyarrow.ipc.FileWriter(f, rb.schema) > w.write_batch(rb2) > w.close() > {code} > which will result in a big file: > {code} > $ ll /tmp/t.arrow > -rw-rw-r-- 1 itai itai 800618 Apr 12 13:22 /tmp/t.arrow > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (ARROW-809) C++: Writing sliced record batch to IPC writes the entire array
[ https://issues.apache.org/jira/browse/ARROW-809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15968317#comment-15968317 ] Itai Incze commented on ARROW-809: -- I wrote the comment before seeing yours latest one, so not in order to doubt the solution. I've seen that code... though I'm certain you're much better acquainted with it than I am :) > C++: Writing sliced record batch to IPC writes the entire array > --- > > Key: ARROW-809 > URL: https://issues.apache.org/jira/browse/ARROW-809 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Itai Incze >Assignee: Wes McKinney >Priority: Minor > Fix For: 0.3.0 > > > The bug can be triggered through python: > {code} > import pyarrow.parquet > array = pyarrow.array.from_pylist([1] * 100) > rb = pyarrow.RecordBatch.from_arrays([array], ['a']) > rb2 = rb.slice(0,2) > with open('/tmp/t.arrow', 'wb') as f: > w = pyarrow.ipc.FileWriter(f, rb.schema) > w.write_batch(rb2) > w.close() > {code} > which will result in a big file: > {code} > $ ll /tmp/t.arrow > -rw-rw-r-- 1 itai itai 800618 Apr 12 13:22 /tmp/t.arrow > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (ARROW-809) C++: Writing sliced record batch to IPC writes the entire array
[ https://issues.apache.org/jira/browse/ARROW-809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15968310#comment-15968310 ] Wes McKinney commented on ARROW-809: There is some buffer slicing happening on the IPC write path already: https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/writer.cc#L207. It needs to be made consistent (+ well tested), though > C++: Writing sliced record batch to IPC writes the entire array > --- > > Key: ARROW-809 > URL: https://issues.apache.org/jira/browse/ARROW-809 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Itai Incze >Assignee: Wes McKinney >Priority: Minor > Fix For: 0.3.0 > > > The bug can be triggered through python: > {code} > import pyarrow.parquet > array = pyarrow.array.from_pylist([1] * 100) > rb = pyarrow.RecordBatch.from_arrays([array], ['a']) > rb2 = rb.slice(0,2) > with open('/tmp/t.arrow', 'wb') as f: > w = pyarrow.ipc.FileWriter(f, rb.schema) > w.write_batch(rb2) > w.close() > {code} > which will result in a big file: > {code} > $ ll /tmp/t.arrow > -rw-rw-r-- 1 itai itai 800618 Apr 12 13:22 /tmp/t.arrow > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (ARROW-809) C++: Writing sliced record batch to IPC writes the entire array
[ https://issues.apache.org/jira/browse/ARROW-809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15968295#comment-15968295 ] Itai Incze commented on ARROW-809: -- I've fiddled with it a bit - without altering the array class, I found there's a problem finding the exact number of items with a boolean array - where it doesnt matter, and in union array. There may be other instances as well that i'm not aware of. Seems to me that adding a private boolean {{IsSliced}} to the array is the cleanest way. > C++: Writing sliced record batch to IPC writes the entire array > --- > > Key: ARROW-809 > URL: https://issues.apache.org/jira/browse/ARROW-809 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Itai Incze >Assignee: Wes McKinney >Priority: Minor > Fix For: 0.3.0 > > > The bug can be triggered through python: > {code} > import pyarrow.parquet > array = pyarrow.array.from_pylist([1] * 100) > rb = pyarrow.RecordBatch.from_arrays([array], ['a']) > rb2 = rb.slice(0,2) > with open('/tmp/t.arrow', 'wb') as f: > w = pyarrow.ipc.FileWriter(f, rb.schema) > w.write_batch(rb2) > w.close() > {code} > which will result in a big file: > {code} > $ ll /tmp/t.arrow > -rw-rw-r-- 1 itai itai 800618 Apr 12 13:22 /tmp/t.arrow > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (ARROW-809) C++: Writing sliced record batch to IPC writes the entire array
[ https://issues.apache.org/jira/browse/ARROW-809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15968289#comment-15968289 ] Itai Incze commented on ARROW-809: -- Agreed - its a small and easy bug. All is needed is to agree on the approach. I've fiddled with it a bit - without altering the array class, I found there's a problem finding the exact number of items with a boolean array - where it doesnt matter, and in union array. There may be other instances as well that i'm not aware of. Seems to me that adding a private boolean {{IsSliced}} to the array is the cleanest way. > C++: Writing sliced record batch to IPC writes the entire array > --- > > Key: ARROW-809 > URL: https://issues.apache.org/jira/browse/ARROW-809 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Itai Incze >Assignee: Wes McKinney >Priority: Minor > Fix For: 0.3.0 > > > The bug can be triggered through python: > {code} > import pyarrow.parquet > array = pyarrow.array.from_pylist([1] * 100) > rb = pyarrow.RecordBatch.from_arrays([array], ['a']) > rb2 = rb.slice(0,2) > with open('/tmp/t.arrow', 'wb') as f: > w = pyarrow.ipc.FileWriter(f, rb.schema) > w.write_batch(rb2) > w.close() > {code} > which will result in a big file: > {code} > $ ll /tmp/t.arrow > -rw-rw-r-- 1 itai itai 800618 Apr 12 13:22 /tmp/t.arrow > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (ARROW-809) C++: Writing sliced record batch to IPC writes the entire array
[ https://issues.apache.org/jira/browse/ARROW-809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15968285#comment-15968285 ] Wes McKinney commented on ARROW-809: I'm going to truncate the data buffers to a 64-byte padding offset, patch coming tomorrow probably > C++: Writing sliced record batch to IPC writes the entire array > --- > > Key: ARROW-809 > URL: https://issues.apache.org/jira/browse/ARROW-809 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Itai Incze >Assignee: Wes McKinney >Priority: Minor > Fix For: 0.3.0 > > > The bug can be triggered through python: > {code} > import pyarrow.parquet > array = pyarrow.array.from_pylist([1] * 100) > rb = pyarrow.RecordBatch.from_arrays([array], ['a']) > rb2 = rb.slice(0,2) > with open('/tmp/t.arrow', 'wb') as f: > w = pyarrow.ipc.FileWriter(f, rb.schema) > w.write_batch(rb2) > w.close() > {code} > which will result in a big file: > {code} > $ ll /tmp/t.arrow > -rw-rw-r-- 1 itai itai 800618 Apr 12 13:22 /tmp/t.arrow > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (ARROW-809) C++: Writing sliced record batch to IPC writes the entire array
[ https://issues.apache.org/jira/browse/ARROW-809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15968010#comment-15968010 ] Wes McKinney commented on ARROW-809: Marked for 0.3. I don't think this should be hard to fix > C++: Writing sliced record batch to IPC writes the entire array > --- > > Key: ARROW-809 > URL: https://issues.apache.org/jira/browse/ARROW-809 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Itai Incze >Priority: Minor > Fix For: 0.3.0 > > > The bug can be triggered through python: > {code} > import pyarrow.parquet > array = pyarrow.array.from_pylist([1] * 100) > rb = pyarrow.RecordBatch.from_arrays([array], ['a']) > rb2 = rb.slice(0,2) > with open('/tmp/t.arrow', 'wb') as f: > w = pyarrow.ipc.FileWriter(f, rb.schema) > w.write_batch(rb2) > w.close() > {code} > which will result in a big file: > {code} > $ ll /tmp/t2.arrow > -rw-rw-r-- 1 itai itai 800618 Apr 12 13:22 /tmp/t.arrow > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (ARROW-809) C++: Writing sliced record batch to IPC writes the entire array
[ https://issues.apache.org/jira/browse/ARROW-809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15965731#comment-15965731 ] Itai Incze commented on ARROW-809: -- The reason for the bug seems to be that sliced arrays are tested with the condition {{(array.offset() != 0)}} throughout {{arrow/ipc/writer.cc}}, which doesn't account for a {{\[0:x\]}} slice. To test for 0-based slices there's a need to know the number of elements in the original array, which is easily computed in cases like fixed-width array types but could be harder in others. Another approach could be adding a class member to mark the slices. This could be either by a simple boolean, the original dimensions or a reference to the original array. > C++: Writing sliced record batch to IPC writes the entire array > --- > > Key: ARROW-809 > URL: https://issues.apache.org/jira/browse/ARROW-809 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Itai Incze >Priority: Minor > > The bug can be triggered through python: > {code} > import pyarrow.parquet > array = pyarrow.array.from_pylist([1] * 100) > rb = pyarrow.RecordBatch.from_arrays([array], ['a']) > rb2 = rb.slice(0,2) > with open('/tmp/t.arrow', 'wb') as f: > w = pyarrow.ipc.FileWriter(f, rb.schema) > w.write_batch(rb2) > w.close() > {code} > which will result in a big file: > {code} > $ ll /tmp/t2.arrow > -rw-rw-r-- 1 itai itai 800618 Apr 12 13:22 /tmp/t.arrow > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346)