[jira] [Updated] (ARROW-1692) [Python, Java] UnionArray round trip not working
[ https://issues.apache.org/jira/browse/ARROW-1692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-1692: -- Labels: columnar-format-1.0 pull-request-available (was: columnar-format-1.0) > [Python, Java] UnionArray round trip not working > > > Key: ARROW-1692 > URL: https://issues.apache.org/jira/browse/ARROW-1692 > Project: Apache Arrow > Issue Type: Bug > Components: Integration, Java, Python >Reporter: Philipp Moritz >Assignee: Ryan Murray >Priority: Blocker > Labels: columnar-format-1.0, pull-request-available > Fix For: 1.0.0 > > Attachments: union_array.arrow > > Time Spent: 10m > Remaining Estimate: 0h > > I'm currently working on making pyarrow.serialization data available from the > Java side, one problem I was running into is that it seems the Java > implementation cannot read UnionArrays generated from C++. To make this > easily reproducible I created a clean Python implementation for creating > UnionArrays: https://github.com/apache/arrow/pull/1216 > The data is generated with the following script: > {code} > import pyarrow as pa > binary = pa.array([b'a', b'b', b'c', b'd'], type='binary') > int64 = pa.array([1, 2, 3], type='int64') > types = pa.array([0, 1, 0, 0, 1, 1, 0], type='int8') > value_offsets = pa.array([0, 0, 2, 1, 1, 2, 3], type='int32') > result = pa.UnionArray.from_arrays([binary, int64], types, value_offsets) > batch = pa.RecordBatch.from_arrays([result], ["test"]) > sink = pa.BufferOutputStream() > writer = pa.RecordBatchStreamWriter(sink, batch.schema) > writer.write_batch(batch) > sink.close() > b = sink.get_result() > with open("union_array.arrow", "wb") as f: > f.write(b) > # Sanity check: Read the batch in again > with open("union_array.arrow", "rb") as f: > b = f.read() > reader = pa.RecordBatchStreamReader(pa.BufferReader(b)) > batch = reader.read_next_batch() > print("union array is", batch.column(0)) > {code} > I attached the file generated by that script. Then when I run the following > code in Java: > {code} > RootAllocator allocator = new RootAllocator(10); > ByteArrayInputStream in = new > ByteArrayInputStream(Files.readAllBytes(Paths.get("union_array.arrow"))); > ArrowStreamReader reader = new ArrowStreamReader(in, allocator); > reader.loadNextBatch() > {code} > I get the following error: > {code} > | java.lang.IllegalArgumentException thrown: Could not load buffers for > field test: Union(Sparse, [22, 5])<0: Binary, 1: Int(64, true)>. error > message: can not truncate buffer to a larger size 7: 0 > |at VectorLoader.loadBuffers (VectorLoader.java:83) > |at VectorLoader.load (VectorLoader.java:62) > |at ArrowReader$1.visit (ArrowReader.java:125) > |at ArrowReader$1.visit (ArrowReader.java:111) > |at ArrowRecordBatch.accepts (ArrowRecordBatch.java:128) > |at ArrowReader.loadNextBatch (ArrowReader.java:137) > |at (#7:1) > {code} > It seems like Java is not picking up that the UnionArray is Dense instead of > Sparse. After changing the default in > java/vector/src/main/codegen/templates/UnionVector.java from Sparse to Dense, > I get this: > {code} > jshell> reader.getVectorSchemaRoot().getSchema() > $9 ==> Schema [0])<: Int(64, true)> > {code} > but then reading doesn't work: > {code} > jshell> reader.loadNextBatch() > | java.lang.IllegalArgumentException thrown: Could not load buffers for > field list: Union(Dense, [1])<: Struct Int(64, true). error message: can not truncate buffer to a larger size 1: > 0 > |at VectorLoader.loadBuffers (VectorLoader.java:83) > |at VectorLoader.load (VectorLoader.java:62) > |at ArrowReader$1.visit (ArrowReader.java:125) > |at ArrowReader$1.visit (ArrowReader.java:111) > |at ArrowRecordBatch.accepts (ArrowRecordBatch.java:128) > |at ArrowReader.loadNextBatch (ArrowReader.java:137) > |at (#8:1) > {code} > Any help with this is appreciated! -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-1692) [Python, Java] UnionArray round trip not working
[ https://issues.apache.org/jira/browse/ARROW-1692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson updated ARROW-1692: --- Component/s: Integration > [Python, Java] UnionArray round trip not working > > > Key: ARROW-1692 > URL: https://issues.apache.org/jira/browse/ARROW-1692 > Project: Apache Arrow > Issue Type: Bug > Components: Integration, Java, Python >Reporter: Philipp Moritz >Priority: Blocker > Labels: columnar-format-1.0 > Fix For: 1.0.0 > > Attachments: union_array.arrow > > > I'm currently working on making pyarrow.serialization data available from the > Java side, one problem I was running into is that it seems the Java > implementation cannot read UnionArrays generated from C++. To make this > easily reproducible I created a clean Python implementation for creating > UnionArrays: https://github.com/apache/arrow/pull/1216 > The data is generated with the following script: > {code} > import pyarrow as pa > binary = pa.array([b'a', b'b', b'c', b'd'], type='binary') > int64 = pa.array([1, 2, 3], type='int64') > types = pa.array([0, 1, 0, 0, 1, 1, 0], type='int8') > value_offsets = pa.array([0, 0, 2, 1, 1, 2, 3], type='int32') > result = pa.UnionArray.from_arrays([binary, int64], types, value_offsets) > batch = pa.RecordBatch.from_arrays([result], ["test"]) > sink = pa.BufferOutputStream() > writer = pa.RecordBatchStreamWriter(sink, batch.schema) > writer.write_batch(batch) > sink.close() > b = sink.get_result() > with open("union_array.arrow", "wb") as f: > f.write(b) > # Sanity check: Read the batch in again > with open("union_array.arrow", "rb") as f: > b = f.read() > reader = pa.RecordBatchStreamReader(pa.BufferReader(b)) > batch = reader.read_next_batch() > print("union array is", batch.column(0)) > {code} > I attached the file generated by that script. Then when I run the following > code in Java: > {code} > RootAllocator allocator = new RootAllocator(10); > ByteArrayInputStream in = new > ByteArrayInputStream(Files.readAllBytes(Paths.get("union_array.arrow"))); > ArrowStreamReader reader = new ArrowStreamReader(in, allocator); > reader.loadNextBatch() > {code} > I get the following error: > {code} > | java.lang.IllegalArgumentException thrown: Could not load buffers for > field test: Union(Sparse, [22, 5])<0: Binary, 1: Int(64, true)>. error > message: can not truncate buffer to a larger size 7: 0 > |at VectorLoader.loadBuffers (VectorLoader.java:83) > |at VectorLoader.load (VectorLoader.java:62) > |at ArrowReader$1.visit (ArrowReader.java:125) > |at ArrowReader$1.visit (ArrowReader.java:111) > |at ArrowRecordBatch.accepts (ArrowRecordBatch.java:128) > |at ArrowReader.loadNextBatch (ArrowReader.java:137) > |at (#7:1) > {code} > It seems like Java is not picking up that the UnionArray is Dense instead of > Sparse. After changing the default in > java/vector/src/main/codegen/templates/UnionVector.java from Sparse to Dense, > I get this: > {code} > jshell> reader.getVectorSchemaRoot().getSchema() > $9 ==> Schema [0])<: Int(64, true)> > {code} > but then reading doesn't work: > {code} > jshell> reader.loadNextBatch() > | java.lang.IllegalArgumentException thrown: Could not load buffers for > field list: Union(Dense, [1])<: Struct Int(64, true). error message: can not truncate buffer to a larger size 1: > 0 > |at VectorLoader.loadBuffers (VectorLoader.java:83) > |at VectorLoader.load (VectorLoader.java:62) > |at ArrowReader$1.visit (ArrowReader.java:125) > |at ArrowReader$1.visit (ArrowReader.java:111) > |at ArrowRecordBatch.accepts (ArrowRecordBatch.java:128) > |at ArrowReader.loadNextBatch (ArrowReader.java:137) > |at (#8:1) > {code} > Any help with this is appreciated! -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-1692) [Python, Java] UnionArray round trip not working
[ https://issues.apache.org/jira/browse/ARROW-1692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-1692: Fix Version/s: (was: 0.16.0) 1.0.0 > [Python, Java] UnionArray round trip not working > > > Key: ARROW-1692 > URL: https://issues.apache.org/jira/browse/ARROW-1692 > Project: Apache Arrow > Issue Type: Bug > Components: Java, Python >Reporter: Philipp Moritz >Assignee: Micah Kornfield >Priority: Blocker > Labels: columnar-format-1.0 > Fix For: 1.0.0 > > Attachments: union_array.arrow > > > I'm currently working on making pyarrow.serialization data available from the > Java side, one problem I was running into is that it seems the Java > implementation cannot read UnionArrays generated from C++. To make this > easily reproducible I created a clean Python implementation for creating > UnionArrays: https://github.com/apache/arrow/pull/1216 > The data is generated with the following script: > {code} > import pyarrow as pa > binary = pa.array([b'a', b'b', b'c', b'd'], type='binary') > int64 = pa.array([1, 2, 3], type='int64') > types = pa.array([0, 1, 0, 0, 1, 1, 0], type='int8') > value_offsets = pa.array([0, 0, 2, 1, 1, 2, 3], type='int32') > result = pa.UnionArray.from_arrays([binary, int64], types, value_offsets) > batch = pa.RecordBatch.from_arrays([result], ["test"]) > sink = pa.BufferOutputStream() > writer = pa.RecordBatchStreamWriter(sink, batch.schema) > writer.write_batch(batch) > sink.close() > b = sink.get_result() > with open("union_array.arrow", "wb") as f: > f.write(b) > # Sanity check: Read the batch in again > with open("union_array.arrow", "rb") as f: > b = f.read() > reader = pa.RecordBatchStreamReader(pa.BufferReader(b)) > batch = reader.read_next_batch() > print("union array is", batch.column(0)) > {code} > I attached the file generated by that script. Then when I run the following > code in Java: > {code} > RootAllocator allocator = new RootAllocator(10); > ByteArrayInputStream in = new > ByteArrayInputStream(Files.readAllBytes(Paths.get("union_array.arrow"))); > ArrowStreamReader reader = new ArrowStreamReader(in, allocator); > reader.loadNextBatch() > {code} > I get the following error: > {code} > | java.lang.IllegalArgumentException thrown: Could not load buffers for > field test: Union(Sparse, [22, 5])<0: Binary, 1: Int(64, true)>. error > message: can not truncate buffer to a larger size 7: 0 > |at VectorLoader.loadBuffers (VectorLoader.java:83) > |at VectorLoader.load (VectorLoader.java:62) > |at ArrowReader$1.visit (ArrowReader.java:125) > |at ArrowReader$1.visit (ArrowReader.java:111) > |at ArrowRecordBatch.accepts (ArrowRecordBatch.java:128) > |at ArrowReader.loadNextBatch (ArrowReader.java:137) > |at (#7:1) > {code} > It seems like Java is not picking up that the UnionArray is Dense instead of > Sparse. After changing the default in > java/vector/src/main/codegen/templates/UnionVector.java from Sparse to Dense, > I get this: > {code} > jshell> reader.getVectorSchemaRoot().getSchema() > $9 ==> Schema [0])<: Int(64, true)> > {code} > but then reading doesn't work: > {code} > jshell> reader.loadNextBatch() > | java.lang.IllegalArgumentException thrown: Could not load buffers for > field list: Union(Dense, [1])<: Struct Int(64, true). error message: can not truncate buffer to a larger size 1: > 0 > |at VectorLoader.loadBuffers (VectorLoader.java:83) > |at VectorLoader.load (VectorLoader.java:62) > |at ArrowReader$1.visit (ArrowReader.java:125) > |at ArrowReader$1.visit (ArrowReader.java:111) > |at ArrowRecordBatch.accepts (ArrowRecordBatch.java:128) > |at ArrowReader.loadNextBatch (ArrowReader.java:137) > |at (#8:1) > {code} > Any help with this is appreciated! -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-1692) [Python, Java] UnionArray round trip not working
[ https://issues.apache.org/jira/browse/ARROW-1692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson updated ARROW-1692: --- Priority: Blocker (was: Major) > [Python, Java] UnionArray round trip not working > > > Key: ARROW-1692 > URL: https://issues.apache.org/jira/browse/ARROW-1692 > Project: Apache Arrow > Issue Type: Bug > Components: Java, Python >Reporter: Philipp Moritz >Assignee: Micah Kornfield >Priority: Blocker > Labels: columnar-format-1.0 > Fix For: 1.0.0 > > Attachments: union_array.arrow > > > I'm currently working on making pyarrow.serialization data available from the > Java side, one problem I was running into is that it seems the Java > implementation cannot read UnionArrays generated from C++. To make this > easily reproducible I created a clean Python implementation for creating > UnionArrays: https://github.com/apache/arrow/pull/1216 > The data is generated with the following script: > {code} > import pyarrow as pa > binary = pa.array([b'a', b'b', b'c', b'd'], type='binary') > int64 = pa.array([1, 2, 3], type='int64') > types = pa.array([0, 1, 0, 0, 1, 1, 0], type='int8') > value_offsets = pa.array([0, 0, 2, 1, 1, 2, 3], type='int32') > result = pa.UnionArray.from_arrays([binary, int64], types, value_offsets) > batch = pa.RecordBatch.from_arrays([result], ["test"]) > sink = pa.BufferOutputStream() > writer = pa.RecordBatchStreamWriter(sink, batch.schema) > writer.write_batch(batch) > sink.close() > b = sink.get_result() > with open("union_array.arrow", "wb") as f: > f.write(b) > # Sanity check: Read the batch in again > with open("union_array.arrow", "rb") as f: > b = f.read() > reader = pa.RecordBatchStreamReader(pa.BufferReader(b)) > batch = reader.read_next_batch() > print("union array is", batch.column(0)) > {code} > I attached the file generated by that script. Then when I run the following > code in Java: > {code} > RootAllocator allocator = new RootAllocator(10); > ByteArrayInputStream in = new > ByteArrayInputStream(Files.readAllBytes(Paths.get("union_array.arrow"))); > ArrowStreamReader reader = new ArrowStreamReader(in, allocator); > reader.loadNextBatch() > {code} > I get the following error: > {code} > | java.lang.IllegalArgumentException thrown: Could not load buffers for > field test: Union(Sparse, [22, 5])<0: Binary, 1: Int(64, true)>. error > message: can not truncate buffer to a larger size 7: 0 > |at VectorLoader.loadBuffers (VectorLoader.java:83) > |at VectorLoader.load (VectorLoader.java:62) > |at ArrowReader$1.visit (ArrowReader.java:125) > |at ArrowReader$1.visit (ArrowReader.java:111) > |at ArrowRecordBatch.accepts (ArrowRecordBatch.java:128) > |at ArrowReader.loadNextBatch (ArrowReader.java:137) > |at (#7:1) > {code} > It seems like Java is not picking up that the UnionArray is Dense instead of > Sparse. After changing the default in > java/vector/src/main/codegen/templates/UnionVector.java from Sparse to Dense, > I get this: > {code} > jshell> reader.getVectorSchemaRoot().getSchema() > $9 ==> Schema [0])<: Int(64, true)> > {code} > but then reading doesn't work: > {code} > jshell> reader.loadNextBatch() > | java.lang.IllegalArgumentException thrown: Could not load buffers for > field list: Union(Dense, [1])<: Struct Int(64, true). error message: can not truncate buffer to a larger size 1: > 0 > |at VectorLoader.loadBuffers (VectorLoader.java:83) > |at VectorLoader.load (VectorLoader.java:62) > |at ArrowReader$1.visit (ArrowReader.java:125) > |at ArrowReader$1.visit (ArrowReader.java:111) > |at ArrowRecordBatch.accepts (ArrowRecordBatch.java:128) > |at ArrowReader.loadNextBatch (ArrowReader.java:137) > |at (#8:1) > {code} > Any help with this is appreciated! -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-1692) [Python, Java] UnionArray round trip not working
[ https://issues.apache.org/jira/browse/ARROW-1692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-1692: Fix Version/s: (was: 0.14.0) 1.0.0 > [Python, Java] UnionArray round trip not working > > > Key: ARROW-1692 > URL: https://issues.apache.org/jira/browse/ARROW-1692 > Project: Apache Arrow > Issue Type: Bug > Components: Java, Python >Reporter: Philipp Moritz >Assignee: Micah Kornfield >Priority: Major > Labels: columnar-format-1.0 > Fix For: 1.0.0 > > Attachments: union_array.arrow > > > I'm currently working on making pyarrow.serialization data available from the > Java side, one problem I was running into is that it seems the Java > implementation cannot read UnionArrays generated from C++. To make this > easily reproducible I created a clean Python implementation for creating > UnionArrays: https://github.com/apache/arrow/pull/1216 > The data is generated with the following script: > {code} > import pyarrow as pa > binary = pa.array([b'a', b'b', b'c', b'd'], type='binary') > int64 = pa.array([1, 2, 3], type='int64') > types = pa.array([0, 1, 0, 0, 1, 1, 0], type='int8') > value_offsets = pa.array([0, 0, 2, 1, 1, 2, 3], type='int32') > result = pa.UnionArray.from_arrays([binary, int64], types, value_offsets) > batch = pa.RecordBatch.from_arrays([result], ["test"]) > sink = pa.BufferOutputStream() > writer = pa.RecordBatchStreamWriter(sink, batch.schema) > writer.write_batch(batch) > sink.close() > b = sink.get_result() > with open("union_array.arrow", "wb") as f: > f.write(b) > # Sanity check: Read the batch in again > with open("union_array.arrow", "rb") as f: > b = f.read() > reader = pa.RecordBatchStreamReader(pa.BufferReader(b)) > batch = reader.read_next_batch() > print("union array is", batch.column(0)) > {code} > I attached the file generated by that script. Then when I run the following > code in Java: > {code} > RootAllocator allocator = new RootAllocator(10); > ByteArrayInputStream in = new > ByteArrayInputStream(Files.readAllBytes(Paths.get("union_array.arrow"))); > ArrowStreamReader reader = new ArrowStreamReader(in, allocator); > reader.loadNextBatch() > {code} > I get the following error: > {code} > | java.lang.IllegalArgumentException thrown: Could not load buffers for > field test: Union(Sparse, [22, 5])<0: Binary, 1: Int(64, true)>. error > message: can not truncate buffer to a larger size 7: 0 > |at VectorLoader.loadBuffers (VectorLoader.java:83) > |at VectorLoader.load (VectorLoader.java:62) > |at ArrowReader$1.visit (ArrowReader.java:125) > |at ArrowReader$1.visit (ArrowReader.java:111) > |at ArrowRecordBatch.accepts (ArrowRecordBatch.java:128) > |at ArrowReader.loadNextBatch (ArrowReader.java:137) > |at (#7:1) > {code} > It seems like Java is not picking up that the UnionArray is Dense instead of > Sparse. After changing the default in > java/vector/src/main/codegen/templates/UnionVector.java from Sparse to Dense, > I get this: > {code} > jshell> reader.getVectorSchemaRoot().getSchema() > $9 ==> Schema [0])<: Int(64, true)> > {code} > but then reading doesn't work: > {code} > jshell> reader.loadNextBatch() > | java.lang.IllegalArgumentException thrown: Could not load buffers for > field list: Union(Dense, [1])<: Struct Int(64, true). error message: can not truncate buffer to a larger size 1: > 0 > |at VectorLoader.loadBuffers (VectorLoader.java:83) > |at VectorLoader.load (VectorLoader.java:62) > |at ArrowReader$1.visit (ArrowReader.java:125) > |at ArrowReader$1.visit (ArrowReader.java:111) > |at ArrowRecordBatch.accepts (ArrowRecordBatch.java:128) > |at ArrowReader.loadNextBatch (ArrowReader.java:137) > |at (#8:1) > {code} > Any help with this is appreciated! -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-1692) [Python, Java] UnionArray round trip not working
[ https://issues.apache.org/jira/browse/ARROW-1692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou updated ARROW-1692: -- Component/s: Python Java > [Python, Java] UnionArray round trip not working > > > Key: ARROW-1692 > URL: https://issues.apache.org/jira/browse/ARROW-1692 > Project: Apache Arrow > Issue Type: Bug > Components: Java, Python >Reporter: Philipp Moritz >Assignee: Micah Kornfield >Priority: Major > Labels: columnar-format-1.0 > Fix For: 0.14.0 > > Attachments: union_array.arrow > > > I'm currently working on making pyarrow.serialization data available from the > Java side, one problem I was running into is that it seems the Java > implementation cannot read UnionArrays generated from C++. To make this > easily reproducible I created a clean Python implementation for creating > UnionArrays: https://github.com/apache/arrow/pull/1216 > The data is generated with the following script: > {code} > import pyarrow as pa > binary = pa.array([b'a', b'b', b'c', b'd'], type='binary') > int64 = pa.array([1, 2, 3], type='int64') > types = pa.array([0, 1, 0, 0, 1, 1, 0], type='int8') > value_offsets = pa.array([0, 0, 2, 1, 1, 2, 3], type='int32') > result = pa.UnionArray.from_arrays([binary, int64], types, value_offsets) > batch = pa.RecordBatch.from_arrays([result], ["test"]) > sink = pa.BufferOutputStream() > writer = pa.RecordBatchStreamWriter(sink, batch.schema) > writer.write_batch(batch) > sink.close() > b = sink.get_result() > with open("union_array.arrow", "wb") as f: > f.write(b) > # Sanity check: Read the batch in again > with open("union_array.arrow", "rb") as f: > b = f.read() > reader = pa.RecordBatchStreamReader(pa.BufferReader(b)) > batch = reader.read_next_batch() > print("union array is", batch.column(0)) > {code} > I attached the file generated by that script. Then when I run the following > code in Java: > {code} > RootAllocator allocator = new RootAllocator(10); > ByteArrayInputStream in = new > ByteArrayInputStream(Files.readAllBytes(Paths.get("union_array.arrow"))); > ArrowStreamReader reader = new ArrowStreamReader(in, allocator); > reader.loadNextBatch() > {code} > I get the following error: > {code} > | java.lang.IllegalArgumentException thrown: Could not load buffers for > field test: Union(Sparse, [22, 5])<0: Binary, 1: Int(64, true)>. error > message: can not truncate buffer to a larger size 7: 0 > |at VectorLoader.loadBuffers (VectorLoader.java:83) > |at VectorLoader.load (VectorLoader.java:62) > |at ArrowReader$1.visit (ArrowReader.java:125) > |at ArrowReader$1.visit (ArrowReader.java:111) > |at ArrowRecordBatch.accepts (ArrowRecordBatch.java:128) > |at ArrowReader.loadNextBatch (ArrowReader.java:137) > |at (#7:1) > {code} > It seems like Java is not picking up that the UnionArray is Dense instead of > Sparse. After changing the default in > java/vector/src/main/codegen/templates/UnionVector.java from Sparse to Dense, > I get this: > {code} > jshell> reader.getVectorSchemaRoot().getSchema() > $9 ==> Schema [0])<: Int(64, true)> > {code} > but then reading doesn't work: > {code} > jshell> reader.loadNextBatch() > | java.lang.IllegalArgumentException thrown: Could not load buffers for > field list: Union(Dense, [1])<: Struct Int(64, true). error message: can not truncate buffer to a larger size 1: > 0 > |at VectorLoader.loadBuffers (VectorLoader.java:83) > |at VectorLoader.load (VectorLoader.java:62) > |at ArrowReader$1.visit (ArrowReader.java:125) > |at ArrowReader$1.visit (ArrowReader.java:111) > |at ArrowRecordBatch.accepts (ArrowRecordBatch.java:128) > |at ArrowReader.loadNextBatch (ArrowReader.java:137) > |at (#8:1) > {code} > Any help with this is appreciated! -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-1692) [Python, Java] UnionArray round trip not working
[ https://issues.apache.org/jira/browse/ARROW-1692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-1692: Fix Version/s: (was: 0.13.0) 0.14.0 > [Python, Java] UnionArray round trip not working > > > Key: ARROW-1692 > URL: https://issues.apache.org/jira/browse/ARROW-1692 > Project: Apache Arrow > Issue Type: Bug >Reporter: Philipp Moritz >Priority: Major > Labels: columnar-format-1.0 > Fix For: 0.14.0 > > Attachments: union_array.arrow > > > I'm currently working on making pyarrow.serialization data available from the > Java side, one problem I was running into is that it seems the Java > implementation cannot read UnionArrays generated from C++. To make this > easily reproducible I created a clean Python implementation for creating > UnionArrays: https://github.com/apache/arrow/pull/1216 > The data is generated with the following script: > {code} > import pyarrow as pa > binary = pa.array([b'a', b'b', b'c', b'd'], type='binary') > int64 = pa.array([1, 2, 3], type='int64') > types = pa.array([0, 1, 0, 0, 1, 1, 0], type='int8') > value_offsets = pa.array([0, 0, 2, 1, 1, 2, 3], type='int32') > result = pa.UnionArray.from_arrays([binary, int64], types, value_offsets) > batch = pa.RecordBatch.from_arrays([result], ["test"]) > sink = pa.BufferOutputStream() > writer = pa.RecordBatchStreamWriter(sink, batch.schema) > writer.write_batch(batch) > sink.close() > b = sink.get_result() > with open("union_array.arrow", "wb") as f: > f.write(b) > # Sanity check: Read the batch in again > with open("union_array.arrow", "rb") as f: > b = f.read() > reader = pa.RecordBatchStreamReader(pa.BufferReader(b)) > batch = reader.read_next_batch() > print("union array is", batch.column(0)) > {code} > I attached the file generated by that script. Then when I run the following > code in Java: > {code} > RootAllocator allocator = new RootAllocator(10); > ByteArrayInputStream in = new > ByteArrayInputStream(Files.readAllBytes(Paths.get("union_array.arrow"))); > ArrowStreamReader reader = new ArrowStreamReader(in, allocator); > reader.loadNextBatch() > {code} > I get the following error: > {code} > | java.lang.IllegalArgumentException thrown: Could not load buffers for > field test: Union(Sparse, [22, 5])<0: Binary, 1: Int(64, true)>. error > message: can not truncate buffer to a larger size 7: 0 > |at VectorLoader.loadBuffers (VectorLoader.java:83) > |at VectorLoader.load (VectorLoader.java:62) > |at ArrowReader$1.visit (ArrowReader.java:125) > |at ArrowReader$1.visit (ArrowReader.java:111) > |at ArrowRecordBatch.accepts (ArrowRecordBatch.java:128) > |at ArrowReader.loadNextBatch (ArrowReader.java:137) > |at (#7:1) > {code} > It seems like Java is not picking up that the UnionArray is Dense instead of > Sparse. After changing the default in > java/vector/src/main/codegen/templates/UnionVector.java from Sparse to Dense, > I get this: > {code} > jshell> reader.getVectorSchemaRoot().getSchema() > $9 ==> Schema [0])<: Int(64, true)> > {code} > but then reading doesn't work: > {code} > jshell> reader.loadNextBatch() > | java.lang.IllegalArgumentException thrown: Could not load buffers for > field list: Union(Dense, [1])<: Struct Int(64, true). error message: can not truncate buffer to a larger size 1: > 0 > |at VectorLoader.loadBuffers (VectorLoader.java:83) > |at VectorLoader.load (VectorLoader.java:62) > |at ArrowReader$1.visit (ArrowReader.java:125) > |at ArrowReader$1.visit (ArrowReader.java:111) > |at ArrowRecordBatch.accepts (ArrowRecordBatch.java:128) > |at ArrowReader.loadNextBatch (ArrowReader.java:137) > |at (#8:1) > {code} > Any help with this is appreciated! -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-1692) [Python, Java] UnionArray round trip not working
[ https://issues.apache.org/jira/browse/ARROW-1692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-1692: Fix Version/s: (was: 0.12.0) 0.13.0 > [Python, Java] UnionArray round trip not working > > > Key: ARROW-1692 > URL: https://issues.apache.org/jira/browse/ARROW-1692 > Project: Apache Arrow > Issue Type: Bug >Reporter: Philipp Moritz >Priority: Major > Labels: columnar-format-1.0 > Fix For: 0.13.0 > > Attachments: union_array.arrow > > > I'm currently working on making pyarrow.serialization data available from the > Java side, one problem I was running into is that it seems the Java > implementation cannot read UnionArrays generated from C++. To make this > easily reproducible I created a clean Python implementation for creating > UnionArrays: https://github.com/apache/arrow/pull/1216 > The data is generated with the following script: > {code} > import pyarrow as pa > binary = pa.array([b'a', b'b', b'c', b'd'], type='binary') > int64 = pa.array([1, 2, 3], type='int64') > types = pa.array([0, 1, 0, 0, 1, 1, 0], type='int8') > value_offsets = pa.array([0, 0, 2, 1, 1, 2, 3], type='int32') > result = pa.UnionArray.from_arrays([binary, int64], types, value_offsets) > batch = pa.RecordBatch.from_arrays([result], ["test"]) > sink = pa.BufferOutputStream() > writer = pa.RecordBatchStreamWriter(sink, batch.schema) > writer.write_batch(batch) > sink.close() > b = sink.get_result() > with open("union_array.arrow", "wb") as f: > f.write(b) > # Sanity check: Read the batch in again > with open("union_array.arrow", "rb") as f: > b = f.read() > reader = pa.RecordBatchStreamReader(pa.BufferReader(b)) > batch = reader.read_next_batch() > print("union array is", batch.column(0)) > {code} > I attached the file generated by that script. Then when I run the following > code in Java: > {code} > RootAllocator allocator = new RootAllocator(10); > ByteArrayInputStream in = new > ByteArrayInputStream(Files.readAllBytes(Paths.get("union_array.arrow"))); > ArrowStreamReader reader = new ArrowStreamReader(in, allocator); > reader.loadNextBatch() > {code} > I get the following error: > {code} > | java.lang.IllegalArgumentException thrown: Could not load buffers for > field test: Union(Sparse, [22, 5])<0: Binary, 1: Int(64, true)>. error > message: can not truncate buffer to a larger size 7: 0 > |at VectorLoader.loadBuffers (VectorLoader.java:83) > |at VectorLoader.load (VectorLoader.java:62) > |at ArrowReader$1.visit (ArrowReader.java:125) > |at ArrowReader$1.visit (ArrowReader.java:111) > |at ArrowRecordBatch.accepts (ArrowRecordBatch.java:128) > |at ArrowReader.loadNextBatch (ArrowReader.java:137) > |at (#7:1) > {code} > It seems like Java is not picking up that the UnionArray is Dense instead of > Sparse. After changing the default in > java/vector/src/main/codegen/templates/UnionVector.java from Sparse to Dense, > I get this: > {code} > jshell> reader.getVectorSchemaRoot().getSchema() > $9 ==> Schema [0])<: Int(64, true)> > {code} > but then reading doesn't work: > {code} > jshell> reader.loadNextBatch() > | java.lang.IllegalArgumentException thrown: Could not load buffers for > field list: Union(Dense, [1])<: Struct Int(64, true). error message: can not truncate buffer to a larger size 1: > 0 > |at VectorLoader.loadBuffers (VectorLoader.java:83) > |at VectorLoader.load (VectorLoader.java:62) > |at ArrowReader$1.visit (ArrowReader.java:125) > |at ArrowReader$1.visit (ArrowReader.java:111) > |at ArrowRecordBatch.accepts (ArrowRecordBatch.java:128) > |at ArrowReader.loadNextBatch (ArrowReader.java:137) > |at (#8:1) > {code} > Any help with this is appreciated! -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-1692) [Python, Java] UnionArray round trip not working
[ https://issues.apache.org/jira/browse/ARROW-1692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe L. Korn updated ARROW-1692: --- Fix Version/s: (was: 0.11.0) 0.12.0 > [Python, Java] UnionArray round trip not working > > > Key: ARROW-1692 > URL: https://issues.apache.org/jira/browse/ARROW-1692 > Project: Apache Arrow > Issue Type: Bug >Reporter: Philipp Moritz >Priority: Major > Labels: columnar-format-1.0 > Fix For: 0.12.0 > > Attachments: union_array.arrow > > > I'm currently working on making pyarrow.serialization data available from the > Java side, one problem I was running into is that it seems the Java > implementation cannot read UnionArrays generated from C++. To make this > easily reproducible I created a clean Python implementation for creating > UnionArrays: https://github.com/apache/arrow/pull/1216 > The data is generated with the following script: > {code} > import pyarrow as pa > binary = pa.array([b'a', b'b', b'c', b'd'], type='binary') > int64 = pa.array([1, 2, 3], type='int64') > types = pa.array([0, 1, 0, 0, 1, 1, 0], type='int8') > value_offsets = pa.array([0, 0, 2, 1, 1, 2, 3], type='int32') > result = pa.UnionArray.from_arrays([binary, int64], types, value_offsets) > batch = pa.RecordBatch.from_arrays([result], ["test"]) > sink = pa.BufferOutputStream() > writer = pa.RecordBatchStreamWriter(sink, batch.schema) > writer.write_batch(batch) > sink.close() > b = sink.get_result() > with open("union_array.arrow", "wb") as f: > f.write(b) > # Sanity check: Read the batch in again > with open("union_array.arrow", "rb") as f: > b = f.read() > reader = pa.RecordBatchStreamReader(pa.BufferReader(b)) > batch = reader.read_next_batch() > print("union array is", batch.column(0)) > {code} > I attached the file generated by that script. Then when I run the following > code in Java: > {code} > RootAllocator allocator = new RootAllocator(10); > ByteArrayInputStream in = new > ByteArrayInputStream(Files.readAllBytes(Paths.get("union_array.arrow"))); > ArrowStreamReader reader = new ArrowStreamReader(in, allocator); > reader.loadNextBatch() > {code} > I get the following error: > {code} > | java.lang.IllegalArgumentException thrown: Could not load buffers for > field test: Union(Sparse, [22, 5])<0: Binary, 1: Int(64, true)>. error > message: can not truncate buffer to a larger size 7: 0 > |at VectorLoader.loadBuffers (VectorLoader.java:83) > |at VectorLoader.load (VectorLoader.java:62) > |at ArrowReader$1.visit (ArrowReader.java:125) > |at ArrowReader$1.visit (ArrowReader.java:111) > |at ArrowRecordBatch.accepts (ArrowRecordBatch.java:128) > |at ArrowReader.loadNextBatch (ArrowReader.java:137) > |at (#7:1) > {code} > It seems like Java is not picking up that the UnionArray is Dense instead of > Sparse. After changing the default in > java/vector/src/main/codegen/templates/UnionVector.java from Sparse to Dense, > I get this: > {code} > jshell> reader.getVectorSchemaRoot().getSchema() > $9 ==> Schema [0])<: Int(64, true)> > {code} > but then reading doesn't work: > {code} > jshell> reader.loadNextBatch() > | java.lang.IllegalArgumentException thrown: Could not load buffers for > field list: Union(Dense, [1])<: Struct Int(64, true). error message: can not truncate buffer to a larger size 1: > 0 > |at VectorLoader.loadBuffers (VectorLoader.java:83) > |at VectorLoader.load (VectorLoader.java:62) > |at ArrowReader$1.visit (ArrowReader.java:125) > |at ArrowReader$1.visit (ArrowReader.java:111) > |at ArrowRecordBatch.accepts (ArrowRecordBatch.java:128) > |at ArrowReader.loadNextBatch (ArrowReader.java:137) > |at (#8:1) > {code} > Any help with this is appreciated! -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-1692) [Python, Java] UnionArray round trip not working
[ https://issues.apache.org/jira/browse/ARROW-1692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-1692: Fix Version/s: (was: 0.10.0) 0.11.0 > [Python, Java] UnionArray round trip not working > > > Key: ARROW-1692 > URL: https://issues.apache.org/jira/browse/ARROW-1692 > Project: Apache Arrow > Issue Type: Bug >Reporter: Philipp Moritz >Priority: Major > Labels: columnar-format-1.0 > Fix For: 0.11.0 > > Attachments: union_array.arrow > > > I'm currently working on making pyarrow.serialization data available from the > Java side, one problem I was running into is that it seems the Java > implementation cannot read UnionArrays generated from C++. To make this > easily reproducible I created a clean Python implementation for creating > UnionArrays: https://github.com/apache/arrow/pull/1216 > The data is generated with the following script: > {code} > import pyarrow as pa > binary = pa.array([b'a', b'b', b'c', b'd'], type='binary') > int64 = pa.array([1, 2, 3], type='int64') > types = pa.array([0, 1, 0, 0, 1, 1, 0], type='int8') > value_offsets = pa.array([0, 0, 2, 1, 1, 2, 3], type='int32') > result = pa.UnionArray.from_arrays([binary, int64], types, value_offsets) > batch = pa.RecordBatch.from_arrays([result], ["test"]) > sink = pa.BufferOutputStream() > writer = pa.RecordBatchStreamWriter(sink, batch.schema) > writer.write_batch(batch) > sink.close() > b = sink.get_result() > with open("union_array.arrow", "wb") as f: > f.write(b) > # Sanity check: Read the batch in again > with open("union_array.arrow", "rb") as f: > b = f.read() > reader = pa.RecordBatchStreamReader(pa.BufferReader(b)) > batch = reader.read_next_batch() > print("union array is", batch.column(0)) > {code} > I attached the file generated by that script. Then when I run the following > code in Java: > {code} > RootAllocator allocator = new RootAllocator(10); > ByteArrayInputStream in = new > ByteArrayInputStream(Files.readAllBytes(Paths.get("union_array.arrow"))); > ArrowStreamReader reader = new ArrowStreamReader(in, allocator); > reader.loadNextBatch() > {code} > I get the following error: > {code} > | java.lang.IllegalArgumentException thrown: Could not load buffers for > field test: Union(Sparse, [22, 5])<0: Binary, 1: Int(64, true)>. error > message: can not truncate buffer to a larger size 7: 0 > |at VectorLoader.loadBuffers (VectorLoader.java:83) > |at VectorLoader.load (VectorLoader.java:62) > |at ArrowReader$1.visit (ArrowReader.java:125) > |at ArrowReader$1.visit (ArrowReader.java:111) > |at ArrowRecordBatch.accepts (ArrowRecordBatch.java:128) > |at ArrowReader.loadNextBatch (ArrowReader.java:137) > |at (#7:1) > {code} > It seems like Java is not picking up that the UnionArray is Dense instead of > Sparse. After changing the default in > java/vector/src/main/codegen/templates/UnionVector.java from Sparse to Dense, > I get this: > {code} > jshell> reader.getVectorSchemaRoot().getSchema() > $9 ==> Schema [0])<: Int(64, true)> > {code} > but then reading doesn't work: > {code} > jshell> reader.loadNextBatch() > | java.lang.IllegalArgumentException thrown: Could not load buffers for > field list: Union(Dense, [1])<: Struct Int(64, true). error message: can not truncate buffer to a larger size 1: > 0 > |at VectorLoader.loadBuffers (VectorLoader.java:83) > |at VectorLoader.load (VectorLoader.java:62) > |at ArrowReader$1.visit (ArrowReader.java:125) > |at ArrowReader$1.visit (ArrowReader.java:111) > |at ArrowRecordBatch.accepts (ArrowRecordBatch.java:128) > |at ArrowReader.loadNextBatch (ArrowReader.java:137) > |at (#8:1) > {code} > Any help with this is appreciated! -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-1692) [Python, Java] UnionArray round trip not working
[ https://issues.apache.org/jira/browse/ARROW-1692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-1692: Labels: columnar-format-1.0 (was: ) > [Python, Java] UnionArray round trip not working > > > Key: ARROW-1692 > URL: https://issues.apache.org/jira/browse/ARROW-1692 > Project: Apache Arrow > Issue Type: Bug >Reporter: Philipp Moritz >Priority: Major > Labels: columnar-format-1.0 > Fix For: 0.11.0 > > Attachments: union_array.arrow > > > I'm currently working on making pyarrow.serialization data available from the > Java side, one problem I was running into is that it seems the Java > implementation cannot read UnionArrays generated from C++. To make this > easily reproducible I created a clean Python implementation for creating > UnionArrays: https://github.com/apache/arrow/pull/1216 > The data is generated with the following script: > {code} > import pyarrow as pa > binary = pa.array([b'a', b'b', b'c', b'd'], type='binary') > int64 = pa.array([1, 2, 3], type='int64') > types = pa.array([0, 1, 0, 0, 1, 1, 0], type='int8') > value_offsets = pa.array([0, 0, 2, 1, 1, 2, 3], type='int32') > result = pa.UnionArray.from_arrays([binary, int64], types, value_offsets) > batch = pa.RecordBatch.from_arrays([result], ["test"]) > sink = pa.BufferOutputStream() > writer = pa.RecordBatchStreamWriter(sink, batch.schema) > writer.write_batch(batch) > sink.close() > b = sink.get_result() > with open("union_array.arrow", "wb") as f: > f.write(b) > # Sanity check: Read the batch in again > with open("union_array.arrow", "rb") as f: > b = f.read() > reader = pa.RecordBatchStreamReader(pa.BufferReader(b)) > batch = reader.read_next_batch() > print("union array is", batch.column(0)) > {code} > I attached the file generated by that script. Then when I run the following > code in Java: > {code} > RootAllocator allocator = new RootAllocator(10); > ByteArrayInputStream in = new > ByteArrayInputStream(Files.readAllBytes(Paths.get("union_array.arrow"))); > ArrowStreamReader reader = new ArrowStreamReader(in, allocator); > reader.loadNextBatch() > {code} > I get the following error: > {code} > | java.lang.IllegalArgumentException thrown: Could not load buffers for > field test: Union(Sparse, [22, 5])<0: Binary, 1: Int(64, true)>. error > message: can not truncate buffer to a larger size 7: 0 > |at VectorLoader.loadBuffers (VectorLoader.java:83) > |at VectorLoader.load (VectorLoader.java:62) > |at ArrowReader$1.visit (ArrowReader.java:125) > |at ArrowReader$1.visit (ArrowReader.java:111) > |at ArrowRecordBatch.accepts (ArrowRecordBatch.java:128) > |at ArrowReader.loadNextBatch (ArrowReader.java:137) > |at (#7:1) > {code} > It seems like Java is not picking up that the UnionArray is Dense instead of > Sparse. After changing the default in > java/vector/src/main/codegen/templates/UnionVector.java from Sparse to Dense, > I get this: > {code} > jshell> reader.getVectorSchemaRoot().getSchema() > $9 ==> Schema [0])<: Int(64, true)> > {code} > but then reading doesn't work: > {code} > jshell> reader.loadNextBatch() > | java.lang.IllegalArgumentException thrown: Could not load buffers for > field list: Union(Dense, [1])<: Struct Int(64, true). error message: can not truncate buffer to a larger size 1: > 0 > |at VectorLoader.loadBuffers (VectorLoader.java:83) > |at VectorLoader.load (VectorLoader.java:62) > |at ArrowReader$1.visit (ArrowReader.java:125) > |at ArrowReader$1.visit (ArrowReader.java:111) > |at ArrowRecordBatch.accepts (ArrowRecordBatch.java:128) > |at ArrowReader.loadNextBatch (ArrowReader.java:137) > |at (#8:1) > {code} > Any help with this is appreciated! -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-1692) [Python, Java] UnionArray round trip not working
[ https://issues.apache.org/jira/browse/ARROW-1692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-1692: Fix Version/s: (was: 0.9.0) 0.10.0 > [Python, Java] UnionArray round trip not working > > > Key: ARROW-1692 > URL: https://issues.apache.org/jira/browse/ARROW-1692 > Project: Apache Arrow > Issue Type: Bug >Reporter: Philipp Moritz >Priority: Major > Fix For: 0.10.0 > > Attachments: union_array.arrow > > > I'm currently working on making pyarrow.serialization data available from the > Java side, one problem I was running into is that it seems the Java > implementation cannot read UnionArrays generated from C++. To make this > easily reproducible I created a clean Python implementation for creating > UnionArrays: https://github.com/apache/arrow/pull/1216 > The data is generated with the following script: > {code} > import pyarrow as pa > binary = pa.array([b'a', b'b', b'c', b'd'], type='binary') > int64 = pa.array([1, 2, 3], type='int64') > types = pa.array([0, 1, 0, 0, 1, 1, 0], type='int8') > value_offsets = pa.array([0, 0, 2, 1, 1, 2, 3], type='int32') > result = pa.UnionArray.from_arrays([binary, int64], types, value_offsets) > batch = pa.RecordBatch.from_arrays([result], ["test"]) > sink = pa.BufferOutputStream() > writer = pa.RecordBatchStreamWriter(sink, batch.schema) > writer.write_batch(batch) > sink.close() > b = sink.get_result() > with open("union_array.arrow", "wb") as f: > f.write(b) > # Sanity check: Read the batch in again > with open("union_array.arrow", "rb") as f: > b = f.read() > reader = pa.RecordBatchStreamReader(pa.BufferReader(b)) > batch = reader.read_next_batch() > print("union array is", batch.column(0)) > {code} > I attached the file generated by that script. Then when I run the following > code in Java: > {code} > RootAllocator allocator = new RootAllocator(10); > ByteArrayInputStream in = new > ByteArrayInputStream(Files.readAllBytes(Paths.get("union_array.arrow"))); > ArrowStreamReader reader = new ArrowStreamReader(in, allocator); > reader.loadNextBatch() > {code} > I get the following error: > {code} > | java.lang.IllegalArgumentException thrown: Could not load buffers for > field test: Union(Sparse, [22, 5])<0: Binary, 1: Int(64, true)>. error > message: can not truncate buffer to a larger size 7: 0 > |at VectorLoader.loadBuffers (VectorLoader.java:83) > |at VectorLoader.load (VectorLoader.java:62) > |at ArrowReader$1.visit (ArrowReader.java:125) > |at ArrowReader$1.visit (ArrowReader.java:111) > |at ArrowRecordBatch.accepts (ArrowRecordBatch.java:128) > |at ArrowReader.loadNextBatch (ArrowReader.java:137) > |at (#7:1) > {code} > It seems like Java is not picking up that the UnionArray is Dense instead of > Sparse. After changing the default in > java/vector/src/main/codegen/templates/UnionVector.java from Sparse to Dense, > I get this: > {code} > jshell> reader.getVectorSchemaRoot().getSchema() > $9 ==> Schema [0])<: Int(64, true)> > {code} > but then reading doesn't work: > {code} > jshell> reader.loadNextBatch() > | java.lang.IllegalArgumentException thrown: Could not load buffers for > field list: Union(Dense, [1])<: Struct Int(64, true). error message: can not truncate buffer to a larger size 1: > 0 > |at VectorLoader.loadBuffers (VectorLoader.java:83) > |at VectorLoader.load (VectorLoader.java:62) > |at ArrowReader$1.visit (ArrowReader.java:125) > |at ArrowReader$1.visit (ArrowReader.java:111) > |at ArrowRecordBatch.accepts (ArrowRecordBatch.java:128) > |at ArrowReader.loadNextBatch (ArrowReader.java:137) > |at (#8:1) > {code} > Any help with this is appreciated! -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-1692) [Python, Java] UnionArray round trip not working
[ https://issues.apache.org/jira/browse/ARROW-1692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-1692: Fix Version/s: 0.9.0 > [Python, Java] UnionArray round trip not working > > > Key: ARROW-1692 > URL: https://issues.apache.org/jira/browse/ARROW-1692 > Project: Apache Arrow > Issue Type: Bug >Reporter: Philipp Moritz > Fix For: 0.9.0 > > Attachments: union_array.arrow > > > I'm currently working on making pyarrow.serialization data available from the > Java side, one problem I was running into is that it seems the Java > implementation cannot read UnionArrays generated from C++. To make this > easily reproducible I created a clean Python implementation for creating > UnionArrays: https://github.com/apache/arrow/pull/1216 > The data is generated with the following script: > {code} > import pyarrow as pa > binary = pa.array([b'a', b'b', b'c', b'd'], type='binary') > int64 = pa.array([1, 2, 3], type='int64') > types = pa.array([0, 1, 0, 0, 1, 1, 0], type='int8') > value_offsets = pa.array([0, 0, 2, 1, 1, 2, 3], type='int32') > result = pa.UnionArray.from_arrays([binary, int64], types, value_offsets) > batch = pa.RecordBatch.from_arrays([result], ["test"]) > sink = pa.BufferOutputStream() > writer = pa.RecordBatchStreamWriter(sink, batch.schema) > writer.write_batch(batch) > sink.close() > b = sink.get_result() > with open("union_array.arrow", "wb") as f: > f.write(b) > # Sanity check: Read the batch in again > with open("union_array.arrow", "rb") as f: > b = f.read() > reader = pa.RecordBatchStreamReader(pa.BufferReader(b)) > batch = reader.read_next_batch() > print("union array is", batch.column(0)) > {code} > I attached the file generated by that script. Then when I run the following > code in Java: > {code} > RootAllocator allocator = new RootAllocator(10); > ByteArrayInputStream in = new > ByteArrayInputStream(Files.readAllBytes(Paths.get("union_array.arrow"))); > ArrowStreamReader reader = new ArrowStreamReader(in, allocator); > reader.loadNextBatch() > {code} > I get the following error: > {code} > | java.lang.IllegalArgumentException thrown: Could not load buffers for > field test: Union(Sparse, [22, 5])<0: Binary, 1: Int(64, true)>. error > message: can not truncate buffer to a larger size 7: 0 > |at VectorLoader.loadBuffers (VectorLoader.java:83) > |at VectorLoader.load (VectorLoader.java:62) > |at ArrowReader$1.visit (ArrowReader.java:125) > |at ArrowReader$1.visit (ArrowReader.java:111) > |at ArrowRecordBatch.accepts (ArrowRecordBatch.java:128) > |at ArrowReader.loadNextBatch (ArrowReader.java:137) > |at (#7:1) > {code} > It seems like Java is not picking up that the UnionArray is Dense instead of > Sparse. After changing the default in > java/vector/src/main/codegen/templates/UnionVector.java from Sparse to Dense, > I get this: > {code} > jshell> reader.getVectorSchemaRoot().getSchema() > $9 ==> Schema [0])<: Int(64, true)> > {code} > but then reading doesn't work: > {code} > jshell> reader.loadNextBatch() > | java.lang.IllegalArgumentException thrown: Could not load buffers for > field list: Union(Dense, [1])<: Struct Int(64, true). error message: can not truncate buffer to a larger size 1: > 0 > |at VectorLoader.loadBuffers (VectorLoader.java:83) > |at VectorLoader.load (VectorLoader.java:62) > |at ArrowReader$1.visit (ArrowReader.java:125) > |at ArrowReader$1.visit (ArrowReader.java:111) > |at ArrowRecordBatch.accepts (ArrowRecordBatch.java:128) > |at ArrowReader.loadNextBatch (ArrowReader.java:137) > |at (#8:1) > {code} > Any help with this is appreciated! -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (ARROW-1692) [Python, Java] UnionArray round trip not working
[ https://issues.apache.org/jira/browse/ARROW-1692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Philipp Moritz updated ARROW-1692: -- Description: I'm currently working on making pyarrow.serialization data available from the Java side, one problem I was running into is that it seems the Java implementation cannot read UnionArrays generated from C++. To make this easily reproducible I created a clean Python implementation for creating UnionArrays: https://github.com/apache/arrow/pull/1216 The data is generated with the following script: {code} import pyarrow as pa binary = pa.array([b'a', b'b', b'c', b'd'], type='binary') int64 = pa.array([1, 2, 3], type='int64') types = pa.array([0, 1, 0, 0, 1, 1, 0], type='int8') value_offsets = pa.array([0, 0, 2, 1, 1, 2, 3], type='int32') result = pa.UnionArray.from_arrays([binary, int64], types, value_offsets) batch = pa.RecordBatch.from_arrays([result], ["test"]) sink = pa.BufferOutputStream() writer = pa.RecordBatchStreamWriter(sink, batch.schema) writer.write_batch(batch) sink.close() b = sink.get_result() with open("union_array.arrow", "wb") as f: f.write(b) # Sanity check: Read the batch in again with open("union_array.arrow", "rb") as f: b = f.read() reader = pa.RecordBatchStreamReader(pa.BufferReader(b)) batch = reader.read_next_batch() print("union array is", batch.column(0)) {code} I attached the file generated by that script. Then when I run the following code in Java: {code} RootAllocator allocator = new RootAllocator(10); ByteArrayInputStream in = new ByteArrayInputStream(Files.readAllBytes(Paths.get("union_array.arrow"))); ArrowStreamReader reader = new ArrowStreamReader(in, allocator); reader.loadNextBatch() {code} I get the following error: {code} | java.lang.IllegalArgumentException thrown: Could not load buffers for field test: Union(Sparse, [22, 5])<0: Binary, 1: Int(64, true)>. error message: can not truncate buffer to a larger size 7: 0 |at VectorLoader.loadBuffers (VectorLoader.java:83) |at VectorLoader.load (VectorLoader.java:62) |at ArrowReader$1.visit (ArrowReader.java:125) |at ArrowReader$1.visit (ArrowReader.java:111) |at ArrowRecordBatch.accepts (ArrowRecordBatch.java:128) |at ArrowReader.loadNextBatch (ArrowReader.java:137) |at (#7:1) {code} It seems like Java is not picking up that the UnionArray is Dense instead of Sparse. After changing the default in java/vector/src/main/codegen/templates/UnionVector.java from Sparse to Dense, I get this: {code} jshell> reader.getVectorSchemaRoot().getSchema() $9 ==> Schema {code} but then reading doesn't work: {code} jshell> reader.loadNextBatch() | java.lang.IllegalArgumentException thrown: Could not load buffers for field list: Union(Dense, [1])<: Struct>>>. error message: can not truncate buffer to a larger size 1: 0 |at VectorLoader.loadBuffers (VectorLoader.java:83) |at VectorLoader.load (VectorLoader.java:62) |at ArrowReader$1.visit (ArrowReader.java:125) |at ArrowReader$1.visit (ArrowReader.java:111) |at ArrowRecordBatch.accepts (ArrowRecordBatch.java:128) |at ArrowReader.loadNextBatch (ArrowReader.java:137) |at (#8:1) {code} Any help with this is appreciated! was: I'm currently working on making pyarrow.serialization data available from the Java side, one problem I was running into is that it seems the Java implementation cannot read UnionArrays generated from C++. To make this easily reproducible I created a clean Python implementation for creating UnionArrays: https://github.com/apache/arrow/pull/1216 The data is generated with the following script: ``` import pyarrow as pa binary = pa.array([b'a', b'b', b'c', b'd'], type='binary') int64 = pa.array([1, 2, 3], type='int64') types = pa.array([0, 1, 0, 0, 1, 1, 0], type='int8') value_offsets = pa.array([0, 0, 2, 1, 1, 2, 3], type='int32') result = pa.UnionArray.from_arrays([binary, int64], types, value_offsets) batch = pa.RecordBatch.from_arrays([result], ["test"]) sink = pa.BufferOutputStream() writer = pa.RecordBatchStreamWriter(sink, batch.schema) writer.write_batch(batch) sink.close() b = sink.get_result() with open("union_array.arrow", "wb") as f: f.write(b) # Sanity check: Read the batch in again with open("union_array.arrow", "rb") as f: b = f.read() reader = pa.RecordBatchStreamReader(pa.BufferReader(b)) batch = reader.read_next_batch() print("union array is", batch.column(0)) ``` I attached the file generated by that script. Then when I run the following code in Java: ``` RootAllocator allocator = new RootAllocator(10); ByteArrayInputStream in = new ByteArrayInputStream(Files.readAllBytes(Paths.get("union_array.arrow"))); ArrowStreamReader reader = new ArrowStreamReader(in, allocator); reader.loadNextBatch() ``` I get the following error: ``` | ja