David Li created ARROW-7254: ------------------------------- Summary: BaseVariableWidthVector#setSafe appears to make value offsets inconsistent Key: ARROW-7254 URL: https://issues.apache.org/jira/browse/ARROW-7254 Project: Apache Arrow Issue Type: Bug Components: Java Affects Versions: 0.15.1 Reporter: David Li
The following program writes a file which PyArrow either segfaults (0.14.1) or rejects with an error (0.15.1) {{pyarrow.lib.ArrowInvalid: Column 0: Offset invariant failure at: 2 inconsistent value_offsets for null slot0!=4}} on reading. Calling {{setRowCount}} again, or calling {{setSafe}} with a higher index fixes it. While it seems from the new documentation that we should (must?) call {{VectorSchemaRoot#setRowCount}} at the end, I wouldn't have expected to get an invalid file by calling using {{setSafe}}, either. Full traceback: {noformat} > python3 -c 'import pyarrow as pa; print(pa.ipc.open_stream(open("./test.bin", > "rb")).read_pandas())' Traceback (most recent call last): File "<string>", line 1, in <module> File "/Users/lidavidm/Flight/arrow-5137-auth/java/venv/lib/python3.7/site-packages/pyarrow/ipc.py", line 46, in read_pandas table = self.read_all() File "pyarrow/ipc.pxi", line 330, in pyarrow.lib._CRecordBatchReader.read_all File "pyarrow/public-api.pxi", line 321, in pyarrow.lib.pyarrow_wrap_table File "pyarrow/error.pxi", line 78, in pyarrow.lib.check_status pyarrow.lib.ArrowInvalid: Column 0: Offset invariant failure at: 2 inconsistent value_offsets for null slot0!=4 {noformat} Full program: {code:java} import java.io.OutputStream; import java.nio.charset.StandardCharsets; import java.nio.file.Files; import java.nio.file.Paths; import java.util.Collections; import org.apache.arrow.memory.BufferAllocator; import org.apache.arrow.memory.RootAllocator; import org.apache.arrow.vector.VarCharVector; import org.apache.arrow.vector.VectorSchemaRoot; import org.apache.arrow.vector.ipc.ArrowStreamWriter; import org.apache.arrow.vector.types.pojo.ArrowType; import org.apache.arrow.vector.types.pojo.Field; import org.apache.arrow.vector.types.pojo.Schema; public class AsdfTest { public static void main(String[] args) throws Exception { Schema schema = new Schema(Collections.singletonList(Field.nullable("a", new ArrowType.Utf8()))); try (BufferAllocator allocator = new RootAllocator(Integer.MAX_VALUE); VectorSchemaRoot root = VectorSchemaRoot.create(schema, allocator)) { root.setRowCount(2); VarCharVector v = (VarCharVector) root.getVector("a"); v.setSafe(0, "asdf".getBytes(StandardCharsets.UTF_8)); try (OutputStream output = Files.newOutputStream(Paths.get("./test.bin"))) { ArrowStreamWriter writer = new ArrowStreamWriter(root, null, output); writer.writeBatch(); writer.close(); } } } } {code} {{v.setNull(1)}} after {{v.setSafe(0, "asdf")}} does not fix it. Using {{set}} instead of {{setSafe}} will fail in Java. -- This message was sent by Atlassian Jira (v8.3.4#803005)