Brian Hulette created ARROW-3667: ------------------------------------ Summary: [JS] Incorrectly reads record batches with an all null column Key: ARROW-3667 URL: https://issues.apache.org/jira/browse/ARROW-3667 Project: Apache Arrow Issue Type: Bug Affects Versions: JS-0.3.1 Reporter: Brian Hulette Fix For: JS-0.4.0
The JS library seems to incorrectly read any columns that come after an all-null column in IPC buffers produced by pyarrow. Here's a python script that generates two arrow buffers, one with an all-null column followed by a utf-8 column, and a second with those two reversed {code:python} import pyarrow as pa import pandas as pd def serialize_to_arrow(df, fd, compress=True): batch = pa.RecordBatch.from_pandas(df) writer = pa.RecordBatchFileWriter(fd, batch.schema) writer.write_batch(batch) writer.close() if __name__ == "__main__": df = pd.DataFrame(data={'nulls': [None, None, None], 'not nulls': ['abc', 'def', 'ghi']}, columns=['nulls', 'not nulls']) with open('bad.arrow', 'wb') as fd: serialize_to_arrow(df, fd) df = pd.DataFrame(df, columns=['not nulls', 'nulls']) with open('good.arrow', 'wb') as fd: serialize_to_arrow(df, fd) {code} JS incorrectly interprets the [null, not null] case: {code:javascript} > var arrow = require('apache-arrow') undefined > var fs = require('fs') undefined > arrow.Table.from(fs.readFileSync('good.arrow')).getColumn('not nulls').get(0) 'abc' > arrow.Table.from(fs.readFileSync('bad.arrow')).getColumn('not nulls').get(0) '\u0000\u0000\u0000\u0000\u0003\u0000\u0000\u0000\u0006\u0000\u0000\u0000\t\u0000\u0000\u0000' {code} Presumably this is because pyarrow is omitting some (or all) of the buffers associated with the all-null column, but the JS IPC reader is still looking for them, causing the buffer count to get out of sync. -- This message was sent by Atlassian JIRA (v7.6.3#76005)