Brian Hulette created ARROW-3667:
------------------------------------
Summary: [JS] Incorrectly reads record batches with an all null
column
Key: ARROW-3667
URL: https://issues.apache.org/jira/browse/ARROW-3667
Project: Apache Arrow
Issue Type: Bug
Affects Versions: JS-0.3.1
Reporter: Brian Hulette
Fix For: JS-0.4.0
The JS library seems to incorrectly read any columns that come after an
all-null column in IPC buffers produced by pyarrow.
Here's a python script that generates two arrow buffers, one with an all-null
column followed by a utf-8 column, and a second with those two reversed
{code:python}
import pyarrow as pa
import pandas as pd
def serialize_to_arrow(df, fd, compress=True):
batch = pa.RecordBatch.from_pandas(df)
writer = pa.RecordBatchFileWriter(fd, batch.schema)
writer.write_batch(batch)
writer.close()
if __name__ == "__main__":
df = pd.DataFrame(data={'nulls': [None, None, None], 'not nulls': ['abc',
'def', 'ghi']}, columns=['nulls', 'not nulls'])
with open('bad.arrow', 'wb') as fd:
serialize_to_arrow(df, fd)
df = pd.DataFrame(df, columns=['not nulls', 'nulls'])
with open('good.arrow', 'wb') as fd:
serialize_to_arrow(df, fd)
{code}
JS incorrectly interprets the [null, not null] case:
{code:javascript}
> var arrow = require('apache-arrow')
undefined
> var fs = require('fs')
undefined
> arrow.Table.from(fs.readFileSync('good.arrow')).getColumn('not nulls').get(0)
'abc'
> arrow.Table.from(fs.readFileSync('bad.arrow')).getColumn('not nulls').get(0)
'\u0000\u0000\u0000\u0000\u0003\u0000\u0000\u0000\u0006\u0000\u0000\u0000\t\u0000\u0000\u0000'
{code}
Presumably this is because pyarrow is omitting some (or all) of the buffers
associated with the all-null column, but the JS IPC reader is still looking for
them, causing the buffer count to get out of sync.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)