[jira] [Updated] (ARROW-3667) [JS] Incorrectly reads record batches with an all null column
[ https://issues.apache.org/jira/browse/ARROW-3667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou updated ARROW-3667: -- Component/s: JavaScript > [JS] Incorrectly reads record batches with an all null column > - > > Key: ARROW-3667 > URL: https://issues.apache.org/jira/browse/ARROW-3667 > Project: Apache Arrow > Issue Type: Bug > Components: JavaScript >Affects Versions: JS-0.3.1 >Reporter: Brian Hulette >Assignee: Paul Taylor >Priority: Major > Fix For: JS-0.4.1 > > > The JS library seems to incorrectly read any columns that come after an > all-null column in IPC buffers produced by pyarrow. > Here's a python script that generates two arrow buffers, one with an all-null > column followed by a utf-8 column, and a second with those two reversed > {code:python} > import pyarrow as pa > import pandas as pd > def serialize_to_arrow(df, fd, compress=True): > batch = pa.RecordBatch.from_pandas(df) > writer = pa.RecordBatchFileWriter(fd, batch.schema) > writer.write_batch(batch) > writer.close() > if __name__ == "__main__": > df = pd.DataFrame(data={'nulls': [None, None, None], 'not nulls': ['abc', > 'def', 'ghi']}, columns=['nulls', 'not nulls']) > with open('bad.arrow', 'wb') as fd: > serialize_to_arrow(df, fd) > df = pd.DataFrame(df, columns=['not nulls', 'nulls']) > with open('good.arrow', 'wb') as fd: > serialize_to_arrow(df, fd) > {code} > JS incorrectly interprets the [null, not null] case: > {code:javascript} > > var arrow = require('apache-arrow') > undefined > > var fs = require('fs') > undefined > > arrow.Table.from(fs.readFileSync('good.arrow')).getColumn('not > > nulls').get(0) > 'abc' > > arrow.Table.from(fs.readFileSync('bad.arrow')).getColumn('not nulls').get(0) > '\u\u\u\u\u0003\u\u\u\u0006\u\u\u\t\u\u\u' > {code} > Presumably this is because pyarrow is omitting some (or all) of the buffers > associated with the all-null column, but the JS IPC reader is still looking > for them, causing the buffer count to get out of sync. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-3667) [JS] Incorrectly reads record batches with an all null column
[ https://issues.apache.org/jira/browse/ARROW-3667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kouhei Sutou updated ARROW-3667: Fix Version/s: (was: JS-0.5.0) > [JS] Incorrectly reads record batches with an all null column > - > > Key: ARROW-3667 > URL: https://issues.apache.org/jira/browse/ARROW-3667 > Project: Apache Arrow > Issue Type: Bug >Affects Versions: JS-0.3.1 >Reporter: Brian Hulette >Assignee: Paul Taylor >Priority: Major > Fix For: JS-0.4.1 > > > The JS library seems to incorrectly read any columns that come after an > all-null column in IPC buffers produced by pyarrow. > Here's a python script that generates two arrow buffers, one with an all-null > column followed by a utf-8 column, and a second with those two reversed > {code:python} > import pyarrow as pa > import pandas as pd > def serialize_to_arrow(df, fd, compress=True): > batch = pa.RecordBatch.from_pandas(df) > writer = pa.RecordBatchFileWriter(fd, batch.schema) > writer.write_batch(batch) > writer.close() > if __name__ == "__main__": > df = pd.DataFrame(data={'nulls': [None, None, None], 'not nulls': ['abc', > 'def', 'ghi']}, columns=['nulls', 'not nulls']) > with open('bad.arrow', 'wb') as fd: > serialize_to_arrow(df, fd) > df = pd.DataFrame(df, columns=['not nulls', 'nulls']) > with open('good.arrow', 'wb') as fd: > serialize_to_arrow(df, fd) > {code} > JS incorrectly interprets the [null, not null] case: > {code:javascript} > > var arrow = require('apache-arrow') > undefined > > var fs = require('fs') > undefined > > arrow.Table.from(fs.readFileSync('good.arrow')).getColumn('not > > nulls').get(0) > 'abc' > > arrow.Table.from(fs.readFileSync('bad.arrow')).getColumn('not nulls').get(0) > '\u\u\u\u\u0003\u\u\u\u0006\u\u\u\t\u\u\u' > {code} > Presumably this is because pyarrow is omitting some (or all) of the buffers > associated with the all-null column, but the JS IPC reader is still looking > for them, causing the buffer count to get out of sync. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-3667) [JS] Incorrectly reads record batches with an all null column
[ https://issues.apache.org/jira/browse/ARROW-3667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brian Hulette updated ARROW-3667: - Fix Version/s: (was: JS-0.4.0) JS-0.5.0 > [JS] Incorrectly reads record batches with an all null column > - > > Key: ARROW-3667 > URL: https://issues.apache.org/jira/browse/ARROW-3667 > Project: Apache Arrow > Issue Type: Bug >Affects Versions: JS-0.3.1 >Reporter: Brian Hulette >Priority: Major > Fix For: JS-0.5.0 > > > The JS library seems to incorrectly read any columns that come after an > all-null column in IPC buffers produced by pyarrow. > Here's a python script that generates two arrow buffers, one with an all-null > column followed by a utf-8 column, and a second with those two reversed > {code:python} > import pyarrow as pa > import pandas as pd > def serialize_to_arrow(df, fd, compress=True): > batch = pa.RecordBatch.from_pandas(df) > writer = pa.RecordBatchFileWriter(fd, batch.schema) > writer.write_batch(batch) > writer.close() > if __name__ == "__main__": > df = pd.DataFrame(data={'nulls': [None, None, None], 'not nulls': ['abc', > 'def', 'ghi']}, columns=['nulls', 'not nulls']) > with open('bad.arrow', 'wb') as fd: > serialize_to_arrow(df, fd) > df = pd.DataFrame(df, columns=['not nulls', 'nulls']) > with open('good.arrow', 'wb') as fd: > serialize_to_arrow(df, fd) > {code} > JS incorrectly interprets the [null, not null] case: > {code:javascript} > > var arrow = require('apache-arrow') > undefined > > var fs = require('fs') > undefined > > arrow.Table.from(fs.readFileSync('good.arrow')).getColumn('not > > nulls').get(0) > 'abc' > > arrow.Table.from(fs.readFileSync('bad.arrow')).getColumn('not nulls').get(0) > '\u\u\u\u\u0003\u\u\u\u0006\u\u\u\t\u\u\u' > {code} > Presumably this is because pyarrow is omitting some (or all) of the buffers > associated with the all-null column, but the JS IPC reader is still looking > for them, causing the buffer count to get out of sync. -- This message was sent by Atlassian JIRA (v7.6.3#76005)