[jira] [Commented] (ARROW-16543) [JS] Timestamp types are all the same

2022-06-01 Thread Paul Taylor (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17545241#comment-17545241
 ] 

Paul Taylor commented on ARROW-16543:
-

[~terusus] how did you construct the Timestamp Vectors? The semantic meaning of 
the type is " since the epoch," so it's valid for the different 
duration dtypes to have the same underlying representation.

> [JS] Timestamp types are all the same
> -
>
> Key: ARROW-16543
> URL: https://issues.apache.org/jira/browse/ARROW-16543
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: JavaScript
>Reporter: Teodor Kostov
>Priority: Major
>
> Current timestamp types are all the same. They have the same representation. 
> And also the same precision.
> For example, {{TimestampSecond}} and {{TimestampMillisecond}} return the 
> values as {{165211818}}. Instead, I would expect the {{TimestampSecond}} 
> to drop the 3 zeros when returning a value, e.g. {{1652118180}}. Also, the 
> representation underneath is still an {{int32}} array. Even though for 
> {{TimestampSecond}} every second value is {{0}}, the array still has double 
> the amount of integers.
> I also got an error when trying to read a {{Date}} as {{TimestampNanosecond}} 
> - {{TypeError: can't convert 165211818 to BigInt}}.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Comment Edited] (ARROW-16705) [JavaScript] TypeError: RecordBatchReader.from(...).toNodeStream is not a function

2022-06-01 Thread Paul Taylor (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17545007#comment-17545007
 ] 

Paul Taylor edited comment on ARROW-16705 at 6/1/22 6:07 PM:
-

[~vic-bonilla] The RecordBatchReader is used to transform the IPC format off 
the wire into RecordBatches. You don't need to use the RecordBatchReader, 
because the Builder already produces RecordBatches (or the Vectors that can go 
inside a RecordBatch).

Instead you can transform the StructVector produced by the Builder into a 
RecordBatch, and go straight to the IPC format with the writer like this:
{code:java}
const {Readable, pipeline} = require('node:stream');
const {RecordBatch, Schema} = require('apache-arrow')

const messagesToBatches = async function*(source) {
  let schema = undefined; 
  const transform = builderThroughAsyncIterable(builderOptions);
  for await (const vector of transform(source)) {
schema ??= new Schema(vector.type.children);
for (const chunk of vector.data) {
  yield new RecordBatch(schema, chunk);
}
  }
}

pipeline(
  Readable.from(messagesToBatches(messagesAsyncIterable)),
  RecordBatchStreamWriter.throughNode(),
  fsWriter
) {code}
 

 


was (Author: paul.e.taylor):
[~vic-bonilla] The RecordBatchReader is used to transform the IPC format off 
the wire into RecordBatches. You don't need to use the RecordBatchReader, 
because the Builder already produces RecordBatches (or the Vectors that can go 
inside a RecordBatch).

Instead you can transform the StructVector produced by the Builder into a 
RecordBatch, and go straight to the IPC format with the writer like this:
{code:java}
const {Readable, pipeline} = require('node:stream');
const {RecordBatch, Schema} = require('apache-arrow')

const messagesToBatches = async function*(source) {
  const transform = builderThroughAsyncIterable(builderOptions);
  let schema = undefined;
  for await (const vector of transform(source)) {
schema ??= new Schema(vector.type.children);
for (const chunk of vector.data) {
  yield new RecordBatch(schema, chunk);
}
  }
}

pipeline(
  Readable.from(messagesToBatches(messagesAsyncIterable)),
  RecordBatchStreamWriter.throughNode(),
  fsWriter
) {code}
 

 

> [JavaScript] TypeError: RecordBatchReader.from(...).toNodeStream is not a 
> function
> --
>
> Key: ARROW-16705
> URL: https://issues.apache.org/jira/browse/ARROW-16705
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: JavaScript
>Affects Versions: 8.0.0
> Environment: Nodejs v16.13.0
>Reporter: Victor Bonilla
>Priority: Major
>  Labels: async, ipc, javascript, stream
>
> Trying to code a real-time stream from an async iterable of objects to an IPC 
> Streaming format file I'm getting a TypeError.
> The idea is to stream every message to the arrow file as soon as it arrives 
> without waiting to build the complete table to stream it. To take advantage 
> of the stream event handling, I'm using the functional approach of 
> [node:stream|https://nodejs.org/docs/latest-v16.x/api/stream.html] module 
> (Nodejs v16.13.0).
> The async iterable contains messages that are JS objects containing different 
> data types, for example:
> {code:javascript}
> {
> id: '6345',
> product: 'foo',
> price: 62.78,
> created: '2022-05-01T16:01:00.105Z',
> }{code}
> Code to replicate the error:
> {code:javascript}
> const {
> Struct, Field, Utf8, Float32, TimestampMillisecond,
> RecordBatchReader, RecordBatchStreamWriter,
> builderThroughAsyncIterable,
> } = require('apache-arrow')
> const fs = require("fs");
> const path = require("path");
> const {pipeline} = require('node:stream');
> const asyncIterable = {
> [Symbol.asyncIterator]: async function* () {
> while (true) {
> const obj = {
> id: Math.floor(Math.random() * 10).toString(),
> product: 'foo',
> price: Math.random() + Math.floor(Math.random() * 10),
> created: new Date(),
> }
> yield obj;
> // insert some asynchrony
> await new Promise((r) => setTimeout(r, 1000));
> }
> }
> }
> async function streamToArrow(messagesAsyncIterable) {
> const message_type = new Struct([
> new Field('id', new Utf8, false),
>         new Field('product', new Utf8, false),
>         new Field('price', new Float32, false),
> new Field('created', new TimestampMillisecond, false),
> ]);
> const builderOptions = {
> type: message_type,
> nullValues: [null, 'n/a', undefined],
> highWaterMark: 30,
> queueingStrategy: 'count',
> };
> const transform = 

[jira] [Comment Edited] (ARROW-16705) [JavaScript] TypeError: RecordBatchReader.from(...).toNodeStream is not a function

2022-06-01 Thread Paul Taylor (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17545007#comment-17545007
 ] 

Paul Taylor edited comment on ARROW-16705 at 6/1/22 6:06 PM:
-

[~vic-bonilla] The RecordBatchReader is used to transform the IPC format off 
the wire into RecordBatches. You don't need to use the RecordBatchReader, 
because the Builder already produces RecordBatches (or the Vectors that can go 
inside a RecordBatch).

Instead you can transform the StructVector produced by the Builder into a 
RecordBatch, and go straight to the IPC format with the writer like this:
{code:java}
const {Readable, pipeline} = require('node:stream');
const {RecordBatch, Schema} = require('apache-arrow')

const messagesToBatches = async function*(source) {
  const transform = builderThroughAsyncIterable(builderOptions);
  let schema = undefined;
  for await (const vector of transform(source)) {
schema ??= new Schema(vector.type.children);
for (const chunk of vector.data) {
  yield new RecordBatch(schema, chunk);
}
  }
}

pipeline(
  Readable.from(messagesToBatches(messagesAsyncIterable)),
  RecordBatchStreamWriter.throughNode(),
  fsWriter
) {code}
 

 


was (Author: paul.e.taylor):
 

[~vic-bonilla] The RecordBatchReader is used to transform the IPC format off 
the wire into RecordBatches. You don't need to use the RecordBatchReader, 
because the Builder already produces RecordBatches (or the Vectors that can go 
inside a RecordBatch).

Instead you can transform the StructVector produced by the Builder into a 
RecordBatch, and go straight to the IPC format with the writer like this:
{code:java}
const {Readable, pipeline} = require('node:stream');
const {RecordBatch, Schema} = require('apache-arrow')

const messagesToBatches = async function*(source) {
  const transform = builderThroughAsyncIterable(builderOptions);
  for await (const vector of transform(source)) {
const schema = new Schema(vector.type.children);
for (const chunk of vector.data) {
  yield new RecordBatch(schema, chunk);
}
  }
}

pipeline(
  Readable.from(messagesToBatches(messagesAsyncIterable)),
  RecordBatchStreamWriter.throughNode(),
  fsWriter
) {code}
 

 

> [JavaScript] TypeError: RecordBatchReader.from(...).toNodeStream is not a 
> function
> --
>
> Key: ARROW-16705
> URL: https://issues.apache.org/jira/browse/ARROW-16705
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: JavaScript
>Affects Versions: 8.0.0
> Environment: Nodejs v16.13.0
>Reporter: Victor Bonilla
>Priority: Major
>  Labels: async, ipc, javascript, stream
>
> Trying to code a real-time stream from an async iterable of objects to an IPC 
> Streaming format file I'm getting a TypeError.
> The idea is to stream every message to the arrow file as soon as it arrives 
> without waiting to build the complete table to stream it. To take advantage 
> of the stream event handling, I'm using the functional approach of 
> [node:stream|https://nodejs.org/docs/latest-v16.x/api/stream.html] module 
> (Nodejs v16.13.0).
> The async iterable contains messages that are JS objects containing different 
> data types, for example:
> {code:javascript}
> {
> id: '6345',
> product: 'foo',
> price: 62.78,
> created: '2022-05-01T16:01:00.105Z',
> }{code}
> Code to replicate the error:
> {code:javascript}
> const {
> Struct, Field, Utf8, Float32, TimestampMillisecond,
> RecordBatchReader, RecordBatchStreamWriter,
> builderThroughAsyncIterable,
> } = require('apache-arrow')
> const fs = require("fs");
> const path = require("path");
> const {pipeline} = require('node:stream');
> const asyncIterable = {
> [Symbol.asyncIterator]: async function* () {
> while (true) {
> const obj = {
> id: Math.floor(Math.random() * 10).toString(),
> product: 'foo',
> price: Math.random() + Math.floor(Math.random() * 10),
> created: new Date(),
> }
> yield obj;
> // insert some asynchrony
> await new Promise((r) => setTimeout(r, 1000));
> }
> }
> }
> async function streamToArrow(messagesAsyncIterable) {
> const message_type = new Struct([
> new Field('id', new Utf8, false),
>         new Field('product', new Utf8, false),
>         new Field('price', new Float32, false),
> new Field('created', new TimestampMillisecond, false),
> ]);
> const builderOptions = {
> type: message_type,
> nullValues: [null, 'n/a', undefined],
> highWaterMark: 30,
> queueingStrategy: 'count',
> };
> const transform = builderThroughAsyncIterable(builderOptions);  
> let 

[jira] [Assigned] (ARROW-16704) tableFromIPC should handle AsyncRecordBatchReader inputs

2022-06-01 Thread Paul Taylor (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Taylor reassigned ARROW-16704:
---

Assignee: Paul Taylor

> tableFromIPC should handle AsyncRecordBatchReader inputs
> 
>
> Key: ARROW-16704
> URL: https://issues.apache.org/jira/browse/ARROW-16704
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: JavaScript
>Affects Versions: 8.0.0
>Reporter: Paul Taylor
>Assignee: Paul Taylor
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> To match the prior `Table.from()` method, `tableFromIPC()` method should 
> handle the case where the input is an async RecordBatchReader.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (ARROW-16705) [JavaScript] TypeError: RecordBatchReader.from(...).toNodeStream is not a function

2022-06-01 Thread Paul Taylor (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17545007#comment-17545007
 ] 

Paul Taylor commented on ARROW-16705:
-

 

[~vic-bonilla] The RecordBatchReader is used to transform the IPC format off 
the wire into RecordBatches. You don't need to use the RecordBatchReader, 
because the Builder already produces RecordBatches (or the Vectors that can go 
inside a RecordBatch).

Instead you can transform the StructVector produced by the Builder into a 
RecordBatch, and go straight to the IPC format with the writer like this:
{code:java}
const {Readable, pipeline} = require('node:stream');
const {RecordBatch, Schema} = require('apache-arrow')

const messagesToBatches = async function*(source) {
  const transform = builderThroughAsyncIterable(builderOptions);
  for await (const vector of transform(source)) {
const schema = new Schema(vector.type.children);
for (const chunk of vector.data) {
  yield new RecordBatch(schema, chunk);
}
  }
}

pipeline(
  Readable.from(messagesToBatches(messagesAsyncIterable)),
  RecordBatchStreamWriter.throughNode(),
  fsWriter
) {code}
 

 

> [JavaScript] TypeError: RecordBatchReader.from(...).toNodeStream is not a 
> function
> --
>
> Key: ARROW-16705
> URL: https://issues.apache.org/jira/browse/ARROW-16705
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: JavaScript
>Affects Versions: 8.0.0
> Environment: Nodejs v16.13.0
>Reporter: Victor Bonilla
>Priority: Major
>  Labels: async, ipc, javascript, stream
>
> Trying to code a real-time stream from an async iterable of objects to an IPC 
> Streaming format file I'm getting a TypeError.
> The idea is to stream every message to the arrow file as soon as it arrives 
> without waiting to build the complete table to stream it. To take advantage 
> of the stream event handling, I'm using the functional approach of 
> [node:stream|https://nodejs.org/docs/latest-v16.x/api/stream.html] module 
> (Nodejs v16.13.0).
> The async iterable contains messages that are JS objects containing different 
> data types, for example:
> {code:javascript}
> {
> id: '6345',
> product: 'foo',
> price: 62.78,
> created: '2022-05-01T16:01:00.105Z',
> }{code}
> Code to replicate the error:
> {code:javascript}
> const {
> Struct, Field, Utf8, Float32, TimestampMillisecond,
> RecordBatchReader, RecordBatchStreamWriter,
> builderThroughAsyncIterable,
> } = require('apache-arrow')
> const fs = require("fs");
> const path = require("path");
> const {pipeline} = require('node:stream');
> const asyncIterable = {
> [Symbol.asyncIterator]: async function* () {
> while (true) {
> const obj = {
> id: Math.floor(Math.random() * 10).toString(),
> product: 'foo',
> price: Math.random() + Math.floor(Math.random() * 10),
> created: new Date(),
> }
> yield obj;
> // insert some asynchrony
> await new Promise((r) => setTimeout(r, 1000));
> }
> }
> }
> async function streamToArrow(messagesAsyncIterable) {
> const message_type = new Struct([
> new Field('id', new Utf8, false),
>         new Field('product', new Utf8, false),
>         new Field('price', new Float32, false),
> new Field('created', new TimestampMillisecond, false),
> ]);
> const builderOptions = {
> type: message_type,
> nullValues: [null, 'n/a', undefined],
> highWaterMark: 30,
> queueingStrategy: 'count',
> };
> const transform = builderThroughAsyncIterable(builderOptions);  
> let file_path = './ipc_stream.arrow';
> const fsWriter = fs.createWriteStream(file_path);
> pipeline(
> RecordBatchReader
> .from(transform(messagesAsyncIterable))
> .toNodeStream(),  // Throws TypeError: 
> RecordBatchReader.from(...).toNodeStream is not a function         
> RecordBatchStreamWriter.throughNode(),
> fsWriter,
> (err, value) => {
> if (err) {
> console.error(err);
> } else {
> console.log(value, 'value returned');
> }
> }
> ).on('close', () => {
> console.log('Closed pipeline')
> });
> }
> streamToArrow(asyncIterable){code}
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16704) tableFromIPC should handle AsyncRecordBatchReader inputs

2022-05-31 Thread Paul Taylor (Jira)
Paul Taylor created ARROW-16704:
---

 Summary: tableFromIPC should handle AsyncRecordBatchReader inputs
 Key: ARROW-16704
 URL: https://issues.apache.org/jira/browse/ARROW-16704
 Project: Apache Arrow
  Issue Type: Improvement
  Components: JavaScript
Affects Versions: 8.0.0
Reporter: Paul Taylor


To match the prior `Table.from()` method, `tableFromIPC()` method should handle 
the case where the input is an async RecordBatchReader.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Assigned] (ARROW-16371) [JS] Empty table should provide an empty iterator

2022-05-31 Thread Paul Taylor (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Taylor reassigned ARROW-16371:
---

Assignee: Paul Taylor

> [JS] Empty table should provide an empty iterator
> -
>
> Key: ARROW-16371
> URL: https://issues.apache.org/jira/browse/ARROW-16371
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: JavaScript
>Reporter: Teodor Kostov
>Assignee: Paul Taylor
>Priority: Minor
>
> When a table is created without any data and an iterator is requested I would 
> expect to get an empty iterator that just returns that it's done.
> Expected result:
> {code:json}
> {"value": null, "done": true}
> {code}
> However, the code fails in {{strideForType()}} with {{Uncaught TypeError: 
> type2 is undefined}}.
> {code:javascript}
> schema = new arrow.Schema(dataType.children)
> data = new arrow.Table(this.schema)
> const iter = data[Symbol.iterator]()
> {code}
> It seems that the [table just creates a new vector with its 
> data|https://github.com/apache/arrow/blob/e9481532e93e4f29a1c2c322e00f268d6cd9f534/js/src/table.ts#L227]
>  and then the [{{strideForType}} method 
> fails|https://github.com/apache/arrow/blob/e9481532e93e4f29a1c2c322e00f268d6cd9f534/js/src/type.ts#L652].



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (ARROW-15642) [Python] [JavaScript] Arrow IPC file output by apache-arrow tableToIPC method cannot be read by pyarrow

2022-04-06 Thread Paul Taylor (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-15642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17518297#comment-17518297
 ] 

Paul Taylor commented on ARROW-15642:
-

[~domoritz] the IPC stream format is the more common use-case, at least in 
real-time ETL processing. File format is useful for reading more efficiently 
from disk, but not suited for inter-process communication.

If a consumer process wanted the advantage of constant-time random batch access 
(like the File format provides), they could buffer the stream until it's 
finished and write the footer. However it is not possible to to process an 
incoming Arrow table (in the IPC File format) in batches as they arrive, as the 
IPC File reader blocks until it sees the footer at the end.

> [Python] [JavaScript] Arrow IPC file output by apache-arrow tableToIPC method 
> cannot be read by pyarrow
> ---
>
> Key: ARROW-15642
> URL: https://issues.apache.org/jira/browse/ARROW-15642
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: JavaScript, Python
>Affects Versions: 7.0.0
>Reporter: Dan Coates
>Assignee: Weston Pace
>Priority: Major
>
> IPC files created by the node library `apache-arrow` don't seem to be able to 
> be read by pyarrow. There is an example of this issue here: 
> [https://github.com/dancoates/pyarrow-jsarrow-test 
> |https://github.com/dancoates/pyarrow-jsarrow-test]
>  
> writing the arrow file from js
> {code:javascript}
> import {tableToIPC, tableFromArrays} from 'apache-arrow';
> import fs from 'fs';
> const LENGTH = 2000;
> const rainAmounts = Float32Array.from(
>     { length: LENGTH },
>     () => Number((Math.random() * 20).toFixed(1)));
> const rainDates = Array.from(
>     { length: LENGTH },
>     (_, i) => new Date(Date.now() - 1000 * 60 * 60 * 24 * i));
> const rainfall = tableFromArrays({
>     precipitation: rainAmounts,
>     date: rainDates
> });
> const outputTable = tableToIPC(rainfall);
> fs.writeFileSync('jsarrow.arrow', outputTable); {code}
>  
> reading in python
> {code:python}
> import pyarrow as pa
> with open('jsarrow.arrow', 'rb') as f:
> with pa.ipc.open_file(f) as reader:
> df = reader.read_pandas()
> print(df.head())
>  {code}
>  
> produces the error:
> {code:java}
> pyarrow.lib.ArrowInvalid: Not an Arrow file {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Assigned] (ARROW-15852) [JS] Table getByteLength and indexOf don't work

2022-03-31 Thread Paul Taylor (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-15852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Taylor reassigned ARROW-15852:
---

Assignee: Paul Taylor

> [JS] Table getByteLength and indexOf don't work
> ---
>
> Key: ARROW-15852
> URL: https://issues.apache.org/jira/browse/ARROW-15852
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: JavaScript
>Affects Versions: 7.0.0
>Reporter: Timothy Higinbottom
>Assignee: Paul Taylor
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> The functions table.getByteLength() and table.indexOf() don't return the 
> correct values.
> They are bound dynamically to the Table class, in a way I don't fully 
> understand, with the following code:
> [https://github.com/apache/arrow/blob/1b796ec3f9caeb5e86e3348ba940bef8d95915c5/js/src/table.ts#L378-L390]
> The other functions like that, get(), set(), and isValid() all seem to work.  
> However, getByteLength() and indexOf() return the placeholder/sentinel values 
> of 0 and -1 respectively that are defined in the no-op code here: 
> [https://github.com/apache/arrow/blob/1b796ec3f9caeb5e86e3348ba940bef8d95915c5/js/src/table.ts#L207-L221,]
>  which I assume is to generate the right type definitions, and thus 
> documentation.
> It's fairly simple for a user to implement the right logic themselves (at 
> least for getByteLength) and it's a quick patch to define the functions 
> normally instead of on the prototype, e.g.:
>  
> {code:java}
>     /**
>      * Get the size in bytes of an element by index.
>      * @param index The index at which to get the byteLength.
>      */
>     // @ts-ignore
>     public getByteLength(index: number): number { return 
> this.data[index].byteLength; }
>     /**
>      * Get the size in bytes of a table.
>      */
>     //@ts-ignore
>     public getByteLength(): number { 
>         return this.data.map((batch) => batch.byteLength).reduce((sum, 
> newLength) => sum + newLength);
>     } {code}
> I'd be happy to send this as a PR if that's an OK alternative to the way it's 
> currently implemented. 
> Here's a Github repo of a minimal reproduction of the issue in NodeJS:
> [https://github.com/alexkreidler/apache-arrow-js-small-bug]
>  
> And an observable notebook for in the browser (although I couldn't get ESM 
> working): [https://observablehq.com/@08027ecfa2b2f7bb/arrow-7-canary]
>  
> Thanks to all for your work on Arrow!
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (ARROW-15852) [JS] Table getByteLength and indexOf don't work

2022-03-31 Thread Paul Taylor (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-15852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17515537#comment-17515537
 ] 

Paul Taylor commented on ARROW-15852:
-

[~timhigins] Thanks for the report. In running your code, I discovered an 
oversight we made in the v7.0 refactor.

That said, I think your {{indexOf()}} call is incorrect – {{indexOf()}} is the 
inverse of {{get()}} such that this should assert true: 
{{table.indexOf(table.get(0)) === 0}}

In your case (looking up the index of a row), you want to pass the entire row 
contents to the {{table.indexOf()}} call like this:
{code:javascript}
const { tableFromArrays } = require('apache-arrow');

const t = tableFromArrays({
  a: [0, 1, 2],
  b: ["foo", "bar", "baz"]
});

console.log(t.getByteLength(0));
console.log(t.getByteLength(1));

console.log(t.indexOf({a: 0, b: "foo"}));
console.log(t.indexOf({a: 1, b: "bar"}));
console.log(t.indexOf({a: 2, b: "baz"}));
{code}

> [JS] Table getByteLength and indexOf don't work
> ---
>
> Key: ARROW-15852
> URL: https://issues.apache.org/jira/browse/ARROW-15852
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: JavaScript
>Affects Versions: 7.0.0
>Reporter: Timothy Higinbottom
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> The functions table.getByteLength() and table.indexOf() don't return the 
> correct values.
> They are bound dynamically to the Table class, in a way I don't fully 
> understand, with the following code:
> [https://github.com/apache/arrow/blob/1b796ec3f9caeb5e86e3348ba940bef8d95915c5/js/src/table.ts#L378-L390]
> The other functions like that, get(), set(), and isValid() all seem to work.  
> However, getByteLength() and indexOf() return the placeholder/sentinel values 
> of 0 and -1 respectively that are defined in the no-op code here: 
> [https://github.com/apache/arrow/blob/1b796ec3f9caeb5e86e3348ba940bef8d95915c5/js/src/table.ts#L207-L221,]
>  which I assume is to generate the right type definitions, and thus 
> documentation.
> It's fairly simple for a user to implement the right logic themselves (at 
> least for getByteLength) and it's a quick patch to define the functions 
> normally instead of on the prototype, e.g.:
>  
> {code:java}
>     /**
>      * Get the size in bytes of an element by index.
>      * @param index The index at which to get the byteLength.
>      */
>     // @ts-ignore
>     public getByteLength(index: number): number { return 
> this.data[index].byteLength; }
>     /**
>      * Get the size in bytes of a table.
>      */
>     //@ts-ignore
>     public getByteLength(): number { 
>         return this.data.map((batch) => batch.byteLength).reduce((sum, 
> newLength) => sum + newLength);
>     } {code}
> I'd be happy to send this as a PR if that's an OK alternative to the way it's 
> currently implemented. 
> Here's a Github repo of a minimal reproduction of the issue in NodeJS:
> [https://github.com/alexkreidler/apache-arrow-js-small-bug]
>  
> And an observable notebook for in the browser (although I couldn't get ESM 
> working): [https://observablehq.com/@08027ecfa2b2f7bb/arrow-7-canary]
>  
> Thanks to all for your work on Arrow!
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Issue Comment Deleted] (ARROW-13046) [Release] JS package failing test prior to publish

2021-06-15 Thread Paul Taylor (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Taylor updated ARROW-13046:

Comment: was deleted

(was: [~jorgecarleitao] Looks like the 4.0.1 branch also needs this commit: 
https://github.com/apache/arrow/commit/3a6f6053c74eb698208395091009ac50be9dc29e)

> [Release] JS package failing test prior to publish
> --
>
> Key: ARROW-13046
> URL: https://issues.apache.org/jira/browse/ARROW-13046
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: JavaScript
>Reporter: Jorge Leitão
>Priority: Major
>
> While trying to publish the JS, I am getting an error when running the tests 
> (on mac).
> To reproduce, run `dev/release/post-05-js.sh 4.0.1` on branch 
> `release-arrow-4.0.1`:
> {code:java}
> ~/projects/arrow/apache-arrow-4.0.1/js ~/projects/arrow
> yarn install v1.22.1
> [1/5]   Validating package.json...
> [2/5]   Resolving packages...
> [3/5]   Fetching packages...
> info google-closure-compiler-linux@20210406.0.0: The platform "darwin" is 
> incompatible with this module.
> info "google-closure-compiler-linux@20210406.0.0" is an optional dependency 
> and failed compatibility check. Excluding it from installation.
> info google-closure-compiler-windows@20210406.0.0: The platform "darwin" is 
> incompatible with this module.
> info "google-closure-compiler-windows@20210406.0.0" is an optional dependency 
> and failed compatibility check. Excluding it from installation.
> [4/5]   Linking dependencies...
> warning "lerna > @lerna/version > @lerna/github-client > @octokit/rest > 
> @octokit/plugin-request-log@1.0.3" has unmet peer dependency 
> "@octokit/core@>=3".
> [5/5]   Building fresh packages...
> warning Your current version of Yarn is out of date. The latest version is 
> "1.22.5", while you're on "1.22.1".
> info To upgrade, run the following command:
> $ brew upgrade yarn
> ✨  Done in 121.72s.
> yarn run v1.22.1
> $ 
> /Users/jorgecarleitao/projects/arrow/apache-arrow-4.0.1/js/node_modules/.bin/gulp
> [05:39:21] Using gulpfile ~/projects/arrow/apache-arrow-4.0.1/js/gulpfile.js
> [05:39:21] Starting 'default'...
> [05:39:21] Starting 'clean'...
> [05:39:21] Starting 'clean:ts'...
> [05:39:21] Starting 'clean:apache-arrow'...
> [05:39:21] Starting 'clean:es5:cjs'...
> [05:39:21] Starting 'clean:es2015:cjs'...
> [05:39:21] Starting 'clean:esnext:cjs'...
> [05:39:21] Starting 'clean:es5:esm'...
> [05:39:21] Starting 'clean:es2015:esm'...
> [05:39:21] Starting 'clean:esnext:esm'...
> [05:39:21] Starting 'clean:es5:cls'...
> [05:39:21] Starting 'clean:es2015:cls'...
> [05:39:21] Starting 'clean:esnext:cls'...
> [05:39:21] Starting 'clean:es5:umd'...
> [05:39:21] Starting 'clean:es2015:umd'...
> [05:39:21] Starting 'clean:esnext:umd'...
> [05:39:21] Finished 'clean:ts' after 211 ms
> [05:39:21] Finished 'clean:apache-arrow' after 199 ms
> [05:39:21] Finished 'clean:es5:cjs' after 195 ms
> [05:39:21] Finished 'clean:es2015:cjs' after 196 ms
> [05:39:21] Finished 'clean:esnext:cjs' after 190 ms
> [05:39:21] Finished 'clean:es5:esm' after 180 ms
> [05:39:21] Finished 'clean:es2015:esm' after 172 ms
> [05:39:21] Finished 'clean:esnext:esm' after 169 ms
> [05:39:21] Finished 'clean:es5:cls' after 151 ms
> [05:39:21] Finished 'clean:es2015:cls' after 146 ms
> [05:39:22] Finished 'clean:esnext:cls' after 163 ms
> [05:39:22] Finished 'clean:es5:umd' after 149 ms
> [05:39:22] Finished 'clean:es2015:umd' after 146 ms
> [05:39:22] Finished 'clean:esnext:umd' after 142 ms
> [05:39:22] Finished 'clean' after 293 ms
> [05:39:22] Starting 'build'...
> [05:39:22] Starting 'build:ts'...
> [05:39:22] Starting 'build:apache-arrow'...
> [05:39:22] Starting 'build:es5:cjs'...
> [05:39:22] Starting 'clean:ts'...
> [05:39:22] Starting 'clean:es5:cjs'...
> [05:39:22] Finished 'clean:ts' after 728 μs
> [05:39:22] Starting 'compile:ts'...
> [05:39:22] Starting 'build:es2015:umd'...
> [05:39:22] Starting 'build:esnext:cjs'...
> [05:39:22] Starting 'build:esnext:esm'...
> [05:39:22] Starting 'build:esnext:umd'...
> [05:39:22] Finished 'clean:es5:cjs' after 11 ms
> [05:39:22] Starting 'compile:es5:cjs'...
> [05:39:22] Starting 'build:es2015:cls'...
> [05:39:22] Starting 'clean:esnext:cjs'...
> [05:39:22] Starting 'clean:esnext:esm'...
> [05:39:22] Starting 'build:esnext:cls'...
> [05:39:22] Starting 'clean:es2015:cls'...
> [05:39:22] Finished 'clean:esnext:cjs' after 30 ms
> [05:39:22] Starting 'compile:esnext:cjs'...
> [05:39:22] Finished 'clean:esnext:esm' after 28 ms
> [05:39:22] Starting 'compile:esnext:esm'...
> [05:39:22] Starting 'clean:esnext:cls'...
> [05:39:22] Finished 'clean:es2015:cls' after 53 ms
> [05:39:22] Starting 'compile:es2015:cls'...
> [05:39:22] Finished 'clean:esnext:cls' after 43 ms
> [05:39:22] Starting 'compile:esnext:cls'...
> [05:39:23] Finished 

[jira] [Commented] (ARROW-13046) [Release] JS package failing test prior to publish

2021-06-15 Thread Paul Taylor (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-13046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17363772#comment-17363772
 ] 

Paul Taylor commented on ARROW-13046:
-

[~jorgecarleitao] Looks like the 4.0.1 branch also needs this commit: 
https://github.com/apache/arrow/commit/3a6f6053c74eb698208395091009ac50be9dc29e

> [Release] JS package failing test prior to publish
> --
>
> Key: ARROW-13046
> URL: https://issues.apache.org/jira/browse/ARROW-13046
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: JavaScript
>Reporter: Jorge Leitão
>Priority: Major
>
> While trying to publish the JS, I am getting an error when running the tests 
> (on mac).
> To reproduce, run `dev/release/post-05-js.sh 4.0.1` on branch 
> `release-arrow-4.0.1`:
> {code:java}
> ~/projects/arrow/apache-arrow-4.0.1/js ~/projects/arrow
> yarn install v1.22.1
> [1/5]   Validating package.json...
> [2/5]   Resolving packages...
> [3/5]   Fetching packages...
> info google-closure-compiler-linux@20210406.0.0: The platform "darwin" is 
> incompatible with this module.
> info "google-closure-compiler-linux@20210406.0.0" is an optional dependency 
> and failed compatibility check. Excluding it from installation.
> info google-closure-compiler-windows@20210406.0.0: The platform "darwin" is 
> incompatible with this module.
> info "google-closure-compiler-windows@20210406.0.0" is an optional dependency 
> and failed compatibility check. Excluding it from installation.
> [4/5]   Linking dependencies...
> warning "lerna > @lerna/version > @lerna/github-client > @octokit/rest > 
> @octokit/plugin-request-log@1.0.3" has unmet peer dependency 
> "@octokit/core@>=3".
> [5/5]   Building fresh packages...
> warning Your current version of Yarn is out of date. The latest version is 
> "1.22.5", while you're on "1.22.1".
> info To upgrade, run the following command:
> $ brew upgrade yarn
> ✨  Done in 121.72s.
> yarn run v1.22.1
> $ 
> /Users/jorgecarleitao/projects/arrow/apache-arrow-4.0.1/js/node_modules/.bin/gulp
> [05:39:21] Using gulpfile ~/projects/arrow/apache-arrow-4.0.1/js/gulpfile.js
> [05:39:21] Starting 'default'...
> [05:39:21] Starting 'clean'...
> [05:39:21] Starting 'clean:ts'...
> [05:39:21] Starting 'clean:apache-arrow'...
> [05:39:21] Starting 'clean:es5:cjs'...
> [05:39:21] Starting 'clean:es2015:cjs'...
> [05:39:21] Starting 'clean:esnext:cjs'...
> [05:39:21] Starting 'clean:es5:esm'...
> [05:39:21] Starting 'clean:es2015:esm'...
> [05:39:21] Starting 'clean:esnext:esm'...
> [05:39:21] Starting 'clean:es5:cls'...
> [05:39:21] Starting 'clean:es2015:cls'...
> [05:39:21] Starting 'clean:esnext:cls'...
> [05:39:21] Starting 'clean:es5:umd'...
> [05:39:21] Starting 'clean:es2015:umd'...
> [05:39:21] Starting 'clean:esnext:umd'...
> [05:39:21] Finished 'clean:ts' after 211 ms
> [05:39:21] Finished 'clean:apache-arrow' after 199 ms
> [05:39:21] Finished 'clean:es5:cjs' after 195 ms
> [05:39:21] Finished 'clean:es2015:cjs' after 196 ms
> [05:39:21] Finished 'clean:esnext:cjs' after 190 ms
> [05:39:21] Finished 'clean:es5:esm' after 180 ms
> [05:39:21] Finished 'clean:es2015:esm' after 172 ms
> [05:39:21] Finished 'clean:esnext:esm' after 169 ms
> [05:39:21] Finished 'clean:es5:cls' after 151 ms
> [05:39:21] Finished 'clean:es2015:cls' after 146 ms
> [05:39:22] Finished 'clean:esnext:cls' after 163 ms
> [05:39:22] Finished 'clean:es5:umd' after 149 ms
> [05:39:22] Finished 'clean:es2015:umd' after 146 ms
> [05:39:22] Finished 'clean:esnext:umd' after 142 ms
> [05:39:22] Finished 'clean' after 293 ms
> [05:39:22] Starting 'build'...
> [05:39:22] Starting 'build:ts'...
> [05:39:22] Starting 'build:apache-arrow'...
> [05:39:22] Starting 'build:es5:cjs'...
> [05:39:22] Starting 'clean:ts'...
> [05:39:22] Starting 'clean:es5:cjs'...
> [05:39:22] Finished 'clean:ts' after 728 μs
> [05:39:22] Starting 'compile:ts'...
> [05:39:22] Starting 'build:es2015:umd'...
> [05:39:22] Starting 'build:esnext:cjs'...
> [05:39:22] Starting 'build:esnext:esm'...
> [05:39:22] Starting 'build:esnext:umd'...
> [05:39:22] Finished 'clean:es5:cjs' after 11 ms
> [05:39:22] Starting 'compile:es5:cjs'...
> [05:39:22] Starting 'build:es2015:cls'...
> [05:39:22] Starting 'clean:esnext:cjs'...
> [05:39:22] Starting 'clean:esnext:esm'...
> [05:39:22] Starting 'build:esnext:cls'...
> [05:39:22] Starting 'clean:es2015:cls'...
> [05:39:22] Finished 'clean:esnext:cjs' after 30 ms
> [05:39:22] Starting 'compile:esnext:cjs'...
> [05:39:22] Finished 'clean:esnext:esm' after 28 ms
> [05:39:22] Starting 'compile:esnext:esm'...
> [05:39:22] Starting 'clean:esnext:cls'...
> [05:39:22] Finished 'clean:es2015:cls' after 53 ms
> [05:39:22] Starting 'compile:es2015:cls'...
> [05:39:22] Finished 'clean:esnext:cls' after 43 ms
> [05:39:22] Starting 'compile:esnext:cls'...
> [05:39:23] 

[jira] [Updated] (ARROW-12570) [JS] Fix issues that blocked the v4.0.0 release

2021-04-27 Thread Paul Taylor (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Taylor updated ARROW-12570:

Issue Type: Bug  (was: Improvement)

> [JS] Fix issues that blocked the v4.0.0 release
> ---
>
> Key: ARROW-12570
> URL: https://issues.apache.org/jira/browse/ARROW-12570
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: JavaScript
>Reporter: Paul Taylor
>Assignee: Paul Taylor
>Priority: Major
>
> A few issues had to be fixed manually for the v4.0.0 release:
> * ts-jest throwing a type error running the tests on the TS source
> * lerna.json really does need those version numbers
> * npm has introduced rate limits since v3.0.0
> * support npm 2FA one-time-passwords for publish



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-12570) [JS] Fix issues that blocked the v4.0.0 release

2021-04-27 Thread Paul Taylor (Jira)
Paul Taylor created ARROW-12570:
---

 Summary: [JS] Fix issues that blocked the v4.0.0 release
 Key: ARROW-12570
 URL: https://issues.apache.org/jira/browse/ARROW-12570
 Project: Apache Arrow
  Issue Type: Improvement
  Components: JavaScript
Reporter: Paul Taylor
Assignee: Paul Taylor


A few issues had to be fixed manually for the v4.0.0 release:

* ts-jest throwing a type error running the tests on the TS source
* lerna.json really does need those version numbers
* npm has introduced rate limits since v3.0.0
* support npm 2FA one-time-passwords for publish



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-12305) [JS] Benchmark test data generate.py assumes python 2

2021-04-08 Thread Paul Taylor (Jira)
Paul Taylor created ARROW-12305:
---

 Summary: [JS] Benchmark test data generate.py assumes python 2
 Key: ARROW-12305
 URL: https://issues.apache.org/jira/browse/ARROW-12305
 Project: Apache Arrow
  Issue Type: Improvement
  Components: JavaScript
Reporter: Paul Taylor
Assignee: Paul Taylor






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-12270) [JS] remove rxjs and ix dependency or make them lighter

2021-04-08 Thread Paul Taylor (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-12270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17317417#comment-17317417
 ] 

Paul Taylor commented on ARROW-12270:
-

Rx is used for the build scripts, and Ix is used in the IPC tests.

> [JS] remove rxjs and ix dependency or make them lighter
> ---
>
> Key: ARROW-12270
> URL: https://issues.apache.org/jira/browse/ARROW-12270
> Project: Apache Arrow
>  Issue Type: Task
>  Components: JavaScript
>Reporter: Dominik Moritz
>Assignee: Paul Taylor
>Priority: Minor
>
> We don't use these dependencies extensively so they could be good candidates 
> for being cleaned up to make the dev setup easier to understand for 
> newcomers. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-12271) [JS] Run Lerna directly instead of via npx

2021-04-08 Thread Paul Taylor (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-12271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17317416#comment-17317416
 ] 

Paul Taylor commented on ARROW-12271:
-

`npx lerna` uses the version of lerna installed in the project's `node_modules` 
bin dir.

> [JS] Run Lerna directly instead of via npx
> --
>
> Key: ARROW-12271
> URL: https://issues.apache.org/jira/browse/ARROW-12271
> Project: Apache Arrow
>  Issue Type: Task
>  Components: JavaScript
>Reporter: Dominik Moritz
>Assignee: Dominik Moritz
>Priority: Minor
>
> Npx can install lerna but a user may not use the right version. Instead, we 
> should call `yarn lerna`. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10794) Typescript Arrowjs Class 'RecordBatch' incorrectly extends base class 'StructVector

2021-02-04 Thread Paul Taylor (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17279379#comment-17279379
 ] 

Paul Taylor commented on ARROW-10794:
-

Thanks for reporting, I'll look into submitting a PR to fix these. In the 
meantime, you should be able to set `"skipLibCheck": true` in your tsconfig to 
workaround this.

> Typescript Arrowjs Class 'RecordBatch' incorrectly extends base class 
> 'StructVector
> -
>
> Key: ARROW-10794
> URL: https://issues.apache.org/jira/browse/ARROW-10794
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: JavaScript
>Affects Versions: 2.0.0
>Reporter: vikash
>Priority: Blocker
> Attachments: Screenshot_1.png
>
>
> i  am  trying  to  use apache-arrow  js  in  angular typescript version 
> 4.0.2 ,for that  i have  seen  issues  in  Typescript  failed  to  compile
>  steps  to  reprodcue
> -
> 1) install  angular cli  npm install -g @angular/cli
> 2) create  new  project  using ng new my-app
> 3) install apache  arrow  using  npm install apache-arrow
> 4) file  app.componenet.ts have  added below code
> ```
> import \{ Component } from '@angular/core';
> import \{ Table } from 'apache-arrow';
> import \{ readFileSync } from 'fs';
> @Component({
>   selector: 'app-root',
>   templateUrl: './app.component.html',
>   styleUrls: ['./app.component.css']
> })
> export class AppComponent {
>   title = 'arrow-typescript';
>    arrow = readFileSync('simple.arrow');
>  table = Table.from([this.arrow]);
> }
> ```
>  
> but  when  i  am  using  npm  run  build  its  failed  with  below  error
> Error: node_modules/apache-arrow/recordbatch.d.ts:17:18 - error TS2430: 
> Interface 'RecordBatch' incorrectly extends interface 'StructVector'.
>  The types of 'slice(...).clone' are incompatible between these types.
>  Type '(data: Data>, children?: AbstractVector[] | undefined) 
> => RecordBatch' is not assignable to type ' 
> = Struct>(data: Data, children?: AbstractVector[] | undefined) => 
> VectorType'.
>  Types of parameters 'data' and 'data' are incompatible.
>  Type 'Data' is not assignable to type 'Data>'.
>  Type 'R' is not assignable to type 'Struct'.
>  Property 'dataTypes' is missing in type 'DataType' but required 
> in type 'Struct'.
> 17 export interface RecordBatch  ~~~
> node_modules/apache-arrow/type.d.ts:458:5
>  458 dataTypes: T;
>  ~
>  'dataTypes' is declared here.
> node_modules/apache-arrow/recordbatch.d.ts:24:22 - error TS2415: Class 
> 'RecordBatch' incorrectly extends base class 'StructVector'.
> 24 export declare class RecordBatch  ~~~
> node_modules/apache-arrow/ipc/reader.d.ts:236:5 - error TS2717: Subsequent 
> property declarations must have the same type. Property 'schema' must be of 
> type 'Schema', but here has type 'Schema'.
> 236 schema: Schema;
>  ~~
> node_modules/apache-arrow/ipc/reader.d.ts:189:5
>  189 schema: Schema;
>  ~~
>  'schema' was also declared here.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-10450) [Javascript] Table.fromStruct() silently truncates vectors to the first chunk

2021-02-04 Thread Paul Taylor (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17279371#comment-17279371
 ] 

Paul Taylor edited comment on ARROW-10450 at 2/5/21, 6:08 AM:
--

Yeah, this is unfortunately a tricky spot with the current Chunked vectors. The 
`.data` getter on Chunked only returns the data field of the first chunk. 
Table.fromStruct() doesn't expect to get a ChunkedVector as input, it expects a 
single-chunk StructVector.

Your `Vector.from({data: }) ` call runs those JS objects through 
the Arrow Struct Builder and serializes them into Arrow vectors of binary data.

The `highWaterMark` defaults to 1000 to avoid the case where someone tries to 
serialize lots of data, and the builder has to grow allocations past the 2GB 
limit. Builder internal buffers grow geometrically, so this is relatively easy 
to do with strings.

As you noted, you don't run into this issue when you do `Table.new()` because 
that method expects its input is likely split up across multiple chunks. The 
only downside is now you have a Table of struct of fields, rather than a Table 
of fields.


was (Author: paul.e.taylor):
Yeah, this is unfortunately a tricky spot with the current Chunked vectors. The 
`.data` getter on Chunked only returns the data field of the first chunk. 
Table.fromStruct() doesn't expect to get a ChunkedVector as input, it expects a 
single-chunk StructVector.

Your `Vector.from({data: }) ` call runs those JS objects through 
the Arrow Struct Builder and serialized into binary Arrow vectors.

The `highWaterMark` defaults to 1000 to avoid the case where someone tries to 
serialize lots of data, and the builder has to grow allocations past the 2GB 
limit. Builder internal buffers grow geometrically, so this is relatively easy 
to do with strings.

As you noted, you don't run into this issue when you do `Table.new()` because 
that method expects its input is likely split up across multiple chunks. The 
only downside is now you have a Table of struct of fields, rather than a Table 
of fields.

> [Javascript] Table.fromStruct() silently truncates vectors to the first chunk
> -
>
> Key: ARROW-10450
> URL: https://issues.apache.org/jira/browse/ARROW-10450
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: JavaScript
>Affects Versions: 2.0.0
>Reporter: David Saslawsky
>Priority: Minor
>
> Table.fromStruct() only uses the first chunk from the input vector.
> {code:javascript}
> import { Bool, Field, Int32, Struct, Table, Vector } from "apache-arrow";
> const myStruct = new Struct([
>   Field.new({ name: "over", type: new Int32() }),
>   Field.new({ name: "out", type: new Bool() })
> ]);
> const data = [];
> for(let i=0;i<1500;i++) {
>   data.push({ over:i, out:i%2 === 0 });
> // create a vector with two chunks
> const victor = Vector.from({
>   type: myStruct,
>   /*highWaterMark: Infinity,*/
>   values: data
> });
> console.log(victor.length);  // 1500 
> const table = Table.fromStruct(victor);
> console.log(table.length);   // 1000
> {code}
>  The workaround is to set highWaterMark to Infinity
>  
> Table.new() works as expected
> {code:javascript}
> const int32Array = new Int32Array(1500);for(let i=0;i<1500;i++)  
> int32Array[i] = i;
> const intVector = Vector.from({  type: new Int32(),  values: int32Array});
> console.log(intVector.length);  // 1500
>  const intTable = Table.new({ intColumn:intVector });
> console.log(intTable.length);   // 1500
> {code}
>  
> The origin seems to be in Chunked.data() but I don't understand the code 
> enough to propose a fix.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10450) [Javascript] Table.fromStruct() silently truncates vectors to the first chunk

2021-02-04 Thread Paul Taylor (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17279371#comment-17279371
 ] 

Paul Taylor commented on ARROW-10450:
-

Yeah, this is unfortunately a tricky spot with the current Chunked vectors. The 
`.data` getter on Chunked only returns the data field of the first chunk. 
Table.fromStruct() doesn't expect to get a ChunkedVector as input, it expects a 
single-chunk StructVector.

Your `Vector.from({data: }) ` call runs those JS objects through 
the Arrow Struct Builder and serialized into binary Arrow vectors.

The `highWaterMark` defaults to 1000 to avoid the case where someone tries to 
serialize lots of data, and the builder has to grow allocations past the 2GB 
limit. Builder internal buffers grow geometrically, so this is relatively easy 
to do with strings.

As you noted, you don't run into this issue when you do `Table.new()` because 
that method expects its input is likely split up across multiple chunks. The 
only downside is now you have a Table of struct of fields, rather than a Table 
of fields.

> [Javascript] Table.fromStruct() silently truncates vectors to the first chunk
> -
>
> Key: ARROW-10450
> URL: https://issues.apache.org/jira/browse/ARROW-10450
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: JavaScript
>Affects Versions: 2.0.0
>Reporter: David Saslawsky
>Priority: Minor
>
> Table.fromStruct() only uses the first chunk from the input vector.
> {code:javascript}
> import { Bool, Field, Int32, Struct, Table, Vector } from "apache-arrow";
> const myStruct = new Struct([
>   Field.new({ name: "over", type: new Int32() }),
>   Field.new({ name: "out", type: new Bool() })
> ]);
> const data = [];
> for(let i=0;i<1500;i++) {
>   data.push({ over:i, out:i%2 === 0 });
> // create a vector with two chunks
> const victor = Vector.from({
>   type: myStruct,
>   /*highWaterMark: Infinity,*/
>   values: data
> });
> console.log(victor.length);  // 1500 
> const table = Table.fromStruct(victor);
> console.log(table.length);   // 1000
> {code}
>  The workaround is to set highWaterMark to Infinity
>  
> Table.new() works as expected
> {code:javascript}
> const int32Array = new Int32Array(1500);for(let i=0;i<1500;i++)  
> int32Array[i] = i;
> const intVector = Vector.from({  type: new Int32(),  values: int32Array});
> console.log(intVector.length);  // 1500
>  const intTable = Table.new({ intColumn:intVector });
> console.log(intTable.length);   // 1500
> {code}
>  
> The origin seems to be in Chunked.data() but I don't understand the code 
> enough to propose a fix.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-10901) [JS] toArray delivers double length arrays in some cases

2021-02-04 Thread Paul Taylor (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17279363#comment-17279363
 ] 

Paul Taylor edited comment on ARROW-10901 at 2/5/21, 5:50 AM:
--

toArray() on the Numeric vector types returns a zero-copy TypedArray view over 
the underlying `data` buffer for vectors/single-chunk columns. For multi-chunk 
columns it copies the data from each chunk into a single contiguous buffer.

toArray() is a method to deserialize values from their binary Arrow 
representation to JS values, potentially at the cost of additional 
copy/deserialization. For example, Utf8Vector will return an Array of strings, 
DateVector will return an Array of Dates, etc.

If you want the numeric values of an IntVector, you can use the `.values` 
getter directly. This returns the underlying Vector's binary data as a JS typed 
array of the appropriate byte-width, excepting the 64-bit cases.

Not every environment implements the `BigInt64Array` and `BigUint64Array`. 
Since we want to support those environments, we've opted to return the 32-bit 
variants of the 64-bit Vector types.

If you're targeting only environments with BigInts, the `Int64Vector` and 
`Uint64Vector` have additional `.values64` getters that return `BigInt64Array` 
or `BigUint64Array` respectively. These getters will throw an error if called 
in an environment without BigInts.


was (Author: paul.e.taylor):
toArray() on the Numeric vector types returns a zero-copy TypedArray view over 
the underlying `data` buffer for vectors/single-chunk columns. For multi-chunk 
columns it copies the data from each chunk into a single contiguous buffer.

toArray() is a method to deserialize values from their binary Arrow 
representation to JS values. For example, Utf8Vector will return an Array of 
strings, DateVector will return an Array of Dates, etc.

If you want the numeric values of an IntVector, you can use the `.values` 
getter directly. This returns the underlying Vector's binary data as a JS typed 
array of the appropriate byte-width, excepting the 64-bit cases.

Not every environment implements the `BigInt64Array` and `BigUint64Array`. 
Since we want to support those environments, we've opted to return the 32-bit 
variants of the 64-bit Vector types.

If you're targeting only environments with BigInts, the `Int64Vector` and 
`Uint64Vector` have additional `.values64` getters that return `BigInt64Array` 
or `BigUint64Array` respectively. These getters will throw an error if called 
in an environment without BigInts.

> [JS] toArray delivers double length arrays in some cases
> 
>
> Key: ARROW-10901
> URL: https://issues.apache.org/jira/browse/ARROW-10901
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: JavaScript
>Affects Versions: 2.0.0
>Reporter: roland
>Priority: Major
> Attachments: Screen Shot 2020-12-14 at 3.34.24 PM.png, Screen Shot 
> 2020-12-14 at 3.38.54 PM.png
>
>
> When calling `toArray` on a column, one would expect that a column of length 
> 10, would give back an array of length 10. Instead, it sometimes gives back 
> an array of length 20.
> I think this is the case for elements where the type is something like Int64, 
> where it's not guaranteed JS will actually fit the number into it Float 
> (which iirc is not 64 bit exactly). 
> At the same time, if I call `toArray`, I would expect the numbers to stay the 
> same.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10901) [JS] toArray delivers double length arrays in some cases

2021-02-04 Thread Paul Taylor (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17279363#comment-17279363
 ] 

Paul Taylor commented on ARROW-10901:
-

toArray() on the Numeric vector types returns a zero-copy TypedArray view over 
the underlying `data` buffer for vectors/single-chunk columns. For multi-chunk 
columns it copies the data from each chunk into a single contiguous buffer.

toArray() is a method to deserialize values from their binary Arrow 
representation to JS values. For example, Utf8Vector will return an Array of 
strings, DateVector will return an Array of Dates, etc.

If you want the numeric values of an IntVector, you can use the `.values` 
getter directly. This returns the underlying Vector's binary data as a JS typed 
array of the appropriate byte-width, excepting the 64-bit cases.

Not every environment implements the `BigInt64Array` and `BigUint64Array`. 
Since we want to support those environments, we've opted to return the 32-bit 
variants of the 64-bit Vector types.

If you're targeting only environments with BigInts, the `Int64Vector` and 
`Uint64Vector` have additional `.values64` getters that return `BigInt64Array` 
or `BigUint64Array` respectively. These getters will throw an error if called 
in an environment without BigInts.

> [JS] toArray delivers double length arrays in some cases
> 
>
> Key: ARROW-10901
> URL: https://issues.apache.org/jira/browse/ARROW-10901
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: JavaScript
>Affects Versions: 2.0.0
>Reporter: roland
>Priority: Major
> Attachments: Screen Shot 2020-12-14 at 3.34.24 PM.png, Screen Shot 
> 2020-12-14 at 3.38.54 PM.png
>
>
> When calling `toArray` on a column, one would expect that a column of length 
> 10, would give back an array of length 10. Instead, it sometimes gives back 
> an array of length 20.
> I think this is the case for elements where the type is something like Int64, 
> where it's not guaranteed JS will actually fit the number into it Float 
> (which iirc is not 64 bit exactly). 
> At the same time, if I call `toArray`, I would expect the numbers to stay the 
> same.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-11326) utf8 vector buffers don't work if allocated within Web Assembly memory of Node.js

2021-02-04 Thread Paul Taylor (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-11326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17279350#comment-17279350
 ] 

Paul Taylor commented on ARROW-11326:
-

This sounds like a bug in node's Buffer class?

We use the Buffer to utf8 encode and decode in node, because it was (at the 
time of authorship) dramatically faster than TextEncoder/TextDecoder.

> utf8 vector buffers don't work if allocated within Web Assembly memory of 
> Node.js
> -
>
> Key: ARROW-11326
> URL: https://issues.apache.org/jira/browse/ARROW-11326
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: JavaScript
> Environment: node.js in Mac book pro
>Reporter: Dmitri Bronnikov
>Priority: Major
>
> After making int32array of offsets = [0, 1] and uint8array of values = 
> [ascii_code('A')], create a vector of strings:
> const vec = arrow.Vector.new(arrow.Data.new(new Utf8(), 0, 1, 0, [offsets, 
> values, null, null])
> then access the first and only element:
> console.log(vec.get(0))
> Works within browsers. Works in node.js with fixed size types, e.g. float or 
> integer.
> Fails in Node.js (v14.11.0.) with this callstack 
> at ../../node_modules/@apache-arrow/es2015-umd/buffer/index.js:311:1
>     at __proto__ 
> (../../node_modules/@apache-arrow/es2015-umd/buffer/index.js:167:1)
>     at Function._Buffer [as from] 
> (../../node_modules/@apache-arrow/es2015-umd/buffer/index.js:154:1)
>     at prototype 
> (../../node_modules/@apache-arrow/es2015-umd/util/utf8.ts:43:31)
>     at partial2 
> (../../node_modules/@apache-arrow/es2015-umd/visitor/get.ts:293:12)
>     at go.isArray [as get] 
> (../../node_modules/@apache-arrow/es2015-umd/vector/index.ts:175:43)
>     at Sr.get (../../node_modules/@apache-arrow/es2015-umd/util/args.ts:27:7)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-11347) [JavaScript] Consider Objects instead of Maps

2021-02-04 Thread Paul Taylor (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-11347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17279348#comment-17279348
 ] 

Paul Taylor edited comment on ARROW-11347 at 2/5/21, 5:34 AM:
--

[~domoritz] see my comment here: 
https://issues.apache.org/jira/browse/ARROW-11351?focusedCommentId=17279344=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17279344

tl;dr; the Row API doesn't use JS's Map, the abstract Row base class just 
implements the Map interface. The actual lookup is delegated to its concrete 
subclass implementations StructRow and MapRow. StructRow still uses the 
flyweight pattern, and MapRow attempts a different optimization via Proxies if 
available.


was (Author: paul.e.taylor):
[~domoritz] see my comment here: 
https://issues.apache.org/jira/browse/ARROW-11351?focusedCommentId=17279344=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17279344

tl;dr; the Row API doesn't use JS's Map, the abstract Row base class just 
implements the Map interface. The actual lookup is delegated to its concrete 
subclass implementations StructRow and MapRow. StructRow still uses the 
flyweight, and MapRow attempts a similar optimization via Proxies if available.

> [JavaScript] Consider Objects instead of Maps
> -
>
> Key: ARROW-11347
> URL: https://issues.apache.org/jira/browse/ARROW-11347
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: JavaScript
>Reporter: Dominik Moritz
>Priority: Major
>  Labels: performance
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> A quick experiment 
> (https://observablehq.com/@domoritz/performance-of-maps-vs-objects) seems to 
> show that object accesses are a lot faster than map accesses. Would it make 
> sense to switch to objects in the row API to improve performance? 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-11347) [JavaScript] Consider Objects instead of Maps

2021-02-04 Thread Paul Taylor (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-11347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17279348#comment-17279348
 ] 

Paul Taylor edited comment on ARROW-11347 at 2/5/21, 5:33 AM:
--

[~domoritz] see my comment here: 
https://issues.apache.org/jira/browse/ARROW-11351?focusedCommentId=17279344=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17279344

tl;dr; the Row API doesn't use JS's Map, the abstract Row base class just 
implements the Map interface. The actual lookup is delegated to its concrete 
subclass implementations StructRow and MapRow. StructRow still uses the 
flyweight, and MapRow attempts a similar optimization via Proxies if available.


was (Author: paul.e.taylor):
[~domoritz] see my comment here: 
https://issues.apache.org/jira/browse/ARROW-11351?focusedCommentId=17279344=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17279344

tl;dr; the Row API doesn't use JS's Map, the abstract Row base class just 
implements the Map interface. The actual lookup is delegated to its concrete 
subclass implementations StructRow and MapRow. StructRow still uses the 
flyweight, and MapRow attempts a similar optimization via Proxies if available.

> [JavaScript] Consider Objects instead of Maps
> -
>
> Key: ARROW-11347
> URL: https://issues.apache.org/jira/browse/ARROW-11347
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: JavaScript
>Reporter: Dominik Moritz
>Priority: Major
>  Labels: performance
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> A quick experiment 
> (https://observablehq.com/@domoritz/performance-of-maps-vs-objects) seems to 
> show that object accesses are a lot faster than map accesses. Would it make 
> sense to switch to objects in the row API to improve performance? 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-11347) [JavaScript] Consider Objects instead of Maps

2021-02-04 Thread Paul Taylor (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-11347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17279348#comment-17279348
 ] 

Paul Taylor commented on ARROW-11347:
-

[~domoritz] see my comment here: 
https://issues.apache.org/jira/browse/ARROW-11351?focusedCommentId=17279344=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17279344

tl;dr; the Row API doesn't use JS's Map, the abstract Row base class just 
implements the Map interface. The actual lookup is delegated to its concrete 
subclass implementations StructRow and MapRow. StructRow still uses the 
flyweight, and MapRow attempts a similar optimization via Proxies if available.

> [JavaScript] Consider Objects instead of Maps
> -
>
> Key: ARROW-11347
> URL: https://issues.apache.org/jira/browse/ARROW-11347
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: JavaScript
>Reporter: Dominik Moritz
>Priority: Major
>  Labels: performance
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> A quick experiment 
> (https://observablehq.com/@domoritz/performance-of-maps-vs-objects) seems to 
> show that object accesses are a lot faster than map accesses. Would it make 
> sense to switch to objects in the row API to improve performance? 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-11351) Reconsider proxy objects instead of defineProperty

2021-02-04 Thread Paul Taylor (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-11351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17279344#comment-17279344
 ] 

Paul Taylor commented on ARROW-11351:
-

[~domoritz] The Map/Proxy approach was introduced in 
https://github.com/apache/arrow/pull/5371.

The issue is the Arrow MapVector is a list of objects whose fields can vary per 
row. Because of this, we can't take advantage of the Object.defineProperties() 
flyweight pattern optimization like we can with StructVector.

But all is not lost! The Proxy approach allows us to at least defer the cost of 
deserializing each MapRow's key to lookup time, so creation is relatively fast 
(compared to deserializing each MapRow into an Object), and you only pay 
deserialization cost for fields you actually access.

This shouldn't affect the StructVector, as it still uses the 
Object.defineProperties() flyweight pattern.

> Reconsider proxy objects instead of defineProperty
> --
>
> Key: ARROW-11351
> URL: https://issues.apache.org/jira/browse/ARROW-11351
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: JavaScript
>Reporter: Dominik Moritz
>Priority: Major
>
> I was wondering why Arrow uses Proxy objects instead of defineProperty, which 
> was a bit faster in the experiments at 
> https://observablehq.com/@jheer/from-apache-arrow-to-javascript-objects. I 
> don't know whether a change makes sense but I would love to know the design 
> rationale since I couldn't find anything in the issues or on GitHub about it. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-10255) [JS] Reorganize imports and exports to be more friendly to ESM tree-shaking

2021-02-02 Thread Paul Taylor (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17277659#comment-17277659
 ] 

Paul Taylor edited comment on ARROW-10255 at 2/3/21, 3:44 AM:
--

[~bhulette] I vote no on the current PR for 4 reasons: 
# Arrow releases have moved to releases major-version-revs only, so npm won't 
upgrade libs/people by default
# The [very minor 
changes|https://github.com/apache/arrow/pull/8418/files#diff-281075682d7444bc1be962b47cff16401e18f1b9bafee2b58557a3f73fb54507]
 we had to make to our own tests makes me think this shouldn't be a huge pain 
to upgrade for users (not to mention Field and Schema aren't the most common 
APIs to interact with directly)
# These changes are needed by [vis.gl|https://github.com/visgl] to import 
ArrowJS in [loaders.gl|https://github.com/visgl/loaders.gl]. Currently they're 
reimplementing the bits of Schema/Field/DataType they need because importing 
ours adds ~250k (~24k minified) to their bundle, which is over their size 
budget.
# Adding deprecation warnings will add to the size of the lib, and likely in 
ways that can't be tree-shaken. In Python size on-disk isn't an issue, so 
people add deprecation warnings all the time, but without extensive tooling 
support it's difficult to do/guide users how to do in JS.



was (Author: paul.e.taylor):
[~bhulette] I vote no on the current PR for 3 reasons: 
# Arrow releases have moved to releases major-version-revs only, so npm won't 
upgrade libs/people by default
# The [very minor 
changes|https://github.com/apache/arrow/pull/8418/files#diff-281075682d7444bc1be962b47cff16401e18f1b9bafee2b58557a3f73fb54507]
 we had to make to our own tests makes me think this shouldn't be a huge pain 
to upgrade for users (not to mention Field and Schema aren't the most common 
APIs to interact with directly)
# These changes are needed by [vis.gl|https://github.com/visgl] to import 
ArrowJS in [loaders.gl|https://github.com/visgl/loaders.gl]. Currently they're 
reimplementing the bits of Schema/Field/DataType they need because importing 
ours adds ~250k (~24k minified) to their bundle, which is over their size 
budget.


> [JS] Reorganize imports and exports to be more friendly to ESM tree-shaking
> ---
>
> Key: ARROW-10255
> URL: https://issues.apache.org/jira/browse/ARROW-10255
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: JavaScript
>Affects Versions: 0.17.1
>Reporter: Paul Taylor
>Assignee: Paul Taylor
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Presently most of our public classes can't be easily 
> [tree-shaken|https://webpack.js.org/guides/tree-shaking/] by library 
> consumers. This is a problem for libraries that only need to use parts of 
> Arrow.
> For example, the vis.gl projects have an integration test that imports three 
> of our simpler classes and tests the resulting bundle size:
> {code:javascript}
> import {Schema, Field, Float32} from 'apache-arrow';
> // | Bundle Size| Compressed 
> // | 202KB (207112) KB  | 45KB (46618) KB
> {code}
> We can help solve this with the following changes:
> * Add "sideEffects": false to our ESM package.json
> * Reorganize our imports to only include what's needed
> * Eliminate or move some static/member methods to standalone exported 
> functions
> * Wrap the utf8 util's node Buffer detection in eval so Webpack doesn't 
> compile in its own Buffer shim
> * Removing flatbuffers namespaces from generated TS because these defeat 
> Webpack's tree-shaking ability
> Candidate functions for removal/moving to standalone functions:
> * Schema.new, Schema.from, Schema.prototype.compareTo
> * Field.prototype.compareTo
> * Type.prototype.compareTo
> * Table.new, Table.from
> * Column.new
> * Vector.new, Vector.from
> * RecordBatchReader.from
> After applying a few of the above changes to the Schema and flatbuffers 
> files, I was able to reduce the vis.gl's import size 90%:
> {code:javascript}
> // Bundle Size  | Compressed
> // 24KB (24942) KB  | 6KB (6154) KB
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10255) [JS] Reorganize imports and exports to be more friendly to ESM tree-shaking

2021-02-02 Thread Paul Taylor (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17277659#comment-17277659
 ] 

Paul Taylor commented on ARROW-10255:
-

[~bhulette] I vote no on the current PR for 3 reasons: 
# Arrow releases have moved to releases major-version-revs only, so npm won't 
upgrade libs/people by default
# The [very minor 
changes|https://github.com/apache/arrow/pull/8418/files#diff-281075682d7444bc1be962b47cff16401e18f1b9bafee2b58557a3f73fb54507]
 we had to make to our own tests makes me think this shouldn't be a huge pain 
to upgrade for users (not to mention Field and Schema aren't the most common 
APIs to interact with directly)
# These changes are needed by [vis.gl|https://github.com/visgl] to import 
ArrowJS in [loaders.gl|https://github.com/visgl/loaders.gl]. Currently they're 
reimplementing the bits of Schema/Field/DataType they need because importing 
ours adds ~250k (~24k minified) to their bundle, which is over their size 
budget.


> [JS] Reorganize imports and exports to be more friendly to ESM tree-shaking
> ---
>
> Key: ARROW-10255
> URL: https://issues.apache.org/jira/browse/ARROW-10255
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: JavaScript
>Affects Versions: 0.17.1
>Reporter: Paul Taylor
>Assignee: Paul Taylor
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Presently most of our public classes can't be easily 
> [tree-shaken|https://webpack.js.org/guides/tree-shaking/] by library 
> consumers. This is a problem for libraries that only need to use parts of 
> Arrow.
> For example, the vis.gl projects have an integration test that imports three 
> of our simpler classes and tests the resulting bundle size:
> {code:javascript}
> import {Schema, Field, Float32} from 'apache-arrow';
> // | Bundle Size| Compressed 
> // | 202KB (207112) KB  | 45KB (46618) KB
> {code}
> We can help solve this with the following changes:
> * Add "sideEffects": false to our ESM package.json
> * Reorganize our imports to only include what's needed
> * Eliminate or move some static/member methods to standalone exported 
> functions
> * Wrap the utf8 util's node Buffer detection in eval so Webpack doesn't 
> compile in its own Buffer shim
> * Removing flatbuffers namespaces from generated TS because these defeat 
> Webpack's tree-shaking ability
> Candidate functions for removal/moving to standalone functions:
> * Schema.new, Schema.from, Schema.prototype.compareTo
> * Field.prototype.compareTo
> * Type.prototype.compareTo
> * Table.new, Table.from
> * Column.new
> * Vector.new, Vector.from
> * RecordBatchReader.from
> After applying a few of the above changes to the Schema and flatbuffers 
> files, I was able to reduce the vis.gl's import size 90%:
> {code:javascript}
> // Bundle Size  | Compressed
> // 24KB (24942) KB  | 6KB (6154) KB
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10255) [JS] Reorganize imports and exports to be more friendly to ESM tree-shaking

2020-10-09 Thread Paul Taylor (Jira)
Paul Taylor created ARROW-10255:
---

 Summary: [JS] Reorganize imports and exports to be more friendly 
to ESM tree-shaking
 Key: ARROW-10255
 URL: https://issues.apache.org/jira/browse/ARROW-10255
 Project: Apache Arrow
  Issue Type: Improvement
  Components: JavaScript
Affects Versions: 0.17.1
Reporter: Paul Taylor
Assignee: Paul Taylor


Presently most of our public classes can't be easily 
[tree-shaken|https://webpack.js.org/guides/tree-shaking/] by library consumers. 
This is a problem for libraries that only need to use parts of Arrow.

For example, the vis.gl projects have an integration test that imports three of 
our simpler classes and tests the resulting bundle size:

{code:javascript}
import {Schema, Field, Float32} from 'apache-arrow';

// | Bundle Size| Compressed 
// | 202KB (207112) KB  | 45KB (46618) KB
{code}

We can help solve this with the following changes:
* Add "sideEffects": false to our ESM package.json
* Reorganize our imports to only include what's needed
* Eliminate or move some static/member methods to standalone exported functions
* Wrap the utf8 util's node Buffer detection in eval so Webpack doesn't compile 
in its own Buffer shim
* Removing flatbuffers namespaces from generated TS because these defeat 
Webpack's tree-shaking ability

Candidate functions for removal/moving to standalone functions:
* Schema.new, Schema.from, Schema.prototype.compareTo
* Field.prototype.compareTo
* Type.prototype.compareTo
* Table.new, Table.from
* Column.new
* Vector.new, Vector.from
* RecordBatchReader.from

After applying a few of the above changes to the Schema and flatbuffers files, 
I was able to reduce the vis.gl's import size 90%:
{code:javascript}
// Bundle Size  | Compressed
// 24KB (24942) KB  | 6KB (6154) KB
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-9982) [JS] IterableArrayLike should support map

2020-09-26 Thread Paul Taylor (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17202664#comment-17202664
 ] 

Paul Taylor edited comment on ARROW-9982 at 9/26/20, 7:44 PM:
--

{code:javascript}¯\_(ツ)_/¯{code}

We're getting complaints about how large and un-tree-shakeable the library is 
already. I don't see a reason to add more unrelated functionality, especially 
when it already exists in other tree-shakeable libraries like Ix.

Otherwise, you can do `for...of` over any iterable type already, or have your 
own map implementation (no need to use Ix for something so simple)


{code:javascript}
function* map(source, project) { for (let x of source) yield project(x);  }

for (let value of map(vector, (x) => x + 1)) {
  console.log(value);
}
{code}



was (Author: paul.e.taylor):
¯\_(ツ)_/¯

We're getting complaints about how large and un-tree-shakeable the library is 
already. I don't see a reason to add more unrelated functionality, especially 
when it already exists in other tree-shakeable libraries like Ix.

Otherwise, you can do `for...of` over any iterable type already, or have your 
own map implementation (no need to use Ix for something so simple)


{code:javascript}
function* map(source, project) { for (let x of source) yield project(x);  }

for (let value of map(vector, (x) => x + 1)) {
  console.log(value);
}
{code}


> [JS] IterableArrayLike should support map
> -
>
> Key: ARROW-9982
> URL: https://issues.apache.org/jira/browse/ARROW-9982
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: JavaScript
>Reporter: Dominik Moritz
>Priority: Minor
>
> `table.toArray()` returns an `IterableArrayLike` and I would like to be able 
> to `map` a function to it. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-9982) [JS] IterableArrayLike should support map

2020-09-26 Thread Paul Taylor (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17202664#comment-17202664
 ] 

Paul Taylor commented on ARROW-9982:


¯\_(ツ)_/¯

We're getting complaints about how large and un-tree-shakeable the library is 
already. I don't see a reason to add more unrelated functionality, especially 
when it already exists in other tree-shakeable libraries like Ix.

Otherwise, you can do `for...of` over any iterable type already, or have your 
own map implementation (no need to use Ix for something so simple)


{code:javascript}
function* map(source, project) { for (let x of source) yield project(x);  }

map(vector, (value) => console.log(value));
{code}


> [JS] IterableArrayLike should support map
> -
>
> Key: ARROW-9982
> URL: https://issues.apache.org/jira/browse/ARROW-9982
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: JavaScript
>Reporter: Dominik Moritz
>Priority: Minor
>
> `table.toArray()` returns an `IterableArrayLike` and I would like to be able 
> to `map` a function to it. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-9982) [JS] IterableArrayLike should support map

2020-09-26 Thread Paul Taylor (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17202664#comment-17202664
 ] 

Paul Taylor edited comment on ARROW-9982 at 9/26/20, 7:42 PM:
--

¯\_(ツ)_/¯

We're getting complaints about how large and un-tree-shakeable the library is 
already. I don't see a reason to add more unrelated functionality, especially 
when it already exists in other tree-shakeable libraries like Ix.

Otherwise, you can do `for...of` over any iterable type already, or have your 
own map implementation (no need to use Ix for something so simple)


{code:javascript}
function* map(source, project) { for (let x of source) yield project(x);  }

for (let value of map(vector, (x) => x + 1)) {
  console.log(value);
}
{code}



was (Author: paul.e.taylor):
¯\_(ツ)_/¯

We're getting complaints about how large and un-tree-shakeable the library is 
already. I don't see a reason to add more unrelated functionality, especially 
when it already exists in other tree-shakeable libraries like Ix.

Otherwise, you can do `for...of` over any iterable type already, or have your 
own map implementation (no need to use Ix for something so simple)


{code:javascript}
function* map(source, project) { for (let x of source) yield project(x);  }

map(vector, (value) => console.log(value));
{code}


> [JS] IterableArrayLike should support map
> -
>
> Key: ARROW-9982
> URL: https://issues.apache.org/jira/browse/ARROW-9982
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: JavaScript
>Reporter: Dominik Moritz
>Priority: Minor
>
> `table.toArray()` returns an `IterableArrayLike` and I would like to be able 
> to `map` a function to it. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-9982) IterableArrayLike should support map

2020-09-19 Thread Paul Taylor (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17198830#comment-17198830
 ] 

Paul Taylor commented on ARROW-9982:


Why not use [IxJS|https://github.com/ReactiveX/IxJS#iterable] or another 
similar library?


{code:javascript}
import { from } from 'ix/iterable';
import { map } from 'ix/iterable/operators';
from(arrowVec)
  .pipe(map((x) => x + 1))
  .forEach(console.log.bind(console))
{code}


> IterableArrayLike should support map
> 
>
> Key: ARROW-9982
> URL: https://issues.apache.org/jira/browse/ARROW-9982
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: JavaScript
>Reporter: Dominik Moritz
>Priority: Minor
>
> `table.toArray()` returns an `IterableArrayLike` and I would like to be able 
> to `map` a function to it. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-9860) [JS] Arrow Flight JavaScript Client or Example

2020-09-19 Thread Paul Taylor (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17198828#comment-17198828
 ] 

Paul Taylor commented on ARROW-9860:


JS does not support buffer-level compression, and possibly shouldn't ever, due 
to the significant drawbacks of adding JS or WASM-based compression 
implementations to the browser bundles such as perf hit in the readers/writers, 
significant addition to library size, etc.

The only widely/natively supported deflate implementation in browsers is gzip 
(and to a lesser extent, brotli), but deflate is applied at the message/chunk 
level in the browser's networking stack, so compression must be applied to the 
entire payload.

> [JS] Arrow Flight JavaScript Client or Example
> --
>
> Key: ARROW-9860
> URL: https://issues.apache.org/jira/browse/ARROW-9860
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: JavaScript, Python
>Reporter: Alex Monahan
>Priority: Major
>
> Is it possible to use Apache Arrow Flight to send data from a Python Web 
> Server to a JavaScript browser client? If it is possible, is there a code 
> example to use to get started? 
>  
> If this is not possible, what is the fastest way to send data from a Python 
> Web Server to Apache Arrow in the browser today? Would it be faster to send a 
> Parquet file and unpack it client-side, or send Arrow directly/with gzip/ 
> etc.?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-9865) How to Update a Dictionary - Examples needed

2020-09-19 Thread Paul Taylor (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17198826#comment-17198826
 ] 

Paul Taylor commented on ARROW-9865:


Arrow Vectors are immutable, it's not possible to append new elements once 
they've been constructed. You can however construct a new Arrow Vector with 
additional elements via the Builder classes (or related high-level 
`Vector.from()` convenience methods).

> How to Update a Dictionary - Examples needed
> 
>
> Key: ARROW-9865
> URL: https://issues.apache.org/jira/browse/ARROW-9865
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: JavaScript, Python
>Reporter: Alex Monahan
>Priority: Major
>
> How can I do an in-place update on a specific value in a dictionary encoded 
> column in an Arrow Table?
>  
> I have searched all issues and all of the documentation that I can find, and 
> I still can't figure out how to add a new item to a dictionary. The new item 
> I would like to add is different than any previously added value, so it is 
> not already in the dictionary keys. Is this possible? 
> I am working in the JS client, but a Python example may be enough to point me 
> in the right direction.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8394) [JS] Typescript compiler errors for arrow d.ts files, when using es2015-esm package

2020-09-17 Thread Paul Taylor (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17198118#comment-17198118
 ] 

Paul Taylor commented on ARROW-8394:


[~pprice] [~timconkling] [~Costa] PR is up @ 
https://github.com/apache/arrow/pull/8216

> [JS] Typescript compiler errors for arrow d.ts files, when using es2015-esm 
> package
> ---
>
> Key: ARROW-8394
> URL: https://issues.apache.org/jira/browse/ARROW-8394
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: JavaScript
>Affects Versions: 0.16.0
>Reporter: Shyamal Shukla
>Priority: Blocker
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Attempting to use apache-arrow within a web application, but typescript 
> compiler throws the following errors in some of arrow's .d.ts files
> import \{ Table } from "../node_modules/@apache-arrow/es2015-esm/Arrow";
> export class SomeClass {
> .
> .
> constructor() {
> const t = Table.from('');
> }
> *node_modules/@apache-arrow/es2015-esm/column.d.ts:14:22* - error TS2417: 
> Class static side 'typeof Column' incorrectly extends base class static side 
> 'typeof Chunked'. Types of property 'new' are incompatible.
> *node_modules/@apache-arrow/es2015-esm/ipc/reader.d.ts:238:5* - error TS2717: 
> Subsequent property declarations must have the same type. Property 'schema' 
> must be of type 'Schema', but here has type 'Schema'.
> 238 schema: Schema;
> *node_modules/@apache-arrow/es2015-esm/recordbatch.d.ts:17:18* - error 
> TS2430: Interface 'RecordBatch' incorrectly extends interface 'StructVector'. 
> The types of 'slice(...).clone' are incompatible between these types.
> the tsconfig.json file looks like
> {
>  "compilerOptions": {
>  "target":"ES6",
>  "outDir": "dist",
>  "baseUrl": "src/"
>  },
>  "exclude": ["dist"],
>  "include": ["src/*.ts"]
> }



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-8394) [JS] Typescript compiler errors for arrow d.ts files, when using es2015-esm package

2020-09-17 Thread Paul Taylor (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17195668#comment-17195668
 ] 

Paul Taylor edited comment on ARROW-8394 at 9/18/20, 5:30 AM:
--

I've started work on a branch in my fork here[1], but have been occupied the 
last few weeks (work, moving, back injury, etc.). There's not much left to do, 
so I think I should be able to get it finished and PR'd this week.

1. https://github.com/trxcllnt/arrow/tree/fix/typescript-3.9-errors


was (Author: paul.e.taylor):
I've started work on a branch in my fork here[1], but have been occupied the 
last few weeks (work, moving, back injury, etc.). There's not much left to do, 
so I think I should be able to get it finished and PR'd this week.

1. https://github.com/trxcllnt/arrow/tree/typescript-3.9

> [JS] Typescript compiler errors for arrow d.ts files, when using es2015-esm 
> package
> ---
>
> Key: ARROW-8394
> URL: https://issues.apache.org/jira/browse/ARROW-8394
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: JavaScript
>Affects Versions: 0.16.0
>Reporter: Shyamal Shukla
>Priority: Blocker
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Attempting to use apache-arrow within a web application, but typescript 
> compiler throws the following errors in some of arrow's .d.ts files
> import \{ Table } from "../node_modules/@apache-arrow/es2015-esm/Arrow";
> export class SomeClass {
> .
> .
> constructor() {
> const t = Table.from('');
> }
> *node_modules/@apache-arrow/es2015-esm/column.d.ts:14:22* - error TS2417: 
> Class static side 'typeof Column' incorrectly extends base class static side 
> 'typeof Chunked'. Types of property 'new' are incompatible.
> *node_modules/@apache-arrow/es2015-esm/ipc/reader.d.ts:238:5* - error TS2717: 
> Subsequent property declarations must have the same type. Property 'schema' 
> must be of type 'Schema', but here has type 'Schema'.
> 238 schema: Schema;
> *node_modules/@apache-arrow/es2015-esm/recordbatch.d.ts:17:18* - error 
> TS2430: Interface 'RecordBatch' incorrectly extends interface 'StructVector'. 
> The types of 'slice(...).clone' are incompatible between these types.
> the tsconfig.json file looks like
> {
>  "compilerOptions": {
>  "target":"ES6",
>  "outDir": "dist",
>  "baseUrl": "src/"
>  },
>  "exclude": ["dist"],
>  "include": ["src/*.ts"]
> }



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8394) [JS] Typescript compiler errors for arrow d.ts files, when using es2015-esm package

2020-09-14 Thread Paul Taylor (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17195668#comment-17195668
 ] 

Paul Taylor commented on ARROW-8394:


I've started work on a branch in my fork here[1], but have been occupied the 
last few weeks (work, moving, back injury, etc.). There's not much left to do, 
so I think I should be able to get it finished and PR'd this week.

1. https://github.com/trxcllnt/arrow/tree/typescript-3.9

> [JS] Typescript compiler errors for arrow d.ts files, when using es2015-esm 
> package
> ---
>
> Key: ARROW-8394
> URL: https://issues.apache.org/jira/browse/ARROW-8394
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: JavaScript
>Affects Versions: 0.16.0
>Reporter: Shyamal Shukla
>Priority: Blocker
>
> Attempting to use apache-arrow within a web application, but typescript 
> compiler throws the following errors in some of arrow's .d.ts files
> import \{ Table } from "../node_modules/@apache-arrow/es2015-esm/Arrow";
> export class SomeClass {
> .
> .
> constructor() {
> const t = Table.from('');
> }
> *node_modules/@apache-arrow/es2015-esm/column.d.ts:14:22* - error TS2417: 
> Class static side 'typeof Column' incorrectly extends base class static side 
> 'typeof Chunked'. Types of property 'new' are incompatible.
> *node_modules/@apache-arrow/es2015-esm/ipc/reader.d.ts:238:5* - error TS2717: 
> Subsequent property declarations must have the same type. Property 'schema' 
> must be of type 'Schema', but here has type 'Schema'.
> 238 schema: Schema;
> *node_modules/@apache-arrow/es2015-esm/recordbatch.d.ts:17:18* - error 
> TS2430: Interface 'RecordBatch' incorrectly extends interface 'StructVector'. 
> The types of 'slice(...).clone' are incompatible between these types.
> the tsconfig.json file looks like
> {
>  "compilerOptions": {
>  "target":"ES6",
>  "outDir": "dist",
>  "baseUrl": "src/"
>  },
>  "exclude": ["dist"],
>  "include": ["src/*.ts"]
> }



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9659) [C++] RecordBatchStreamReader throws on CUDA device buffers

2020-08-05 Thread Paul Taylor (Jira)
Paul Taylor created ARROW-9659:
--

 Summary: [C++] RecordBatchStreamReader throws on CUDA device 
buffers
 Key: ARROW-9659
 URL: https://issues.apache.org/jira/browse/ARROW-9659
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Affects Versions: 1.0.0
Reporter: Paul Taylor
Assignee: Paul Taylor
 Fix For: 1.0.1


Prior to 1.0.0, the RecordBatchStreamReader was capable of reading source 
CudaBuffers wrapped in a CudaBufferReader. In 1.0.0, the Array validation 
routines call into Buffer::data(), which throws an error if the source isn't in 
host memory.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)