[ 
https://issues.apache.org/jira/browse/ARROW-10450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17279371#comment-17279371
 ] 

Paul Taylor commented on ARROW-10450:
-------------------------------------

Yeah, this is unfortunately a tricky spot with the current Chunked vectors. The 
`.data` getter on Chunked only returns the data field of the first chunk. 
Table.fromStruct() doesn't expect to get a ChunkedVector as input, it expects a 
single-chunk StructVector.

Your `Vector.from({data: <JS objects>}) ` call runs those JS objects through 
the Arrow Struct Builder and serialized into binary Arrow vectors.

The `highWaterMark` defaults to 1000 to avoid the case where someone tries to 
serialize lots of data, and the builder has to grow allocations past the 2GB 
limit. Builder internal buffers grow geometrically, so this is relatively easy 
to do with strings.

As you noted, you don't run into this issue when you do `Table.new()` because 
that method expects its input is likely split up across multiple chunks. The 
only downside is now you have a Table of struct of fields, rather than a Table 
of fields.

> [Javascript] Table.fromStruct() silently truncates vectors to the first chunk
> -----------------------------------------------------------------------------
>
>                 Key: ARROW-10450
>                 URL: https://issues.apache.org/jira/browse/ARROW-10450
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: JavaScript
>    Affects Versions: 2.0.0
>            Reporter: David Saslawsky
>            Priority: Minor
>
> Table.fromStruct() only uses the first chunk from the input vector.
> {code:javascript}
> import { Bool, Field, Int32, Struct, Table, Vector } from "apache-arrow";
> const myStruct = new Struct([
>   Field.new({ name: "over", type: new Int32() }),
>   Field.new({ name: "out", type: new Bool() })
> ]);
> const data = [];
> for(let i=0;i<1500;i++) {
>   data.push({ over:i, out:i%2 === 0 });
> // create a vector with two chunks
> const victor = Vector.from({
>   type: myStruct,
>   /*highWaterMark: Infinity,*/
>   values: data
> });
> console.log(victor.length);  // 1500 
> const table = Table.fromStruct(victor);
> console.log(table.length);   // 1000
> {code}
>  The workaround is to set highWaterMark to Infinity
>  
> Table.new() works as expected
> {code:javascript}
> const int32Array = new Int32Array(1500);for(let i=0;i<1500;i++)  
> int32Array[i] = i;
> const intVector = Vector.from({  type: new Int32(),  values: int32Array});
> console.log(intVector.length);  // 1500
>  const intTable = Table.new({ intColumn:intVector });
> console.log(intTable.length);   // 1500
> {code}
>  
> The origin seems to be in Chunked.data() but I don't understand the code 
> enough to propose a fix.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to