[jira] [Updated] (ARROW-7048) [Java] Support for combining multiple vectors under VectorSchemaRoot

2020-02-27 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-7048:
--
Labels: pull-request-available  (was: )

> [Java] Support for combining multiple vectors under VectorSchemaRoot
> 
>
> Key: ARROW-7048
> URL: https://issues.apache.org/jira/browse/ARROW-7048
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Java
>Reporter: Yogesh Tewari
>Assignee: Liya Fan
>Priority: Major
>  Labels: pull-request-available
>
> Hi,
>  
> pyarrow.Table.combine_chunks provides a nice functionality of combining 
> multiple batch records under a single pyarrow.Table.
>  
> I am currently working on a downstream application which reads data from 
> BigQuery. BigQuery storage api supports data output in Arrow format but 
> streams data in many batches of size 1024 or less number of rows.
> It would be really nice to have Arrow Java api provide this functionality 
> under an abstraction like VectorSchemaRoot.
> After getting guidance from [~emkornfi...@gmail.com], I tried to write my own 
> implementation by copying data vector by vector using TransferPair's 
> copyValueSafe
> But, unless I am missing some thing obvious, turns out it only copies one 
> value at a time. That means a lot of looping trying copyValueSafe millions of 
> rows from source vector index to target vector index. Ideally I would want to 
> concatenate/link the underlying buffers rather than copying one cell at a 
> time.
>  
> Eg, if I have :
> {code:java}
> List batchList = new ArrayList<>();
> try (ArrowStreamReader reader = new ArrowStreamReader(new 
> ByteArrayInputStream(out.toByteArray()), allocator)) {
> Schema schema = reader.getVectorSchemaRoot().getSchema();
> for (int i = 0; i < 5; i++) {
> // This will be loaded with new values on every call to loadNextBatch
> VectorSchemaRoot readBatch = reader.getVectorSchemaRoot();
> reader.loadNextBatch();
> batchList.add(readBatch);
> }
> }
> //VectorSchemaRoot.combineChunks(batchList, newVectorSchemaRoot);{code}
>  
> A method like VectorSchemaRoot.combineChunks(List)?
> I did read the VectorSchemaRoot discussion on 
> https://issues.apache.org/jira/browse/ARROW-6896 and am not sure if its the 
> right thing to use here.
>  
>  
> PS. Feel free to update the title of this feature request with more 
> appropriate wordings.
>  
> Cheers,
> Yogesh
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7048) [Java] Support for combining multiple vectors under VectorSchemaRoot

2019-11-01 Thread Yogesh Tewari (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yogesh Tewari updated ARROW-7048:
-
Description: 
Hi,

 

pyarrow.Table.combine_chunks provides a nice functionality of combining 
multiple batch records under a single pyarrow.Table.

 

I am currently working on a downstream application which reads data from 
BigQuery. BigQuery storage api supports data output in Arrow format but streams 
data in many batches of size 1024 or less number of rows.

It would be really nice to have Arrow Java api provide this functionality under 
an abstraction like VectorSchemaRoot.

After getting guidance from [~emkornfi...@gmail.com], I tried to write my own 
implementation by copying data vector by vector using TransferPair's 
copyValueSafe

But, unless I am missing some thing obvious, turns out it only copies one value 
at a time. That means a lot of looping trying copyValueSafe millions of rows 
from source vector index to target vector index. Ideally I would want to 
concatenate/link the underlying buffers rather than copying one cell at a time.

 

Eg, if I have :
{code:java}
List batchList = new ArrayList<>();
try (ArrowStreamReader reader = new ArrowStreamReader(new 
ByteArrayInputStream(out.toByteArray()), allocator)) {
Schema schema = reader.getVectorSchemaRoot().getSchema();
for (int i = 0; i < 5; i++) {
// This will be loaded with new values on every call to loadNextBatch
VectorSchemaRoot readBatch = reader.getVectorSchemaRoot();
reader.loadNextBatch();
batchList.add(readBatch);
}
}

//VectorSchemaRoot.combineChunks(batchList, newVectorSchemaRoot);{code}
 

A method like VectorSchemaRoot.combineChunks(List)?

I did read the VectorSchemaRoot discussion on 
https://issues.apache.org/jira/browse/ARROW-6896 and am not sure if its the 
right thing to use here.

 

 

PS. Feel free to update the title of this feature request with more appropriate 
wordings.

 

Cheers,

Yogesh

 

 

  was:
Hi,

 

pyarrow.Table.combine_chunks provides a nice functionality of combining 
multiple batch records under a single pyarrow.Table.

 

I am currently working on a downstream application which reads data from 
BigQuery. BigQuery storage api supports data output in Arrow format but streams 
data in many batches of size 1024 or less number of rows.

It would be really nice to have Arrow Java api provide this functionality under 
an abstraction like VectorSchemaRoot.

After getting guidance from [~emkornfi...@gmail.com], I tried to write my own 
implementation by copying data vector by vector using TransferPair's 
copyValueSafe

But, unless I am missing some thing obvious, turns out it only copies one value 
at a time. That means a lot of looping trying copyValueSafe millions of rows 
from source vector index to target vector index. Ideally I would want to 
concatenate/link the underlying buffers rather than copying one cell at a time.

 

Eg, if I have :
{code:java}
List batchList = new ArrayList<>();
try (ArrowStreamReader reader = new ArrowStreamReader(new 
ByteArrayInputStream(out.toByteArray()), allocator)) {
Schema schema = reader.getVectorSchemaRoot().getSchema();
for (int i = 0; i < 5; i++) {
// This will be loaded with new values on every call to loadNextBatch
VectorSchemaRoot readBatch = reader.getVectorSchemaRoot();
reader.loadNextBatch();
batchList.add(readBatch);
}
}

//VectorSchemaRoot.combineChunks(batchList, newVectorSchemaRoot);{code}
 

A method like VectorSchemaRoot.combineChunks(List)?

I did read the VectorSchemaRoot discussion on 
https://issues.apache.org/jira/browse/ARROW-6896 and am not sure if its the 
right thing to use here.

 

 

PS. Feel free to update the title of this feature request to more appropriate 
wordings.

 

Cheers,

Yogesh

 

 


> [Java] Support for combining multiple vectors under VectorSchemaRoot
> 
>
> Key: ARROW-7048
> URL: https://issues.apache.org/jira/browse/ARROW-7048
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Java
>Reporter: Yogesh Tewari
>Priority: Major
>
> Hi,
>  
> pyarrow.Table.combine_chunks provides a nice functionality of combining 
> multiple batch records under a single pyarrow.Table.
>  
> I am currently working on a downstream application which reads data from 
> BigQuery. BigQuery storage api supports data output in Arrow format but 
> streams data in many batches of size 1024 or less number of rows.
> It would be really nice to have Arrow Java api provide this functionality 
> under an abstraction like VectorSchemaRoot.
> After getting guidance from [~emkornfi...@gmail.com], I tried to write my own 
> implementation by copying data vector by vector using TransferPair's 
> copyValueSafe
> But, u