[ https://issues.apache.org/jira/browse/ARROW-7048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17046623#comment-17046623 ]
Liya Fan commented on ARROW-7048: --------------------------------- [~yogeshtewari] Sorry for the long wait. We have provided a PR for this issue. Would you please take a look, and check if it is what you want? > [Java] Support for combining multiple vectors under VectorSchemaRoot > -------------------------------------------------------------------- > > Key: ARROW-7048 > URL: https://issues.apache.org/jira/browse/ARROW-7048 > Project: Apache Arrow > Issue Type: New Feature > Components: Java > Reporter: Yogesh Tewari > Assignee: Liya Fan > Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > Hi, > > pyarrow.Table.combine_chunks provides a nice functionality of combining > multiple batch records under a single pyarrow.Table. > > I am currently working on a downstream application which reads data from > BigQuery. BigQuery storage api supports data output in Arrow format but > streams data in many batches of size 1024 or less number of rows. > It would be really nice to have Arrow Java api provide this functionality > under an abstraction like VectorSchemaRoot. > After getting guidance from [~emkornfi...@gmail.com], I tried to write my own > implementation by copying data vector by vector using TransferPair's > copyValueSafe > But, unless I am missing some thing obvious, turns out it only copies one > value at a time. That means a lot of looping trying copyValueSafe millions of > rows from source vector index to target vector index. Ideally I would want to > concatenate/link the underlying buffers rather than copying one cell at a > time. > > Eg, if I have : > {code:java} > List<VectorSchemaRoot> batchList = new ArrayList<>(); > try (ArrowStreamReader reader = new ArrowStreamReader(new > ByteArrayInputStream(out.toByteArray()), allocator)) { > Schema schema = reader.getVectorSchemaRoot().getSchema(); > for (int i = 0; i < 5; i++) { > // This will be loaded with new values on every call to loadNextBatch > VectorSchemaRoot readBatch = reader.getVectorSchemaRoot(); > reader.loadNextBatch(); > batchList.add(readBatch); > } > } > //VectorSchemaRoot.combineChunks(batchList, newVectorSchemaRoot);{code} > > A method like VectorSchemaRoot.combineChunks(List<VectorSchemaRoot>)? > I did read the VectorSchemaRoot discussion on > https://issues.apache.org/jira/browse/ARROW-6896 and am not sure if its the > right thing to use here. > > > PS. Feel free to update the title of this feature request with more > appropriate wordings. > > Cheers, > Yogesh > > -- This message was sent by Atlassian Jira (v8.3.4#803005)