> > Maybe Java can add the concept of Tables and ChunkedArrays sometime in the > future.
Is there a concrete use-case here? It might pay to open up some JIRAs. I'm still not 100% clear on the rationale for the way VectorSchemaRoot is designed and how that would relate to Table/ChunkedArrays (or maybe they are completely separate)? On Tue, Nov 12, 2019 at 11:28 AM Bryan Cutler <cutl...@gmail.com> wrote: > Yes, you are correct. I think I was mixing up a couple different things. I > like the way C++/Python distinguishes it where a RecordBatch is contiguous > memory and a Table can be chunked. So since you are just talking about > RecordBatches, I think we should keep it contiguous and concat would > require memcpy. Maybe Java can add the concept of Tables and ChunkedArrays > sometime in the future. > > On Mon, Nov 11, 2019 at 9:59 AM Micah Kornfield <emkornfi...@gmail.com> > wrote: > >> I think having a chunked array with multiple vector buffers would be >>> ideal, similar to C++. It might take a fair amount of work to add this but >>> would open up a lot more functionality. >> >> >> There are potentially two different use-cases. ChunkedArray is >> logical/lazy concatenation where as concat, physically rebuilds the vectors >> to be a single vector. >> >> On Fri, Nov 8, 2019 at 10:51 AM Bryan Cutler <cutl...@gmail.com> wrote: >> >>> I think having a chunked array with multiple vector buffers would be >>> ideal, similar to C++. It might take a fair amount of work to add this but >>> would open up a lot more functionality. As for the API, >>> VectorSchemaRoot.concat(Collection<VectorSchemaRoot>) seems good to me. >>> >>> On Thu, Nov 7, 2019 at 12:09 AM Fan Liya <liya.fa...@gmail.com> wrote: >>> >>>> Hi Micah, >>>> >>>> Thanks for bringing this up. >>>> >>>> > 1. An efficient solution already exists? It seems like TransferPair >>>> implementations could possibly be improved upon or have they already >>>> been >>>> optimized? >>>> >>>> Fundamnentally, memory copy is unavoidable, IMO, because the source and >>>> targe memory regions are likely to be in non-contiguous regions. >>>> An alternative is to make ArrowBuf support a number of non-contiguous >>>> memory regions. However, that would harm the perfomance of ArrowBuf, and >>>> ArrowBuf is the core of the Arrow library. >>>> >>>> > 2. What the preferred API for doing this would be? Some options i >>>> can >>>> think of: >>>> >>>> > * VectorSchemaRoot.concat(Collection<VectorSchemaRoot>) >>>> > * VectorSchemaRoot.from(Collection<ArrowRecordBatch>) >>>> > * VectorLoader.load(Collection<ArrowRecordBatch>) >>>> >>>> IMO, option 1 is required, as we have scenarios that need to concate >>>> vectors/VectorSchemaRoots (e.g. restore the complete dictionary from >>>> delta >>>> dictionaries). >>>> Options 2 and 3 are optional for us. >>>> >>>> Best, >>>> Liya Fan >>>> >>>> On Thu, Nov 7, 2019 at 3:44 PM Micah Kornfield <emkornfi...@gmail.com> >>>> wrote: >>>> >>>> > Hi, >>>> > A colleague opened up >>>> https://issues.apache.org/jira/browse/ARROW-7048 for >>>> > having similar functionality to the python APIs that allow for >>>> creating one >>>> > larger data structure from a series of record batches. I just wanted >>>> to >>>> > surface it here in case: >>>> > 1. An efficient solution already exists? It seems like TransferPair >>>> > implementations could possibly be improved upon or have they already >>>> been >>>> > optimized? >>>> > 2. What the preferred API for doing this would be? Some options i >>>> can >>>> > think of: >>>> > >>>> > * VectorSchemaRoot.concat(Collection<VectorSchemaRoot>) >>>> > * VectorSchemaRoot.from(Collection<ArrowRecordBatch>) >>>> > * VectorLoader.load(Collection<ArrowRecordBatch>) >>>> > >>>> > Thanks, >>>> > Micah >>>> > >>>> >>>