Re: [Java] Append multiple record batches together?

Micah Kornfield Thu, 14 Nov 2019 22:15:09 -0800

>
> Maybe Java can add the concept of Tables and ChunkedArrays sometime in the
> future.



Is there a concrete use-case here?  It might pay to open up some JIRAs.
I'm still not 100% clear on the rationale for the way VectorSchemaRoot is
designed and how that would relate to Table/ChunkedArrays (or maybe they
are completely separate)?

On Tue, Nov 12, 2019 at 11:28 AM Bryan Cutler <cutl...@gmail.com> wrote:

> Yes, you are correct. I think I was mixing up a couple different things. I
> like the way C++/Python distinguishes it where a RecordBatch is contiguous
> memory and a Table can be chunked. So since you are just talking about
> RecordBatches, I think we should keep it contiguous and concat would
> require memcpy. Maybe Java can add the concept of Tables and ChunkedArrays
> sometime in the future.
>
> On Mon, Nov 11, 2019 at 9:59 AM Micah Kornfield <emkornfi...@gmail.com>
> wrote:
>
>> I think having a chunked array with multiple vector buffers would be
>>> ideal, similar to C++. It might take a fair amount of work to add this but
>>> would open up a lot more functionality.
>>
>>
>> There are potentially two different use-cases.  ChunkedArray is
>> logical/lazy concatenation where as concat, physically rebuilds the vectors
>> to be a single vector.
>>
>> On Fri, Nov 8, 2019 at 10:51 AM Bryan Cutler <cutl...@gmail.com> wrote:
>>
>>> I think having a chunked array with multiple vector buffers would be
>>> ideal, similar to C++. It might take a fair amount of work to add this but
>>> would open up a lot more functionality. As for the API,
>>> VectorSchemaRoot.concat(Collection<VectorSchemaRoot>) seems good to me.
>>>
>>> On Thu, Nov 7, 2019 at 12:09 AM Fan Liya <liya.fa...@gmail.com> wrote:
>>>
>>>> Hi Micah,
>>>>
>>>> Thanks for bringing this up.
>>>>
>>>> > 1.  An efficient solution already exists? It seems like TransferPair
>>>> implementations could possibly be improved upon or have they already
>>>> been
>>>> optimized?
>>>>
>>>> Fundamnentally, memory copy is unavoidable, IMO, because the source and
>>>> targe memory regions are likely to be in non-contiguous regions.
>>>> An alternative is to make ArrowBuf support a number of non-contiguous
>>>> memory regions. However, that would harm the perfomance of ArrowBuf, and
>>>> ArrowBuf is the core of the Arrow library.
>>>>
>>>> > 2.  What the preferred API for doing this would be?  Some options i
>>>> can
>>>> think of:
>>>>
>>>> > * VectorSchemaRoot.concat(Collection<VectorSchemaRoot>)
>>>> > * VectorSchemaRoot.from(Collection<ArrowRecordBatch>)
>>>> > * VectorLoader.load(Collection<ArrowRecordBatch>)
>>>>
>>>> IMO, option 1 is required, as we have scenarios that need to concate
>>>> vectors/VectorSchemaRoots (e.g. restore the complete dictionary from
>>>> delta
>>>> dictionaries).
>>>> Options 2 and 3 are optional for us.
>>>>
>>>> Best,
>>>> Liya Fan
>>>>
>>>> On Thu, Nov 7, 2019 at 3:44 PM Micah Kornfield <emkornfi...@gmail.com>
>>>> wrote:
>>>>
>>>> > Hi,
>>>> > A colleague opened up
>>>> https://issues.apache.org/jira/browse/ARROW-7048 for
>>>> > having similar functionality to the python APIs that allow for
>>>> creating one
>>>> > larger data structure from a series of record batches.  I just wanted
>>>> to
>>>> > surface it here in case:
>>>> > 1.  An efficient solution already exists? It seems like TransferPair
>>>> > implementations could possibly be improved upon or have they already
>>>> been
>>>> > optimized?
>>>> > 2.  What the preferred API for doing this would be?  Some options i
>>>> can
>>>> > think of:
>>>> >
>>>> > * VectorSchemaRoot.concat(Collection<VectorSchemaRoot>)
>>>> > * VectorSchemaRoot.from(Collection<ArrowRecordBatch>)
>>>> > * VectorLoader.load(Collection<ArrowRecordBatch>)
>>>> >
>>>> > Thanks,
>>>> > Micah
>>>> >
>>>>
>>>

Re: [Java] Append multiple record batches together?

Reply via email to