One use-case for ChunkedArray that comes to my mind is external sort for
large vectors.

Best,
Liya Fan

On Fri, Nov 15, 2019 at 2:14 PM Micah Kornfield <emkornfi...@gmail.com>
wrote:

> >
> > Maybe Java can add the concept of Tables and ChunkedArrays sometime in
> the
> > future.
>
>
> Is there a concrete use-case here?  It might pay to open up some JIRAs.
> I'm still not 100% clear on the rationale for the way VectorSchemaRoot is
> designed and how that would relate to Table/ChunkedArrays (or maybe they
> are completely separate)?
>
> On Tue, Nov 12, 2019 at 11:28 AM Bryan Cutler <cutl...@gmail.com> wrote:
>
> > Yes, you are correct. I think I was mixing up a couple different things.
> I
> > like the way C++/Python distinguishes it where a RecordBatch is
> contiguous
> > memory and a Table can be chunked. So since you are just talking about
> > RecordBatches, I think we should keep it contiguous and concat would
> > require memcpy. Maybe Java can add the concept of Tables and
> ChunkedArrays
> > sometime in the future.
> >
> > On Mon, Nov 11, 2019 at 9:59 AM Micah Kornfield <emkornfi...@gmail.com>
> > wrote:
> >
> >> I think having a chunked array with multiple vector buffers would be
> >>> ideal, similar to C++. It might take a fair amount of work to add this
> but
> >>> would open up a lot more functionality.
> >>
> >>
> >> There are potentially two different use-cases.  ChunkedArray is
> >> logical/lazy concatenation where as concat, physically rebuilds the
> vectors
> >> to be a single vector.
> >>
> >> On Fri, Nov 8, 2019 at 10:51 AM Bryan Cutler <cutl...@gmail.com> wrote:
> >>
> >>> I think having a chunked array with multiple vector buffers would be
> >>> ideal, similar to C++. It might take a fair amount of work to add this
> but
> >>> would open up a lot more functionality. As for the API,
> >>> VectorSchemaRoot.concat(Collection<VectorSchemaRoot>) seems good to me.
> >>>
> >>> On Thu, Nov 7, 2019 at 12:09 AM Fan Liya <liya.fa...@gmail.com> wrote:
> >>>
> >>>> Hi Micah,
> >>>>
> >>>> Thanks for bringing this up.
> >>>>
> >>>> > 1.  An efficient solution already exists? It seems like TransferPair
> >>>> implementations could possibly be improved upon or have they already
> >>>> been
> >>>> optimized?
> >>>>
> >>>> Fundamnentally, memory copy is unavoidable, IMO, because the source
> and
> >>>> targe memory regions are likely to be in non-contiguous regions.
> >>>> An alternative is to make ArrowBuf support a number of non-contiguous
> >>>> memory regions. However, that would harm the perfomance of ArrowBuf,
> and
> >>>> ArrowBuf is the core of the Arrow library.
> >>>>
> >>>> > 2.  What the preferred API for doing this would be?  Some options i
> >>>> can
> >>>> think of:
> >>>>
> >>>> > * VectorSchemaRoot.concat(Collection<VectorSchemaRoot>)
> >>>> > * VectorSchemaRoot.from(Collection<ArrowRecordBatch>)
> >>>> > * VectorLoader.load(Collection<ArrowRecordBatch>)
> >>>>
> >>>> IMO, option 1 is required, as we have scenarios that need to concate
> >>>> vectors/VectorSchemaRoots (e.g. restore the complete dictionary from
> >>>> delta
> >>>> dictionaries).
> >>>> Options 2 and 3 are optional for us.
> >>>>
> >>>> Best,
> >>>> Liya Fan
> >>>>
> >>>> On Thu, Nov 7, 2019 at 3:44 PM Micah Kornfield <emkornfi...@gmail.com
> >
> >>>> wrote:
> >>>>
> >>>> > Hi,
> >>>> > A colleague opened up
> >>>> https://issues.apache.org/jira/browse/ARROW-7048 for
> >>>> > having similar functionality to the python APIs that allow for
> >>>> creating one
> >>>> > larger data structure from a series of record batches.  I just
> wanted
> >>>> to
> >>>> > surface it here in case:
> >>>> > 1.  An efficient solution already exists? It seems like TransferPair
> >>>> > implementations could possibly be improved upon or have they already
> >>>> been
> >>>> > optimized?
> >>>> > 2.  What the preferred API for doing this would be?  Some options i
> >>>> can
> >>>> > think of:
> >>>> >
> >>>> > * VectorSchemaRoot.concat(Collection<VectorSchemaRoot>)
> >>>> > * VectorSchemaRoot.from(Collection<ArrowRecordBatch>)
> >>>> > * VectorLoader.load(Collection<ArrowRecordBatch>)
> >>>> >
> >>>> > Thanks,
> >>>> > Micah
> >>>> >
> >>>>
> >>>
>

Reply via email to