@Wes McKinney,

Thanks a lot for your comments and effort.
The JIRA looks good. I will track it.

Best,
Liya Fan

On Fri, Jul 12, 2019 at 10:31 PM Wes McKinney <wesmck...@gmail.com> wrote:

> hi Liya -- yes, it seems reasonable to defer the conversion from your
> pointer-based extension representation to a proper VarCharVector until
> you need to send over IPC.
>
> Note that there is no mechanism yet in Java with extension types to
> cause a conversion to take place when the IPC step is reached.
>
> I just opened https://issues.apache.org/jira/browse/ARROW-5929 to try
> to explain this issue. Let me know if it is not clear
>
> I'm interested to experiment with the same thing in C++. We would have
> an ExtensionArray in C++ whose values are string_view referencing
> external memory, for example.
>
> - Wes
>
> On Thu, Jul 11, 2019 at 10:16 PM Fan Liya <liya.fa...@gmail.com> wrote:
> >
> > @Wes McKinney,
> >
> > Thanks a lot for the brainstorming. I think your ideas are reasonable and
> > feasible.
> > About IPC, my idea is that we can send the vector as a
> PointerStringVector,
> > and receive it as a VarCharVector, so that the overhead of memory
> > compaction can be hidden.
> > What do you think?
> >
> > Best,
> > Liya Fan
> >
> > On Fri, Jul 12, 2019 at 11:07 AM Fan Liya <liya.fa...@gmail.com> wrote:
> >
> > > @Uwe L. Korn
> > >
> > > Thanks a lot for the suggestion. I think this is exactly what we are
> doing
> > > right now.
> > >
> > > Best,
> > > Liya Fan
> > >
> > > On Thu, Jul 11, 2019 at 9:44 PM Wes McKinney <wesmck...@gmail.com>
> wrote:
> > >
> > >> hi Liya -- have you thought about implementing this as an
> > >> ExtensionType / ExtensionVector? You actually can already do this, so
> > >> if this helps you reference strings stored in some external memory
> > >> then that seems reasonable. Such a PointerStringVector could have a
> > >> method that converts it into the Arrow varbinary columnar
> > >> representation.
> > >>
> > >> You wouldn't be able to put such an object into the IPC binary
> > >> protocol, though. If that's a requirement (being able to use the IPC
> > >> protocol) for this kind of data, before going any further in the
> > >> discussion I would suggest that you work out exactly how such data
> > >> would be moved from one process address space to another (using
> > >> Buffers).
> > >>
> > >> - Wes
> > >>
> > >> On Thu, Jul 11, 2019 at 7:35 AM Uwe L. Korn <uw...@xhochy.com> wrote:
> > >> >
> > >> > Hello Liya Fan,
> > >> >
> > >> > here your best approach is to copy into the Arrow format as you can
> > >> then use this as the basis for working with the Arrow-native
> representation
> > >> as well as your internal representation. You will have to use two
> different
> > >> offset vector as those two will always differ but in the case of your
> > >> internal representation, you don't have the requirement of
> consecutive data
> > >> as Arrow has but you can still work with the strings just as before
> even
> > >> when stored consecutively.
> > >> >
> > >> > Uwe
> > >> >
> > >> > On Thu, Jul 11, 2019, at 2:24 PM, Fan Liya wrote:
> > >> > > Hi Korn,
> > >> > >
> > >> > > Thanks a lot for your comments.
> > >> > >
> > >> > > In my opinion, your comments make sense to me. Allowing
> > >> non-consecutive
> > >> > > memory segments will break some good design choices of Arrow.
> > >> > > However, there are wide-spread user requirements for
> non-consecutive
> > >> memory
> > >> > > segments. I am wondering how can we help such users. What advice
> we
> > >> can
> > >> > > give to them?
> > >> > >
> > >> > > Memory copy/move can be a solution, but is there a better
> solution?
> > >> > > Is there a third alternative? Can we virtualize the
> non-consecutive
> > >> memory
> > >> > > segments into a consecutive one? (Although performance overhead is
> > >> > > unavoidable.)
> > >> > >
> > >> > > What do you think? Let's brain-storm it.
> > >> > >
> > >> > > Best,
> > >> > > Liya Fan
> > >> > >
> > >> > >
> > >> > > On Thu, Jul 11, 2019 at 8:05 PM Uwe L. Korn <uw...@xhochy.com>
> wrote:
> > >> > >
> > >> > > > Hello Liya,
> > >> > > >
> > >> > > > I'm quite -1 on this type as Arrow is about efficient columnar
> > >> structures.
> > >> > > > We have opened the standard also to matrix-like types but always
> > >> keep the
> > >> > > > constraint of consecutive memory. Now also adding types where
> > >> memory is no
> > >> > > > longer consecutive but spread in the heap will make the scope
> of the
> > >> > > > project much wider (It seems that we then just turn into a
> general
> > >> > > > serialization framework).
> > >> > > >
> > >> > > > One of the ideas of a common standard is that some need to make
> > >> > > > compromises. I think in this case it is a necessary compromise
> to
> > >> not allow
> > >> > > > all kind of string representations.
> > >> > > >
> > >> > > > Uwe
> > >> > > >
> > >> > > > On Thu, Jul 11, 2019, at 6:01 AM, Fan Liya wrote:
> > >> > > > > Hi all,
> > >> > > > >
> > >> > > > >
> > >> > > > > We are thinking of providing varchar/varbinary vectors with a
> > >> different
> > >> > > > > memory layout which exists in a wide range of systems. The
> memory
> > >> layout
> > >> > > > is
> > >> > > > > different from that of VarCharVector in the following ways:
> > >> > > > >
> > >> > > > >
> > >> > > > >    1.
> > >> > > > >
> > >> > > > >    Instead of storing (start offset, end offset), the new
> layout
> > >> stores
> > >> > > > >    (start offset, length)
> > >> > > > >    2.
> > >> > > > >
> > >> > > > >    The content of varchars may not be in a consecutive memory
> > >> region.
> > >> > > > >    Instead, it can be in arbitrary memory address.
> > >> > > > >
> > >> > > > >
> > >> > > > > Due to these differences in memory layout, it incurs
> performance
> > >> overhead
> > >> > > > > when converting data between existing systems and
> VarCharVectors.
> > >> > > > >
> > >> > > > > The above difference 1 seems insignificant, while difference
> 2 is
> > >> > > > difficult
> > >> > > > > to overcome. However, the scenario of difference 2 is
> prevalent in
> > >> > > > > practice: for example we store strings in a series of memory
> > >> segments.
> > >> > > > > Whenever a segment is full, we request a new one. However,
> these
> > >> memory
> > >> > > > > segments may not be consecutive, because other
> processes/threads
> > >> are also
> > >> > > > > requesting/releasing memory segments in the meantime.
> > >> > > > >
> > >> > > > > So we are wondering if it is possible to support such memory
> > >> layout in
> > >> > > > > Arrow. I think there are more systems that are trying to
> adopting
> > >> Arrow,
> > >> > > > > but are hindered by such difficulty.
> > >> > > > >
> > >> > > > > Would you please give your valuable feedback?
> > >> > > > >
> > >> > > > >
> > >> > > > > Best,
> > >> > > > >
> > >> > > > > Liya Fan
> > >> > > > >
> > >> > > >
> > >> > >
> > >>
> > >
>

Reply via email to