hi Liya -- have you thought about implementing this as an
ExtensionType / ExtensionVector? You actually can already do this, so
if this helps you reference strings stored in some external memory
then that seems reasonable. Such a PointerStringVector could have a
method that converts it into the Arrow varbinary columnar
representation.

You wouldn't be able to put such an object into the IPC binary
protocol, though. If that's a requirement (being able to use the IPC
protocol) for this kind of data, before going any further in the
discussion I would suggest that you work out exactly how such data
would be moved from one process address space to another (using
Buffers).

- Wes

On Thu, Jul 11, 2019 at 7:35 AM Uwe L. Korn <uw...@xhochy.com> wrote:
>
> Hello Liya Fan,
>
> here your best approach is to copy into the Arrow format as you can then use 
> this as the basis for working with the Arrow-native representation as well as 
> your internal representation. You will have to use two different offset 
> vector as those two will always differ but in the case of your internal 
> representation, you don't have the requirement of consecutive data as Arrow 
> has but you can still work with the strings just as before even when stored 
> consecutively.
>
> Uwe
>
> On Thu, Jul 11, 2019, at 2:24 PM, Fan Liya wrote:
> > Hi Korn,
> >
> > Thanks a lot for your comments.
> >
> > In my opinion, your comments make sense to me. Allowing non-consecutive
> > memory segments will break some good design choices of Arrow.
> > However, there are wide-spread user requirements for non-consecutive memory
> > segments. I am wondering how can we help such users. What advice we can
> > give to them?
> >
> > Memory copy/move can be a solution, but is there a better solution?
> > Is there a third alternative? Can we virtualize the non-consecutive memory
> > segments into a consecutive one? (Although performance overhead is
> > unavoidable.)
> >
> > What do you think? Let's brain-storm it.
> >
> > Best,
> > Liya Fan
> >
> >
> > On Thu, Jul 11, 2019 at 8:05 PM Uwe L. Korn <uw...@xhochy.com> wrote:
> >
> > > Hello Liya,
> > >
> > > I'm quite -1 on this type as Arrow is about efficient columnar structures.
> > > We have opened the standard also to matrix-like types but always keep the
> > > constraint of consecutive memory. Now also adding types where memory is no
> > > longer consecutive but spread in the heap will make the scope of the
> > > project much wider (It seems that we then just turn into a general
> > > serialization framework).
> > >
> > > One of the ideas of a common standard is that some need to make
> > > compromises. I think in this case it is a necessary compromise to not 
> > > allow
> > > all kind of string representations.
> > >
> > > Uwe
> > >
> > > On Thu, Jul 11, 2019, at 6:01 AM, Fan Liya wrote:
> > > > Hi all,
> > > >
> > > >
> > > > We are thinking of providing varchar/varbinary vectors with a different
> > > > memory layout which exists in a wide range of systems. The memory layout
> > > is
> > > > different from that of VarCharVector in the following ways:
> > > >
> > > >
> > > >    1.
> > > >
> > > >    Instead of storing (start offset, end offset), the new layout stores
> > > >    (start offset, length)
> > > >    2.
> > > >
> > > >    The content of varchars may not be in a consecutive memory region.
> > > >    Instead, it can be in arbitrary memory address.
> > > >
> > > >
> > > > Due to these differences in memory layout, it incurs performance 
> > > > overhead
> > > > when converting data between existing systems and VarCharVectors.
> > > >
> > > > The above difference 1 seems insignificant, while difference 2 is
> > > difficult
> > > > to overcome. However, the scenario of difference 2 is prevalent in
> > > > practice: for example we store strings in a series of memory segments.
> > > > Whenever a segment is full, we request a new one. However, these memory
> > > > segments may not be consecutive, because other processes/threads are 
> > > > also
> > > > requesting/releasing memory segments in the meantime.
> > > >
> > > > So we are wondering if it is possible to support such memory layout in
> > > > Arrow. I think there are more systems that are trying to adopting Arrow,
> > > > but are hindered by such difficulty.
> > > >
> > > > Would you please give your valuable feedback?
> > > >
> > > >
> > > > Best,
> > > >
> > > > Liya Fan
> > > >
> > >
> >

Reply via email to