Re: [Discuss] Support an alternative memory layout for varchar/varbinary vectors

Uwe L. Korn Thu, 11 Jul 2019 05:35:41 -0700

Hello Liya Fan,

here your best approach is to copy into the Arrow format as you can then use 
this as the basis for working with the Arrow-native representation as well as 
your internal representation. You will have to use two different offset vector 
as those two will always differ but in the case of your internal 
representation, you don't have the requirement of consecutive data as Arrow has 
but you can still work with the strings just as before even when stored 
consecutively.


Uwe

On Thu, Jul 11, 2019, at 2:24 PM, Fan Liya wrote:
> Hi Korn,
> 
> Thanks a lot for your comments.
> 
> In my opinion, your comments make sense to me. Allowing non-consecutive
> memory segments will break some good design choices of Arrow.
> However, there are wide-spread user requirements for non-consecutive memory
> segments. I am wondering how can we help such users. What advice we can
> give to them?
> 
> Memory copy/move can be a solution, but is there a better solution?
> Is there a third alternative? Can we virtualize the non-consecutive memory
> segments into a consecutive one? (Although performance overhead is
> unavoidable.)
> 
> What do you think? Let's brain-storm it.
> 
> Best,
> Liya Fan
> 
> 
> On Thu, Jul 11, 2019 at 8:05 PM Uwe L. Korn <[email protected]> wrote:
> 
> > Hello Liya,
> >
> > I'm quite -1 on this type as Arrow is about efficient columnar structures.
> > We have opened the standard also to matrix-like types but always keep the
> > constraint of consecutive memory. Now also adding types where memory is no
> > longer consecutive but spread in the heap will make the scope of the
> > project much wider (It seems that we then just turn into a general
> > serialization framework).
> >
> > One of the ideas of a common standard is that some need to make
> > compromises. I think in this case it is a necessary compromise to not allow
> > all kind of string representations.
> >
> > Uwe
> >
> > On Thu, Jul 11, 2019, at 6:01 AM, Fan Liya wrote:
> > > Hi all,
> > >
> > >
> > > We are thinking of providing varchar/varbinary vectors with a different
> > > memory layout which exists in a wide range of systems. The memory layout
> > is
> > > different from that of VarCharVector in the following ways:
> > >
> > >
> > >    1.
> > >
> > >    Instead of storing (start offset, end offset), the new layout stores
> > >    (start offset, length)
> > >    2.
> > >
> > >    The content of varchars may not be in a consecutive memory region.
> > >    Instead, it can be in arbitrary memory address.
> > >
> > >
> > > Due to these differences in memory layout, it incurs performance overhead
> > > when converting data between existing systems and VarCharVectors.
> > >
> > > The above difference 1 seems insignificant, while difference 2 is
> > difficult
> > > to overcome. However, the scenario of difference 2 is prevalent in
> > > practice: for example we store strings in a series of memory segments.
> > > Whenever a segment is full, we request a new one. However, these memory
> > > segments may not be consecutive, because other processes/threads are also
> > > requesting/releasing memory segments in the meantime.
> > >
> > > So we are wondering if it is possible to support such memory layout in
> > > Arrow. I think there are more systems that are trying to adopting Arrow,
> > > but are hindered by such difficulty.
> > >
> > > Would you please give your valuable feedback?
> > >
> > >
> > > Best,
> > >
> > > Liya Fan
> > >
> >
>

Re: [Discuss] Support an alternative memory layout for varchar/varbinary vectors

Reply via email to