@Wes McKinney,

Thanks a lot for the brainstorming. I think your ideas are reasonable and
feasible.
About IPC, my idea is that we can send the vector as a PointerStringVector,
and receive it as a VarCharVector, so that the overhead of memory
compaction can be hidden.
What do you think?

Best,
Liya Fan

On Fri, Jul 12, 2019 at 11:07 AM Fan Liya <liya.fa...@gmail.com> wrote:

> @Uwe L. Korn
>
> Thanks a lot for the suggestion. I think this is exactly what we are doing
> right now.
>
> Best,
> Liya Fan
>
> On Thu, Jul 11, 2019 at 9:44 PM Wes McKinney <wesmck...@gmail.com> wrote:
>
>> hi Liya -- have you thought about implementing this as an
>> ExtensionType / ExtensionVector? You actually can already do this, so
>> if this helps you reference strings stored in some external memory
>> then that seems reasonable. Such a PointerStringVector could have a
>> method that converts it into the Arrow varbinary columnar
>> representation.
>>
>> You wouldn't be able to put such an object into the IPC binary
>> protocol, though. If that's a requirement (being able to use the IPC
>> protocol) for this kind of data, before going any further in the
>> discussion I would suggest that you work out exactly how such data
>> would be moved from one process address space to another (using
>> Buffers).
>>
>> - Wes
>>
>> On Thu, Jul 11, 2019 at 7:35 AM Uwe L. Korn <uw...@xhochy.com> wrote:
>> >
>> > Hello Liya Fan,
>> >
>> > here your best approach is to copy into the Arrow format as you can
>> then use this as the basis for working with the Arrow-native representation
>> as well as your internal representation. You will have to use two different
>> offset vector as those two will always differ but in the case of your
>> internal representation, you don't have the requirement of consecutive data
>> as Arrow has but you can still work with the strings just as before even
>> when stored consecutively.
>> >
>> > Uwe
>> >
>> > On Thu, Jul 11, 2019, at 2:24 PM, Fan Liya wrote:
>> > > Hi Korn,
>> > >
>> > > Thanks a lot for your comments.
>> > >
>> > > In my opinion, your comments make sense to me. Allowing
>> non-consecutive
>> > > memory segments will break some good design choices of Arrow.
>> > > However, there are wide-spread user requirements for non-consecutive
>> memory
>> > > segments. I am wondering how can we help such users. What advice we
>> can
>> > > give to them?
>> > >
>> > > Memory copy/move can be a solution, but is there a better solution?
>> > > Is there a third alternative? Can we virtualize the non-consecutive
>> memory
>> > > segments into a consecutive one? (Although performance overhead is
>> > > unavoidable.)
>> > >
>> > > What do you think? Let's brain-storm it.
>> > >
>> > > Best,
>> > > Liya Fan
>> > >
>> > >
>> > > On Thu, Jul 11, 2019 at 8:05 PM Uwe L. Korn <uw...@xhochy.com> wrote:
>> > >
>> > > > Hello Liya,
>> > > >
>> > > > I'm quite -1 on this type as Arrow is about efficient columnar
>> structures.
>> > > > We have opened the standard also to matrix-like types but always
>> keep the
>> > > > constraint of consecutive memory. Now also adding types where
>> memory is no
>> > > > longer consecutive but spread in the heap will make the scope of the
>> > > > project much wider (It seems that we then just turn into a general
>> > > > serialization framework).
>> > > >
>> > > > One of the ideas of a common standard is that some need to make
>> > > > compromises. I think in this case it is a necessary compromise to
>> not allow
>> > > > all kind of string representations.
>> > > >
>> > > > Uwe
>> > > >
>> > > > On Thu, Jul 11, 2019, at 6:01 AM, Fan Liya wrote:
>> > > > > Hi all,
>> > > > >
>> > > > >
>> > > > > We are thinking of providing varchar/varbinary vectors with a
>> different
>> > > > > memory layout which exists in a wide range of systems. The memory
>> layout
>> > > > is
>> > > > > different from that of VarCharVector in the following ways:
>> > > > >
>> > > > >
>> > > > >    1.
>> > > > >
>> > > > >    Instead of storing (start offset, end offset), the new layout
>> stores
>> > > > >    (start offset, length)
>> > > > >    2.
>> > > > >
>> > > > >    The content of varchars may not be in a consecutive memory
>> region.
>> > > > >    Instead, it can be in arbitrary memory address.
>> > > > >
>> > > > >
>> > > > > Due to these differences in memory layout, it incurs performance
>> overhead
>> > > > > when converting data between existing systems and VarCharVectors.
>> > > > >
>> > > > > The above difference 1 seems insignificant, while difference 2 is
>> > > > difficult
>> > > > > to overcome. However, the scenario of difference 2 is prevalent in
>> > > > > practice: for example we store strings in a series of memory
>> segments.
>> > > > > Whenever a segment is full, we request a new one. However, these
>> memory
>> > > > > segments may not be consecutive, because other processes/threads
>> are also
>> > > > > requesting/releasing memory segments in the meantime.
>> > > > >
>> > > > > So we are wondering if it is possible to support such memory
>> layout in
>> > > > > Arrow. I think there are more systems that are trying to adopting
>> Arrow,
>> > > > > but are hindered by such difficulty.
>> > > > >
>> > > > > Would you please give your valuable feedback?
>> > > > >
>> > > > >
>> > > > > Best,
>> > > > >
>> > > > > Liya Fan
>> > > > >
>> > > >
>> > >
>>
>

Reply via email to