@Wes McKinney, Thanks a lot for your comments and effort. The JIRA looks good. I will track it.
Best, Liya Fan On Fri, Jul 12, 2019 at 10:31 PM Wes McKinney <wesmck...@gmail.com> wrote: > hi Liya -- yes, it seems reasonable to defer the conversion from your > pointer-based extension representation to a proper VarCharVector until > you need to send over IPC. > > Note that there is no mechanism yet in Java with extension types to > cause a conversion to take place when the IPC step is reached. > > I just opened https://issues.apache.org/jira/browse/ARROW-5929 to try > to explain this issue. Let me know if it is not clear > > I'm interested to experiment with the same thing in C++. We would have > an ExtensionArray in C++ whose values are string_view referencing > external memory, for example. > > - Wes > > On Thu, Jul 11, 2019 at 10:16 PM Fan Liya <liya.fa...@gmail.com> wrote: > > > > @Wes McKinney, > > > > Thanks a lot for the brainstorming. I think your ideas are reasonable and > > feasible. > > About IPC, my idea is that we can send the vector as a > PointerStringVector, > > and receive it as a VarCharVector, so that the overhead of memory > > compaction can be hidden. > > What do you think? > > > > Best, > > Liya Fan > > > > On Fri, Jul 12, 2019 at 11:07 AM Fan Liya <liya.fa...@gmail.com> wrote: > > > > > @Uwe L. Korn > > > > > > Thanks a lot for the suggestion. I think this is exactly what we are > doing > > > right now. > > > > > > Best, > > > Liya Fan > > > > > > On Thu, Jul 11, 2019 at 9:44 PM Wes McKinney <wesmck...@gmail.com> > wrote: > > > > > >> hi Liya -- have you thought about implementing this as an > > >> ExtensionType / ExtensionVector? You actually can already do this, so > > >> if this helps you reference strings stored in some external memory > > >> then that seems reasonable. Such a PointerStringVector could have a > > >> method that converts it into the Arrow varbinary columnar > > >> representation. > > >> > > >> You wouldn't be able to put such an object into the IPC binary > > >> protocol, though. If that's a requirement (being able to use the IPC > > >> protocol) for this kind of data, before going any further in the > > >> discussion I would suggest that you work out exactly how such data > > >> would be moved from one process address space to another (using > > >> Buffers). > > >> > > >> - Wes > > >> > > >> On Thu, Jul 11, 2019 at 7:35 AM Uwe L. Korn <uw...@xhochy.com> wrote: > > >> > > > >> > Hello Liya Fan, > > >> > > > >> > here your best approach is to copy into the Arrow format as you can > > >> then use this as the basis for working with the Arrow-native > representation > > >> as well as your internal representation. You will have to use two > different > > >> offset vector as those two will always differ but in the case of your > > >> internal representation, you don't have the requirement of > consecutive data > > >> as Arrow has but you can still work with the strings just as before > even > > >> when stored consecutively. > > >> > > > >> > Uwe > > >> > > > >> > On Thu, Jul 11, 2019, at 2:24 PM, Fan Liya wrote: > > >> > > Hi Korn, > > >> > > > > >> > > Thanks a lot for your comments. > > >> > > > > >> > > In my opinion, your comments make sense to me. Allowing > > >> non-consecutive > > >> > > memory segments will break some good design choices of Arrow. > > >> > > However, there are wide-spread user requirements for > non-consecutive > > >> memory > > >> > > segments. I am wondering how can we help such users. What advice > we > > >> can > > >> > > give to them? > > >> > > > > >> > > Memory copy/move can be a solution, but is there a better > solution? > > >> > > Is there a third alternative? Can we virtualize the > non-consecutive > > >> memory > > >> > > segments into a consecutive one? (Although performance overhead is > > >> > > unavoidable.) > > >> > > > > >> > > What do you think? Let's brain-storm it. > > >> > > > > >> > > Best, > > >> > > Liya Fan > > >> > > > > >> > > > > >> > > On Thu, Jul 11, 2019 at 8:05 PM Uwe L. Korn <uw...@xhochy.com> > wrote: > > >> > > > > >> > > > Hello Liya, > > >> > > > > > >> > > > I'm quite -1 on this type as Arrow is about efficient columnar > > >> structures. > > >> > > > We have opened the standard also to matrix-like types but always > > >> keep the > > >> > > > constraint of consecutive memory. Now also adding types where > > >> memory is no > > >> > > > longer consecutive but spread in the heap will make the scope > of the > > >> > > > project much wider (It seems that we then just turn into a > general > > >> > > > serialization framework). > > >> > > > > > >> > > > One of the ideas of a common standard is that some need to make > > >> > > > compromises. I think in this case it is a necessary compromise > to > > >> not allow > > >> > > > all kind of string representations. > > >> > > > > > >> > > > Uwe > > >> > > > > > >> > > > On Thu, Jul 11, 2019, at 6:01 AM, Fan Liya wrote: > > >> > > > > Hi all, > > >> > > > > > > >> > > > > > > >> > > > > We are thinking of providing varchar/varbinary vectors with a > > >> different > > >> > > > > memory layout which exists in a wide range of systems. The > memory > > >> layout > > >> > > > is > > >> > > > > different from that of VarCharVector in the following ways: > > >> > > > > > > >> > > > > > > >> > > > > 1. > > >> > > > > > > >> > > > > Instead of storing (start offset, end offset), the new > layout > > >> stores > > >> > > > > (start offset, length) > > >> > > > > 2. > > >> > > > > > > >> > > > > The content of varchars may not be in a consecutive memory > > >> region. > > >> > > > > Instead, it can be in arbitrary memory address. > > >> > > > > > > >> > > > > > > >> > > > > Due to these differences in memory layout, it incurs > performance > > >> overhead > > >> > > > > when converting data between existing systems and > VarCharVectors. > > >> > > > > > > >> > > > > The above difference 1 seems insignificant, while difference > 2 is > > >> > > > difficult > > >> > > > > to overcome. However, the scenario of difference 2 is > prevalent in > > >> > > > > practice: for example we store strings in a series of memory > > >> segments. > > >> > > > > Whenever a segment is full, we request a new one. However, > these > > >> memory > > >> > > > > segments may not be consecutive, because other > processes/threads > > >> are also > > >> > > > > requesting/releasing memory segments in the meantime. > > >> > > > > > > >> > > > > So we are wondering if it is possible to support such memory > > >> layout in > > >> > > > > Arrow. I think there are more systems that are trying to > adopting > > >> Arrow, > > >> > > > > but are hindered by such difficulty. > > >> > > > > > > >> > > > > Would you please give your valuable feedback? > > >> > > > > > > >> > > > > > > >> > > > > Best, > > >> > > > > > > >> > > > > Liya Fan > > >> > > > > > > >> > > > > > >> > > > > >> > > > >