Hello Liya,

I'm quite -1 on this type as Arrow is about efficient columnar structures. We 
have opened the standard also to matrix-like types but always keep the 
constraint of consecutive memory. Now also adding types where memory is no 
longer consecutive but spread in the heap will make the scope of the project 
much wider (It seems that we then just turn into a general serialization 
framework).

One of the ideas of a common standard is that some need to make compromises. I 
think in this case it is a necessary compromise to not allow all kind of string 
representations.

Uwe

On Thu, Jul 11, 2019, at 6:01 AM, Fan Liya wrote:
> Hi all,
> 
> 
> We are thinking of providing varchar/varbinary vectors with a different
> memory layout which exists in a wide range of systems. The memory layout is
> different from that of VarCharVector in the following ways:
> 
> 
>    1.
> 
>    Instead of storing (start offset, end offset), the new layout stores
>    (start offset, length)
>    2.
> 
>    The content of varchars may not be in a consecutive memory region.
>    Instead, it can be in arbitrary memory address.
> 
> 
> Due to these differences in memory layout, it incurs performance overhead
> when converting data between existing systems and VarCharVectors.
> 
> The above difference 1 seems insignificant, while difference 2 is difficult
> to overcome. However, the scenario of difference 2 is prevalent in
> practice: for example we store strings in a series of memory segments.
> Whenever a segment is full, we request a new one. However, these memory
> segments may not be consecutive, because other processes/threads are also
> requesting/releasing memory segments in the meantime.
> 
> So we are wondering if it is possible to support such memory layout in
> Arrow. I think there are more systems that are trying to adopting Arrow,
> but are hindered by such difficulty.
> 
> Would you please give your valuable feedback?
> 
> 
> Best,
> 
> Liya Fan
>

Reply via email to