Really thanks!

After more experimentation with liborc::ColumnVectorBatch this morning I found 
that it is actually spaced so there is no need to write another function to 
efficiently append “compressed” values. This also simplifies the Arrow2ORC 
adapter I’m working on.

> On Oct 20, 2020, at 12:55 AM, Micah Kornfield <emkornfi...@gmail.com> wrote:
> 
> For reference, that parquet uses to space out values is in rle_decoder.h
> [1].  This uses both BitBlockCounter and BitRunReader.  BitBlockCounter is
> faster than BitRunReader but on micro-benchmarks BitRunReader still
> provides some benefits assuming nulls are fairly infrequent.
> 
> It is worth noting that this code assumes preallocated arrays (i.e. it
> doesn't use builders).
> 
> [1]
> https://github.com/apache/arrow/blob/e0a9d0f28affdccb45bf76fde58d0eec1328cd40/cpp/src/arrow/util/rle_encoding.h
> 
> On Sun, Oct 18, 2020 at 10:35 AM Wes McKinney <wesmck...@gmail.com> wrote:
> 
>> hi Ying, the code in adapter_util.cc doesn't look right to me unless
>> the data in liborc::ColumnVectorBatch is spaced (has placeholder bytes
>> where there is a null). We have quite a bit of code in Parquet that
>> deals specifically with this issue -- I'm not sure if we have a
>> ready-made function that will efficiently append the "compressed"
>> value efficiently to a builder, but we certianly have all the tools
>> you need to do so (e.g. the BitRunReader is helpful here)
>> 
>> On Sun, Oct 18, 2020 at 12:24 PM Ying Zhou <yzhou7...@gmail.com> wrote:
>>> 
>>> Hi,
>>> 
>>> Unlike Arrow in ORC when an entry is null it is only recorded in the
>> PRESENT stream (equivalent to the validity bitmap in Arrow) but not in any
>> DATA stream for any type including numeric types. Hence the notNull (aka
>> PRESENT) and data buffers from ORC generally don’t have the same size.
>>> 
>>> However according to cpp/src/arrow/adaptes/orc/adapter_util.cc <
>> http://adapter_util.cc/> line 126 it is possible to directly use
>> AppendValues to call builder->AppendValues(source, length, valid_bytes)
>> with builder being an Int64Builder with source and valid_bytes having
>> different sizes which doesn’t seem to be reasonable. May I ask whether this
>> is actually valid usage of AppendValues? Thanks!
>>> 
>>> 
>>> Best,
>>> Ying Zhou
>> 

Reply via email to