For reference, that parquet uses to space out values is in rle_decoder.h
[1].  This uses both BitBlockCounter and BitRunReader.  BitBlockCounter is
faster than BitRunReader but on micro-benchmarks BitRunReader still
provides some benefits assuming nulls are fairly infrequent.

It is worth noting that this code assumes preallocated arrays (i.e. it
doesn't use builders).

[1]
https://github.com/apache/arrow/blob/e0a9d0f28affdccb45bf76fde58d0eec1328cd40/cpp/src/arrow/util/rle_encoding.h

On Sun, Oct 18, 2020 at 10:35 AM Wes McKinney <wesmck...@gmail.com> wrote:

> hi Ying, the code in adapter_util.cc doesn't look right to me unless
> the data in liborc::ColumnVectorBatch is spaced (has placeholder bytes
> where there is a null). We have quite a bit of code in Parquet that
> deals specifically with this issue -- I'm not sure if we have a
> ready-made function that will efficiently append the "compressed"
> value efficiently to a builder, but we certianly have all the tools
> you need to do so (e.g. the BitRunReader is helpful here)
>
> On Sun, Oct 18, 2020 at 12:24 PM Ying Zhou <yzhou7...@gmail.com> wrote:
> >
> > Hi,
> >
> > Unlike Arrow in ORC when an entry is null it is only recorded in the
> PRESENT stream (equivalent to the validity bitmap in Arrow) but not in any
> DATA stream for any type including numeric types. Hence the notNull (aka
> PRESENT) and data buffers from ORC generally don’t have the same size.
> >
> > However according to cpp/src/arrow/adaptes/orc/adapter_util.cc <
> http://adapter_util.cc/> line 126 it is possible to directly use
> AppendValues to call builder->AppendValues(source, length, valid_bytes)
> with builder being an Int64Builder with source and valid_bytes having
> different sizes which doesn’t seem to be reasonable. May I ask whether this
> is actually valid usage of AppendValues? Thanks!
> >
> >
> > Best,
> > Ying Zhou
>

Reply via email to