For reference, that parquet uses to space out values is in rle_decoder.h [1]. This uses both BitBlockCounter and BitRunReader. BitBlockCounter is faster than BitRunReader but on micro-benchmarks BitRunReader still provides some benefits assuming nulls are fairly infrequent.
It is worth noting that this code assumes preallocated arrays (i.e. it doesn't use builders). [1] https://github.com/apache/arrow/blob/e0a9d0f28affdccb45bf76fde58d0eec1328cd40/cpp/src/arrow/util/rle_encoding.h On Sun, Oct 18, 2020 at 10:35 AM Wes McKinney <wesmck...@gmail.com> wrote: > hi Ying, the code in adapter_util.cc doesn't look right to me unless > the data in liborc::ColumnVectorBatch is spaced (has placeholder bytes > where there is a null). We have quite a bit of code in Parquet that > deals specifically with this issue -- I'm not sure if we have a > ready-made function that will efficiently append the "compressed" > value efficiently to a builder, but we certianly have all the tools > you need to do so (e.g. the BitRunReader is helpful here) > > On Sun, Oct 18, 2020 at 12:24 PM Ying Zhou <yzhou7...@gmail.com> wrote: > > > > Hi, > > > > Unlike Arrow in ORC when an entry is null it is only recorded in the > PRESENT stream (equivalent to the validity bitmap in Arrow) but not in any > DATA stream for any type including numeric types. Hence the notNull (aka > PRESENT) and data buffers from ORC generally don’t have the same size. > > > > However according to cpp/src/arrow/adaptes/orc/adapter_util.cc < > http://adapter_util.cc/> line 126 it is possible to directly use > AppendValues to call builder->AppendValues(source, length, valid_bytes) > with builder being an Int64Builder with source and valid_bytes having > different sizes which doesn’t seem to be reasonable. May I ask whether this > is actually valid usage of AppendValues? Thanks! > > > > > > Best, > > Ying Zhou >