Really thanks! After more experimentation with liborc::ColumnVectorBatch this morning I found that it is actually spaced so there is no need to write another function to efficiently append “compressed” values. This also simplifies the Arrow2ORC adapter I’m working on.
> On Oct 20, 2020, at 12:55 AM, Micah Kornfield <emkornfi...@gmail.com> wrote: > > For reference, that parquet uses to space out values is in rle_decoder.h > [1]. This uses both BitBlockCounter and BitRunReader. BitBlockCounter is > faster than BitRunReader but on micro-benchmarks BitRunReader still > provides some benefits assuming nulls are fairly infrequent. > > It is worth noting that this code assumes preallocated arrays (i.e. it > doesn't use builders). > > [1] > https://github.com/apache/arrow/blob/e0a9d0f28affdccb45bf76fde58d0eec1328cd40/cpp/src/arrow/util/rle_encoding.h > > On Sun, Oct 18, 2020 at 10:35 AM Wes McKinney <wesmck...@gmail.com> wrote: > >> hi Ying, the code in adapter_util.cc doesn't look right to me unless >> the data in liborc::ColumnVectorBatch is spaced (has placeholder bytes >> where there is a null). We have quite a bit of code in Parquet that >> deals specifically with this issue -- I'm not sure if we have a >> ready-made function that will efficiently append the "compressed" >> value efficiently to a builder, but we certianly have all the tools >> you need to do so (e.g. the BitRunReader is helpful here) >> >> On Sun, Oct 18, 2020 at 12:24 PM Ying Zhou <yzhou7...@gmail.com> wrote: >>> >>> Hi, >>> >>> Unlike Arrow in ORC when an entry is null it is only recorded in the >> PRESENT stream (equivalent to the validity bitmap in Arrow) but not in any >> DATA stream for any type including numeric types. Hence the notNull (aka >> PRESENT) and data buffers from ORC generally don’t have the same size. >>> >>> However according to cpp/src/arrow/adaptes/orc/adapter_util.cc < >> http://adapter_util.cc/> line 126 it is possible to directly use >> AppendValues to call builder->AppendValues(source, length, valid_bytes) >> with builder being an Int64Builder with source and valid_bytes having >> different sizes which doesn’t seem to be reasonable. May I ask whether this >> is actually valid usage of AppendValues? Thanks! >>> >>> >>> Best, >>> Ying Zhou >>