Really thanks!
After more experimentation with liborc::ColumnVectorBatch this morning I found
that it is actually spaced so there is no need to write another function to
efficiently append “compressed” values. This also simplifies the Arrow2ORC
adapter I’m working on.
> On Oct 20, 2020, at
For reference, that parquet uses to space out values is in rle_decoder.h
[1]. This uses both BitBlockCounter and BitRunReader. BitBlockCounter is
faster than BitRunReader but on micro-benchmarks BitRunReader still
provides some benefits assuming nulls are fairly infrequent.
It is worth noting
hi Ying, the code in adapter_util.cc doesn't look right to me unless
the data in liborc::ColumnVectorBatch is spaced (has placeholder bytes
where there is a null). We have quite a bit of code in Parquet that
deals specifically with this issue -- I'm not sure if we have a
ready-made function that
Hi,
Unlike Arrow in ORC when an entry is null it is only recorded in the PRESENT
stream (equivalent to the validity bitmap in Arrow) but not in any DATA stream
for any type including numeric types. Hence the notNull (aka PRESENT) and data
buffers from ORC generally don’t have the same size.