Re: [C++] AppendValues for numeric types with invalid slots omitted from source

2020-10-20 Thread Ying Zhou
Really thanks! After more experimentation with liborc::ColumnVectorBatch this morning I found that it is actually spaced so there is no need to write another function to efficiently append “compressed” values. This also simplifies the Arrow2ORC adapter I’m working on. > On Oct 20, 2020, at

Re: [C++] AppendValues for numeric types with invalid slots omitted from source

2020-10-19 Thread Micah Kornfield
For reference, that parquet uses to space out values is in rle_decoder.h [1]. This uses both BitBlockCounter and BitRunReader. BitBlockCounter is faster than BitRunReader but on micro-benchmarks BitRunReader still provides some benefits assuming nulls are fairly infrequent. It is worth noting

Re: [C++] AppendValues for numeric types with invalid slots omitted from source

2020-10-18 Thread Wes McKinney
hi Ying, the code in adapter_util.cc doesn't look right to me unless the data in liborc::ColumnVectorBatch is spaced (has placeholder bytes where there is a null). We have quite a bit of code in Parquet that deals specifically with this issue -- I'm not sure if we have a ready-made function that

[C++] AppendValues for numeric types with invalid slots omitted from source

2020-10-18 Thread Ying Zhou
Hi, Unlike Arrow in ORC when an entry is null it is only recorded in the PRESENT stream (equivalent to the validity bitmap in Arrow) but not in any DATA stream for any type including numeric types. Hence the notNull (aka PRESENT) and data buffers from ORC generally don’t have the same size.