Thanks a lot, Owen

Very interesting, and a bit different from parquet, where pages respect row
boundaries.

My understanding is that this favors a smaller and more stable memory usage
(parquet pages may have an arbitrarily large uncompressed data) in
detriment of a potentially slower deserialization.

For example, decoding a pre-loaded f32 column is very fast via pointer
arithmetics and "load 4 bytes little endian" ops, which compilers usually
optimize to simd. With streaming, the "load 4 bytes" may hit a new
compression unit and require a new decompression round (e.g. 2 bytes are
available, but we need the next 2 bytes). This makes it more difficult to
produce simd. But maybe this isn't so much of an issue, given that most
types are highly encoded anyways (f32 and f64 seems to be the exception)?

Anyways, thank you very much for the in-depth explanation. It makes a lot
of sense.

Best,
Jorge


On Wed, Jul 27, 2022 at 12:56 AM Owen O'Malley <owen.omal...@gmail.com>
wrote:

> Compression in ORC not only crosses rows, but across the row groups (every
> 10k rows) that are the index points. Look at the ORC specification (
> https://orc.apache.org/specification/ORCv1/) on Compression. Compression
> does not cross stripe boundaries, because that would violate the constraint
> that you can read each stripe independently. The expected case is to stream
> through all of the rows in a stripe, so it is optimized for improving
> compression.
>
> Note that the constraints also don't run in the other direction. A single
> value may cross several compression chunks.
>
> All of the kinds of streams use the generic (zlib, zstd, snappy, etc.) the
> same way. The generic compression is the last stage of the process.
>
> You should probably look at how seek is done using the indexes. To seek to
> the start of a row group, we keep a list of integers for each stream. For a
> compressed integer stream, the index will have three values:
>
>    - <compressed byte offset from start of stream of a compression chunk>
>    - <uncompressed byte offset from compression chunk of rle block>
>    - <rle offset with rle block>
>
> So to jump to row 10000, you'd use the first number to find the number of
> compressed bytes to jump over and you'd decompress starting from there.
> From the decompressed bytes, you'd skip over the second value number of
> bytes and start the rle decompression. Now you'd use the third number to
> skip over that many values from the rle.
>
> .. Owen
>
>
>
>
> On Tue, Jul 26, 2022 at 6:10 AM Jorge Cardoso Leitão <
> jorgecarlei...@gmail.com> wrote:
>
> > Hi,
> >
> > In variable-length types like strings, there is the stream kind "Data"
> > containing the concatenated values. When decoding to e.g. a vector of
> > strings, is there any constraint over whether compression breaks the
> values
> > boundaries?
> >
> > I.e. say we have a string column with 2 rows, each with 100Mb each,
> [r1,r2]
> > (which are concatenated in "Data"). Can we end up with a compression
> where
> > r2 is split in between two compressions?
> >
> > Is this also valid for the stream kind "Length"?
> >
> > More broadly, the question is whether, when deserializing we need to
> > "concatenate" bytes from parts of the compressed items or whether we can
> > assume that compression respects row boundaries.
> >
> > Best,
> > Jorge
> >
>

Reply via email to