Hi Phil, I can just make an educated guess here: I would think that in the design of v2 pages, it was indeed expected, that RLE compresses the levels well enough to be sufficient and the fact that no decompression is needed would result in certain queries being faster. That's why v2 doesn't compress the levels.
But you are of course right that there are pathological cases, which are actually not too uncommon, where RLE doesn't compress well while a compression like ZSTD would compress very well. So, one could argue that v2 never compressing levels is actually not a good design for this case; it would rather be good if v2 could decide to compress levels with a flag (there is a flag whether the data is compressed, but sadly none for the levels). But v2 is what it is and therefore doesn't compress levels. So I think your train of thought is correct and valid: For your specific use case, v1 compresses way better than v2 and might therefore be superior. Note though that actually v1 is the default for most writers, so using v2 in the first place is still somewhat a niche thing. So, it's not only "that's why v1 pages are still supported" but rather a "there is v2, but most people write v1 anyway". So there is nothing wrong with writing v1 pages in your use case; they won't go away anytime soon and will likely be supported by readers forever. Cheers, Jan Am Mo., 7. Apr. 2025 um 19:16 Uhr schrieb Phil Langdale <[email protected]>: > Hi everyone, > > This is my first time I've had to look deeply into Parquet internals, so > let me apologise if this has been discussed elsewhere in the past. I've > tried to do my due diligence in terms of searching online but I haven't > found a clear answer. > > I'm currently trying to define a suitable parquet schema for storing > Prometheus originated time-series data, focusing on space efficiency. I > have what looks like a highly effective solution using V1 data pages, but > with V2, the lack of compression of repetition levels results in a massive > loss of comparative efficiency. I'm trying to understand whether this > behaviour was considered in the original V2 design, and whether I'm missing > something in how I'm trying to use the format. Is continuing to use V1 data > pages the correct solution for my use-case? > > Here are my specifics: > > * I am exporting data from Prometheus that is already well sorted. If it > wasn't, I would do the sorting myself. This ensures that metrics are > initially sorted by name, then by the set of labels, then by timestamp. > This should lead to best case data for encoding and compression, and my > results support this. > > I have the following schema: > > message spark_schema { > required binary metric_name (STRING); > required group labels (MAP) { > repeated group key_value { > required binary key (STRING); > optional binary value (STRING); > } > } > required double value; > required int64 timestamp (TIMESTAMP(MILLIS,true)); > } > > and I'm explicitly forcing the use of DELTA_BINARY_PACKED for timestamps > and BYTE_STREAM_SPLIT for the double values. > > For metric names and label key/values, I'm using normal dictionary > encoding, and the cardinality of these is low in the sample data I'm > working with. So far so good. In terms of labels, each sample has a few 10s > of labels (eg: 26 in one of the worked examples below). Due to the sorting, > each data page will typically be made up of rows where the set of labels is > identical for every row. This means that the repetition level sequence for > each row will also look identical. And so, although RLE is in use and leads > to the repetitionLevels for a given row taking up 4 bytes, this 4 byte > sequence is then repeated ~3000 times. With v1 data pages, the whole-page > compression will naturally handle this incredibly well, as it's a best-case > scenario. But with v2 data pages, this block is left uncompressed, and ends > up being the largest contributor to the final file size, leading to files > that are 15x bigger than with v1 pages. > > Here is some data. > > With v1 pages > ------------------- > > Meta: > > Row group 0: count: 2741760 0.58 B records start: 4 total(compressed): > 1.526 MB total(uncompressed):126.067 MB > ---------------------------------------------------------------------- > type encodings count avg size nulls > metric_name BINARY Z _ R 2741760 0.00 B 0 > labels.key_value.key BINARY Z _ R 57358080 0.00 B 0 > labels.key_value.value BINARY Z _ R 57358080 0.01 B 0 > value DOUBLE Z 2741760 0.32 B 0 > timestamp INT64 Z D 2741760 0.03 B 0 > > and some debug data for a page: > > ColumnChunkPageWriteStore: writePageV1: compressor: ZSTD, row count: 3787, > uncompressed size: 69016, compressed size: 267 > > With v2 pages > ------------------- > > Meta: > > Row group 0: count: 2741760 8.58 B records start: 4 total(compressed): > 22.427 MB total(uncompressed):126.167 MB > ---------------------------------------------------------------------- > type encodings count avg size nulls > metric_name BINARY Z _ R 2741760 0.00 B 0 > labels.key_value.key BINARY Z _ R 57358080 0.20 B 0 > labels.key_value.value BINARY Z _ R 57358080 0.20 B 0 > value DOUBLE Z 2741760 0.32 B 0 > timestamp INT64 Z D 2741760 0.03 B 0 > > and > > ColumnChunkPageWriteStore: writePageV2: compressor: ZSTD, row count: 3773, > uncompressed size: 51612, compressed size: 264, repetitionLevels size > 15092, definitionLevels size 4 > > So we can see that all the extra space is due to uncompressed repetition > levels. Is this use-cases considered pathological? I'm not sure how, but > maybe there's something else that will trip me up down the line that you > can tell me about. Similarly, maybe I'll discover that the decompression > overhead of v1 is so painful that this is unusable. > > In the end, is this simply a case of "that's why v1 pages are still > supported" and I move ahead with that, or should it be possible for me to > use v2 pages, and something else is going wrong? > > Thank you for your insights! > > --phil >
