Rich-T-kid commented on issue #3520: URL: https://github.com/apache/arrow-rs/issues/3520#issuecomment-4452099993
> > I'm happy to help with this, especially anything related to supporting REE in parquet > > Thank you [@albertlockett](https://github.com/albertlockett) > > One big missing piece I know if is the ability to read data from Parquet as REE arrays -- even though often the pages are compressed with the REE/Bit packing hybrid > > So the idea would be to specify to the parquet reader that we wanted to read data as an REE array and then implement the appropriate decoders to do that > > Before doing this it would likely be very helpful to create some sort of benchmark / example where reading an REE array directly would help a lot (maybe a column with very many repeated string values 🤔 ) I'm working on #8016 and realized the read path (parquet → REE) also isn't currently implemented. Reading through this thread I'm a bit confused on the right approach. Per the [Parquet spec](https://parquet.apache.org/docs/file-format/data-pages/encodings/), RLE is only used for BOOLEAN values and dictionary indices — whereas in Arrow, an REE array can wrap values of any type. What I'm really trying to confirm is the decode flow. Normally for a column chunk (e.g. strings), you'd decode each value alongside the rep/def levels to handle nulls etc. But since Parquet's RLE for column data only shows up as `RLE_DICTIONARY` on data pages (which requires a dictionary page), I think the REE read path would look more like: 1. Read the dictionary page to get the unique values. 2. Walk the RLE-encoded indices on the data pages, and for each run, append a `(value, run_end)` pair directly into the REE array — rather than materializing every row first. Is that the right mental model, or is there a different mapping people have in mind? This feels distinct enough from the write path in #8016 that it probably warrants its own tracking issue, Im happy to file one. cc @alamb @albertlockett @Jefffrey @vegarsti -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
