Rich-T-kid commented on issue #3520:
URL: https://github.com/apache/arrow-rs/issues/3520#issuecomment-4452099993

   > > I'm happy to help with this, especially anything related to supporting 
REE in parquet
   > 
   > Thank you [@albertlockett](https://github.com/albertlockett)
   > 
   > One big missing piece I know if is the ability to read data from Parquet 
as REE arrays -- even though often the pages are compressed with the REE/Bit 
packing hybrid
   > 
   > So the idea would be to specify to the parquet reader that we wanted to 
read data as an REE array and then implement the appropriate decoders to do that
   > 
   > Before doing this it would likely be very helpful to create some sort of 
benchmark / example where reading an REE array directly would help a lot (maybe 
a column with very many repeated string values 🤔 )
   
   I'm working on #8016 and realized the read path (parquet → REE) also isn't 
currently implemented.
   
   Reading through this thread I'm a bit confused on the right approach. Per 
the [Parquet 
spec](https://parquet.apache.org/docs/file-format/data-pages/encodings/), RLE 
is only used for BOOLEAN values and dictionary indices — whereas in Arrow, an 
REE array can wrap values of any type.
   
   What I'm really trying to confirm is the decode flow. Normally for a column 
chunk (e.g. strings), you'd decode each value alongside the rep/def levels to 
handle nulls etc. But since Parquet's RLE for column data only shows up as 
`RLE_DICTIONARY` on data pages (which requires a dictionary page), I think the 
REE read path would look more like:
   
   1. Read the dictionary page to get the unique values.
   2. Walk the RLE-encoded indices on the data pages, and for each run, append 
a `(value, run_end)` pair directly into the REE array — rather than 
materializing every row first.
   
   Is that the right mental model, or is there a different mapping people have 
in mind?
   
   This feels distinct enough from the write path in #8016 that it probably 
warrants its own tracking issue, Im happy to file one.
   
   cc @alamb @albertlockett @Jefffrey @vegarsti 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to