Re: [I] What is the best case performance for the new thrift-remodel / custom thrift parser [arrow-rs]

via GitHub Sat, 18 Oct 2025 16:38:39 -0700


alamb commented on issue #8441:
URL: https://github.com/apache/arrow-rs/issues/8441#issuecomment-3331070043


   > I do feel that the lack of random access is a bit of a red herring. I 
don't know offhand what changes Alkis made to the metadata for the flatbuffer 
test, but if it's the same form as the current metadata I don't see how he'd be 
able to skip decoding the whole thing. 
   
   Yes I feel the same way
   
   > I'd also note that the flatbuffers parser totally ignores the page indexes 
AFAICT, and also omits the most expensive structures in the current metadata, 
so I think the 10X number is a bit unfair to bandy about. I'm sure we could 
rewrite the current metadata structures in a more friendly to parse way and see 
similarly spectacular improvements.
   
   Yeah, this is what I would like to try and do -- cook up the best possible 
case. 
   
   For example, I am imagining 10k string columns with column chunk statistics, 
and then using our custom thrift parser to skip allocating strings. 
   
   So for a file with 100 row groups, skipping the stats is going to save 100 * 
10k = 1M allocations. I have to imagine we can see similarly spectacular 
numbers by doing that much less work 😆 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] What is the best case performance for the new thrift-remodel / custom thrift parser [arrow-rs]

Reply via email to