Re: [PR] [Website] Add post "How the Apache Arrow Format Accelerates Query Result Transfer" [arrow-site]

via GitHub Tue, 07 Jan 2025 13:00:36 -0800


ianmcook commented on PR #569:
URL: https://github.com/apache/arrow-site/pull/569#issuecomment-2576208565


   > The intro of the blog post points to ser/de as a benefit to the arrow 
format. I'm curious if a reference exists (and can be, or will eventually be, 
added) that shows a similar comparison for arrow vs parquet. Mostly in the 
sense that storage sits in a mechanically similar spot (but the serialization 
and deserialization have an arbitrarily large time gap between their execution).
   
   Thanks @drin. This is part of what the second post in the series will cover. 
It will describe why formats like Parquet and ORC are typically better than 
Arrow for archival storage (mostly because higher compression ratios mean lower 
cost to store for long periods, which easily outweighs the tradeoff of higher 
ser/de overheads).
   
   > I realize it's a bit of a scope creep, but I think the comparison of 
ser/de time and compression size would be really valuable to readers (and I 
think some naive numbers wouldn't be very time consuming to get?)
   
   Agreed. I'd like to include something like this in the second post too, 
comparing time and size using Arrow IPC vs. Parquet, ORC, Avro, CSV, JSON. But 
there are so many different variables at play (network speed, CPU and memory 
specs, encoding and compression options, how optimized the implementation is, 
whether or not to aggressively downcast based on the range of values in the 
data, what column types to use in the example, ... ) that I expect it will be 
impossible to claim that any results we present are representative. So the main 
message might end up being "YMMV" and we will probably want to provide a repo 
with some tools that readers can use to experiment for themselves.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [Website] Add post "How the Apache Arrow Format Accelerates Query Result Transfer" [arrow-site]

Reply via email to