ianmcook commented on PR #569: URL: https://github.com/apache/arrow-site/pull/569#issuecomment-2576208565
> The intro of the blog post points to ser/de as a benefit to the arrow format. I'm curious if a reference exists (and can be, or will eventually be, added) that shows a similar comparison for arrow vs parquet. Mostly in the sense that storage sits in a mechanically similar spot (but the serialization and deserialization have an arbitrarily large time gap between their execution). Thanks @drin. This is part of what the second post in the series will cover. It will describe why formats like Parquet and ORC are typically better than Arrow for archival storage (mostly because higher compression ratios mean lower cost to store for long periods, which easily outweighs the tradeoff of higher ser/de overheads). > I realize it's a bit of a scope creep, but I think the comparison of ser/de time and compression size would be really valuable to readers (and I think some naive numbers wouldn't be very time consuming to get?) Agreed. I'd like to include something like this in the second post too, comparing time and size using Arrow IPC vs. Parquet, ORC, Avro, CSV, JSON. But there are so many different variables at play (network speed, CPU and memory specs, encoding and compression options, how optimized the implementation is, whether or not to aggressively downcast based on the range of values in the data, what column types to use in the example, ... ) that I expect it will be impossible to claim that any results we present are representative. So the main message might end up being "YMMV" and we will probably want to provide a repo with some tools that readers can use to experiment for themselves. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
