I suppose you already know this, but you can use public datasets as a source of real-world Parquet footers.
For example, the GeoParquet website lists a couple data providers: https://geoparquet.org/ Regards Antoine. On Sun, 18 Aug 2024 14:20:28 +0200 Alkis Evlogimenos <alkis.evlogime...@databricks.com.INVALID> wrote: > The biggest thing about benchmarks is the data itself. This is why I added > this binary https://github.com/apache/arrow/pull/42174 to help users donate > their slow to parse footers for benchmarking purposes. > > On Sat, Aug 17, 2024 at 11:43 AM Neelaksh Singh <neelaks...@gmail.com> > wrote: > > > One more point that I would like to mention here is that we have put a lot > > of effort into REPRODUCIBILITY for this benchmark repo. There have been a > > lot of great benchmarking efforts that have been done as part of this > > discussion. However, one limitation is that many of the experiments have > > not included code or take a fair bit of effort to setup. We've made strong > > efforts here using Docker and vcpkg to make the setup for these benchmarks > > as transparent and reproducible as possible. Our hope is that this will > > provide a useful contribution for others to either reproduce many of the > > results that have been discussed or easily run their own experiments when > > trying alternatives. We hope this will help facilitate the discussion with > > easily shareable experiments. > > > > On Thu, Aug 15, 2024, 9:21 PM Alkis Evlogimenos > > <alkis.evlogime...@databricks.com.invalid> wrote: > > > > > > Alkis, can you elaborate how you brought the size of Flatbuffers down? > > > > > > I have the internal PR rewritten in separate commits with all the steps. > > I > > > plan to publish it to arrow repo as soon as possible. The heavy things in > > > metadata are statistics, offsets, path_in_schema. It takes ~10 steps to > > cut > > > the size down, each of which takes a good chunk of the original size. > > > > > > On Thu, Aug 15, 2024 at 2:43 PM Jan Finis > > > <jpfinis-re5jqeeqqe8avxtiumw...@public.gmane.org> wrote: > > > > > > > I guess most close source implementations have done these optimizations > > > > already, it has just not been done in the open source versions. E.g., > > we > > > > switched to a custom-built thrift runtime using pool allocators and > > > string > > > > views instead of copied strings a few years ago, seeing comparable > > > > speed-ups. The C++ thrift library is just horribly inefficient. > > > > > > > > I agree with Alkis though that there are some gains that can be > > achieved > > > by > > > > optimizing, but the format has inherent drawbacks. Flatbuffers is > > indeed > > > > more efficient but at the cost of increased size. > > > > Alkis, can you elaborate how you brought the size of Flatbuffers down? > > > > > > > > Cheers, > > > > Jan > > > > > > > > Am Do., 15. Aug. 2024 um 13:50 Uhr schrieb Andrew Lamb < > > > > andrewlam...@gmail.com>: > > > > > > > > > I don't disagree that flatbuffers would be faster than thrift > > decoding > > > > > > > > > > I am trying to say that with software engineering only (no change to > > > the > > > > > format) it is likely possible to increase parquet thrift metadata > > > parsing > > > > > speed by 4x. > > > > > > > > > > This is not 25x of course, but 4x is non trivial. > > > > > > > > > > The fact that no one yet has bothered to invest the time to get the > > 4x > > > > yet > > > > > in open source implementations of parquet suggests to me that the > > > parsing > > > > > time may not be as critical an issue as we think > > > > > > > > > > Andrew > > > > > > > > > > On Thu, Aug 15, 2024 at 6:50 AM Alkis Evlogimenos > > > > > <alkis.evlogimenos-z4fuwbjybqlnpcjqcok8iauzikbjl...@public.gmane.org> > > > > > wrote: > > > > > > > > > > > The difference in parsing speed between thrift and flatbuffer is > > > >25x. > > > > > > Thrift has some fundamental design decisions that make decoding > > slow: > > > > > > 1. the thrift compact protocol is very data dependent: uleb > > encoding > > > > for > > > > > > integers, field ids are deltas from previous. The data dependencies > > > > > > disallow pipelining of modern cpus > > > > > > 2. object model does not have a way to use arenas to avoid many > > > > > allocations > > > > > > of objects > > > > > > If we keep thrift, we can potentially get 2 fixed, but fixing 1 > > > > requires > > > > > > changes to the thrift serialization protocol. Such a change is not > > > > > > different from switching serialization format. > > > > > > > > > > > > > > > > > > On Thu, Aug 15, 2024 at 12:30 PM Andrew Lamb < > > andrewlam...@gmail.com > > > > > > > > > > wrote: > > > > > > > > > > > > > I wanted to share some work Xiangpeng Hao did at InfluxData this > > > > summer > > > > > > on > > > > > > > the current (thrift) metadata format[1]. > > > > > > > > > > > > > > We found that with careful software engineering, we could likely > > > > > improve > > > > > > > the speed of reading existing parquet footer format by a factor > > of > > > 4 > > > > or > > > > > > > more ([2] contains some specific ideas). While we analyzed the > > > > > > > Rust implementation, I believe a similar conclusion applies to > > > C/C++. > > > > > > > > > > > > > > I realize that there are certain features that switching to an > > > > entirely > > > > > > new > > > > > > > footer format would achieve, but the cost to adopting a new > > format > > > > > > > across the ecosystem is immense (e.g. Parquet "version 2.0" etc). > > > > > > > > > > > > > > It is my opinion that investing the same effort in software > > > > > optimization > > > > > > > that would be required for a new footer format would have a much > > > > bigger > > > > > > > impact > > > > > > > > > > > > > > Andrew > > > > > > > > > > > > > > [1]: > > https://www.influxdata.com/blog/how-good-parquet-wide-tables/ > > > > > > > [2]: https://github.com/apache/arrow-rs/issues/5853 > > > > > > > > > > > > > > On Thu, Aug 15, 2024 at 4:26 AM Alkis Evlogimenos > > > > > > > <alkis.evlogimenos-z4fuwbjybqlnpcjqcok8iauzikbjl...@public.gmane.org> > > > > > > > wrote: > > > > > > > > > > > > > > > Hi Julien. > > > > > > > > > > > > > > > > Thank you for reconnecting the threads. > > > > > > > > > > > > > > > > I have broken down my experiments in a narrative, commit by > > > commit > > > > on > > > > > > how > > > > > > > > we can go from flatbuffers being ~2x larger than thrift to > > being > > > > > > smaller > > > > > > > > (and at times even half) the size of thrift. This is still on > > an > > > > > > internal > > > > > > > > branch, I will resume work towards the end of this month to > > port > > > it > > > > > to > > > > > > > > arrow so that folks can look at the progress and share ideas. > > > > > > > > > > > > > > > > On the benchmarking front I need to build and share a binary > > for > > > > > third > > > > > > > > parties to donate their footers for analysis. > > > > > > > > > > > > > > > > The PR for parquet extensions has gotten a few rounds of > > reviews: > > > > > > > > https://github.com/apache/parquet-format/pull/254. I hope it > > > will > > > > be > > > > > > > > merged > > > > > > > > soon. > > > > > > > > > > > > > > > > I missed the sync yesterday - for some reason I didn't receive > > an > > > > > > > > invitation. Julien could you add me again to the invite list? > > > > > > > > > > > > > > > > On Thu, Aug 15, 2024 at 1:32 AM Julien Le Dem < > > jul...@apache.org > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > This came up in the sync today. > > > > > > > > > > > > > > > > > > There are a few concurrent experiments with flatbuffers for a > > > > > > > > > > > > > > future > > > > > > > > > Parquet footer replacement. In itself it is fine and just > > > wanted > > > > to > > > > > > > > > reconnect the threads here so that folks are aware of each > > > other > > > > > and > > > > > > > can > > > > > > > > > share findings. > > > > > > > > > > > > > > > > > > - Neelaksh benchmarking and experiments: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://medium.com/@neelaksh-singh/benchmarking-apache-parquet-my-mid-program-journey-as-an-mlh-fellow-bc0b8332c3b1 > > > > > > > > > > > > > > https://github.com/Neelaksh-Singh/gresearch_parquet_benchmarking > > > > > > > > > > > > > > > > > > - Alkis has also been experimenting and led the proposal for > > > > > enabling > > > > > > > > > extending the existing footer. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://docs.google.com/document/d/1KkoR0DjzYnLQXO-d0oRBv2k157IZU0_injqd4eV4WiI/edit#heading=h.15ohoov5qqm6 > > > > > > > > > > > > > > > > > > > > - Xuwei also shared that he is looking into this. > > > > > > > > > > > > > > > > > > I would suggest that you all reply to this thread sharing > > your > > > > > > current > > > > > > > > > progress or ideas and a link to your respective repos for > > > > > > > experimenting. > > > > > > > > > > > > > > > > > > Best > > > > > > > > > Julien > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >