Btw, just to show that these principles are generally true for parquet. I used the vanilla spark.read.parquet() for illustration.
On Sat, Jul 18, 2020 at 8:07 PM Vinoth Chandar <[email protected]> wrote: > Hi all, > > You might have heard this repeatedly mentioned over tickets, when we talk > about Hudi paying some "tax" during write time to ensure query performance > is good. > > These are conscious decisions we made, designing Uber's data lake for > scale. and sometimes these are not appreciated when trying to optimize > single Spark jobs for e.g > > So, I decided to write a small demo (all working on a macbook, on some > 50GB of data and show how impactful these are). Hopefully you find it > useful. > > TL;DR : > - Keeping data sorted by time helps temporal queries 2-3x speed up. > - 20x reduction in file size can cause upto 3-4x degradation in query > performance. > > https://gist.github.com/vinothchandar/5544a92e616094c049f58c152faf0a53 > https://gist.github.com/vinothchandar/d7fa1338cddfae68390afcdfe310f94e > > > Now, is anyone interested in turning these into blogs on hudi.apache.org? > :). referencing the right config names and showing our users how to nail > this. > > Thanks > Vinoth >
