Btw, just to show that these principles are generally true for parquet. I
used the vanilla spark.read.parquet() for illustration.

On Sat, Jul 18, 2020 at 8:07 PM Vinoth Chandar <[email protected]> wrote:

> Hi all,
>
> You might have heard this repeatedly mentioned over tickets, when we talk
> about Hudi paying some "tax" during write time to ensure query performance
> is good.
>
> These are conscious decisions we made, designing Uber's data lake for
> scale. and sometimes these are not appreciated when trying to optimize
> single Spark jobs for e.g
>
> So, I decided to write a small demo (all working on a macbook, on some
> 50GB of data and show how impactful these are). Hopefully you find it
> useful.
>
> TL;DR :
> - Keeping data sorted by time helps temporal queries 2-3x speed up.
> - 20x reduction in file size can cause upto 3-4x degradation in query
> performance.
>
> https://gist.github.com/vinothchandar/5544a92e616094c049f58c152faf0a53
> https://gist.github.com/vinothchandar/d7fa1338cddfae68390afcdfe310f94e
>
>
> Now, is anyone interested in turning these into blogs on hudi.apache.org?
> :). referencing the right config names and showing our users how to nail
> this.
>
> Thanks
> Vinoth
>

Reply via email to