This is really really promising! I think the gains will be much higher if clustered over a larger window of commits! We can keep improving this over time.
I ll be sure to link the results to the doc updates On Wed, Jan 20, 2021 at 10:40 PM Satish Kotha <satishko...@uber.com.invalid> wrote: > Hello everyone, > > We see ~60% improvement in query runtime for some datasets. See an example > documented here > < > https://cwiki.apache.org/confluence/display/HUDI/RFC+-+19+Clustering+data+for+freshness+and+query+performance#RFC19Clusteringdataforfreshnessandqueryperformance-PerformanceEvaluation > >. > Please try out this feature and share any feedback. > I have included commands to run async clustering in the example section > < > https://cwiki.apache.org/confluence/display/HUDI/RFC+-+19+Clustering+data+for+freshness+and+query+performance#RFC19Clusteringdataforfreshnessandqueryperformance-PerformanceEvaluation > >. > You could also setup inline clustering using commands in this section > < > https://cwiki.apache.org/confluence/display/HUDI/RFC+-+19+Clustering+data+for+freshness+and+query+performance#RFC19Clusteringdataforfreshnessandqueryperformance-Commandstoscheduleandrunclustering > > > . > > Thanks > Satish > > On Tue, Dec 22, 2020 at 10:32 PM Vinoth Chandar <vin...@apache.org> wrote: > > > Please help us test this more, before RC is cut! :) > > > > On Tue, Dec 22, 2020 at 10:23 PM Satish Kotha > <satishko...@uber.com.invalid > > > > > wrote: > > > > > Hello all, > > > > > > Clustering feature landed <https://github.com/apache/hudi/pull/2263> > on > > > master branch and is available in beta. This feature can be used to do > > > following > > > 1) Stitch small files into larger files > > > 2) Change data layout on disk by sorting data using different columns > > (for > > > query/storage optimization) > > > > > > If you are interested in the above use cases, appreciate it if you can > > try > > > out this feature. I have included commands to run clustering in this > > > section > > > < > > > > > > https://cwiki.apache.org/confluence/display/HUDI/RFC+-+19+Clustering+data+for+speed+and+query+performance#RFC19Clusteringdataforspeedandqueryperformance-Commandstoscheduleandrunclustering > > > > > > > (along > > > with caveats as this feature is still in beta). > > > > > > Any feedback is welcome. I'm also on #general room in slack. Please > feel > > > free to ping me if you have any questions/comments. > > > > > > Thanks > > > Satish > > > > > >