Results look good. +1 to enable this via configuration.

On Fri, May 3, 2019 at 9:29 AM nishith agarwal <[email protected]> wrote:

> Alright, sounds good, looking forward to the PR!
>
> On Fri, May 3, 2019 at 9:27 AM Vinoth Chandar <[email protected]> wrote:
>
> > >> BTW, in case of fewer number of files where sort partitioning may
> > work better, will we see higher stage times ?
> > I think this will always perform equally or faster, since it just trades
> > off more parallelism in opening file handles.
> > Sort-based was being nicer to our HDFS NameNode :)
> >
> > >> We can do this for jobs that are running > 1 cores ?
> > This should have no bearing on this setting.. We can discuss more on the
> PR
> > with adequate context
> >
> > On Fri, May 3, 2019 at 9:21 AM nishith agarwal <[email protected]>
> > wrote:
> >
> > > Nice, we needed to take a fresher look at the indexing stages, great
> > start!
> > > The results look promising. Looks like the Min 24th percentile bumped
> but
> > > that's expected since the cost moved from the highest tasks to the
> other
> > > ones. Eventually, this will bring down the skew and stage time.
> > >
> > > BTW, in case of fewer number of files where sort partitioning may work
> > > better, will we see higher stage times ? I'm guessing the increase in
> > those
> > > depends on the number of files vs number of records ? Is there scope to
> > > parallelize per partition if opening multiple handles is ok ? We can do
> > > this for jobs that are running > 1 cores ?
> > >
> > > -Nishith
> > >
> > > On Fri, May 3, 2019 at 9:00 AM [email protected] <[email protected]>
> > > wrote:
> > >
> > > >
> > > > This does look very promising. It makes sense to enable this mode
> > through
> > > > configuration.
> > > > Balaji.V    On Friday, May 3, 2019, 8:18:44 AM PDT, Vinoth Chandar <
> > > > [email protected]> wrote:
> > > >
> > > >  Hello all,
> > > >
> > > > Noticed in a recent run that there were some skews on the bloom
> filter
> > > > checking stage. I noticed that even though sort based partitioning
> > > > uniformly distributes the records among partitions, the cost is
> > > controlled
> > > > by number of file groups being checked in one partitioning..
> > > >
> > > > I chose to prototype a file group based custom partitioner, with the
> > > > intention of distributing this more evenly.. I am seeing consistently
> > > good
> > > > results
> > > >
> > > > for e.g
> > > > ```
> > > > Metric      Min 25th percentile Median 75th percentile Max
> > > > Duration 2 s 14 s                        48 s 1.6 min
> > 3.9
> > > > min
> > > > ```
> > > > becomes
> > > > ```
> > > > Metric      Min 25th percentile Median 75th percentile Max
> > > > Duration      21 s 40 s                44 s 49 s
> > > >  1.9 min
> > > > ```
> > > > I can just make this a configuration per se.. So probably worth
> getting
> > > it
> > > > in and iterating?
> > > > If y'all think so, will prep a PR. HUDI-108 tracks this
> > > >
> > > > Thanks
> > > > Vinoth
> > > >
> > >
> >
>

Reply via email to