Results look good. +1 to enable this via configuration. On Fri, May 3, 2019 at 9:29 AM nishith agarwal <[email protected]> wrote:
> Alright, sounds good, looking forward to the PR! > > On Fri, May 3, 2019 at 9:27 AM Vinoth Chandar <[email protected]> wrote: > > > >> BTW, in case of fewer number of files where sort partitioning may > > work better, will we see higher stage times ? > > I think this will always perform equally or faster, since it just trades > > off more parallelism in opening file handles. > > Sort-based was being nicer to our HDFS NameNode :) > > > > >> We can do this for jobs that are running > 1 cores ? > > This should have no bearing on this setting.. We can discuss more on the > PR > > with adequate context > > > > On Fri, May 3, 2019 at 9:21 AM nishith agarwal <[email protected]> > > wrote: > > > > > Nice, we needed to take a fresher look at the indexing stages, great > > start! > > > The results look promising. Looks like the Min 24th percentile bumped > but > > > that's expected since the cost moved from the highest tasks to the > other > > > ones. Eventually, this will bring down the skew and stage time. > > > > > > BTW, in case of fewer number of files where sort partitioning may work > > > better, will we see higher stage times ? I'm guessing the increase in > > those > > > depends on the number of files vs number of records ? Is there scope to > > > parallelize per partition if opening multiple handles is ok ? We can do > > > this for jobs that are running > 1 cores ? > > > > > > -Nishith > > > > > > On Fri, May 3, 2019 at 9:00 AM [email protected] <[email protected]> > > > wrote: > > > > > > > > > > > This does look very promising. It makes sense to enable this mode > > through > > > > configuration. > > > > Balaji.V On Friday, May 3, 2019, 8:18:44 AM PDT, Vinoth Chandar < > > > > [email protected]> wrote: > > > > > > > > Hello all, > > > > > > > > Noticed in a recent run that there were some skews on the bloom > filter > > > > checking stage. I noticed that even though sort based partitioning > > > > uniformly distributes the records among partitions, the cost is > > > controlled > > > > by number of file groups being checked in one partitioning.. > > > > > > > > I chose to prototype a file group based custom partitioner, with the > > > > intention of distributing this more evenly.. I am seeing consistently > > > good > > > > results > > > > > > > > for e.g > > > > ``` > > > > Metric Min 25th percentile Median 75th percentile Max > > > > Duration 2 s 14 s 48 s 1.6 min > > 3.9 > > > > min > > > > ``` > > > > becomes > > > > ``` > > > > Metric Min 25th percentile Median 75th percentile Max > > > > Duration 21 s 40 s 44 s 49 s > > > > 1.9 min > > > > ``` > > > > I can just make this a configuration per se.. So probably worth > getting > > > it > > > > in and iterating? > > > > If y'all think so, will prep a PR. HUDI-108 tracks this > > > > > > > > Thanks > > > > Vinoth > > > > > > > > > >
