> We found (after consulting with the datasketches team) that the
> overhead trying to do hierarchical merging with sketches outweighed
> the performance benefit.

It makes sense that this is a risk. The good news is that the groupBy
merging is less resource intensive on a per-file basis, so you should be
able to have a lot of files open at once. If N-way hierarchical merging of
groupBy spill files existed, and N was set to, say, 10000, then it
shouldn't slow things down too badly.

On Tue, Aug 10, 2021 at 3:32 PM Will Lauer <wla...@verizonmedia.com.invalid>
wrote:

> Interesting. That's worth looking into.
>
> BTW, when we updated to whatever version introduced the hierarchical
> merging, we found that this actually has poor performance with sketches and
> had to disable the hierarchical merging. We found (after consulting with
> the datasketches team) that the overhead trying to do hierarchical merging
> with sketches outweighed the performance benefit. Sketches have a large
> finalization cost, and having to pay that at each level of the hierarchy
> swamped the gains from parallelizing the process.
>
> Given that we just get failures here, it might be worth doing hierarchical
> merging, as slow vs broken is a better tradeoff than slow vs fast.
>
> Will
>
>
> <http://www.verizonmedia.com>
>
> Will Lauer
>
> Senior Principal Architect, Audience & Advertising Reporting
> Data Platforms & Systems Engineering
>
> M 508 561 6427
> 1908 S. First St
> Champaign, IL 61822
>
> <http://www.facebook.com/verizonmedia>   <http://twitter.com/verizonmedia>
> <https://www.linkedin.com/company/verizon-media/>
> <http://www.instagram.com/verizonmedia>
>
>
>
> On Tue, Aug 10, 2021 at 12:29 PM Gian Merlino <g...@apache.org> wrote:
>
> > Hey Will,
> >
> > The sorting that happens on the data servers is really useful, because it
> > means the Broker can do its part of the query fully streaming instead of
> > buffering things up.
> >
> > At one point we had a similar problem in ingestion (you could have a ton
> of
> > spill files if you had a lot of sketches) and ended up addressing that by
> > doing the merge hierarchically. Something similar should work in the
> > SpillingGrouper. Instead of opening everything all at once, we could open
> > things in chunks of N, merge those, and then proceed hierarchically. The
> > merge tree would have logarithmic height.
> >
> > The ingestion version of this was:
> >
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_druid_pull_10689&d=DwIBaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=vGHo2vqhE2ZeS_hHdb4Y3eoJ4WjVKhEg5Xld1w9ptEQ&m=fOgaA482CvqwfYURHMAIkb0INK1oswSrfJSdXlqXSCA&s=ZJ8LwhSGqznNndW6sDKV97NjCLYxyw-Z_bFTQ7R_VPM&e=
> >
> > On Mon, Aug 9, 2021 at 2:30 PM Will Lauer <wla...@verizonmedia.com
> > .invalid>
> > wrote:
> >
> > > I recently submitted an issue about "Too many open files" in GroupBy
> v2 (
> > >
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_druid_issues_11558&d=DwIBaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=vGHo2vqhE2ZeS_hHdb4Y3eoJ4WjVKhEg5Xld1w9ptEQ&m=fOgaA482CvqwfYURHMAIkb0INK1oswSrfJSdXlqXSCA&s=bbH6GMSCU7z1AnJ_UoJjEqn1Ep0xJ_kOxKIB8CCnr3o&e=
> > ) and have been investigating
> > > a
> > > solution. It looked like the problem was happening because the code
> > > preemptively opened all the spill files for reading, which when there
> > are a
> > > huge number of spill files (in our case, a single query is generating
> > 110k
> > > spill files), causes the "too many open files" error when the files
> > ulimit
> > > is set to an otherwise reasonable number. We can work around this for
> now
> > > by setting "ulimit -n" to a huge value (like 1 million), but I was
> hoping
> > > for a better solution.
> > >
> > > In
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_druid_pull_11559&d=DwIBaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=vGHo2vqhE2ZeS_hHdb4Y3eoJ4WjVKhEg5Xld1w9ptEQ&m=fOgaA482CvqwfYURHMAIkb0INK1oswSrfJSdXlqXSCA&s=K6-9nBzX2slHBC7KyzDHEhGfRgI4uIMfNlbUMF1IRh8&e=
> > , I attempted to fix this by
> > > lazily opening files only when they were ready to be read and closing
> > them
> > > immediately after they had finished being read. While this looks like
> it
> > > fixes the issue in some small edge cases, it isn't a general solution
> > > because many queries end up calling CloseableIterators.mergeSorted() to
> > > merge all the spill files together, which due to sorting necessitates
> > > reading all the files at once, causing the "too many files" error
> again.
> > It
> > > looks like mergeSorted() is called because frequently the grouping code
> > is
> > > assuming the results should be sorted and is calling
> > > ConcurrentGrouper.parallelSortAndGetGroupersIterator().
> > >
> > > My question is, can anyone think of a way to avoid the need for sorting
> > at
> > > this level so as to avoid the need for opening all the spill files.
> Given
> > > how sketches work in druid right now, I don't see an easy way to reduce
> > the
> > > number of spill files we are seeing, so I was hoping to address this on
> > the
> > > grouper side, but right now I can't see a solution that makes this any
> > > better. We aren't blocked, because we can set the maximum number of
> files
> > > to a much larger number, but that is an unpalatable long term solution.
> > >
> > > Will
> > >
> > >
> > > <http://www.verizonmedia.com>
> > >
> > > Will Lauer
> > >
> > > Senior Principal Architect, Audience & Advertising Reporting
> > > Data Platforms & Systems Engineering
> > >
> > > M 508 561 6427
> > > 1908 S. First St
> > > Champaign, IL 61822
> > >
> > > <
> >
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.facebook.com_verizonmedia&d=DwIBaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=vGHo2vqhE2ZeS_hHdb4Y3eoJ4WjVKhEg5Xld1w9ptEQ&m=fOgaA482CvqwfYURHMAIkb0INK1oswSrfJSdXlqXSCA&s=bBuBrJqXrK8CF93z_q6ABvtHUEDjtXuRV34rEFfMAS4&e=
> > >   <
> >
> https://urldefense.proofpoint.com/v2/url?u=http-3A__twitter.com_verizonmedia&d=DwIBaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=vGHo2vqhE2ZeS_hHdb4Y3eoJ4WjVKhEg5Xld1w9ptEQ&m=fOgaA482CvqwfYURHMAIkb0INK1oswSrfJSdXlqXSCA&s=bmqp2pN5AHwhPVzOY0mjfCvqqy6KFneY7dixrJ29BL8&e=
> > >
> > > <
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_company_verizon-2Dmedia_&d=DwIBaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=vGHo2vqhE2ZeS_hHdb4Y3eoJ4WjVKhEg5Xld1w9ptEQ&m=fOgaA482CvqwfYURHMAIkb0INK1oswSrfJSdXlqXSCA&s=8wnli45Segm6Fsd-xVcOj_GutNnuZmHyf2NOOoF8Zv8&e=
> > >
> > > <
> >
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.instagram.com_verizonmedia&d=DwIBaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=vGHo2vqhE2ZeS_hHdb4Y3eoJ4WjVKhEg5Xld1w9ptEQ&m=fOgaA482CvqwfYURHMAIkb0INK1oswSrfJSdXlqXSCA&s=VWIesUn6Op6tEVcT9slFyiFCu_p_6yjrspPlpUpisZg&e=
> > >
> > >
> >
>

Reply via email to