Thanks, Chris. I would still appreciate any anecdotes people have for
setting these thread pools.

I did a bit more log analysis on bulk import times. With the default of 5
threads we're seeing an average somewhere around 50-100ms per file per
thread in the master bulk import thread pool. This means that a bulk import
of 1000 files (e.g. 1 file for each of 1000 tablets) takes about 10-20
seconds. We're targeting something like 1 bulk import per second for some
of our instances, most of which are much smaller than 1000 files, and we're
going to need better performance in that scenario. Our plan is to run with
200 threads for master.bulk.threadpool.size for a while on one of our
larger clusters and see how well that works. Thoughts?

Adam


On Thu, Aug 18, 2016 at 10:39 AM, Christopher <ctubb...@apache.org> wrote:

> Bumping this thread up, because I'm also curious if anybody has any
> thoughts on Adam's questions.
>
> On Mon, Aug 15, 2016 at 1:49 PM Adam Fuchs <afu...@apache.org> wrote:
>
> > I've been looking through the bulk load code lately related to some
> > performance issues a customer of ours is experiencing, and I'm perplexed
> by
> > a couple of things. Between o.a.a.master.tableOps.LoadFiles and
> > o.a.a.server.client.BulkImporter we have 4 thread pools that are used in
> > bulk load. It seems like only the master thread pool gets any parallelism
> > because we always send one file at a time to the tservers
> (LoadFiles:154).
> > Are the three thread pools in the tserver vestigial? Did we used to send
> > bigger batches to the tservers and find that one at a time was more
> > optimal?
> >
> > Seems like we could greatly simplify the tserver portion of the bulk
> load.
> > Can anybody think of why that might not be a good idea?
> >
> > Also, has anybody optimized the pool sizes for multiple concurrent large
> > bulk loads, and do you have suggestions on what settings to use (i.e.
> > master.fate.threadpool.size and master.bulk.threadpool.size)?
> >
> > Thanks,
> > Adam
> >
>

Reply via email to