Re: Bulk Ingest

Josh Elser Thu, 16 Jun 2016 19:24:50 -0700

There are two big things that are required to really scale up bulkloading. Sadly (I guess) they are both things you would need to beimplement on your own:

1) Avoid lots of small files. Target as large of files as you can,relative to your ingest latency requirements and your max file size (seton your instance or table)

2) Avoid having to import one file to multiple tablets. Remember thatthe majority of the metadata update for Accumulo is updating the tabletrow with the new file. When you have one file which spans many tablets,you are now create N metadata updates instead of just one. When youcreate the files, take into account the split points of your table, anduse that try to target one file per tablet.


Roshan Punnoose wrote:

We are trying to perform bulk ingest at scale and wanted to get some
quick thoughts on how to increase performance and stability. One of the
problems we have is that we sometimes import thousands of small files,
and I don't believe there is a good way around this in the architecture
as of yet. Already I have run into an rpc timeout issue because the
import process is taking longer than 5m. And another issue where we have
so many files after a bulk import that we have had to bump the
tserver.scan.files.open.max to 1K.

Here are some other configs that we have been toying with:
- master.fate.threadpool.size: 20
- master.bulk.threadpool.size: 20
- master.bulk.timeout: 20m
- tserver.bulk.process.threads: 20
- tserver.bulk.assign.threads: 20
- tserver.bulk.timeout: 20m
- tserver.compaction.major.concurrent.max: 20
- tserver.scan.files.open.max: 1200
- tserver.server.threads.minimum: 64
- table.file.max: 64
- table.compaction.major.ratio: 20

(HDFS)
- dfs.namenode.handler.count: 100
- dfs.datanode.handler.count: 50

Just want to get any quick ideas for performing bulk ingest at scale.
Thanks guys

p.s. This is on Accumulo 1.6.5

Re: Bulk Ingest

Reply via email to