There are two big things that are required to really scale up bulk loading. Sadly (I guess) they are both things you would need to be implement on your own:

1) Avoid lots of small files. Target as large of files as you can, relative to your ingest latency requirements and your max file size (set on your instance or table)

2) Avoid having to import one file to multiple tablets. Remember that the majority of the metadata update for Accumulo is updating the tablet row with the new file. When you have one file which spans many tablets, you are now create N metadata updates instead of just one. When you create the files, take into account the split points of your table, and use that try to target one file per tablet.

Roshan Punnoose wrote:
We are trying to perform bulk ingest at scale and wanted to get some
quick thoughts on how to increase performance and stability. One of the
problems we have is that we sometimes import thousands of small files,
and I don't believe there is a good way around this in the architecture
as of yet. Already I have run into an rpc timeout issue because the
import process is taking longer than 5m. And another issue where we have
so many files after a bulk import that we have had to bump the
tserver.scan.files.open.max to 1K.

Here are some other configs that we have been toying with:
- master.fate.threadpool.size: 20
- master.bulk.threadpool.size: 20
- master.bulk.timeout: 20m
- tserver.bulk.process.threads: 20
- tserver.bulk.assign.threads: 20
- tserver.bulk.timeout: 20m
- tserver.compaction.major.concurrent.max: 20
- tserver.scan.files.open.max: 1200
- tserver.server.threads.minimum: 64
- table.file.max: 64
- table.compaction.major.ratio: 20

(HDFS)
- dfs.namenode.handler.count: 100
- dfs.datanode.handler.count: 50

Just want to get any quick ideas for performing bulk ingest at scale.
Thanks guys

p.s. This is on Accumulo 1.6.5

Reply via email to