Re: stupid/dangerous batch load question

Seidl, Ed Wed, 28 May 2014 11:33:06 -0700

It's a small cluster, only 7 tservers.

While the import directory is occuring, no other MR jobs are running.  The 
number of queued major compactions increases into the thousands.  I have the 
number of bulk import threads set to 10, so there will be 70 concurrent major 
compactions shown on the monitor page.  System load is actually pretty low…~40 
or so.  The rows themselves are not very large, but the rowid's are on the 
order of 150 bytes.  Overridden bits of config below:


default  | general.rpc.timeout ............................. | 120s
site     |    @override .................................... | 300s
default  | master.bulk.threadpool.size ..................... | 5
system   |    @override .................................... | 32
default  | master.bulk.timeout ............................. | 5m
system   |    @override .................................... | 60m
default  | tserver.bulk.assign.threads ..................... | 1
system   |    @override .................................... | 10
default  | tserver.bulk.process.threads .................... | 1
system   |    @override .................................... | 10
default  | tserver.bulk.timeout ............................ | 5m
system   |    @override .................................... | 60m
default  | tserver.cache.data.size ......................... | 128M
site     |    @override .................................... | 256M
default  | tserver.compaction.major.concurrent.max ......... | 3
system   |    @override .................................... | 10
default  | tserver.compaction.minor.concurrent.max ......... | 4
site     |    @override .................................... | 10

Thanks,
Ed

From: David Medinets <[email protected]<mailto:[email protected]>>
Reply-To: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Date: Wednesday, May 28, 2014 11:16 AM
To: accumulo-user <[email protected]<mailto:[email protected]>>
Subject: Re: stupid/dangerous batch load question

Lots of questions can be asked:

How many servers?
How many compactions are being run at once?
What is the size of the mutations?

What does the Accumulo monitor page say during the ingest process? Does it 
indicate high load?

Are you running map-reduce jobs at the same time as the bulk ingest?

I think there is a setting to change the number of threads used by bulk ingest. 
Can you run 'config -t' and post the results?

I've used tables with thousands of tablets, I can't remember having to wait for 
a Bulk Ingest to process.



On Wed, May 28, 2014 at 1:49 PM, Seidl, Ed 
<[email protected]<mailto:[email protected]>> wrote:
I have a large amount of data that I am batch loading into accumulo.  I'm using 
mapreduce to read in chunks of data and write out rfiles to be loaded with 
importdirectory.  I've noticed that the import will hang for longer and longer 
times as more data is added.  For instance, one table, which currently has 
~2500 tablets, now takes around 2 hours to process the importdirectory.

In poking around in the source for TableOperationsImpl (1.5.0), I see that 
there is an option to not wait on certain operations (like compact).  Would it 
be dangerous to (optionally) return immediately from importdirectory, and 
instead check the fail directory to detect errors in the import?  I know this 
will eventually cause a backup in the staging directories, but is there any 
potential to corrupt the tables?

Thanks,
Ed

Re: stupid/dangerous batch load question

Reply via email to