Hi Dave,

The table is currently organizing netflow data with its rowID of 
timestamp_netflowRecordID, some columns corresponding to various netflow 
quantites, and one column representing the entire netflow in binary form.

The table is about 1.2 TB, and I am scanning 5-40 GB per scan, which scans 
about 7-28 tablets.

What do you mean by a custom load balancer? Do you mean balancing the data on 
ingest, or balancing the query load? What would you recommend for balancing the 
query load if I can only retrieve the data from a particular tablet?

I've played with index/data caches, though I haven't used readahead threads or 
max open files. Is that referring to rfiles?

I'm noticing that most of the queries are CPU bound, and that read i/o is not 
being hit very hard. Is that a typical behavior for scans?

Thanks,
David

From: Dave Marion [mailto:[email protected]]
Sent: Wednesday, August 21, 2013 7:29 PM
To: [email protected]
Subject: RE: Straggler problem in Accumulo BatchScans

How is the table organized?
What percent of the table are you scanning in these large operations?
Have you considered writing a custom load balancer?

I don't think that a tablet can be hosted on multiple servers. But you might be 
able to play around with the index/data caches, readahead threads (concurrent 
queries), and max open files to achieve better performance.

From: Slater, David M. [mailto:[email protected]]
Sent: Wednesday, August 21, 2013 7:09 PM
To: [email protected]<mailto:[email protected]>
Subject: Straggler problem in Accumulo BatchScans

Hey, I have a 7 node network running accumulo 1.4.1 and hadoop 1.0.4.

When I run large BatchScanner operations, the number of tablets scanned per 
node is not uniform, leading to the overloaded nodes taking much longer to 
finish than the others. For queries that require all of the scans to finish 
before returning, this is a major latency issue. What are some practical means 
of load-balancing this to reduce delay?

Is it possible for tablets to be hosted on multiple tablet servers, up to the 
replication factor of the underlying hdfs? Are there reasons this might be an 
undesirable design?

Thanks in advance,
David

Reply via email to