David, Each tablet is hosted by one tablet server, and there's no way around that. (This is actually quite reasonably; otherwise, we would receive duplicate results from multiple tablet servers.)
One strategy to deal with imbalanced data is to add a random partition prefix to your row Ids. This does complicate building queries, but in general, you'll be able to leverage all of your nodes. I did some testing with the nodes of such random shard ids, and it seems like having 1-2x as many shards as tablet servers worked pretty well. (I'd suggest 2x in case you ever grow your cloud.) In particular, if you can reingest your data, prepend a random "01-14~" to the beginning of each row Id, and see if that helps. After that, you can "help" Accumulo decide where it should split tablets with addSplits 01 02 <etc> 14 from the Accumulo shell (or programmatically with the addSplits). After that, you can make sure that your 14+ splits are distributed across the 7 nodes in a reasonable way. I hope that helps, Jim http://accumulo.apache.org/1.4/apidocs/org/apache/accumulo/core/client/admin/TableOperations.html#addSplits%28java.lang.String,%20java.util.SortedSet%29 On Wed, Aug 21, 2013 at 7:09 PM, Slater, David M. <[email protected]>wrote: > Hey, I have a 7 node network running accumulo 1.4.1 and hadoop 1.0.4.**** > > ** ** > > When I run large BatchScanner operations, the number of tablets scanned > per node is not uniform, leading to the overloaded nodes taking much longer > to finish than the others. For queries that require all of the scans to > finish before returning, this is a major latency issue. What are some > practical means of load-balancing this to reduce delay?**** > > ** ** > > Is it possible for tablets to be hosted on multiple tablet servers, up to > the replication factor of the underlying hdfs? Are there reasons this might > be an undesirable design?**** > > ** ** > > Thanks in advance, > David **** >
