Yes, you can and should do that. Just keep an eye out for how much work each map/reduce needs to do: you want to give it a bunch of work to do so you aren't spending 30 seconds to start a process that finishes in 10 seconds.
-Eric On Mon, May 21, 2012 at 12:48 PM, Perko, Ralph J <[email protected]>wrote: > Eric, > > Thanks for the quick reply. Another question - My cluster has over 80 > cpus available. Suppose I create something like 50 splits across the 7 > servers - I will increase my map job count accordingly. What are your > thoughts on this? > > Thanks, > Ralph > > __________________________________________________ > Ralph Perko > Pacific Northwest National Laboratory > > > > From: Eric Newton <[email protected]<mailto:[email protected]>> > Reply-To: "[email protected]<mailto:[email protected]>" < > [email protected]<mailto:[email protected]>> > To: "[email protected]<mailto:[email protected]>" < > [email protected]<mailto:[email protected]>> > Subject: Re: table splits > > You need to estimate the size of the split. First, get the id of the > table with "tables -l" in the accumulo shell. > > Then, find out the size of table in hdfs: > > $ hadoop fs -dus /accumulo/tables/<id> > > Divide by 7, and use that as the split size: > > shell> config -t mytable -s table.split.threshold=newsize > > The table will automatically split out. Afterwards, you can then raise > the split size to keep it from splitting until it gets much bigger: > > shell> config -t mytable -s table.split.threshold=1G > > -Eric > > On Mon, May 21, 2012 at 12:24 PM, Perko, Ralph J <[email protected] > <mailto:[email protected]>> wrote: > Hi, > > I am looking for advice on how to best layout my table splits. I have a 7 > node cluster and my table contains ~10M records. I would like to split the > table equally across all the servers however I see no utility to do this in > this manner. I understand I can create splits for some letter range but I > was hoping for some way to have accumulo create "n" equal splits. Is this > possible? Right now the best way I see to handle this is to write a > utility that iterates the table, keeps a count and at some given value > (table size/ split count) spits out the beginning and end row and then I > create the split manually. > > Thanks, > Ralph > > __________________________________________________ > Ralph Perko > Pacific Northwest National Laboratory > > > >
