Michael, This question should be addressed to the hbase-user mailing list as it is strictly about HBase's usage of MapReduce, the framework itself doen't have any knowledge of how the region servers are configured. I CC'd it.
Uploading into an empty table is always a problem as you saw since there's no load distribution. I would recommend instead to write directly into HFiles as documented here: http://hadoop.apache.org/hbase/docs/r0.20.2/api/org/apache/hadoop/hbase/mapreduce/package-summary.html#bulk Other useful information for us: hbase/hadoop versions, hardware used, optimizations used to do the insert, configuration files. Thx, J-D On Tue, Jan 12, 2010 at 1:35 PM, Clements, Michael <[email protected]> wrote: > This leads to one quick & easy question: how does one reduce the number > of map tasks for a job? My goal is to limit the # of Map tasks so they > don't overwhelm the HBase region servers. > > The Docs point in several directions. > > There's a method job.setNumReduceTasks(), but no setNumMapTasks(). > > There is a job Configuration setting setNumMapTasks(), but it's > deprecated and says it only can increase, not reduce, the number of > tasks. > > There's InputFormat and its subclasses, which do the actual file splits. > But no single method to simply set the number of splits. One would have > to write his own subclass to measure the total size of all input files, > divide by the desired # of mappers and split it all up. > > The last option is not trivial but it is doable. Before I jump in I > figured I'd ask if there is an easier way. > > Thanks > > -----Original Message----- > From: > mapreduce-user-return-267-michael.clements=disney....@hadoop.apache.org > [mailto:[email protected] > che.org] On Behalf Of Clements, Michael > Sent: Tuesday, January 12, 2010 10:53 AM > To: [email protected] > Subject: how to load big files into Hbase without crashing? > > I have 15-node Hadoop cluster that is working for most jobs. But every > time I upload large data files into HBase, the job fails. > > I surmise that this file (15GB in size) is big enough, there are so many > tasks (about 55 at once), they swamp the region server processes. > > Each cluster node is also an HBase region server, so there are a minimum > of about 4 jobs for each region server. But when the table is small, > there are few regions so each region server is hosting many more tasks. > For example if the table starts out empty there is a single region, so a > single region server has to handle calls from all 55 tasks. It can't > handle this, the tasks give up and the job fails. > > This is just conjecture on my part. Does it sound reasonable? > > If so, what methods are there to prevent this? Limiting the number of > tasks for the upload job is one obvious solution, but what is a good > limit? The more general question is, how many map tasks can a typical > region server support? > > Limiting the number of tasks is tedious and error-prone, as it requires > somebody to look at the HBase table, see how many regions it has, on > which servers, and manually configure the job accordingly. If the job is > big enough, then the number of regions will grow during the job and the > initial task counts won't be ideal anymore. > > Ideally, the Hadoop framework would be smart enough to look at how many > regions & region servers exist and dynamically allocate a reasonable > number of tasks. > > Does the community have any knowledge or techniques to handle this? > > Thanks > > Michael Clements > Solutions Architect > [email protected] > 206 664-4374 office > 360 317 5051 mobile > > >
