Mat, Do you have DataNodes hosted on the same machines with RegionServers?
Is this import job running as a MapReduce? You have 4 maps and 4 reduces per node, plus the DN and the RS. I'd recommend at the very least to have 4 cores, or 8 if you have CPU intensive MR jobs. Before memory becomes an issue, you're going to be quickly CPU bound between all three of these things running on a single core (hyperthreaded or not, even 2 cores may not be sufficient). I have had some luck with splitting my tables early on in the import, but this will only make a difference if you have fully randomized the insert order of your keys, as Ryan pointed out. Either way, you should probably have max map and reduce tasks set to 1 each per node. Or another idea, since you have a decent number of nodes, you could segment your cluster a bit to prevent starvation and contention between 4+ jvms on a core. Run HDFS separate from HBase and MR. Would have to know more about what you're trying to do to help you figure out the best distribution. JG > -----Original Message----- > From: Mat Hofschen [mailto:[email protected]] > Sent: Wednesday, March 11, 2009 1:15 AM > To: [email protected] > Subject: Import into empty table > > Hi all, > I am having trouble with importing a medium dataset into an empty new > table. > The import runs for about 60 minutes. > There is a lot of failed/killed tasks in this scenario and sometime the > import fails altogether. > > If I import a smaller subset into the empty table and then perform > manual > split of regions (via split button on webpage) and then import the > larger > dataset, the import runs for about 10 minutes. > > It seems to me that the performance bottleneck during the first import > is > the single region on the single cluster machine. This machine is > heavily > loaded. So my question is whether I can force hbase to split faster > during > heavy write operations and what tuning parameters may be affecting this > scenario. > > Thanks for your help, > Matthias > > p.s. here are the details > > Details: > 33 cluster machines in testlab (3 year old servers with hyperthreading > single core cpu) 1.5 gig of memory, debian 5 lenny 32bit > hadoop 0.19.0, hbase 0.19.0 > -Xmx 500mb for java processes > hadoop > mapred.map.tasks=20 > mapred.reduce.tasks=15 > dfs.block.size=16777216 > mapred.tasktracker.map.tasks.maximum=4 > mapred.tasktracker.reduce.tasks.maximum=4 > > hbase > hbase.hregion.max.filesize=67108864 > > hbase table > 3 column families > > import file > 5 Mill records with 18 columns with 6 columns per family > filesize 1.1 gig csv-file > import via provided java SampleUploader
