I have generally found in my imports that trying to aggressively split doesnt seem to help.
I did write a file randomizer which seems to help. It's a simple map-reduce... I shoudl post it sometime. -ryan On Wed, Mar 11, 2009 at 1:14 AM, Mat Hofschen <hofsc...@gmail.com> wrote: > Hi all, > I am having trouble with importing a medium dataset into an empty new > table. > The import runs for about 60 minutes. > There is a lot of failed/killed tasks in this scenario and sometime the > import fails altogether. > > If I import a smaller subset into the empty table and then perform manual > split of regions (via split button on webpage) and then import the larger > dataset, the import runs for about 10 minutes. > > It seems to me that the performance bottleneck during the first import is > the single region on the single cluster machine. This machine is heavily > loaded. So my question is whether I can force hbase to split faster during > heavy write operations and what tuning parameters may be affecting this > scenario. > > Thanks for your help, > Matthias > > p.s. here are the details > > Details: > 33 cluster machines in testlab (3 year old servers with hyperthreading > single core cpu) 1.5 gig of memory, debian 5 lenny 32bit > hadoop 0.19.0, hbase 0.19.0 > -Xmx 500mb for java processes > hadoop > mapred.map.tasks=20 > mapred.reduce.tasks=15 > dfs.block.size=16777216 > mapred.tasktracker.map.tasks.maximum=4 > mapred.tasktracker.reduce.tasks.maximum=4 > > hbase > hbase.hregion.max.filesize=67108864 > > hbase table > 3 column families > > import file > 5 Mill records with 18 columns with 6 columns per family > filesize 1.1 gig csv-file > import via provided java SampleUploader >