Hi all, I am having trouble with importing a medium dataset into an empty new table. The import runs for about 60 minutes. There is a lot of failed/killed tasks in this scenario and sometime the import fails altogether.
If I import a smaller subset into the empty table and then perform manual split of regions (via split button on webpage) and then import the larger dataset, the import runs for about 10 minutes. It seems to me that the performance bottleneck during the first import is the single region on the single cluster machine. This machine is heavily loaded. So my question is whether I can force hbase to split faster during heavy write operations and what tuning parameters may be affecting this scenario. Thanks for your help, Matthias p.s. here are the details Details: 33 cluster machines in testlab (3 year old servers with hyperthreading single core cpu) 1.5 gig of memory, debian 5 lenny 32bit hadoop 0.19.0, hbase 0.19.0 -Xmx 500mb for java processes hadoop mapred.map.tasks=20 mapred.reduce.tasks=15 dfs.block.size=16777216 mapred.tasktracker.map.tasks.maximum=4 mapred.tasktracker.reduce.tasks.maximum=4 hbase hbase.hregion.max.filesize=67108864 hbase table 3 column families import file 5 Mill records with 18 columns with 6 columns per family filesize 1.1 gig csv-file import via provided java SampleUploader
