I have generally found in my imports that trying to aggressively split
doesnt seem to help.

I did write a file randomizer which seems to help.  It's a simple
map-reduce... I shoudl post it sometime.

-ryan

On Wed, Mar 11, 2009 at 1:14 AM, Mat Hofschen <hofsc...@gmail.com> wrote:

> Hi all,
> I am having trouble with importing a medium dataset into an empty new
> table.
> The import runs for about 60 minutes.
> There is a lot of failed/killed tasks in this scenario and sometime the
> import fails altogether.
>
> If I import a smaller subset into the empty table and then perform manual
> split of regions (via split button on webpage) and then import the larger
> dataset, the import runs for about 10 minutes.
>
> It seems to me that the performance bottleneck during the first import is
> the single region on the single cluster machine. This machine is heavily
> loaded. So my question is whether I can force hbase to split faster during
> heavy write operations and what tuning parameters may be affecting this
> scenario.
>
> Thanks for your help,
> Matthias
>
> p.s. here are the details
>
> Details:
> 33 cluster machines in testlab (3 year old servers with hyperthreading
> single core cpu) 1.5 gig of memory, debian 5 lenny 32bit
> hadoop 0.19.0, hbase 0.19.0
> -Xmx 500mb for java processes
> hadoop
> mapred.map.tasks=20
> mapred.reduce.tasks=15
> dfs.block.size=16777216
> mapred.tasktracker.map.tasks.maximum=4
> mapred.tasktracker.reduce.tasks.maximum=4
>
> hbase
> hbase.hregion.max.filesize=67108864
>
> hbase table
> 3 column families
>
> import file
> 5 Mill records with 18 columns with 6 columns per family
> filesize 1.1 gig csv-file
> import via provided java SampleUploader
>

Reply via email to