You might have to delve into tweaking the GC settings.  Here is what I am
setting:

export HBASE_OPTS="-XX:+UseConcMarkSweepGC -XX:NewSize=12m
-XX:MaxNewSize=12m -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps
-Xloggc:/export/hadoop/logs/gc.log"

What I found is once your in-memory complexity rises, the JVM resizes the
ParNew/new-gen space up and up, thus extending the length of the so-called
'minor collections'.  At one point with a 150MB ParNew I was seeing
typically 100-150ms and outliers of 400ms for 'minor' GC pauses.  If you
have all machines in your cluster pausing for 100ms at semi-random
intervals, that holds up your import as the clients are waiting on the
paused JVM to continue.

The key thing I had to do was set -XX:MaxNewSize=12m (about 2 * my L2 on
xeons).  You get more frequent, but smaller, GCs.  Your heap also tends to
grow larger than before, but with CMS it doesnt result in larger VM pauses,
just more ram usage.  I I personally use -Xmx4500m.   With your machines,
I'd consider a setting at least 1500, preferably 2000-2500m.  Of course I
have a ton of heap to chuck at it, so CMS collections dont happen all the
time (but when they do, they can prune 1500mb of garbage).

Since you are on a 2 core, you will probably have to set the CMS to
incremental:
-XX:+CMSIncrementalMode

To prevent the CMS GC from starving out your main threads.

Good luck with it!
-ryan

On Wed, Apr 29, 2009 at 3:33 PM, Jim Twensky <jim.twen...@gmail.com> wrote:

> Hi,
>
> I'm doing some experiments to import large datasets to Hbase using a Map
> job. Before posting some numbers, here is a summary of my test cluster:
>
> I have 7 regionservers and 1 master. I also run HDFS datanodes and Hadoop
> tasktrackers on the same 7 regionservers. Similarly, I run the Hadoop
> namenode on the same machine that I run the Hbase master. Each machine is
> an
> IBM e325 node that has two 2.2 GHz AMD64 processors, 4 GB RAM, and 80 GB
> local disk.
>
> My dataset is simply the output of another map reduce job, consisting of 7
> sequence files with a total size of 40 GB. Each file contains key, value
> records of the form (Text, LongWritable). The keys are sentences or phrases
> extracted from sentences and the values are frequencies. The total number
> of
> records is roughly 420m and an average key is around 100 bytes. (40GB /
> 420m
> - ignoring long writables)
>
> I tried to randomize the (key,value) pairs with another map reduce job and
> I
> also set:
>
>            table.setAutoFlush(false);
>            table.setWriteBufferSize(1024*1024*10);
>
> based on some advice that I read before on the list. My Map function that
> imports data to Hbase is as follow:
>
>    public void map(Text key, LongWritable value,
> OutputCollector<NullWritable, NullWritable> output, Reporter reporter)
> throws IOException {
>
>        BatchUpdate update = new BatchUpdate(key.toString());
>        update.put("frequency:value",Bytes.toBytes(value.get()));
>
>        table.commit(update);
>    }
>
> So far I can hit 20% of the import in 40-45 minutes so importing the whole
> data set will presumbly take more than 3.5 hours. I tried diffirent write
> buffer sizes between5 MB and 20 MB and didn't get any significant
> improvements. I did my experiments with 1 or 2 mappers per node although 1
> mapper per node seemed to do better than 2 nodes.  When I refresh the Hbase
> master web interface during my imports, I see the requests are generally
> divided equally to 7 regionservers and as I keep hitting the refreh button,
> I can see that I get 10000 to 70000 requests at once.
>
> I read some earlier posts from Ryan and Stack, and I was actually expecting
> at least twice better performonce so I decided to ask to the list whether
> this is an expected performance or way below it.
>
> I'd appreciate any comments/suggestions.
>
>
> Thanks,
> Jim
>

Reply via email to