Re: Some results with 200 nodes..

Ryan Rawson Fri, 13 Aug 2010 15:14:47 -0700

Thanks for this info, it's great!

On the memory issue, I was talking to the Azul guys and they said even
a single swapped out page can destroy Java performance.  This is
because the Java GC algorithm is at complete odds with the LRU
swapper... the GC sweeper is always going to the next coldest page,
and the LRU swapper is always swapping out the next coldest page :-)
In other words, never, ever, let Java swap.  With the fixed sized Xmx
this is somewhat managable, but you then you get no ability to grow
slightly during high load.


Thanks again,
-ryan

On Fri, Aug 13, 2010 at 3:02 PM, Vidhyashankar Venkataraman
<[email protected]> wrote:
> 200 node experiment on bulk loads and scans:
> 30 KB rows, uncompressed, 1 column family. I generate random data on the fly
> 4 gig regions, 1 MB Hbase block size,
> A) Bulk loads: around 30 MBps per node. Most of the time was taken by my map 
> reduce job not using the Hbase api.
>
>
>  *   I had initially encountered gc issues when I was running MR apps while 
> using the Hbase API. It is the oft-repeated issue of gc taking up a lot of 
> time (swap-ins and swap-outs) which results in the RS failing to report to ZK 
> about its liveness.  This results in an exception at the RS.
>  *   The way I fixed the gc problems was by
>
>            1) JD's suggestion:  To change the GC params in hbase-env.sh
> -XX:+DoEscapeAnalysis -XX:+AggressiveOpts
> -XX:+UseConcMarkSweepGC -XX:NewSize=64m -XX:MaxNewSize=64m
> -XX:CMSInitiatingOccupancyFraction=88 -verbose:gc -XX:+PrintGCDetails
> -XX:+PrintGCTimeStamps
>           2) Increasing the zk timeout and the tick period. This is already 
> presented in Hbase FAQs
>               Another suggestion by JD and stack was the vm swappability 
> parameter, but I didn't't have permissions to reset it.
>
> Update API:
> I issued some updates through an MR job that issued batches of puts on 
> existing and new rows. It added around 17 TB of new data with between 2 and 3 
> storefiles per region.
>
> B) Scans: I used TableInputFormat.. I got a rate of 28 MBps per node (in the 
> average) and 43 MBps per node (in the median).. There were 700 tasks running 
> at a time, 47000 tasks in total. 6 minute map completion time in the average 
> and 4 minutes in the median. 95th percentile is 20 mins. The max completion 
> time of some tasks was 10 hours though!
> So there is a very small number of tasks that take a really long time to 
> finish and I havent had enough time to figure out why since I have to rerun 
> the tests now.  I will try to fix this issue and let you guys know..
>
> C) I issued a major_compact on the entire table:
> I then issued major_compact on the entire table. The entire set of 117 TB 
> (after some updates) got read and rewritten: Finished in 28 hours: Roughly 7 
> MBps per node. With no concurrent compactions in any node, I think this is a 
> good number.
>
> D) There are more experiments that need to be done but will be only next 
> week. I will post any updates.
> For example, I want to see the effect of number of storefiles per region and 
> column families on scans and compactions.
>
>
> Summarizing, for the access patterns that we have been testing for, memory 
> sensitivity is the one major problem that we had faced. Can you guys let me 
> know if any of the numbers sound a little off?
>
> Thank you
> Vidhya
>

Re: Some results with 200 nodes..

Reply via email to