Re: Scan vs Put vs Get

Jean-Marc Spaggiari Thu, 28 Jun 2012 05:13:13 -0700

Oh! I never looked at this part ;) Ok. I have it.

Here are the numbers for one server before the read:


blockCacheSizeMB=186.28
blockCacheFreeMB=55.4
blockCacheCount=2923
blockCacheHitCount=195999
blockCacheMissCount=89297
blockCacheEvictedCount=69858
blockCacheHitRatio=68%
blockCacheHitCachingRatio=72%

And here are the numbers after 100 iterations of 1000 gets for  the same server:

blockCacheSizeMB=194.44
blockCacheFreeMB=47.25
blockCacheCount=3052
blockCacheHitCount=232034
blockCacheMissCount=103250
blockCacheEvictedCount=83682
blockCacheHitRatio=69%
blockCacheHitCachingRatio=72%

Don't forget that there is between 40B and 50B of lines in the table,
so I don't think the servers can store all of them in memory. And
since I'm accessing based on a random key, odds to have the right row
in memory are small I think.

JM

2012/6/28, Ramkrishna.S.Vasudevan <ramkrishna.vasude...@huawei.com>:
> In 0.94
>
> The UI of the RS has a metrics table.  In that you can see
> blockCacheHitCount, blockCacheMissCount etc.  May be there is a variation
> when you do scan() and get() here.
>
> Regards
> Ram
>
>
>
>> -----Original Message-----
>> From: Jean-Marc Spaggiari [mailto:jean-m...@spaggiari.org]
>> Sent: Thursday, June 28, 2012 4:44 PM
>> To: user@hbase.apache.org
>> Subject: Re: Scan vs Put vs Get
>>
>> Wow. First, thanks a lot all for jumping into this.
>>
>> Let me try to reply to everyone in a single post.
>>
>> > How many Gets you batch together in one call
>> I tried with multiple different values from 10 to 3000 with similar
>> results.
>> Time to read 10 lines : 181.0 mseconds (55 lines/seconds)
>> Time to read 100 lines : 484.0 mseconds (207 lines/seconds)
>> Time to read 1000 lines : 4739.0 mseconds (211 lines/seconds)
>> Time to read 3000 lines : 13582.0 mseconds (221 lines/seconds)
>>
>> > Is this equal to the Scan#setCaching () that u are using?
>> The scan call is done after the get test. So I can't set the cache for
>> the scan before I do the gets. Also, I tried to run them separatly (On
>> time only the put, one time only the get, etc.) so I did not find a
>> way to setup the cache for the get.
>>
>> > If both are same u can be sure that the the number of NW calls is
>> coming almost same.
>> Here are the results for 10 000 gets and 10 000 scan.next(). Each time
>> I access the result to be sure they are sent to the client.
>> (gets) Time to read 10000 lines : 36620.0 mseconds (273 lines/seconds)
>> (scan) Time to read 10000 lines : 119.0 mseconds (84034 lines/seconds)
>>
>> >[Block caching is enabled?]
>> Good question. I don't know :( Is it enabled by default? How can I
>> verify or activate it?
>>
>> > Also have you tried using Bloom filters?
>> Not yet. They are on page 381 on Lars' book and I'm only on page 168 ;)
>>
>>
>> > What's the hbase version you're using?
>> I manually installed 0.94.0. I can try with an other version.
>>
>> > Is it repeatable?
>> Yes. I tries many many times by adding some options, closing some
>> process on the server side, remonving one datanode, adding one, etc. I
>> can see some small variations, but still in the same range. I was able
>> to move from 200 rows/second  to 300 rows/second. But that's not
>> really a significant improvment. Also, here are the results for 7
>> iterations of the same code.
>>
>> Time to read 1000 lines : 4171.0 mseconds (240 lines/seconds)
>> Time to read 1000 lines : 3439.0 mseconds (291 lines/seconds)
>> Time to read 1000 lines : 3953.0 mseconds (253 lines/seconds)
>> Time to read 1000 lines : 3801.0 mseconds (263 lines/seconds)
>> Time to read 1000 lines : 3680.0 mseconds (272 lines/seconds)
>> Time to read 1000 lines : 3493.0 mseconds (286 lines/seconds)
>> Time to read 1000 lines : 4549.0 mseconds (220 lines/seconds)
>>
>> >If the locations are wrong (region moved) you will have a retry loop
>> I have one dead region. It's a server I brought down few days ago
>> because it was to slow. But it's still on the hbase web interface.
>> However, if I look at the table, there is no table region hosted on
>> this server. Hadoop also was removed from it so it's saying one dead
>> node.
>>
>> >Do you have anything in the logs?
>> Nothing special. Only some "Block cache LRU eviction" entries.
>>
>> > Could you share as well the code
>> Eveything is at the end of this post.
>>
>> >You can also check the cache hit and cache miss statistics that
>> appears on
>> the UI?
>> Can you please tell me how I can find that? I was not able to find
>> that on the web UI. Where should I look?
>>
>> > In your random scan how many Regions are scanned
>> I only have 5 regions servers and 12 table regions. So I guess all the
>> servers are called.
>>
>>
>> So here is the code for the gets. I removed the KeyOnlyFilter because
>> it's not improving the results.
>>
>> JM
>>
>>
>>
>>
>> http://pastebin.com/K75nFiQk (for syntax highligthing)
>>
>> HTable table = new HTable(config, "test3");
>>
>> for (int iteration = 0; iteration < 10; iteration++)
>> {
>>
>>      final int linesToRead = 1000;
>>      System.out.println(new java.util.Date () + " Processing iteration
>> " +
>> iteration + "... ");
>>      Vector<Get> gets = new Vector<Get>(linesToRead);
>>
>>      for (long l = 0; l < linesToRead; l++)
>>      {
>>      byte[] array1 = new byte[24];
>>      for (int i = 0; i < array1.length; i++)
>>              array1[i] = (byte)Math.floor(Math.random() * 256);
>>      Get g = new Get (array1);
>>      gets.addElement(g);
>>
>>      processed++;
>> }
>> Object[] results = new Object[gets.size()];
>>
>> long timeBefore = System.currentTimeMillis();
>> table.batch(gets, results);
>> long timeAfter = System.currentTimeMillis();
>>
>> float duration = timeAfter - timeBefore;
>> System.out.println ("Time to read " + gets.size() + " lines : " +
>> duration + " mseconds (" + Math.round(((float)linesToRead / (duration
>> / 1000))) + " lines/seconds)");
>>
>>
>> for (int i = 0; i < results.length; i++)
>> {
>>      if (results[i] instanceof KeyValue)
>>              if (!((KeyValue)results[i]).isEmptyColumn())
>>                      System.out.println("Result[" + i + "]: " +
>> results[i]); // co
>> BatchExample-9-Dump Print all results.
>> }
>>
>> 2012/6/28, Ramkrishna.S.Vasudevan <ramkrishna.vasude...@huawei.com>:
>> > Hi
>> >
>> > You can also check the cache hit and cache miss statistics that
>> appears on
>> > the UI?
>> >
>> > In your random scan how many Regions are scanned whereas in gets may
>> be
>> > many
>> > due to randomness.
>> >
>> > Regards
>> > Ram
>> >
>> >> -----Original Message-----
>> >> From: N Keywal [mailto:nkey...@gmail.com]
>> >> Sent: Thursday, June 28, 2012 2:00 PM
>> >> To: user@hbase.apache.org
>> >> Subject: Re: Scan vs Put vs Get
>> >>
>> >> Hi Jean-Marc,
>> >>
>> >> Interesting.... :-)
>> >>
>> >> Added to Anoop questions:
>> >>
>> >> What's the hbase version you're using?
>> >>
>> >> Is it repeatable, I mean if you try twice the same "gets" with the
>> >> same client do you have the same results? I'm asking because the
>> >> client caches the locations.
>> >>
>> >> If the locations are wrong (region moved) you will have a retry
>> loop,
>> >> and it includes a sleep. Do you have anything in the logs?
>> >>
>> >> Could you share as well the code you're using to get the ~100 ms
>> time?
>> >>
>> >> Cheers,
>> >>
>> >> N.
>> >>
>> >> On Thu, Jun 28, 2012 at 6:56 AM, Anoop Sam John <anoo...@huawei.com>
>> >> wrote:
>> >> > Hi
>> >> >     How many Gets you batch together in one call? Is this equal to
>> >> the Scan#setCaching () that u are using?
>> >> > If both are same u can be sure that the the number of NW calls is
>> >> coming almost same.
>> >> >
>> >> > Also you are giving random keys in the Gets. The scan will be
>> always
>> >> sequential. Seems in your get scenario it is very very random reads
>> >> resulting in too many reads of HFile block from HDFS. [Block caching
>> is
>> >> enabled?]
>> >> >
>> >> > Also have you tried using Bloom filters?  ROW blooms might improve
>> >> your get performance.
>> >> >
>> >> > -Anoop-
>> >> > ________________________________________
>> >> > From: Jean-Marc Spaggiari [jean-m...@spaggiari.org]
>> >> > Sent: Thursday, June 28, 2012 5:04 AM
>> >> > To: user
>> >> > Subject: Scan vs Put vs Get
>> >> >
>> >> > Hi,
>> >> >
>> >> > I have a small piece of code, for testing, which is putting 1B
>> lines
>> >> > in an existing table, getting 3000 lines and scanning 10000.
>> >> >
>> >> > The table is one family, one column.
>> >> >
>> >> > Everything is done randomly. Put with Random key (24 bytes), fixed
>> >> > family and fixed column names with random content (24 bytes).
>> >> >
>> >> > Get (batch) is done with random keys and scan with
>> RandomRowFilter.
>> >> >
>> >> > And here are the results.
>> >> > Time to insert 1000000 lines: 43 seconds (23255 lines/seconds)
>> >> > That's correct for my needs based on the poor performances of the
>> >> > servers in the cluster. I'm fine with the results.
>> >> >
>> >> > Time to read 3000 lines: 11444.0 mseconds (262 lines/seconds)
>> >> > This is way to low. I don't understand why. So I tried the random
>> >> scan
>> >> > because I'm not able to figure the issue.
>> >> >
>> >> > Time to read 10000 lines: 108.0 mseconds (92593 lines/seconds)
>> >> > This it impressive! I have added that after I failed with the get.
>> I
>> >> > moved from 262 lines per seconds to almost 100K lines/seconds!!!
>> It's
>> >> > awesome!
>> >> >
>> >> > However, I'm still wondering what's wrong with my gets.
>> >> >
>> >> > The code is very simple. I'm using Get objects that I'm executing
>> in
>> >> a
>> >> > Batch. I tried to add a filter but it's not helping. Here is an
>> >> > extract of the code.
>> >> >
>> >> >                        for (long l = 0; l < linesToRead; l++)
>> >> >                        {
>> >> >                                byte[] array1 = new byte[24];
>> >> >                                for (int i = 0; i < array1.length;
>> >> i++)
>> >> >                                                array1[i] =
>> >> (byte)Math.floor(Math.random() * 256);
>> >> >                                Get g = new Get (array1);
>> >> >                                gets.addElement(g);
>> >> >                        }
>> >> >                                Object[] results = new
>> >> Object[gets.size()];
>> >> >                                System.out.println(new
>> java.util.Date
>> >> () + " \"gets\" created.");
>> >> >                                long timeBefore =
>> >> System.currentTimeMillis();
>> >> >                        table.batch(gets, results);
>> >> >                        long timeAfter =
>> System.currentTimeMillis();
>> >> >
>> >> >                        float duration = timeAfter - timeBefore;
>> >> >                        System.out.println ("Time to read " +
>> >> gets.size() + " lines : "
>> >> > + duration + " mseconds (" + Math.round(((float)linesToRead /
>> >> > (duration / 1000))) + " lines/seconds)");
>> >> >
>> >> > What's wrong with it? I can't add the setBatch neither I can add
>> >> > setCaching because it's not a scan. I tried with different numbers
>> of
>> >> > gets but it's almost always the same speed. Am I using it the
>> wrong
>> >> > way? Does anyone have any advice to improve that?
>> >> >
>> >> > Thanks,
>> >> >
>> >> > JM
>> >
>> >
>
>

Re: Scan vs Put vs Get

Reply via email to