Re: Scan vs Put vs Get

Jean-Marc Spaggiari Thu, 28 Jun 2012 06:41:44 -0700

Hi Anoop,

Are Bloom filters for columns? If I add "g.setFilter(new
KeyOnlyFilter());" that mean I can't use bloom filters, right?
Basically, what I'm doing here is something like
"existKey(byte[]):boolean" where I try to see if a key exist in the
database whitout taking into consideration if there is any column
content or not. This should be very fast. Even faster than the scan
which need to keep some tracks of where I'm reading for the next row.


JM

2012/6/28, Anoop Sam John <anoo...@huawei.com>:
>>blockCacheHitRatio=69%
> Seems blocks you are getting from cache.
> You can check with Blooms also once.
>
> You can enable the usage of bloom using the config param
> "io.storefile.bloom.enabled" set to true  . This will enable the usage of
> bloom globally
> Now you need to set the bloom type for your CF
> HColumnDescriptor#setBloomFilterType()   U can check with type
> BloomType.ROW
>
> -Anoop-
>
> _____________________________________
> From: Jean-Marc Spaggiari [jean-m...@spaggiari.org]
> Sent: Thursday, June 28, 2012 5:42 PM
> To: user@hbase.apache.org
> Subject: Re: Scan vs Put vs Get
>
> Oh! I never looked at this part ;) Ok. I have it.
>
> Here are the numbers for one server before the read:
>
> blockCacheSizeMB=186.28
> blockCacheFreeMB=55.4
> blockCacheCount=2923
> blockCacheHitCount=195999
> blockCacheMissCount=89297
> blockCacheEvictedCount=69858
> blockCacheHitRatio=68%
> blockCacheHitCachingRatio=72%
>
> And here are the numbers after 100 iterations of 1000 gets for  the same
> server:
>
> blockCacheSizeMB=194.44
> blockCacheFreeMB=47.25
> blockCacheCount=3052
> blockCacheHitCount=232034
> blockCacheMissCount=103250
> blockCacheEvictedCount=83682
> blockCacheHitRatio=69%
> blockCacheHitCachingRatio=72%
>
> Don't forget that there is between 40B and 50B of lines in the table,
> so I don't think the servers can store all of them in memory. And
> since I'm accessing based on a random key, odds to have the right row
> in memory are small I think.
>
> JM
>
> 2012/6/28, Ramkrishna.S.Vasudevan <ramkrishna.vasude...@huawei.com>:
>> In 0.94
>>
>> The UI of the RS has a metrics table.  In that you can see
>> blockCacheHitCount, blockCacheMissCount etc.  May be there is a variation
>> when you do scan() and get() here.
>>
>> Regards
>> Ram
>>
>>
>>
>>> -----Original Message-----
>>> From: Jean-Marc Spaggiari [mailto:jean-m...@spaggiari.org]
>>> Sent: Thursday, June 28, 2012 4:44 PM
>>> To: user@hbase.apache.org
>>> Subject: Re: Scan vs Put vs Get
>>>
>>> Wow. First, thanks a lot all for jumping into this.
>>>
>>> Let me try to reply to everyone in a single post.
>>>
>>> > How many Gets you batch together in one call
>>> I tried with multiple different values from 10 to 3000 with similar
>>> results.
>>> Time to read 10 lines : 181.0 mseconds (55 lines/seconds)
>>> Time to read 100 lines : 484.0 mseconds (207 lines/seconds)
>>> Time to read 1000 lines : 4739.0 mseconds (211 lines/seconds)
>>> Time to read 3000 lines : 13582.0 mseconds (221 lines/seconds)
>>>
>>> > Is this equal to the Scan#setCaching () that u are using?
>>> The scan call is done after the get test. So I can't set the cache for
>>> the scan before I do the gets. Also, I tried to run them separatly (On
>>> time only the put, one time only the get, etc.) so I did not find a
>>> way to setup the cache for the get.
>>>
>>> > If both are same u can be sure that the the number of NW calls is
>>> coming almost same.
>>> Here are the results for 10 000 gets and 10 000 scan.next(). Each time
>>> I access the result to be sure they are sent to the client.
>>> (gets) Time to read 10000 lines : 36620.0 mseconds (273 lines/seconds)
>>> (scan) Time to read 10000 lines : 119.0 mseconds (84034 lines/seconds)
>>>
>>> >[Block caching is enabled?]
>>> Good question. I don't know :( Is it enabled by default? How can I
>>> verify or activate it?
>>>
>>> > Also have you tried using Bloom filters?
>>> Not yet. They are on page 381 on Lars' book and I'm only on page 168 ;)
>>>
>>>
>>> > What's the hbase version you're using?
>>> I manually installed 0.94.0. I can try with an other version.
>>>
>>> > Is it repeatable?
>>> Yes. I tries many many times by adding some options, closing some
>>> process on the server side, remonving one datanode, adding one, etc. I
>>> can see some small variations, but still in the same range. I was able
>>> to move from 200 rows/second  to 300 rows/second. But that's not
>>> really a significant improvment. Also, here are the results for 7
>>> iterations of the same code.
>>>
>>> Time to read 1000 lines : 4171.0 mseconds (240 lines/seconds)
>>> Time to read 1000 lines : 3439.0 mseconds (291 lines/seconds)
>>> Time to read 1000 lines : 3953.0 mseconds (253 lines/seconds)
>>> Time to read 1000 lines : 3801.0 mseconds (263 lines/seconds)
>>> Time to read 1000 lines : 3680.0 mseconds (272 lines/seconds)
>>> Time to read 1000 lines : 3493.0 mseconds (286 lines/seconds)
>>> Time to read 1000 lines : 4549.0 mseconds (220 lines/seconds)
>>>
>>> >If the locations are wrong (region moved) you will have a retry loop
>>> I have one dead region. It's a server I brought down few days ago
>>> because it was to slow. But it's still on the hbase web interface.
>>> However, if I look at the table, there is no table region hosted on
>>> this server. Hadoop also was removed from it so it's saying one dead
>>> node.
>>>
>>> >Do you have anything in the logs?
>>> Nothing special. Only some "Block cache LRU eviction" entries.
>>>
>>> > Could you share as well the code
>>> Eveything is at the end of this post.
>>>
>>> >You can also check the cache hit and cache miss statistics that
>>> appears on
>>> the UI?
>>> Can you please tell me how I can find that? I was not able to find
>>> that on the web UI. Where should I look?
>>>
>>> > In your random scan how many Regions are scanned
>>> I only have 5 regions servers and 12 table regions. So I guess all the
>>> servers are called.
>>>
>>>
>>> So here is the code for the gets. I removed the KeyOnlyFilter because
>>> it's not improving the results.
>>>
>>> JM
>>>
>>>
>>>
>>>
>>> http://pastebin.com/K75nFiQk (for syntax highligthing)
>>>
>>> HTable table = new HTable(config, "test3");
>>>
>>> for (int iteration = 0; iteration < 10; iteration++)
>>> {
>>>
>>>      final int linesToRead = 1000;
>>>      System.out.println(new java.util.Date () + " Processing iteration
>>> " +
>>> iteration + "... ");
>>>      Vector<Get> gets = new Vector<Get>(linesToRead);
>>>
>>>      for (long l = 0; l < linesToRead; l++)
>>>      {
>>>      byte[] array1 = new byte[24];
>>>      for (int i = 0; i < array1.length; i++)
>>>              array1[i] = (byte)Math.floor(Math.random() * 256);
>>>      Get g = new Get (array1);
>>>      gets.addElement(g);
>>>
>>>      processed++;
>>> }
>>> Object[] results = new Object[gets.size()];
>>>
>>> long timeBefore = System.currentTimeMillis();
>>> table.batch(gets, results);
>>> long timeAfter = System.currentTimeMillis();
>>>
>>> float duration = timeAfter - timeBefore;
>>> System.out.println ("Time to read " + gets.size() + " lines : " +
>>> duration + " mseconds (" + Math.round(((float)linesToRead / (duration
>>> / 1000))) + " lines/seconds)");
>>>
>>>
>>> for (int i = 0; i < results.length; i++)
>>> {
>>>      if (results[i] instanceof KeyValue)
>>>              if (!((KeyValue)results[i]).isEmptyColumn())
>>>                      System.out.println("Result[" + i + "]: " +
>>> results[i]); // co
>>> BatchExample-9-Dump Print all results.
>>> }
>>>
>>> 2012/6/28, Ramkrishna.S.Vasudevan <ramkrishna.vasude...@huawei.com>:
>>> > Hi
>>> >
>>> > You can also check the cache hit and cache miss statistics that
>>> appears on
>>> > the UI?
>>> >
>>> > In your random scan how many Regions are scanned whereas in gets may
>>> be
>>> > many
>>> > due to randomness.
>>> >
>>> > Regards
>>> > Ram
>>> >
>>> >> -----Original Message-----
>>> >> From: N Keywal [mailto:nkey...@gmail.com]
>>> >> Sent: Thursday, June 28, 2012 2:00 PM
>>> >> To: user@hbase.apache.org
>>> >> Subject: Re: Scan vs Put vs Get
>>> >>
>>> >> Hi Jean-Marc,
>>> >>
>>> >> Interesting.... :-)
>>> >>
>>> >> Added to Anoop questions:
>>> >>
>>> >> What's the hbase version you're using?
>>> >>
>>> >> Is it repeatable, I mean if you try twice the same "gets" with the
>>> >> same client do you have the same results? I'm asking because the
>>> >> client caches the locations.
>>> >>
>>> >> If the locations are wrong (region moved) you will have a retry
>>> loop,
>>> >> and it includes a sleep. Do you have anything in the logs?
>>> >>
>>> >> Could you share as well the code you're using to get the ~100 ms
>>> time?
>>> >>
>>> >> Cheers,
>>> >>
>>> >> N.
>>> >>
>>> >> On Thu, Jun 28, 2012 at 6:56 AM, Anoop Sam John <anoo...@huawei.com>
>>> >> wrote:
>>> >> > Hi
>>> >> >     How many Gets you batch together in one call? Is this equal to
>>> >> the Scan#setCaching () that u are using?
>>> >> > If both are same u can be sure that the the number of NW calls is
>>> >> coming almost same.
>>> >> >
>>> >> > Also you are giving random keys in the Gets. The scan will be
>>> always
>>> >> sequential. Seems in your get scenario it is very very random reads
>>> >> resulting in too many reads of HFile block from HDFS. [Block caching
>>> is
>>> >> enabled?]
>>> >> >
>>> >> > Also have you tried using Bloom filters?  ROW blooms might improve
>>> >> your get performance.
>>> >> >
>>> >> > -Anoop-
>>> >> > ________________________________________
>>> >> > From: Jean-Marc Spaggiari [jean-m...@spaggiari.org]
>>> >> > Sent: Thursday, June 28, 2012 5:04 AM
>>> >> > To: user
>>> >> > Subject: Scan vs Put vs Get
>>> >> >
>>> >> > Hi,
>>> >> >
>>> >> > I have a small piece of code, for testing, which is putting 1B
>>> lines
>>> >> > in an existing table, getting 3000 lines and scanning 10000.
>>> >> >
>>> >> > The table is one family, one column.
>>> >> >
>>> >> > Everything is done randomly. Put with Random key (24 bytes), fixed
>>> >> > family and fixed column names with random content (24 bytes).
>>> >> >
>>> >> > Get (batch) is done with random keys and scan with
>>> RandomRowFilter.
>>> >> >
>>> >> > And here are the results.
>>> >> > Time to insert 1000000 lines: 43 seconds (23255 lines/seconds)
>>> >> > That's correct for my needs based on the poor performances of the
>>> >> > servers in the cluster. I'm fine with the results.
>>> >> >
>>> >> > Time to read 3000 lines: 11444.0 mseconds (262 lines/seconds)
>>> >> > This is way to low. I don't understand why. So I tried the random
>>> >> scan
>>> >> > because I'm not able to figure the issue.
>>> >> >
>>> >> > Time to read 10000 lines: 108.0 mseconds (92593 lines/seconds)
>>> >> > This it impressive! I have added that after I failed with the get.
>>> I
>>> >> > moved from 262 lines per seconds to almost 100K lines/seconds!!!
>>> It's
>>> >> > awesome!
>>> >> >
>>> >> > However, I'm still wondering what's wrong with my gets.
>>> >> >
>>> >> > The code is very simple. I'm using Get objects that I'm executing
>>> in
>>> >> a
>>> >> > Batch. I tried to add a filter but it's not helping. Here is an
>>> >> > extract of the code.
>>> >> >
>>> >> >                        for (long l = 0; l < linesToRead; l++)
>>> >> >                        {
>>> >> >                                byte[] array1 = new byte[24];
>>> >> >                                for (int i = 0; i < array1.length;
>>> >> i++)
>>> >> >                                                array1[i] =
>>> >> (byte)Math.floor(Math.random() * 256);
>>> >> >                                Get g = new Get (array1);
>>> >> >                                gets.addElement(g);
>>> >> >                        }
>>> >> >                                Object[] results = new
>>> >> Object[gets.size()];
>>> >> >                                System.out.println(new
>>> java.util.Date
>>> >> () + " \"gets\" created.");
>>> >> >                                long timeBefore =
>>> >> System.currentTimeMillis();
>>> >> >                        table.batch(gets, results);
>>> >> >                        long timeAfter =
>>> System.currentTimeMillis();
>>> >> >
>>> >> >                        float duration = timeAfter - timeBefore;
>>> >> >                        System.out.println ("Time to read " +
>>> >> gets.size() + " lines : "
>>> >> > + duration + " mseconds (" + Math.round(((float)linesToRead /
>>> >> > (duration / 1000))) + " lines/seconds)");
>>> >> >
>>> >> > What's wrong with it? I can't add the setBatch neither I can add
>>> >> > setCaching because it's not a scan. I tried with different numbers
>>> of
>>> >> > gets but it's almost always the same speed. Am I using it the
>>> wrong
>>> >> > way? Does anyone have any advice to improve that?
>>> >> >
>>> >> > Thanks,
>>> >> >
>>> >> > JM
>>> >
>>> >
>>
>>

Re: Scan vs Put vs Get

Reply via email to