Re: Scan vs Put vs Get

Jean-Marc Spaggiari Thu, 28 Jun 2012 06:45:43 -0700

Hi N Keywal,

This result:
Time to read 10000 lines : 122.0 mseconds (81967 lines/seconds)


Is obtain with this code:
HTable table = new HTable(config, "test3");
final int linesToRead = 10000;
System.out.println(new java.util.Date () + " Processing iteration " +
iteration + "... ");
RandomRowFilter rrf = new RandomRowFilter();
KeyOnlyFilter kof = new KeyOnlyFilter();

Scan scan = new Scan();
scan.setFilter(rrf);
scan.setFilter(kof);
scan.setBatch(Math.min(linesToRead, 1000));
scan.setCaching(Math.min(linesToRead, 1000));
ResultScanner scanner = table.getScanner(scan);
processed = 0;
long timeBefore = System.currentTimeMillis();
for (Result result : scanner.next(linesToRead))
{
        if (result != null)
                processed++;
}
scanner.close();
long timeAfter = System.currentTimeMillis();

float duration = timeAfter - timeBefore;
System.out.println ("Time to read " + linesToRead + " lines : " +
duration + " mseconds (" + Math.round(((float)linesToRead / (duration
/ 1000))) + " lines/seconds)");
table.close ();

This is with the scan.

scan > 80 000 lines/seconds
put > 20 000 lines/seconds
get < 300 lines/seconds

2012/6/28, Jean-Marc Spaggiari <jean-m...@spaggiari.org>:
> Hi Anoop,
>
> Are Bloom filters for columns? If I add "g.setFilter(new
> KeyOnlyFilter());" that mean I can't use bloom filters, right?
> Basically, what I'm doing here is something like
> "existKey(byte[]):boolean" where I try to see if a key exist in the
> database whitout taking into consideration if there is any column
> content or not. This should be very fast. Even faster than the scan
> which need to keep some tracks of where I'm reading for the next row.
>
> JM
>
> 2012/6/28, Anoop Sam John <anoo...@huawei.com>:
>>>blockCacheHitRatio=69%
>> Seems blocks you are getting from cache.
>> You can check with Blooms also once.
>>
>> You can enable the usage of bloom using the config param
>> "io.storefile.bloom.enabled" set to true  . This will enable the usage of
>> bloom globally
>> Now you need to set the bloom type for your CF
>> HColumnDescriptor#setBloomFilterType()   U can check with type
>> BloomType.ROW
>>
>> -Anoop-
>>
>> _____________________________________
>> From: Jean-Marc Spaggiari [jean-m...@spaggiari.org]
>> Sent: Thursday, June 28, 2012 5:42 PM
>> To: user@hbase.apache.org
>> Subject: Re: Scan vs Put vs Get
>>
>> Oh! I never looked at this part ;) Ok. I have it.
>>
>> Here are the numbers for one server before the read:
>>
>> blockCacheSizeMB=186.28
>> blockCacheFreeMB=55.4
>> blockCacheCount=2923
>> blockCacheHitCount=195999
>> blockCacheMissCount=89297
>> blockCacheEvictedCount=69858
>> blockCacheHitRatio=68%
>> blockCacheHitCachingRatio=72%
>>
>> And here are the numbers after 100 iterations of 1000 gets for  the same
>> server:
>>
>> blockCacheSizeMB=194.44
>> blockCacheFreeMB=47.25
>> blockCacheCount=3052
>> blockCacheHitCount=232034
>> blockCacheMissCount=103250
>> blockCacheEvictedCount=83682
>> blockCacheHitRatio=69%
>> blockCacheHitCachingRatio=72%
>>
>> Don't forget that there is between 40B and 50B of lines in the table,
>> so I don't think the servers can store all of them in memory. And
>> since I'm accessing based on a random key, odds to have the right row
>> in memory are small I think.
>>
>> JM
>>
>> 2012/6/28, Ramkrishna.S.Vasudevan <ramkrishna.vasude...@huawei.com>:
>>> In 0.94
>>>
>>> The UI of the RS has a metrics table.  In that you can see
>>> blockCacheHitCount, blockCacheMissCount etc.  May be there is a
>>> variation
>>> when you do scan() and get() here.
>>>
>>> Regards
>>> Ram
>>>
>>>
>>>
>>>> -----Original Message-----
>>>> From: Jean-Marc Spaggiari [mailto:jean-m...@spaggiari.org]
>>>> Sent: Thursday, June 28, 2012 4:44 PM
>>>> To: user@hbase.apache.org
>>>> Subject: Re: Scan vs Put vs Get
>>>>
>>>> Wow. First, thanks a lot all for jumping into this.
>>>>
>>>> Let me try to reply to everyone in a single post.
>>>>
>>>> > How many Gets you batch together in one call
>>>> I tried with multiple different values from 10 to 3000 with similar
>>>> results.
>>>> Time to read 10 lines : 181.0 mseconds (55 lines/seconds)
>>>> Time to read 100 lines : 484.0 mseconds (207 lines/seconds)
>>>> Time to read 1000 lines : 4739.0 mseconds (211 lines/seconds)
>>>> Time to read 3000 lines : 13582.0 mseconds (221 lines/seconds)
>>>>
>>>> > Is this equal to the Scan#setCaching () that u are using?
>>>> The scan call is done after the get test. So I can't set the cache for
>>>> the scan before I do the gets. Also, I tried to run them separatly (On
>>>> time only the put, one time only the get, etc.) so I did not find a
>>>> way to setup the cache for the get.
>>>>
>>>> > If both are same u can be sure that the the number of NW calls is
>>>> coming almost same.
>>>> Here are the results for 10 000 gets and 10 000 scan.next(). Each time
>>>> I access the result to be sure they are sent to the client.
>>>> (gets) Time to read 10000 lines : 36620.0 mseconds (273 lines/seconds)
>>>> (scan) Time to read 10000 lines : 119.0 mseconds (84034 lines/seconds)
>>>>
>>>> >[Block caching is enabled?]
>>>> Good question. I don't know :( Is it enabled by default? How can I
>>>> verify or activate it?
>>>>
>>>> > Also have you tried using Bloom filters?
>>>> Not yet. They are on page 381 on Lars' book and I'm only on page 168 ;)
>>>>
>>>>
>>>> > What's the hbase version you're using?
>>>> I manually installed 0.94.0. I can try with an other version.
>>>>
>>>> > Is it repeatable?
>>>> Yes. I tries many many times by adding some options, closing some
>>>> process on the server side, remonving one datanode, adding one, etc. I
>>>> can see some small variations, but still in the same range. I was able
>>>> to move from 200 rows/second  to 300 rows/second. But that's not
>>>> really a significant improvment. Also, here are the results for 7
>>>> iterations of the same code.
>>>>
>>>> Time to read 1000 lines : 4171.0 mseconds (240 lines/seconds)
>>>> Time to read 1000 lines : 3439.0 mseconds (291 lines/seconds)
>>>> Time to read 1000 lines : 3953.0 mseconds (253 lines/seconds)
>>>> Time to read 1000 lines : 3801.0 mseconds (263 lines/seconds)
>>>> Time to read 1000 lines : 3680.0 mseconds (272 lines/seconds)
>>>> Time to read 1000 lines : 3493.0 mseconds (286 lines/seconds)
>>>> Time to read 1000 lines : 4549.0 mseconds (220 lines/seconds)
>>>>
>>>> >If the locations are wrong (region moved) you will have a retry loop
>>>> I have one dead region. It's a server I brought down few days ago
>>>> because it was to slow. But it's still on the hbase web interface.
>>>> However, if I look at the table, there is no table region hosted on
>>>> this server. Hadoop also was removed from it so it's saying one dead
>>>> node.
>>>>
>>>> >Do you have anything in the logs?
>>>> Nothing special. Only some "Block cache LRU eviction" entries.
>>>>
>>>> > Could you share as well the code
>>>> Eveything is at the end of this post.
>>>>
>>>> >You can also check the cache hit and cache miss statistics that
>>>> appears on
>>>> the UI?
>>>> Can you please tell me how I can find that? I was not able to find
>>>> that on the web UI. Where should I look?
>>>>
>>>> > In your random scan how many Regions are scanned
>>>> I only have 5 regions servers and 12 table regions. So I guess all the
>>>> servers are called.
>>>>
>>>>
>>>> So here is the code for the gets. I removed the KeyOnlyFilter because
>>>> it's not improving the results.
>>>>
>>>> JM
>>>>
>>>>
>>>>
>>>>
>>>> http://pastebin.com/K75nFiQk (for syntax highligthing)
>>>>
>>>> HTable table = new HTable(config, "test3");
>>>>
>>>> for (int iteration = 0; iteration < 10; iteration++)
>>>> {
>>>>
>>>>      final int linesToRead = 1000;
>>>>      System.out.println(new java.util.Date () + " Processing iteration
>>>> " +
>>>> iteration + "... ");
>>>>      Vector<Get> gets = new Vector<Get>(linesToRead);
>>>>
>>>>      for (long l = 0; l < linesToRead; l++)
>>>>      {
>>>>      byte[] array1 = new byte[24];
>>>>      for (int i = 0; i < array1.length; i++)
>>>>              array1[i] = (byte)Math.floor(Math.random() * 256);
>>>>      Get g = new Get (array1);
>>>>      gets.addElement(g);
>>>>
>>>>      processed++;
>>>> }
>>>> Object[] results = new Object[gets.size()];
>>>>
>>>> long timeBefore = System.currentTimeMillis();
>>>> table.batch(gets, results);
>>>> long timeAfter = System.currentTimeMillis();
>>>>
>>>> float duration = timeAfter - timeBefore;
>>>> System.out.println ("Time to read " + gets.size() + " lines : " +
>>>> duration + " mseconds (" + Math.round(((float)linesToRead / (duration
>>>> / 1000))) + " lines/seconds)");
>>>>
>>>>
>>>> for (int i = 0; i < results.length; i++)
>>>> {
>>>>      if (results[i] instanceof KeyValue)
>>>>              if (!((KeyValue)results[i]).isEmptyColumn())
>>>>                      System.out.println("Result[" + i + "]: " +
>>>> results[i]); // co
>>>> BatchExample-9-Dump Print all results.
>>>> }
>>>>
>>>> 2012/6/28, Ramkrishna.S.Vasudevan <ramkrishna.vasude...@huawei.com>:
>>>> > Hi
>>>> >
>>>> > You can also check the cache hit and cache miss statistics that
>>>> appears on
>>>> > the UI?
>>>> >
>>>> > In your random scan how many Regions are scanned whereas in gets may
>>>> be
>>>> > many
>>>> > due to randomness.
>>>> >
>>>> > Regards
>>>> > Ram
>>>> >
>>>> >> -----Original Message-----
>>>> >> From: N Keywal [mailto:nkey...@gmail.com]
>>>> >> Sent: Thursday, June 28, 2012 2:00 PM
>>>> >> To: user@hbase.apache.org
>>>> >> Subject: Re: Scan vs Put vs Get
>>>> >>
>>>> >> Hi Jean-Marc,
>>>> >>
>>>> >> Interesting.... :-)
>>>> >>
>>>> >> Added to Anoop questions:
>>>> >>
>>>> >> What's the hbase version you're using?
>>>> >>
>>>> >> Is it repeatable, I mean if you try twice the same "gets" with the
>>>> >> same client do you have the same results? I'm asking because the
>>>> >> client caches the locations.
>>>> >>
>>>> >> If the locations are wrong (region moved) you will have a retry
>>>> loop,
>>>> >> and it includes a sleep. Do you have anything in the logs?
>>>> >>
>>>> >> Could you share as well the code you're using to get the ~100 ms
>>>> time?
>>>> >>
>>>> >> Cheers,
>>>> >>
>>>> >> N.
>>>> >>
>>>> >> On Thu, Jun 28, 2012 at 6:56 AM, Anoop Sam John <anoo...@huawei.com>
>>>> >> wrote:
>>>> >> > Hi
>>>> >> >     How many Gets you batch together in one call? Is this equal to
>>>> >> the Scan#setCaching () that u are using?
>>>> >> > If both are same u can be sure that the the number of NW calls is
>>>> >> coming almost same.
>>>> >> >
>>>> >> > Also you are giving random keys in the Gets. The scan will be
>>>> always
>>>> >> sequential. Seems in your get scenario it is very very random reads
>>>> >> resulting in too many reads of HFile block from HDFS. [Block caching
>>>> is
>>>> >> enabled?]
>>>> >> >
>>>> >> > Also have you tried using Bloom filters?  ROW blooms might improve
>>>> >> your get performance.
>>>> >> >
>>>> >> > -Anoop-
>>>> >> > ________________________________________
>>>> >> > From: Jean-Marc Spaggiari [jean-m...@spaggiari.org]
>>>> >> > Sent: Thursday, June 28, 2012 5:04 AM
>>>> >> > To: user
>>>> >> > Subject: Scan vs Put vs Get
>>>> >> >
>>>> >> > Hi,
>>>> >> >
>>>> >> > I have a small piece of code, for testing, which is putting 1B
>>>> lines
>>>> >> > in an existing table, getting 3000 lines and scanning 10000.
>>>> >> >
>>>> >> > The table is one family, one column.
>>>> >> >
>>>> >> > Everything is done randomly. Put with Random key (24 bytes), fixed
>>>> >> > family and fixed column names with random content (24 bytes).
>>>> >> >
>>>> >> > Get (batch) is done with random keys and scan with
>>>> RandomRowFilter.
>>>> >> >
>>>> >> > And here are the results.
>>>> >> > Time to insert 1000000 lines: 43 seconds (23255 lines/seconds)
>>>> >> > That's correct for my needs based on the poor performances of the
>>>> >> > servers in the cluster. I'm fine with the results.
>>>> >> >
>>>> >> > Time to read 3000 lines: 11444.0 mseconds (262 lines/seconds)
>>>> >> > This is way to low. I don't understand why. So I tried the random
>>>> >> scan
>>>> >> > because I'm not able to figure the issue.
>>>> >> >
>>>> >> > Time to read 10000 lines: 108.0 mseconds (92593 lines/seconds)
>>>> >> > This it impressive! I have added that after I failed with the get.
>>>> I
>>>> >> > moved from 262 lines per seconds to almost 100K lines/seconds!!!
>>>> It's
>>>> >> > awesome!
>>>> >> >
>>>> >> > However, I'm still wondering what's wrong with my gets.
>>>> >> >
>>>> >> > The code is very simple. I'm using Get objects that I'm executing
>>>> in
>>>> >> a
>>>> >> > Batch. I tried to add a filter but it's not helping. Here is an
>>>> >> > extract of the code.
>>>> >> >
>>>> >> >                        for (long l = 0; l < linesToRead; l++)
>>>> >> >                        {
>>>> >> >                                byte[] array1 = new byte[24];
>>>> >> >                                for (int i = 0; i < array1.length;
>>>> >> i++)
>>>> >> >                                                array1[i] =
>>>> >> (byte)Math.floor(Math.random() * 256);
>>>> >> >                                Get g = new Get (array1);
>>>> >> >                                gets.addElement(g);
>>>> >> >                        }
>>>> >> >                                Object[] results = new
>>>> >> Object[gets.size()];
>>>> >> >                                System.out.println(new
>>>> java.util.Date
>>>> >> () + " \"gets\" created.");
>>>> >> >                                long timeBefore =
>>>> >> System.currentTimeMillis();
>>>> >> >                        table.batch(gets, results);
>>>> >> >                        long timeAfter =
>>>> System.currentTimeMillis();
>>>> >> >
>>>> >> >                        float duration = timeAfter - timeBefore;
>>>> >> >                        System.out.println ("Time to read " +
>>>> >> gets.size() + " lines : "
>>>> >> > + duration + " mseconds (" + Math.round(((float)linesToRead /
>>>> >> > (duration / 1000))) + " lines/seconds)");
>>>> >> >
>>>> >> > What's wrong with it? I can't add the setBatch neither I can add
>>>> >> > setCaching because it's not a scan. I tried with different numbers
>>>> of
>>>> >> > gets but it's almost always the same speed. Am I using it the
>>>> wrong
>>>> >> > way? Does anyone have any advice to improve that?
>>>> >> >
>>>> >> > Thanks,
>>>> >> >
>>>> >> > JM
>>>> >
>>>> >
>>>
>>>
>

Re: Scan vs Put vs Get

Reply via email to