RE: Scan vs Put vs Get

2012-06-28 Thread Ramkrishna.S.Vasudevan
Hi

You can also check the cache hit and cache miss statistics that appears on
the UI?

In your random scan how many Regions are scanned whereas in gets may be many
due to randomness.

Regards
Ram

 -Original Message-
 From: N Keywal [mailto:nkey...@gmail.com]
 Sent: Thursday, June 28, 2012 2:00 PM
 To: user@hbase.apache.org
 Subject: Re: Scan vs Put vs Get
 
 Hi Jean-Marc,
 
 Interesting :-)
 
 Added to Anoop questions:
 
 What's the hbase version you're using?
 
 Is it repeatable, I mean if you try twice the same gets with the
 same client do you have the same results? I'm asking because the
 client caches the locations.
 
 If the locations are wrong (region moved) you will have a retry loop,
 and it includes a sleep. Do you have anything in the logs?
 
 Could you share as well the code you're using to get the ~100 ms time?
 
 Cheers,
 
 N.
 
 On Thu, Jun 28, 2012 at 6:56 AM, Anoop Sam John anoo...@huawei.com
 wrote:
  Hi
      How many Gets you batch together in one call? Is this equal to
 the Scan#setCaching () that u are using?
  If both are same u can be sure that the the number of NW calls is
 coming almost same.
 
  Also you are giving random keys in the Gets. The scan will be always
 sequential. Seems in your get scenario it is very very random reads
 resulting in too many reads of HFile block from HDFS. [Block caching is
 enabled?]
 
  Also have you tried using Bloom filters?  ROW blooms might improve
 your get performance.
 
  -Anoop-
  
  From: Jean-Marc Spaggiari [jean-m...@spaggiari.org]
  Sent: Thursday, June 28, 2012 5:04 AM
  To: user
  Subject: Scan vs Put vs Get
 
  Hi,
 
  I have a small piece of code, for testing, which is putting 1B lines
  in an existing table, getting 3000 lines and scanning 1.
 
  The table is one family, one column.
 
  Everything is done randomly. Put with Random key (24 bytes), fixed
  family and fixed column names with random content (24 bytes).
 
  Get (batch) is done with random keys and scan with RandomRowFilter.
 
  And here are the results.
  Time to insert 100 lines: 43 seconds (23255 lines/seconds)
  That's correct for my needs based on the poor performances of the
  servers in the cluster. I'm fine with the results.
 
  Time to read 3000 lines: 11444.0 mseconds (262 lines/seconds)
  This is way to low. I don't understand why. So I tried the random
 scan
  because I'm not able to figure the issue.
 
  Time to read 1 lines: 108.0 mseconds (92593 lines/seconds)
  This it impressive! I have added that after I failed with the get. I
  moved from 262 lines per seconds to almost 100K lines/seconds!!! It's
  awesome!
 
  However, I'm still wondering what's wrong with my gets.
 
  The code is very simple. I'm using Get objects that I'm executing in
 a
  Batch. I tried to add a filter but it's not helping. Here is an
  extract of the code.
 
                         for (long l = 0; l  linesToRead; l++)
                         {
                                 byte[] array1 = new byte[24];
                                 for (int i = 0; i  array1.length;
 i++)
                                                 array1[i] =
 (byte)Math.floor(Math.random() * 256);
                                 Get g = new Get (array1);
                                 gets.addElement(g);
                         }
                                 Object[] results = new
 Object[gets.size()];
                                 System.out.println(new java.util.Date
 () +  \gets\ created.);
                                 long timeBefore =
 System.currentTimeMillis();
                         table.batch(gets, results);
                         long timeAfter = System.currentTimeMillis();
 
                         float duration = timeAfter - timeBefore;
                         System.out.println (Time to read  +
 gets.size() +  lines : 
  + duration +  mseconds ( + Math.round(((float)linesToRead /
  (duration / 1000))) +  lines/seconds));
 
  What's wrong with it? I can't add the setBatch neither I can add
  setCaching because it's not a scan. I tried with different numbers of
  gets but it's almost always the same speed. Am I using it the wrong
  way? Does anyone have any advice to improve that?
 
  Thanks,
 
  JM



RE: Scan vs Put vs Get

2012-06-28 Thread Ramkrishna.S.Vasudevan
In 0.94

The UI of the RS has a metrics table.  In that you can see blockCacheHitCount, 
blockCacheMissCount etc.  May be there is a variation when you do scan() and 
get() here. 

Regards
Ram



 -Original Message-
 From: Jean-Marc Spaggiari [mailto:jean-m...@spaggiari.org]
 Sent: Thursday, June 28, 2012 4:44 PM
 To: user@hbase.apache.org
 Subject: Re: Scan vs Put vs Get
 
 Wow. First, thanks a lot all for jumping into this.
 
 Let me try to reply to everyone in a single post.
 
  How many Gets you batch together in one call
 I tried with multiple different values from 10 to 3000 with similar
 results.
 Time to read 10 lines : 181.0 mseconds (55 lines/seconds)
 Time to read 100 lines : 484.0 mseconds (207 lines/seconds)
 Time to read 1000 lines : 4739.0 mseconds (211 lines/seconds)
 Time to read 3000 lines : 13582.0 mseconds (221 lines/seconds)
 
  Is this equal to the Scan#setCaching () that u are using?
 The scan call is done after the get test. So I can't set the cache for
 the scan before I do the gets. Also, I tried to run them separatly (On
 time only the put, one time only the get, etc.) so I did not find a
 way to setup the cache for the get.
 
  If both are same u can be sure that the the number of NW calls is
 coming almost same.
 Here are the results for 10 000 gets and 10 000 scan.next(). Each time
 I access the result to be sure they are sent to the client.
 (gets) Time to read 1 lines : 36620.0 mseconds (273 lines/seconds)
 (scan) Time to read 1 lines : 119.0 mseconds (84034 lines/seconds)
 
 [Block caching is enabled?]
 Good question. I don't know :( Is it enabled by default? How can I
 verify or activate it?
 
  Also have you tried using Bloom filters?
 Not yet. They are on page 381 on Lars' book and I'm only on page 168 ;)
 
 
  What's the hbase version you're using?
 I manually installed 0.94.0. I can try with an other version.
 
  Is it repeatable?
 Yes. I tries many many times by adding some options, closing some
 process on the server side, remonving one datanode, adding one, etc. I
 can see some small variations, but still in the same range. I was able
 to move from 200 rows/second  to 300 rows/second. But that's not
 really a significant improvment. Also, here are the results for 7
 iterations of the same code.
 
 Time to read 1000 lines : 4171.0 mseconds (240 lines/seconds)
 Time to read 1000 lines : 3439.0 mseconds (291 lines/seconds)
 Time to read 1000 lines : 3953.0 mseconds (253 lines/seconds)
 Time to read 1000 lines : 3801.0 mseconds (263 lines/seconds)
 Time to read 1000 lines : 3680.0 mseconds (272 lines/seconds)
 Time to read 1000 lines : 3493.0 mseconds (286 lines/seconds)
 Time to read 1000 lines : 4549.0 mseconds (220 lines/seconds)
 
 If the locations are wrong (region moved) you will have a retry loop
 I have one dead region. It's a server I brought down few days ago
 because it was to slow. But it's still on the hbase web interface.
 However, if I look at the table, there is no table region hosted on
 this server. Hadoop also was removed from it so it's saying one dead
 node.
 
 Do you have anything in the logs?
 Nothing special. Only some Block cache LRU eviction entries.
 
  Could you share as well the code
 Eveything is at the end of this post.
 
 You can also check the cache hit and cache miss statistics that
 appears on
 the UI?
 Can you please tell me how I can find that? I was not able to find
 that on the web UI. Where should I look?
 
  In your random scan how many Regions are scanned
 I only have 5 regions servers and 12 table regions. So I guess all the
 servers are called.
 
 
 So here is the code for the gets. I removed the KeyOnlyFilter because
 it's not improving the results.
 
 JM
 
 
 
 
 http://pastebin.com/K75nFiQk (for syntax highligthing)
 
 HTable table = new HTable(config, test3);
 
 for (int iteration = 0; iteration  10; iteration++)
 {
 
   final int linesToRead = 1000;
   System.out.println(new java.util.Date () +  Processing iteration
  +
 iteration + ... );
   VectorGet gets = new VectorGet(linesToRead);
 
   for (long l = 0; l  linesToRead; l++)
   {
   byte[] array1 = new byte[24];
   for (int i = 0; i  array1.length; i++)
   array1[i] = (byte)Math.floor(Math.random() * 256);
   Get g = new Get (array1);
   gets.addElement(g);
 
   processed++;
 }
 Object[] results = new Object[gets.size()];
 
 long timeBefore = System.currentTimeMillis();
 table.batch(gets, results);
 long timeAfter = System.currentTimeMillis();
 
 float duration = timeAfter - timeBefore;
 System.out.println (Time to read  + gets.size() +  lines :  +
 duration +  mseconds ( + Math.round(((float)linesToRead / (duration
 / 1000))) +  lines/seconds));
 
 
 for (int i = 0; i  results.length; i++)
 {
   if (results[i] instanceof KeyValue)
   if (!((KeyValue)results[i]).isEmptyColumn())
   System.out.println(Result[ + i + ]:  +
 results[i]); // co
 BatchExample-9-Dump

Re: Scan vs Put vs Get

2012-06-28 Thread Jean-Marc Spaggiari
Oh! I never looked at this part ;) Ok. I have it.

Here are the numbers for one server before the read:

blockCacheSizeMB=186.28
blockCacheFreeMB=55.4
blockCacheCount=2923
blockCacheHitCount=195999
blockCacheMissCount=89297
blockCacheEvictedCount=69858
blockCacheHitRatio=68%
blockCacheHitCachingRatio=72%

And here are the numbers after 100 iterations of 1000 gets for  the same server:

blockCacheSizeMB=194.44
blockCacheFreeMB=47.25
blockCacheCount=3052
blockCacheHitCount=232034
blockCacheMissCount=103250
blockCacheEvictedCount=83682
blockCacheHitRatio=69%
blockCacheHitCachingRatio=72%

Don't forget that there is between 40B and 50B of lines in the table,
so I don't think the servers can store all of them in memory. And
since I'm accessing based on a random key, odds to have the right row
in memory are small I think.

JM

2012/6/28, Ramkrishna.S.Vasudevan ramkrishna.vasude...@huawei.com:
 In 0.94

 The UI of the RS has a metrics table.  In that you can see
 blockCacheHitCount, blockCacheMissCount etc.  May be there is a variation
 when you do scan() and get() here.

 Regards
 Ram



 -Original Message-
 From: Jean-Marc Spaggiari [mailto:jean-m...@spaggiari.org]
 Sent: Thursday, June 28, 2012 4:44 PM
 To: user@hbase.apache.org
 Subject: Re: Scan vs Put vs Get

 Wow. First, thanks a lot all for jumping into this.

 Let me try to reply to everyone in a single post.

  How many Gets you batch together in one call
 I tried with multiple different values from 10 to 3000 with similar
 results.
 Time to read 10 lines : 181.0 mseconds (55 lines/seconds)
 Time to read 100 lines : 484.0 mseconds (207 lines/seconds)
 Time to read 1000 lines : 4739.0 mseconds (211 lines/seconds)
 Time to read 3000 lines : 13582.0 mseconds (221 lines/seconds)

  Is this equal to the Scan#setCaching () that u are using?
 The scan call is done after the get test. So I can't set the cache for
 the scan before I do the gets. Also, I tried to run them separatly (On
 time only the put, one time only the get, etc.) so I did not find a
 way to setup the cache for the get.

  If both are same u can be sure that the the number of NW calls is
 coming almost same.
 Here are the results for 10 000 gets and 10 000 scan.next(). Each time
 I access the result to be sure they are sent to the client.
 (gets) Time to read 1 lines : 36620.0 mseconds (273 lines/seconds)
 (scan) Time to read 1 lines : 119.0 mseconds (84034 lines/seconds)

 [Block caching is enabled?]
 Good question. I don't know :( Is it enabled by default? How can I
 verify or activate it?

  Also have you tried using Bloom filters?
 Not yet. They are on page 381 on Lars' book and I'm only on page 168 ;)


  What's the hbase version you're using?
 I manually installed 0.94.0. I can try with an other version.

  Is it repeatable?
 Yes. I tries many many times by adding some options, closing some
 process on the server side, remonving one datanode, adding one, etc. I
 can see some small variations, but still in the same range. I was able
 to move from 200 rows/second  to 300 rows/second. But that's not
 really a significant improvment. Also, here are the results for 7
 iterations of the same code.

 Time to read 1000 lines : 4171.0 mseconds (240 lines/seconds)
 Time to read 1000 lines : 3439.0 mseconds (291 lines/seconds)
 Time to read 1000 lines : 3953.0 mseconds (253 lines/seconds)
 Time to read 1000 lines : 3801.0 mseconds (263 lines/seconds)
 Time to read 1000 lines : 3680.0 mseconds (272 lines/seconds)
 Time to read 1000 lines : 3493.0 mseconds (286 lines/seconds)
 Time to read 1000 lines : 4549.0 mseconds (220 lines/seconds)

 If the locations are wrong (region moved) you will have a retry loop
 I have one dead region. It's a server I brought down few days ago
 because it was to slow. But it's still on the hbase web interface.
 However, if I look at the table, there is no table region hosted on
 this server. Hadoop also was removed from it so it's saying one dead
 node.

 Do you have anything in the logs?
 Nothing special. Only some Block cache LRU eviction entries.

  Could you share as well the code
 Eveything is at the end of this post.

 You can also check the cache hit and cache miss statistics that
 appears on
 the UI?
 Can you please tell me how I can find that? I was not able to find
 that on the web UI. Where should I look?

  In your random scan how many Regions are scanned
 I only have 5 regions servers and 12 table regions. So I guess all the
 servers are called.


 So here is the code for the gets. I removed the KeyOnlyFilter because
 it's not improving the results.

 JM




 http://pastebin.com/K75nFiQk (for syntax highligthing)

 HTable table = new HTable(config, test3);

 for (int iteration = 0; iteration  10; iteration++)
 {

  final int linesToRead = 1000;
  System.out.println(new java.util.Date () +  Processing iteration
  +
 iteration + ... );
  VectorGet gets = new VectorGet(linesToRead);

  for (long l = 0; l  linesToRead; l

RE: Scan vs Put vs Get

2012-06-28 Thread Anoop Sam John
blockCacheHitRatio=69%
Seems blocks you are getting from cache.  
You can check with Blooms also once.

You can enable the usage of bloom using the config param 
io.storefile.bloom.enabled set to true  . This will enable the usage of bloom 
globally
Now you need to set the bloom type for your CF
HColumnDescriptor#setBloomFilterType()   U can check with type BloomType.ROW

-Anoop-

_
From: Jean-Marc Spaggiari [jean-m...@spaggiari.org]
Sent: Thursday, June 28, 2012 5:42 PM
To: user@hbase.apache.org
Subject: Re: Scan vs Put vs Get

Oh! I never looked at this part ;) Ok. I have it.

Here are the numbers for one server before the read:

blockCacheSizeMB=186.28
blockCacheFreeMB=55.4
blockCacheCount=2923
blockCacheHitCount=195999
blockCacheMissCount=89297
blockCacheEvictedCount=69858
blockCacheHitRatio=68%
blockCacheHitCachingRatio=72%

And here are the numbers after 100 iterations of 1000 gets for  the same server:

blockCacheSizeMB=194.44
blockCacheFreeMB=47.25
blockCacheCount=3052
blockCacheHitCount=232034
blockCacheMissCount=103250
blockCacheEvictedCount=83682
blockCacheHitRatio=69%
blockCacheHitCachingRatio=72%

Don't forget that there is between 40B and 50B of lines in the table,
so I don't think the servers can store all of them in memory. And
since I'm accessing based on a random key, odds to have the right row
in memory are small I think.

JM

2012/6/28, Ramkrishna.S.Vasudevan ramkrishna.vasude...@huawei.com:
 In 0.94

 The UI of the RS has a metrics table.  In that you can see
 blockCacheHitCount, blockCacheMissCount etc.  May be there is a variation
 when you do scan() and get() here.

 Regards
 Ram



 -Original Message-
 From: Jean-Marc Spaggiari [mailto:jean-m...@spaggiari.org]
 Sent: Thursday, June 28, 2012 4:44 PM
 To: user@hbase.apache.org
 Subject: Re: Scan vs Put vs Get

 Wow. First, thanks a lot all for jumping into this.

 Let me try to reply to everyone in a single post.

  How many Gets you batch together in one call
 I tried with multiple different values from 10 to 3000 with similar
 results.
 Time to read 10 lines : 181.0 mseconds (55 lines/seconds)
 Time to read 100 lines : 484.0 mseconds (207 lines/seconds)
 Time to read 1000 lines : 4739.0 mseconds (211 lines/seconds)
 Time to read 3000 lines : 13582.0 mseconds (221 lines/seconds)

  Is this equal to the Scan#setCaching () that u are using?
 The scan call is done after the get test. So I can't set the cache for
 the scan before I do the gets. Also, I tried to run them separatly (On
 time only the put, one time only the get, etc.) so I did not find a
 way to setup the cache for the get.

  If both are same u can be sure that the the number of NW calls is
 coming almost same.
 Here are the results for 10 000 gets and 10 000 scan.next(). Each time
 I access the result to be sure they are sent to the client.
 (gets) Time to read 1 lines : 36620.0 mseconds (273 lines/seconds)
 (scan) Time to read 1 lines : 119.0 mseconds (84034 lines/seconds)

 [Block caching is enabled?]
 Good question. I don't know :( Is it enabled by default? How can I
 verify or activate it?

  Also have you tried using Bloom filters?
 Not yet. They are on page 381 on Lars' book and I'm only on page 168 ;)


  What's the hbase version you're using?
 I manually installed 0.94.0. I can try with an other version.

  Is it repeatable?
 Yes. I tries many many times by adding some options, closing some
 process on the server side, remonving one datanode, adding one, etc. I
 can see some small variations, but still in the same range. I was able
 to move from 200 rows/second  to 300 rows/second. But that's not
 really a significant improvment. Also, here are the results for 7
 iterations of the same code.

 Time to read 1000 lines : 4171.0 mseconds (240 lines/seconds)
 Time to read 1000 lines : 3439.0 mseconds (291 lines/seconds)
 Time to read 1000 lines : 3953.0 mseconds (253 lines/seconds)
 Time to read 1000 lines : 3801.0 mseconds (263 lines/seconds)
 Time to read 1000 lines : 3680.0 mseconds (272 lines/seconds)
 Time to read 1000 lines : 3493.0 mseconds (286 lines/seconds)
 Time to read 1000 lines : 4549.0 mseconds (220 lines/seconds)

 If the locations are wrong (region moved) you will have a retry loop
 I have one dead region. It's a server I brought down few days ago
 because it was to slow. But it's still on the hbase web interface.
 However, if I look at the table, there is no table region hosted on
 this server. Hadoop also was removed from it so it's saying one dead
 node.

 Do you have anything in the logs?
 Nothing special. Only some Block cache LRU eviction entries.

  Could you share as well the code
 Eveything is at the end of this post.

 You can also check the cache hit and cache miss statistics that
 appears on
 the UI?
 Can you please tell me how I can find that? I was not able to find
 that on the web UI. Where should I look?

  In your random scan how many Regions are scanned
 I only have 5 regions

Re: Scan vs Put vs Get

2012-06-28 Thread N Keywal
 that appears on
 the UI?

 In your random scan how many Regions are scanned whereas in gets may be
 many
 due to randomness.

 Regards
 Ram

 -Original Message-
 From: N Keywal [mailto:nkey...@gmail.com]
 Sent: Thursday, June 28, 2012 2:00 PM
 To: user@hbase.apache.org
 Subject: Re: Scan vs Put vs Get

 Hi Jean-Marc,

 Interesting :-)

 Added to Anoop questions:

 What's the hbase version you're using?

 Is it repeatable, I mean if you try twice the same gets with the
 same client do you have the same results? I'm asking because the
 client caches the locations.

 If the locations are wrong (region moved) you will have a retry loop,
 and it includes a sleep. Do you have anything in the logs?

 Could you share as well the code you're using to get the ~100 ms time?

 Cheers,

 N.

 On Thu, Jun 28, 2012 at 6:56 AM, Anoop Sam John anoo...@huawei.com
 wrote:
  Hi
      How many Gets you batch together in one call? Is this equal to
 the Scan#setCaching () that u are using?
  If both are same u can be sure that the the number of NW calls is
 coming almost same.
 
  Also you are giving random keys in the Gets. The scan will be always
 sequential. Seems in your get scenario it is very very random reads
 resulting in too many reads of HFile block from HDFS. [Block caching is
 enabled?]
 
  Also have you tried using Bloom filters?  ROW blooms might improve
 your get performance.
 
  -Anoop-
  
  From: Jean-Marc Spaggiari [jean-m...@spaggiari.org]
  Sent: Thursday, June 28, 2012 5:04 AM
  To: user
  Subject: Scan vs Put vs Get
 
  Hi,
 
  I have a small piece of code, for testing, which is putting 1B lines
  in an existing table, getting 3000 lines and scanning 1.
 
  The table is one family, one column.
 
  Everything is done randomly. Put with Random key (24 bytes), fixed
  family and fixed column names with random content (24 bytes).
 
  Get (batch) is done with random keys and scan with RandomRowFilter.
 
  And here are the results.
  Time to insert 100 lines: 43 seconds (23255 lines/seconds)
  That's correct for my needs based on the poor performances of the
  servers in the cluster. I'm fine with the results.
 
  Time to read 3000 lines: 11444.0 mseconds (262 lines/seconds)
  This is way to low. I don't understand why. So I tried the random
 scan
  because I'm not able to figure the issue.
 
  Time to read 1 lines: 108.0 mseconds (92593 lines/seconds)
  This it impressive! I have added that after I failed with the get. I
  moved from 262 lines per seconds to almost 100K lines/seconds!!! It's
  awesome!
 
  However, I'm still wondering what's wrong with my gets.
 
  The code is very simple. I'm using Get objects that I'm executing in
 a
  Batch. I tried to add a filter but it's not helping. Here is an
  extract of the code.
 
                         for (long l = 0; l  linesToRead; l++)
                         {
                                 byte[] array1 = new byte[24];
                                 for (int i = 0; i  array1.length;
 i++)
                                                 array1[i] =
 (byte)Math.floor(Math.random() * 256);
                                 Get g = new Get (array1);
                                 gets.addElement(g);
                         }
                                 Object[] results = new
 Object[gets.size()];
                                 System.out.println(new java.util.Date
 () +  \gets\ created.);
                                 long timeBefore =
 System.currentTimeMillis();
                         table.batch(gets, results);
                         long timeAfter = System.currentTimeMillis();
 
                         float duration = timeAfter - timeBefore;
                         System.out.println (Time to read  +
 gets.size() +  lines : 
  + duration +  mseconds ( + Math.round(((float)linesToRead /
  (duration / 1000))) +  lines/seconds));
 
  What's wrong with it? I can't add the setBatch neither I can add
  setCaching because it's not a scan. I tried with different numbers of
  gets but it's almost always the same speed. Am I using it the wrong
  way? Does anyone have any advice to improve that?
 
  Thanks,
 
  JM




Re: Scan vs Put vs Get

2012-06-28 Thread Jean-Marc Spaggiari
Hi Anoop,

Are Bloom filters for columns? If I add g.setFilter(new
KeyOnlyFilter()); that mean I can't use bloom filters, right?
Basically, what I'm doing here is something like
existKey(byte[]):boolean where I try to see if a key exist in the
database whitout taking into consideration if there is any column
content or not. This should be very fast. Even faster than the scan
which need to keep some tracks of where I'm reading for the next row.

JM

2012/6/28, Anoop Sam John anoo...@huawei.com:
blockCacheHitRatio=69%
 Seems blocks you are getting from cache.
 You can check with Blooms also once.

 You can enable the usage of bloom using the config param
 io.storefile.bloom.enabled set to true  . This will enable the usage of
 bloom globally
 Now you need to set the bloom type for your CF
 HColumnDescriptor#setBloomFilterType()   U can check with type
 BloomType.ROW

 -Anoop-

 _
 From: Jean-Marc Spaggiari [jean-m...@spaggiari.org]
 Sent: Thursday, June 28, 2012 5:42 PM
 To: user@hbase.apache.org
 Subject: Re: Scan vs Put vs Get

 Oh! I never looked at this part ;) Ok. I have it.

 Here are the numbers for one server before the read:

 blockCacheSizeMB=186.28
 blockCacheFreeMB=55.4
 blockCacheCount=2923
 blockCacheHitCount=195999
 blockCacheMissCount=89297
 blockCacheEvictedCount=69858
 blockCacheHitRatio=68%
 blockCacheHitCachingRatio=72%

 And here are the numbers after 100 iterations of 1000 gets for  the same
 server:

 blockCacheSizeMB=194.44
 blockCacheFreeMB=47.25
 blockCacheCount=3052
 blockCacheHitCount=232034
 blockCacheMissCount=103250
 blockCacheEvictedCount=83682
 blockCacheHitRatio=69%
 blockCacheHitCachingRatio=72%

 Don't forget that there is between 40B and 50B of lines in the table,
 so I don't think the servers can store all of them in memory. And
 since I'm accessing based on a random key, odds to have the right row
 in memory are small I think.

 JM

 2012/6/28, Ramkrishna.S.Vasudevan ramkrishna.vasude...@huawei.com:
 In 0.94

 The UI of the RS has a metrics table.  In that you can see
 blockCacheHitCount, blockCacheMissCount etc.  May be there is a variation
 when you do scan() and get() here.

 Regards
 Ram



 -Original Message-
 From: Jean-Marc Spaggiari [mailto:jean-m...@spaggiari.org]
 Sent: Thursday, June 28, 2012 4:44 PM
 To: user@hbase.apache.org
 Subject: Re: Scan vs Put vs Get

 Wow. First, thanks a lot all for jumping into this.

 Let me try to reply to everyone in a single post.

  How many Gets you batch together in one call
 I tried with multiple different values from 10 to 3000 with similar
 results.
 Time to read 10 lines : 181.0 mseconds (55 lines/seconds)
 Time to read 100 lines : 484.0 mseconds (207 lines/seconds)
 Time to read 1000 lines : 4739.0 mseconds (211 lines/seconds)
 Time to read 3000 lines : 13582.0 mseconds (221 lines/seconds)

  Is this equal to the Scan#setCaching () that u are using?
 The scan call is done after the get test. So I can't set the cache for
 the scan before I do the gets. Also, I tried to run them separatly (On
 time only the put, one time only the get, etc.) so I did not find a
 way to setup the cache for the get.

  If both are same u can be sure that the the number of NW calls is
 coming almost same.
 Here are the results for 10 000 gets and 10 000 scan.next(). Each time
 I access the result to be sure they are sent to the client.
 (gets) Time to read 1 lines : 36620.0 mseconds (273 lines/seconds)
 (scan) Time to read 1 lines : 119.0 mseconds (84034 lines/seconds)

 [Block caching is enabled?]
 Good question. I don't know :( Is it enabled by default? How can I
 verify or activate it?

  Also have you tried using Bloom filters?
 Not yet. They are on page 381 on Lars' book and I'm only on page 168 ;)


  What's the hbase version you're using?
 I manually installed 0.94.0. I can try with an other version.

  Is it repeatable?
 Yes. I tries many many times by adding some options, closing some
 process on the server side, remonving one datanode, adding one, etc. I
 can see some small variations, but still in the same range. I was able
 to move from 200 rows/second  to 300 rows/second. But that's not
 really a significant improvment. Also, here are the results for 7
 iterations of the same code.

 Time to read 1000 lines : 4171.0 mseconds (240 lines/seconds)
 Time to read 1000 lines : 3439.0 mseconds (291 lines/seconds)
 Time to read 1000 lines : 3953.0 mseconds (253 lines/seconds)
 Time to read 1000 lines : 3801.0 mseconds (263 lines/seconds)
 Time to read 1000 lines : 3680.0 mseconds (272 lines/seconds)
 Time to read 1000 lines : 3493.0 mseconds (286 lines/seconds)
 Time to read 1000 lines : 4549.0 mseconds (220 lines/seconds)

 If the locations are wrong (region moved) you will have a retry loop
 I have one dead region. It's a server I brought down few days ago
 because it was to slow. But it's still on the hbase web interface.
 However, if I look at the table, there is no table region

Re: Scan vs Put vs Get

2012-06-28 Thread Jean-Marc Spaggiari
Hi N Keywal,

This result:
Time to read 1 lines : 122.0 mseconds (81967 lines/seconds)

Is obtain with this code:
HTable table = new HTable(config, test3);
final int linesToRead = 1;
System.out.println(new java.util.Date () +  Processing iteration  +
iteration + ... );
RandomRowFilter rrf = new RandomRowFilter();
KeyOnlyFilter kof = new KeyOnlyFilter();

Scan scan = new Scan();
scan.setFilter(rrf);
scan.setFilter(kof);
scan.setBatch(Math.min(linesToRead, 1000));
scan.setCaching(Math.min(linesToRead, 1000));
ResultScanner scanner = table.getScanner(scan);
processed = 0;
long timeBefore = System.currentTimeMillis();
for (Result result : scanner.next(linesToRead))
{
if (result != null)
processed++;
}
scanner.close();
long timeAfter = System.currentTimeMillis();

float duration = timeAfter - timeBefore;
System.out.println (Time to read  + linesToRead +  lines :  +
duration +  mseconds ( + Math.round(((float)linesToRead / (duration
/ 1000))) +  lines/seconds));
table.close ();

This is with the scan.

scan  80 000 lines/seconds
put  20 000 lines/seconds
get  300 lines/seconds

2012/6/28, Jean-Marc Spaggiari jean-m...@spaggiari.org:
 Hi Anoop,

 Are Bloom filters for columns? If I add g.setFilter(new
 KeyOnlyFilter()); that mean I can't use bloom filters, right?
 Basically, what I'm doing here is something like
 existKey(byte[]):boolean where I try to see if a key exist in the
 database whitout taking into consideration if there is any column
 content or not. This should be very fast. Even faster than the scan
 which need to keep some tracks of where I'm reading for the next row.

 JM

 2012/6/28, Anoop Sam John anoo...@huawei.com:
blockCacheHitRatio=69%
 Seems blocks you are getting from cache.
 You can check with Blooms also once.

 You can enable the usage of bloom using the config param
 io.storefile.bloom.enabled set to true  . This will enable the usage of
 bloom globally
 Now you need to set the bloom type for your CF
 HColumnDescriptor#setBloomFilterType()   U can check with type
 BloomType.ROW

 -Anoop-

 _
 From: Jean-Marc Spaggiari [jean-m...@spaggiari.org]
 Sent: Thursday, June 28, 2012 5:42 PM
 To: user@hbase.apache.org
 Subject: Re: Scan vs Put vs Get

 Oh! I never looked at this part ;) Ok. I have it.

 Here are the numbers for one server before the read:

 blockCacheSizeMB=186.28
 blockCacheFreeMB=55.4
 blockCacheCount=2923
 blockCacheHitCount=195999
 blockCacheMissCount=89297
 blockCacheEvictedCount=69858
 blockCacheHitRatio=68%
 blockCacheHitCachingRatio=72%

 And here are the numbers after 100 iterations of 1000 gets for  the same
 server:

 blockCacheSizeMB=194.44
 blockCacheFreeMB=47.25
 blockCacheCount=3052
 blockCacheHitCount=232034
 blockCacheMissCount=103250
 blockCacheEvictedCount=83682
 blockCacheHitRatio=69%
 blockCacheHitCachingRatio=72%

 Don't forget that there is between 40B and 50B of lines in the table,
 so I don't think the servers can store all of them in memory. And
 since I'm accessing based on a random key, odds to have the right row
 in memory are small I think.

 JM

 2012/6/28, Ramkrishna.S.Vasudevan ramkrishna.vasude...@huawei.com:
 In 0.94

 The UI of the RS has a metrics table.  In that you can see
 blockCacheHitCount, blockCacheMissCount etc.  May be there is a
 variation
 when you do scan() and get() here.

 Regards
 Ram



 -Original Message-
 From: Jean-Marc Spaggiari [mailto:jean-m...@spaggiari.org]
 Sent: Thursday, June 28, 2012 4:44 PM
 To: user@hbase.apache.org
 Subject: Re: Scan vs Put vs Get

 Wow. First, thanks a lot all for jumping into this.

 Let me try to reply to everyone in a single post.

  How many Gets you batch together in one call
 I tried with multiple different values from 10 to 3000 with similar
 results.
 Time to read 10 lines : 181.0 mseconds (55 lines/seconds)
 Time to read 100 lines : 484.0 mseconds (207 lines/seconds)
 Time to read 1000 lines : 4739.0 mseconds (211 lines/seconds)
 Time to read 3000 lines : 13582.0 mseconds (221 lines/seconds)

  Is this equal to the Scan#setCaching () that u are using?
 The scan call is done after the get test. So I can't set the cache for
 the scan before I do the gets. Also, I tried to run them separatly (On
 time only the put, one time only the get, etc.) so I did not find a
 way to setup the cache for the get.

  If both are same u can be sure that the the number of NW calls is
 coming almost same.
 Here are the results for 10 000 gets and 10 000 scan.next(). Each time
 I access the result to be sure they are sent to the client.
 (gets) Time to read 1 lines : 36620.0 mseconds (273 lines/seconds)
 (scan) Time to read 1 lines : 119.0 mseconds (84034 lines/seconds)

 [Block caching is enabled?]
 Good question. I don't know :( Is it enabled by default? How can I
 verify or activate it?

  Also have you tried using Bloom filters?
 Not yet. They are on page 381 on Lars' book and I'm only on page 168 ;)


  What's the hbase

Re: Scan vs Put vs Get

2012-06-28 Thread N Keywal
Thank you. It's clearer now. From the code you sent, RandomRowFilter
is not used. You're only using the KeyOnlyFilter (the second setFilter
replaces the first one; you need to use like FilterList to combine
filters). (Note as well that you would need to initialize
RandomRowFilter#chance, if not all the rows will be filtered out.)

So, in one case -list of gets-, you're reading a well defined set of
rows (defined randomly, but well defined :-), and this set spreads all
other the regions.
In the second one (KeyOnlyFilter), you're reading the first 1K rows
you could get from the cluster.

This explains the difference between the results. Activating
RandomRowFilter should not change much the results, as it's different
to select a random set of rows and to get a set of rows defined
randomly (don't know if I'm clear here...).

Unfortunately you're likely to be more interested of the performance
when there is a real selection. Your code for list of gets was correct
imho. I'm interested by the results if you activate bloomfilters.

Cheers,

N.

On Thu, Jun 28, 2012 at 3:45 PM, Jean-Marc Spaggiari
jean-m...@spaggiari.org wrote:
 Hi N Keywal,

 This result:
 Time to read 1 lines : 122.0 mseconds (81967 lines/seconds)

 Is obtain with this code:
 HTable table = new HTable(config, test3);
 final int linesToRead = 1;
 System.out.println(new java.util.Date () +  Processing iteration  +
 iteration + ... );
 RandomRowFilter rrf = new RandomRowFilter();
 KeyOnlyFilter kof = new KeyOnlyFilter();

 Scan scan = new Scan();
 scan.setFilter(rrf);
 scan.setFilter(kof);
 scan.setBatch(Math.min(linesToRead, 1000));
 scan.setCaching(Math.min(linesToRead, 1000));
 ResultScanner scanner = table.getScanner(scan);
 processed = 0;
 long timeBefore = System.currentTimeMillis();
 for (Result result : scanner.next(linesToRead))
 {
        if (result != null)
                processed++;
 }
 scanner.close();
 long timeAfter = System.currentTimeMillis();

 float duration = timeAfter - timeBefore;
 System.out.println (Time to read  + linesToRead +  lines :  +
 duration +  mseconds ( + Math.round(((float)linesToRead / (duration
 / 1000))) +  lines/seconds));
 table.close ();

 This is with the scan.

 scan  80 000 lines/seconds
 put  20 000 lines/seconds
 get  300 lines/seconds

 2012/6/28, Jean-Marc Spaggiari jean-m...@spaggiari.org:
 Hi Anoop,

 Are Bloom filters for columns? If I add g.setFilter(new
 KeyOnlyFilter()); that mean I can't use bloom filters, right?
 Basically, what I'm doing here is something like
 existKey(byte[]):boolean where I try to see if a key exist in the
 database whitout taking into consideration if there is any column
 content or not. This should be very fast. Even faster than the scan
 which need to keep some tracks of where I'm reading for the next row.

 JM

 2012/6/28, Anoop Sam John anoo...@huawei.com:
blockCacheHitRatio=69%
 Seems blocks you are getting from cache.
 You can check with Blooms also once.

 You can enable the usage of bloom using the config param
 io.storefile.bloom.enabled set to true  . This will enable the usage of
 bloom globally
 Now you need to set the bloom type for your CF
 HColumnDescriptor#setBloomFilterType()   U can check with type
 BloomType.ROW

 -Anoop-

 _
 From: Jean-Marc Spaggiari [jean-m...@spaggiari.org]
 Sent: Thursday, June 28, 2012 5:42 PM
 To: user@hbase.apache.org
 Subject: Re: Scan vs Put vs Get

 Oh! I never looked at this part ;) Ok. I have it.

 Here are the numbers for one server before the read:

 blockCacheSizeMB=186.28
 blockCacheFreeMB=55.4
 blockCacheCount=2923
 blockCacheHitCount=195999
 blockCacheMissCount=89297
 blockCacheEvictedCount=69858
 blockCacheHitRatio=68%
 blockCacheHitCachingRatio=72%

 And here are the numbers after 100 iterations of 1000 gets for  the same
 server:

 blockCacheSizeMB=194.44
 blockCacheFreeMB=47.25
 blockCacheCount=3052
 blockCacheHitCount=232034
 blockCacheMissCount=103250
 blockCacheEvictedCount=83682
 blockCacheHitRatio=69%
 blockCacheHitCachingRatio=72%

 Don't forget that there is between 40B and 50B of lines in the table,
 so I don't think the servers can store all of them in memory. And
 since I'm accessing based on a random key, odds to have the right row
 in memory are small I think.

 JM

 2012/6/28, Ramkrishna.S.Vasudevan ramkrishna.vasude...@huawei.com:
 In 0.94

 The UI of the RS has a metrics table.  In that you can see
 blockCacheHitCount, blockCacheMissCount etc.  May be there is a
 variation
 when you do scan() and get() here.

 Regards
 Ram



 -Original Message-
 From: Jean-Marc Spaggiari [mailto:jean-m...@spaggiari.org]
 Sent: Thursday, June 28, 2012 4:44 PM
 To: user@hbase.apache.org
 Subject: Re: Scan vs Put vs Get

 Wow. First, thanks a lot all for jumping into this.

 Let me try to reply to everyone in a single post.

  How many Gets you batch together in one call
 I tried with multiple different values from 10 to 3000 with similar
 results.
 Time to read

Re: Scan vs Put vs Get

2012-06-28 Thread Jean-Marc Spaggiari
Oh! I see! KeyOnlyFilter is overwriting the RandomRowFilter! Bad. I
mean, bad I did not figured that. Thanks for pointing that. That
definitively explain the difference in the performances.

I have activated the bloomfilters with this code:
HBaseAdmin admin = new HBaseAdmin(config);
HTable table = new HTable(config, test3);
System.out.println (table.getTableDescriptor().getColumnFamilies()[0]);
HColumnDescriptor cd = table.getTableDescriptor().getColumnFamilies()[0];
cd.setBloomFilterType(BloomType.ROW);
admin.disableTable(test3);
admin.modifyColumn(test3, cd);
admin.enableTable(test3);
System.out.println (table.getTableDescriptor().getColumnFamilies()[0]);

And here is the result for the first attempt (using gets):
{NAME = 'cf', DATA_BLOCK_ENCODING = 'NONE', BLOOMFILTER = 'NONE',
REPLICATION_SCOPE = '0', VERSIONS = '3', COMPRESSION = 'NONE',
MIN_VERSIONS = '0', TTL = '2147483647', KEEP_DELETED_CELLS =
'false', BLOCKSIZE = '65536', IN_MEMORY = 'false', ENCODE_ON_DISK =
'true', BLOCKCACHE = 'true'}
{NAME = 'cf', DATA_BLOCK_ENCODING = 'NONE', BLOOMFILTER = 'ROW',
REPLICATION_SCOPE = '0', VERSIONS = '3', COMPRESSION = 'NONE',
MIN_VERSIONS = '0', TTL = '2147483647', KEEP_DELETED_CELLS =
'false', BLOCKSIZE = '65536', IN_MEMORY = 'false', ENCODE_ON_DISK =
'true', BLOCKCACHE = 'true'}
Thu Jun 28 11:08:59 EDT 2012 Processing iteration 0...
Time to read 1000 lines : 40177.0 mseconds (25 lines/seconds)

2nd: Time to read 1000 lines : 7621.0 mseconds (131 lines/seconds)
3rd: Time to read 1000 lines : 7659.0 mseconds (131 lines/seconds)
After few more iterations (about 30), I'm between 200 and 250
lines/seconds, like before.

Regarding the filterList, I tried, but now I'm getting this error from
the servers:
org.apache.hadoop.hbase.regionserver.LeaseException:
org.apache.hadoop.hbase.regionserver.LeaseException: lease
'-6376193724680783311' does not exist
Here is the code:
final int linesToRead = 1;
System.out.println(new java.util.Date () +  Processing iteration  +
iteration + ... );
RandomRowFilter rrf = new RandomRowFilter();
KeyOnlyFilter kof = new KeyOnlyFilter();
Scan scan = new Scan();
ListFilter filters = new ArrayListFilter();
filters.add(rrf);
filters.add(kof);
FilterList filterList = new FilterList(filters);
scan.setFilter(filterList);
scan.setBatch(Math.min(linesToRead, 1000));
scan.setCaching(Math.min(linesToRead, 1000));
ResultScanner scanner = table.getScanner(scan);
processed = 0;
long timeBefore = System.currentTimeMillis();
for (Result result : scanner.next(linesToRead))
{
System.out.println(Result:  + result); //
if (result != null)
processed++;
}
scanner.close();

It's failing when I try to do for (Result result :
scanner.next(linesToRead)). I tried with linesToRead=1000, 100, 10 and
1 with the same result :(

I will try to find the root cause, but if you have any hint, it's welcome.

JM


Re: Scan vs Put vs Get

2012-06-28 Thread N Keywal
For the filter list my guess is that you're filtering out all rows
because RandomRowFilter#chance is not initialized (it should be
something like RandomRowFilter rrf = new RandomRowFilter(0.5);)
But note that this test will never be comparable to the test with a
list of gets. You can make it as slow/fast as you want by playing with
the 'chance' parameter.

The results with gets and bloom filter are also in the interesting
category, hopefully an expert will get in the loop...



On Thu, Jun 28, 2012 at 6:04 PM, Jean-Marc Spaggiari
jean-m...@spaggiari.org wrote:
 Oh! I see! KeyOnlyFilter is overwriting the RandomRowFilter! Bad. I
 mean, bad I did not figured that. Thanks for pointing that. That
 definitively explain the difference in the performances.

 I have activated the bloomfilters with this code:
 HBaseAdmin admin = new HBaseAdmin(config);
 HTable table = new HTable(config, test3);
 System.out.println (table.getTableDescriptor().getColumnFamilies()[0]);
 HColumnDescriptor cd = table.getTableDescriptor().getColumnFamilies()[0];
 cd.setBloomFilterType(BloomType.ROW);
 admin.disableTable(test3);
 admin.modifyColumn(test3, cd);
 admin.enableTable(test3);
 System.out.println (table.getTableDescriptor().getColumnFamilies()[0]);

 And here is the result for the first attempt (using gets):
 {NAME = 'cf', DATA_BLOCK_ENCODING = 'NONE', BLOOMFILTER = 'NONE',
 REPLICATION_SCOPE = '0', VERSIONS = '3', COMPRESSION = 'NONE',
 MIN_VERSIONS = '0', TTL = '2147483647', KEEP_DELETED_CELLS =
 'false', BLOCKSIZE = '65536', IN_MEMORY = 'false', ENCODE_ON_DISK =
 'true', BLOCKCACHE = 'true'}
 {NAME = 'cf', DATA_BLOCK_ENCODING = 'NONE', BLOOMFILTER = 'ROW',
 REPLICATION_SCOPE = '0', VERSIONS = '3', COMPRESSION = 'NONE',
 MIN_VERSIONS = '0', TTL = '2147483647', KEEP_DELETED_CELLS =
 'false', BLOCKSIZE = '65536', IN_MEMORY = 'false', ENCODE_ON_DISK =
 'true', BLOCKCACHE = 'true'}
 Thu Jun 28 11:08:59 EDT 2012 Processing iteration 0...
 Time to read 1000 lines : 40177.0 mseconds (25 lines/seconds)

 2nd: Time to read 1000 lines : 7621.0 mseconds (131 lines/seconds)
 3rd: Time to read 1000 lines : 7659.0 mseconds (131 lines/seconds)
 After few more iterations (about 30), I'm between 200 and 250
 lines/seconds, like before.

 Regarding the filterList, I tried, but now I'm getting this error from
 the servers:
 org.apache.hadoop.hbase.regionserver.LeaseException:
 org.apache.hadoop.hbase.regionserver.LeaseException: lease
 '-6376193724680783311' does not exist
 Here is the code:
        final int linesToRead = 1;
        System.out.println(new java.util.Date () +  Processing iteration  +
 iteration + ... );
        RandomRowFilter rrf = new RandomRowFilter();
        KeyOnlyFilter kof = new KeyOnlyFilter();
        Scan scan = new Scan();
        ListFilter filters = new ArrayListFilter();
        filters.add(rrf);
        filters.add(kof);
        FilterList filterList = new FilterList(filters);
        scan.setFilter(filterList);
        scan.setBatch(Math.min(linesToRead, 1000));
        scan.setCaching(Math.min(linesToRead, 1000));
        ResultScanner scanner = table.getScanner(scan);
        processed = 0;
        long timeBefore = System.currentTimeMillis();
        for (Result result : scanner.next(linesToRead))
        {
                System.out.println(Result:  + result); //
                if (result != null)
                        processed++;
        }
        scanner.close();

 It's failing when I try to do for (Result result :
 scanner.next(linesToRead)). I tried with linesToRead=1000, 100, 10 and
 1 with the same result :(

 I will try to find the root cause, but if you have any hint, it's welcome.

 JM


Re: Scan vs Put vs Get

2012-06-28 Thread Jean-Marc Spaggiari
Oh, sorry. You're right. You already said that and I forgot to update
it. It's working fine when I add this parameter. And as you are
saying, I can get the respons time I want by playing with the
chance...

I get (34758 lines/seconds) with 0.99 as the chance, and only (7564
lines/seconds) with 0.09... But that's still better than the gets.

I just retried the gets, to see if the performances are changing after
many table access, but results are still almost the same.

I also tried to read 100 000 rows in a row with a random start key,
and the performances are close to the random filter. (35273
lines/seconds). So it's really the get which is giving me an
headache...

2012/6/28, N Keywal nkey...@gmail.com:
 For the filter list my guess is that you're filtering out all rows
 because RandomRowFilter#chance is not initialized (it should be
 something like RandomRowFilter rrf = new RandomRowFilter(0.5);)
 But note that this test will never be comparable to the test with a
 list of gets. You can make it as slow/fast as you want by playing with
 the 'chance' parameter.

 The results with gets and bloom filter are also in the interesting
 category, hopefully an expert will get in the loop...



 On Thu, Jun 28, 2012 at 6:04 PM, Jean-Marc Spaggiari
 jean-m...@spaggiari.org wrote:
 Oh! I see! KeyOnlyFilter is overwriting the RandomRowFilter! Bad. I
 mean, bad I did not figured that. Thanks for pointing that. That
 definitively explain the difference in the performances.

 I have activated the bloomfilters with this code:
 HBaseAdmin admin = new HBaseAdmin(config);
 HTable table = new HTable(config, test3);
 System.out.println (table.getTableDescriptor().getColumnFamilies()[0]);
 HColumnDescriptor cd = table.getTableDescriptor().getColumnFamilies()[0];
 cd.setBloomFilterType(BloomType.ROW);
 admin.disableTable(test3);
 admin.modifyColumn(test3, cd);
 admin.enableTable(test3);
 System.out.println (table.getTableDescriptor().getColumnFamilies()[0]);

 And here is the result for the first attempt (using gets):
 {NAME = 'cf', DATA_BLOCK_ENCODING = 'NONE', BLOOMFILTER = 'NONE',
 REPLICATION_SCOPE = '0', VERSIONS = '3', COMPRESSION = 'NONE',
 MIN_VERSIONS = '0', TTL = '2147483647', KEEP_DELETED_CELLS =
 'false', BLOCKSIZE = '65536', IN_MEMORY = 'false', ENCODE_ON_DISK =
 'true', BLOCKCACHE = 'true'}
 {NAME = 'cf', DATA_BLOCK_ENCODING = 'NONE', BLOOMFILTER = 'ROW',
 REPLICATION_SCOPE = '0', VERSIONS = '3', COMPRESSION = 'NONE',
 MIN_VERSIONS = '0', TTL = '2147483647', KEEP_DELETED_CELLS =
 'false', BLOCKSIZE = '65536', IN_MEMORY = 'false', ENCODE_ON_DISK =
 'true', BLOCKCACHE = 'true'}
 Thu Jun 28 11:08:59 EDT 2012 Processing iteration 0...
 Time to read 1000 lines : 40177.0 mseconds (25 lines/seconds)

 2nd: Time to read 1000 lines : 7621.0 mseconds (131 lines/seconds)
 3rd: Time to read 1000 lines : 7659.0 mseconds (131 lines/seconds)
 After few more iterations (about 30), I'm between 200 and 250
 lines/seconds, like before.

 Regarding the filterList, I tried, but now I'm getting this error from
 the servers:
 org.apache.hadoop.hbase.regionserver.LeaseException:
 org.apache.hadoop.hbase.regionserver.LeaseException: lease
 '-6376193724680783311' does not exist
 Here is the code:
        final int linesToRead = 1;
        System.out.println(new java.util.Date () +  Processing iteration 
 +
 iteration + ... );
        RandomRowFilter rrf = new RandomRowFilter();
        KeyOnlyFilter kof = new KeyOnlyFilter();
        Scan scan = new Scan();
        ListFilter filters = new ArrayListFilter();
        filters.add(rrf);
        filters.add(kof);
        FilterList filterList = new FilterList(filters);
        scan.setFilter(filterList);
        scan.setBatch(Math.min(linesToRead, 1000));
        scan.setCaching(Math.min(linesToRead, 1000));
        ResultScanner scanner = table.getScanner(scan);
        processed = 0;
        long timeBefore = System.currentTimeMillis();
        for (Result result : scanner.next(linesToRead))
        {
                System.out.println(Result:  + result); //
                if (result != null)
                        processed++;
        }
        scanner.close();

 It's failing when I try to do for (Result result :
 scanner.next(linesToRead)). I tried with linesToRead=1000, 100, 10 and
 1 with the same result :(

 I will try to find the root cause, but if you have any hint, it's
 welcome.

 JM



RE: Scan vs Put vs Get

2012-06-27 Thread Anoop Sam John
Hi
 How many Gets you batch together in one call? Is this equal to the 
Scan#setCaching () that u are using?
If both are same u can be sure that the the number of NW calls is coming almost 
same.

Also you are giving random keys in the Gets. The scan will be always 
sequential. Seems in your get scenario it is very very random reads resulting 
in too many reads of HFile block from HDFS. [Block caching is enabled?]

Also have you tried using Bloom filters?  ROW blooms might improve your get 
performance.

-Anoop-

From: Jean-Marc Spaggiari [jean-m...@spaggiari.org]
Sent: Thursday, June 28, 2012 5:04 AM
To: user
Subject: Scan vs Put vs Get

Hi,

I have a small piece of code, for testing, which is putting 1B lines
in an existing table, getting 3000 lines and scanning 1.

The table is one family, one column.

Everything is done randomly. Put with Random key (24 bytes), fixed
family and fixed column names with random content (24 bytes).

Get (batch) is done with random keys and scan with RandomRowFilter.

And here are the results.
Time to insert 100 lines: 43 seconds (23255 lines/seconds)
That's correct for my needs based on the poor performances of the
servers in the cluster. I'm fine with the results.

Time to read 3000 lines: 11444.0 mseconds (262 lines/seconds)
This is way to low. I don't understand why. So I tried the random scan
because I'm not able to figure the issue.

Time to read 1 lines: 108.0 mseconds (92593 lines/seconds)
This it impressive! I have added that after I failed with the get. I
moved from 262 lines per seconds to almost 100K lines/seconds!!! It's
awesome!

However, I'm still wondering what's wrong with my gets.

The code is very simple. I'm using Get objects that I'm executing in a
Batch. I tried to add a filter but it's not helping. Here is an
extract of the code.

for (long l = 0; l  linesToRead; l++)
{
byte[] array1 = new byte[24];
for (int i = 0; i  array1.length; i++)
array1[i] = 
(byte)Math.floor(Math.random() * 256);
Get g = new Get (array1);
gets.addElement(g);
}
Object[] results = new Object[gets.size()];
System.out.println(new java.util.Date () +  
\gets\ created.);
long timeBefore = System.currentTimeMillis();
table.batch(gets, results);
long timeAfter = System.currentTimeMillis();

float duration = timeAfter - timeBefore;
System.out.println (Time to read  + gets.size() +  
lines : 
+ duration +  mseconds ( + Math.round(((float)linesToRead /
(duration / 1000))) +  lines/seconds));

What's wrong with it? I can't add the setBatch neither I can add
setCaching because it's not a scan. I tried with different numbers of
gets but it's almost always the same speed. Am I using it the wrong
way? Does anyone have any advice to improve that?

Thanks,

JM