Re: scan performance improvement

Friso van Vollenhoven Thu, 11 Nov 2010 05:28:44 -0800

The 256M = default MAX_FILE_SIZE
64K = default HBase block size
64M = HDFS default block size


If you look at a table definition in the HBase master UI you can see settings 
for your table. Like this:
{NAME => 'inrdb_rir_stats', MAX_FILESIZE => '268435456', FAMILIES => [{NAME => 
'data', BLOOMFILTER => 'NONE', REPLICATION_SCOPE => '0', COMPRESSION => 'LZO', 
VERSIONS => '1', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 
'false', BLOCKCACHE => 'true'}, {NAME => 'meta', BLOOMFILTER => 'NONE', 
REPLICATION_SCOPE => '0', COMPRESSION => 'LZO', VERSIONS => '1', TTL => 
'2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 
'true'}]}

Also, have a look here to see how HBase stores data: 
http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html




On 11 nov 2010, at 14:11, Michael Segel wrote:

> 
> Correct me if I'm wrong, but isn't hbase's default block size 256MB while 
> hadoop's default blocksize is 64MB?
> 
> 
>> From: fvanvollenho...@xebia.com
>> To: user@hbase.apache.org
>> Subject: Re: scan performance improvement
>> Date: Thu, 11 Nov 2010 13:08:56 +0000
>> 
>> Not that block size (that's the HDFS one), but the HBase block size. You set 
>> it at table creation or it uses the default of 64K.
>> 
>> The description of hbase.client.scanner.caching says:
>> Number of rows that will be fetched when calling next
>> on a scanner if it is not served from memory. Higher caching values
>> will enable faster scanners but will eat up more memory and some
>> calls of next may take longer and longer times when the cache is empty.
>> 
>> That means that it will pre-fetch that number of rows, if the next row does 
>> not come from memory. So if your rows are small enough to fit 100 of them in 
>> one block, it doesn't matter whether you pre-fetch 1, 50 or 99, because it 
>> will only go to disk when it exhausts the whole block, which sticks in block 
>> cache. So, it will still fetch the same amount of data from disk every time. 
>> If you increase the number to a value that is certain to load multiple 
>> blocks at a time from disk, it will increase performance.
>> 
>> 
>> 
>> On 11 nov 2010, at 12:55, Oleg Ruchovets wrote:
>> 
>>> Yes , I thought about large number , so you said it depends on block size.
>>> Good point.
>>> 
>>> I have one recored ~ 4k ,
>>> block size is:
>>> 
>>> <property>
>>> <name>dfs.block.size</name>
>>> <value>268435456</value>
>>> <description>HDFS blocksize of 256MB for large file-systems.
>>> </description>
>>> </property>
>>> 
>>> what is the number that I have choose? Assuming
>>> I am afraid that using number which is equal one block brings to
>>> socketTimeOutException? Am I write?
>>> 
>>> Thanks Oleg.
>>> 
>>> 
>>> 
>>> 
>>> On Thu, Nov 11, 2010 at 1:30 PM, Friso van Vollenhoven <
>>> fvanvollenho...@xebia.com> wrote:
>>> 
>>>> How small is small? If it is bytes, then setting the value to 50 is not so
>>>> much different from 1, I suppose. If 50 rows fit in one block, it will just
>>>> fetch one block whether the setting is 1 or 50. You might want to try a
>>>> larger value. It should be fine if the records are small and you need them
>>>> all on the client side anyway.
>>>> 
>>>> It also depends on the block size, of course. When you only ever do full
>>>> scans on a table and little random access, you might want to increase that.
>>>> 
>>>> Friso
>>>> 
>>>> 
>>>> 
>>>> 
>>>> On 11 nov 2010, at 12:15, Oleg Ruchovets wrote:
>>>> 
>>>>> Hi ,
>>>>> To improve client performance I  changed
>>>>> hbase.client.scanner.caching from 1 to 50.
>>>>> After running client with new value( hbase.client.scanner.caching from =
>>>> 50
>>>>> ) it didn't improve execution time at all.
>>>>> 
>>>>> I have ~ 9 million small records.
>>>>> I have to do full scan  , so it brings all 9 million records to client .
>>>>> My assumption -- this change have to bring significant improvement , but
>>>> it
>>>>> is not.
>>>>> 
>>>>> Additional Information.
>>>>> I scan table which has 100 regions
>>>>> 5 server
>>>>> 20 map
>>>>> 4  concurrent map
>>>>> Scan process takes 5.5 - 6 hours. As for me it is too much time? Am I
>>>> write?
>>>>> and how can I improve it
>>>>> 
>>>>> 
>>>>> I changed the value in all hbase-site.xml files and restart hbase.
>>>>> 
>>>>> Any suggestions.
>>>> 
>>>> 
>> 
>

Re: scan performance improvement

Reply via email to