Re: scan performance improvement

Ryan Rawson Thu, 11 Nov 2010 10:04:21 -0800

I'd be careful about adjusting HFile block size, we took 64k after
benchmarking a bunch of things, and it seemed to e a good performance
point.


As for scanning small rows, I'd go with a caching size of 1000-3000.
When I set my scanners to that, I can pull 50k+ rows/sec from 1
client.

On Thu, Nov 11, 2010 at 7:36 AM, Friso van Vollenhoven
<fvanvollenho...@xebia.com> wrote:
>> Great , thank you for the explanation.
>>
>>  my table schema is:
>>
>>         {NAME => 'URLs_sanity', FAMILIES => [{NAME => 'gs', VERSIONS =>
>> '1', COMPRESSION => 'NONE', TTL => '2147483647', BLOCKSIZE => '65536',
>> IN_MEMORY => 'false', BLOCKCACHE => 'true'}, {NAME => 'meta-data', VERSIONS
>> => '1', COMPRESSION => 'NONE', TTL => '2147483647', BLOCKSIZE => '65536',
>> IN_MEMORY => 'false', BLOCKCACHE => 'true'}, {NAME => 'snt', VERSIONS =>
>> '1', COMPRESSION => 'NONE', TTL => '2147483647', BLOCKSIZE => '65536',
>> IN_MEMORY => 'false', BLOCKCACHE => 'true'}]
>>
>> couple of questions:
>>     1) How can I know what is the optimal size of BlockSize? What is the
>> best practice regarding this issue
>
> Check the link I sent. There is an explanation on this setting in there.
>
>>     2) Assuming that I have a record 4 k and changed to 50 --> 4*50 = 200
>> and it is ~ 3 blocks , so performance had to be improved , but execution
>> time was the same.
>
> There is of course more involved than just this. And also, you may be already 
> getting the most of what your hardware can give you. You should also try to 
> find out what bottleneck you have (IO or CPU or network). Hadoop and HBase 
> have many settings. There is no magic single knob that makes things fast or 
> slow.
>
>>
>> Oleg.
>>
>>
>> On Thu, Nov 11, 2010 at 3:08 PM, Friso van Vollenhoven <
>> fvanvollenho...@xebia.com> wrote:
>>
>>> Not that block size (that's the HDFS one), but the HBase block size. You
>>> set it at table creation or it uses the default of 64K.
>>>
>>> The description of hbase.client.scanner.caching says:
>>> Number of rows that will be fetched when calling next
>>> on a scanner if it is not served from memory. Higher caching values
>>> will enable faster scanners but will eat up more memory and some
>>> calls of next may take longer and longer times when the cache is empty.
>>>
>>> That means that it will pre-fetch that number of rows, if the next row does
>>> not come from memory. So if your rows are small enough to fit 100 of them in
>>> one block, it doesn't matter whether you pre-fetch 1, 50 or 99, because it
>>> will only go to disk when it exhausts the whole block, which sticks in block
>>> cache. So, it will still fetch the same amount of data from disk every time.
>>> If you increase the number to a value that is certain to load multiple
>>> blocks at a time from disk, it will increase performance.
>>>
>>>
>>>
>>> On 11 nov 2010, at 12:55, Oleg Ruchovets wrote:
>>>
>>>> Yes , I thought about large number , so you said it depends on block
>>> size.
>>>> Good point.
>>>>
>>>> I have one recored ~ 4k ,
>>>> block size is:
>>>>
>>>> <property>
>>>> <name>dfs.block.size</name>
>>>> <value>268435456</value>
>>>> <description>HDFS blocksize of 256MB for large file-systems.
>>>> </description>
>>>> </property>
>>>>
>>>> what is the number that I have choose? Assuming
>>>> I am afraid that using number which is equal one block brings to
>>>> socketTimeOutException? Am I write?
>>>>
>>>> Thanks Oleg.
>>>>
>>>>
>>>>
>>>>
>>>> On Thu, Nov 11, 2010 at 1:30 PM, Friso van Vollenhoven <
>>>> fvanvollenho...@xebia.com> wrote:
>>>>
>>>>> How small is small? If it is bytes, then setting the value to 50 is not
>>> so
>>>>> much different from 1, I suppose. If 50 rows fit in one block, it will
>>> just
>>>>> fetch one block whether the setting is 1 or 50. You might want to try a
>>>>> larger value. It should be fine if the records are small and you need
>>> them
>>>>> all on the client side anyway.
>>>>>
>>>>> It also depends on the block size, of course. When you only ever do full
>>>>> scans on a table and little random access, you might want to increase
>>> that.
>>>>>
>>>>> Friso
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On 11 nov 2010, at 12:15, Oleg Ruchovets wrote:
>>>>>
>>>>>> Hi ,
>>>>>> To improve client performance I  changed
>>>>>> hbase.client.scanner.caching from 1 to 50.
>>>>>> After running client with new value( hbase.client.scanner.caching from
>>> =
>>>>> 50
>>>>>> ) it didn't improve execution time at all.
>>>>>>
>>>>>> I have ~ 9 million small records.
>>>>>> I have to do full scan  , so it brings all 9 million records to client
>>> .
>>>>>> My assumption -- this change have to bring significant improvement ,
>>> but
>>>>> it
>>>>>> is not.
>>>>>>
>>>>>> Additional Information.
>>>>>> I scan table which has 100 regions
>>>>>> 5 server
>>>>>> 20 map
>>>>>> 4  concurrent map
>>>>>> Scan process takes 5.5 - 6 hours. As for me it is too much time? Am I
>>>>> write?
>>>>>> and how can I improve it
>>>>>>
>>>>>>
>>>>>> I changed the value in all hbase-site.xml files and restart hbase.
>>>>>>
>>>>>> Any suggestions.
>>>>>
>>>>>
>>>
>>>
>
>

Re: scan performance improvement

Reply via email to