[ 
https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14348304#comment-14348304
 ] 

Eshcar Hillel commented on HBASE-13071:
---------------------------------------

I will work on a new version following the comments above (will take a few 
days).

[~stack] I will get back with a full answer to your questions, first I want to 
do some additional perf tests on my side.
The cause of the behavior of the tall humps can be rooted in the way you 
performed the tests. What is the size of the prefetch? 30?
If the tests simply call next in a loop without actually processing the data 
(which is simulated with delays in my tests) then the user exhaust the cache 
very quickly even though the prefetch is done in the background, and therefore 
the behavior is equivalent to a sync scan when the app needs to wait for the 
current prefetch to complete. 
It doesn't need to wait for the prefetch thread to complete loading the cache 
at the client side but this is minor when compared to the round trip time at 
the server side.
As I mentioned before, the assumption underlying this new feature is that the 
processing time at the client side can be balanced by the network and IO at the 
server side. If the processing is short then the network+IO is still a 
bottleneck. Makes sense?

> Hbase Streaming Scan Feature
> ----------------------------
>
>                 Key: HBASE-13071
>                 URL: https://issues.apache.org/jira/browse/HBASE-13071
>             Project: HBase
>          Issue Type: New Feature
>    Affects Versions: 0.98.11
>            Reporter: Eshcar Hillel
>         Attachments: 99.eshcar.png, HBASE-13071_98_1.patch, 
> HBASE-13071_trunk_1.patch, HBASE-13071_trunk_2.patch, 
> HBASE-13071_trunk_3.patch, HBASE-13071_trunk_4.patch, 
> HBaseStreamingScanDesign.pdf, HbaseStreamingScanEvaluation.pdf, 
> gc.eshcar.png, hits.eshcar.png, network.png
>
>
> A scan operation iterates over all rows of a table or a subrange of the 
> table. The synchronous nature in which the data is served at the client side 
> hinders the speed the application traverses the data: it increases the 
> overall processing time, and may cause a great variance in the times the 
> application waits for the next piece of data.
> The scanner next() method at the client side invokes an RPC to the 
> regionserver and then stores the results in a cache. The application can 
> specify how many rows will be transmitted per RPC; by default this is set to 
> 100 rows. 
> The cache can be considered as a producer-consumer queue, where the hbase 
> client pushes the data to the queue and the application consumes it. 
> Currently this queue is synchronous, i.e., blocking. More specifically, when 
> the application consumed all the data from the cache --- so the cache is 
> empty --- the hbase client retrieves additional data from the server and 
> re-fills the cache with new data. During this time the application is blocked.
> Under the assumption that the application processing time can be balanced by 
> the time it takes to retrieve the data, an asynchronous approach can reduce 
> the time the application is waiting for data.
> We attach a design document.
> We also have a patch that is based on a private branch, and some evaluation 
> results of this code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to