[ 
https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14525825#comment-14525825
 ] 

Edward Bortnikov commented on HBASE-13071:
------------------------------------------


1. No problem having a per-scan parameter. The assumption is that scans should 
be big in order for the feature to be efficient. 
2. No problem moving to the size-in-bytes parameter. The API should be 
identical for synchronous and asynchronous clients. 

Let's agree on the upper-bound parameter semantics (whether rows or bytes). 
Should it be conservative or optimistic? In the optimistic interpretation, the 
client would directly relay the API parameter to the server. A new prefetch 
request is issued when 50% of the old buffer consumed, so when the new buffer 
arrives the old one might not be released yet. This overlap should be short but 
the bound semantics are soft (best-effort). In the conservative interpretation, 
the client would adapt the API parameters, and issue requests for less data, to 
prevent any overflow. For legacy scans, there was no difference because the 
prefetch and computation parts did not overlap. 

Which approach would be better? 

> Hbase Streaming Scan Feature
> ----------------------------
>
>                 Key: HBASE-13071
>                 URL: https://issues.apache.org/jira/browse/HBASE-13071
>             Project: HBase
>          Issue Type: New Feature
>            Reporter: Eshcar Hillel
>         Attachments: 99.eshcar.png, HBASE-13071_98_1.patch, 
> HBASE-13071_trunk_1.patch, HBASE-13071_trunk_10.patch, 
> HBASE-13071_trunk_2.patch, HBASE-13071_trunk_3.patch, 
> HBASE-13071_trunk_4.patch, HBASE-13071_trunk_5.patch, 
> HBASE-13071_trunk_6.patch, HBASE-13071_trunk_7.patch, 
> HBASE-13071_trunk_8.patch, HBASE-13071_trunk_9.patch, 
> HBASE-13071_trunk_rebase_1.0.patch, HBaseStreamingScanDesign.pdf, 
> HbaseStreamingScanEvaluation.pdf, 
> HbaseStreamingScanEvaluationwithMultipleClients.pdf, gc.delay.png, 
> gc.eshcar.png, gc.png, hits.delay.png, hits.eshcar.png, hits.png, 
> latency.delay.png, latency.png, network.png
>
>
> A scan operation iterates over all rows of a table or a subrange of the 
> table. The synchronous nature in which the data is served at the client side 
> hinders the speed the application traverses the data: it increases the 
> overall processing time, and may cause a great variance in the times the 
> application waits for the next piece of data.
> The scanner next() method at the client side invokes an RPC to the 
> regionserver and then stores the results in a cache. The application can 
> specify how many rows will be transmitted per RPC; by default this is set to 
> 100 rows. 
> The cache can be considered as a producer-consumer queue, where the hbase 
> client pushes the data to the queue and the application consumes it. 
> Currently this queue is synchronous, i.e., blocking. More specifically, when 
> the application consumed all the data from the cache --- so the cache is 
> empty --- the hbase client retrieves additional data from the server and 
> re-fills the cache with new data. During this time the application is blocked.
> Under the assumption that the application processing time can be balanced by 
> the time it takes to retrieve the data, an asynchronous approach can reduce 
> the time the application is waiting for data.
> We attach a design document.
> We also have a patch that is based on a private branch, and some evaluation 
> results of this code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to