[ 
https://issues.apache.org/jira/browse/FLINK-2055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15119736#comment-15119736
 ] 

PJ Van Aeken commented on FLINK-2055:
-------------------------------------

What is the latest news on this?

Having read through the mailing thread, and the corresponding code, it seems 
like the current solution is more of a workaround. I can understand the desire 
for reusing what is already out there, but reusing the HBase TableOutputFormat 
feels a bit like making a sacrifice. I haven't had time to thoroughly 
investigate my suspicions though and am very interested to learn if anyone else 
has. I am by no means an HBase expert, but based on what I think I know about 
HBase, this is the sacrifice I think we're making here:

The native HBaseTableOutputFormat was built for use in batch jobs. It uses the 
BufferedMutator under the hood, which as far as I understood decides to flush 
based on constraints which are determined by HBase itself, such as the 
cumulative size of the Puts etc. That means that while we may "write" to our 
TableOutputFormat every X milliseconds, HBase will still decide on its own when 
to actually flush the records. The HBase client, in order to avoid a large 
amount of small files, also groups the Puts together, but in the mean time 
exposes them through a component called the memstore, making them available 
before the flush. I believe that by using the TableOutputFormat with the 
BufferedMutator, we are skipping the memstore and therefore new Puts remain 
unavailable until the flush. We could off course configure HBase to flush to 
disk more frequently, but should we really do that if we have an alternative?

Now, as mentioned, I'm not sure I fully grasped the inner workings of HBase so 
if I made some false assumptions, I'm sorry. But based on what I think I know 
now, it seems like we're making an unnecessary sacrifice here.

> Implement Streaming HBaseSink
> -----------------------------
>
>                 Key: FLINK-2055
>                 URL: https://issues.apache.org/jira/browse/FLINK-2055
>             Project: Flink
>          Issue Type: New Feature
>          Components: Streaming, Streaming Connectors
>    Affects Versions: 0.9
>            Reporter: Robert Metzger
>            Assignee: Hilmi Yildirim
>
> As per : 
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Write-Stream-to-HBase-td1300.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to