[jira] [Commented] (FLUME-1275) Add Regex Serializer for HBaseSink

Patrick Wendell (JIRA) Thu, 14 Jun 2012 10:42:44 -0700

    [ 
https://issues.apache.org/jira/browse/FLUME-1275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13295183#comment-13295183
 ]


Patrick Wendell commented on FLUME-1275:
----------------------------------------

Thanks Hari,

I just borrowed the use of "UTF8" from the existing hbase class but will change 
to Charsets.UTF_8. It was never meant to be end-user configurable, just wanted 
to avoid having a magic string in the code.

Will add documentation.

The issue with time-stamp uniqueness, that seems like a big issue, regardless 
of whether this particular serializer is being used. For instance, if you have 
multiple sinks writing to HBase at the same time, even if they are each using 
independent nonce's... you can never guarantee that the keys won't overlap (and 
thereby you would lose data). I see a few options:

1) Use HBase atomic increments to actually guarantee that no two puts will have 
the same key. This would need a first phase where you atomically increment a 
common cell, then use the value of that cell in the row-key. This seems high 
overhead and would prevent certain optimizations like batching.

2) Use a key value that is very unlikely to overlap between concurrent HBase 
sinks... for instance:

[timestamp].[machine IP].[nonce]

The nonce would reset each time the process starts... but would just ensure 
that within a given writer, a duplicate timestamp does not cause data loss.

3) Use randomness and hope for the best:
[timestamp].[random value]

This is simple but risks losing data (though maybe we just chose a large enough 
random value that this is very unlikely).

Am I missing something - or is this a hard problem?
                
> Add Regex Serializer for HBaseSink
> ----------------------------------
>
>                 Key: FLUME-1275
>                 URL: https://issues.apache.org/jira/browse/FLUME-1275
>             Project: Flume
>          Issue Type: Improvement
>            Reporter: Patrick Wendell
>         Attachments: FLUME-1275.patch.v1.txt
>
>
> It would be nice to have an "out of the box" HBase serializer that can 
> extract column data from a regular expression. This is a feature in Hive and 
> it is widely used:
> https://issues.apache.org/jira/browse/HIVE-167

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (FLUME-1275) Add Regex Serializer for HBaseSink

Reply via email to