[ 
https://issues.apache.org/jira/browse/FLUME-1275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13296006#comment-13296006
 ] 

Patrick Wendell edited comment on FLUME-1275 at 6/15/12 11:29 PM:
------------------------------------------------------------------

This commit addresses issues discussed up to now. I did not create a generic 
interface for row key generators, though that is probably a good idea and maybe 
cause for another JIRA. I'd like to keep the scope of this one JIRA limited to 
just the serializer in question. Below is a doc describing how row-key 
generation works in this patch:

{noformat}
  /**
   * Returns a row-key with the following format:
   * [time in millis]-[random key]-[nonce]
   */
  protected byte[] getRowKey(Calendar cal) {
    /* NOTE: This key generation strategy has the following properties:
     * 
     * 1) Within a single JVM, the same row key will never be duplicated.
     * 2) Amongst any two JVM's operating at different time periods (according
     *    to their respective clocks), the same row key will never be 
duplicated.
     * 3) Amongst any two JVM's operating concurrently (according to their
     *    respective clocks), the odds of duplicating a row-key are non-zero
     *    but infinitesimal. This would require simultaneous collision in (a) 
     *    the timestamp (b) the respective nonce and (c) the random string.
     *    The string is necessary since (a) and (b) could collide if a fleet
     *    of Flume agents are restarted in tandem.
     *    
     *  Row-key uniqueness is important because conflicting row-keys will cause
     *  data loss. */
{noformat}
                
      was (Author: [email protected]):
    This commit addresses issues discussed up to now. I did not create a 
generic interface for row key generators, though that is probably a good idea 
and maybe cause for another JIRA. I'd like to keep the scope of this one JIRA 
limited to just the serializer in question. Below is a doc describing how 
row-key generation works in this patch:

  /**
   * Returns a row-key with the following format:
   * [time in millis]-[random key]-[nonce]
   */
  protected byte[] getRowKey(Calendar cal) {
    /* NOTE: This key generation strategy has the following properties:
     * 
     * 1) Within a single JVM, the same row key will never be duplicated.
     * 2) Amongst any two JVM's operating at different time periods (according
     *    to their respective clocks), the same row key will never be 
duplicated.
     * 3) Amongst any two JVM's operating concurrently (according to their
     *    respective clocks), the odds of duplicating a row-key are non-zero
     *    but infinitesimal. This would require simultaneous collision in (a) 
     *    the timestamp (b) the respective nonce and (c) the random string.
     *    The string is necessary since (a) and (b) could collide if a fleet
     *    of Flume agents are restarted in tandem.
     *    
     *  Row-key uniqueness is important because conflicting row-keys will cause
     *  data loss. */
                  
> Add Regex Serializer for HBaseSink
> ----------------------------------
>
>                 Key: FLUME-1275
>                 URL: https://issues.apache.org/jira/browse/FLUME-1275
>             Project: Flume
>          Issue Type: Improvement
>            Reporter: Patrick Wendell
>            Assignee: Patrick Wendell
>         Attachments: FLUME-1275.patch.v1.txt, FLUME-1275.patch.v2.txt
>
>
> It would be nice to have an "out of the box" HBase serializer that can 
> extract column data from a regular expression. This is a feature in Hive and 
> it is widely used:
> https://issues.apache.org/jira/browse/HIVE-167

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to