[ 
https://issues.apache.org/jira/browse/FALCON-267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13886193#comment-13886193
 ] 

Srikanth Sundarrajan commented on FALCON-267:
---------------------------------------------

This feature exists in a crude shape/form today in Falcon. There is this tag in 
the feed definition called "late-cut-off", which is the time limit within which 
change is monitored and when a process has a late input (which means the feed 
changed), the process is re-executed. I had proposed the idea of creating 
recipes over falcon system to achieve some common data management objectives 
and this seems a nice fit. I will pen down my thoughts and share on the 
dev-list.

> Add CDC feature
> ---------------
>
>                 Key: FALCON-267
>                 URL: https://issues.apache.org/jira/browse/FALCON-267
>             Project: Falcon
>          Issue Type: New Feature
>            Reporter: Jean-Baptiste Onofré
>            Assignee: Jean-Baptiste Onofré
>
> I propose to add a Change Data Capture feature in Falcon.
> The idea is to be able to catch the change, firstly on HDFS files, publish 
> the identified gap to a messaging topic.
> It's what I would like to PoC:
> - in a feed definition, we had a <capture/> element defining the change check 
> interval.
> - we create a coordinator in oozie which execute the following workflow at 
> capture interval
> - in the Falcon staging "capture" location on HDFS, we keep the first state 
> of the feed. We compare (diff) the current content with the staging location, 
> and write the diff in the Falcon staging. If the file is a binary, we can 
> detect a change (using MD5 for instance) and the diff is the complete file 
> (like in svn, git, etc).
> - if we have a diff, we publish a message in the Falcon "capture" topic 
> (containing a set of JMS properties and the message body contains the link to 
> the diff (on HDFS, in the Falcon staging). The "stream" copy is ovewritten by 
> the new one.
> The purpose of this CDC is:
> 1/ thanks to the publication on the topic, to be able to use "external" tools 
> to "react" when a change occurs. For instance, I plan to make a demo with an 
> Apache Camel route (sending e-mails for example) when data change.
> 2/ staying in falcon/oozie/hadoop, to be able to setup a pipeline triggered 
> by data change: for instance, trigger a job when the data change.
> The first PoC is HDFS/fs centric but I think we can do diff on HBase or Hive.
> Thoughts ?



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to