Jean-Baptiste Onofré created FALCON-267:
-------------------------------------------

             Summary: Add CDC feature
                 Key: FALCON-267
                 URL: https://issues.apache.org/jira/browse/FALCON-267
             Project: Falcon
          Issue Type: New Feature
            Reporter: Jean-Baptiste Onofré
            Assignee: Jean-Baptiste Onofré


I propose to add a Change Data Capture feature in Falcon.

The idea is to be able to catch the change, firstly on HDFS files, publish the 
identified gap to a messaging topic.

It's what I would like to PoC:
- in a feed definition, we had a <capture/> element defining the change check 
interval.
- we create a coordinator in oozie which execute the following workflow at 
capture interval
- in the Falcon staging "capture" location on HDFS, we keep the first state of 
the feed. We compare (diff) the current content with the staging location, and 
write the diff in the Falcon staging. If the file is a binary, we can detect a 
change (using MD5 for instance) and the diff is the complete file (like in svn, 
git, etc).
- if we have a diff, we publish a message in the Falcon "capture" topic 
(containing a set of JMS properties and the message body contains the link to 
the diff (on HDFS, in the Falcon staging). The "stream" copy is ovewritten by 
the new one.

The purpose of this CDC is:
1/ thanks to the publication on the topic, to be able to use "external" tools 
to "react" when a change occurs. For instance, I plan to make a demo with an 
Apache Camel route (sending e-mails for example) when data change.
2/ staying in falcon/oozie/hadoop, to be able to setup a pipeline triggered by 
data change: for instance, trigger a job when the data change.

The first PoC is HDFS/fs centric but I think we can do diff on HBase or Hive.

Thoughts ?



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to