[
https://issues.apache.org/jira/browse/FALCON-267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13886193#comment-13886193
]
Srikanth Sundarrajan commented on FALCON-267:
---------------------------------------------
This feature exists in a crude shape/form today in Falcon. There is this tag in
the feed definition called "late-cut-off", which is the time limit within which
change is monitored and when a process has a late input (which means the feed
changed), the process is re-executed. I had proposed the idea of creating
recipes over falcon system to achieve some common data management objectives
and this seems a nice fit. I will pen down my thoughts and share on the
dev-list.
> Add CDC feature
> ---------------
>
> Key: FALCON-267
> URL: https://issues.apache.org/jira/browse/FALCON-267
> Project: Falcon
> Issue Type: New Feature
> Reporter: Jean-Baptiste Onofré
> Assignee: Jean-Baptiste Onofré
>
> I propose to add a Change Data Capture feature in Falcon.
> The idea is to be able to catch the change, firstly on HDFS files, publish
> the identified gap to a messaging topic.
> It's what I would like to PoC:
> - in a feed definition, we had a <capture/> element defining the change check
> interval.
> - we create a coordinator in oozie which execute the following workflow at
> capture interval
> - in the Falcon staging "capture" location on HDFS, we keep the first state
> of the feed. We compare (diff) the current content with the staging location,
> and write the diff in the Falcon staging. If the file is a binary, we can
> detect a change (using MD5 for instance) and the diff is the complete file
> (like in svn, git, etc).
> - if we have a diff, we publish a message in the Falcon "capture" topic
> (containing a set of JMS properties and the message body contains the link to
> the diff (on HDFS, in the Falcon staging). The "stream" copy is ovewritten by
> the new one.
> The purpose of this CDC is:
> 1/ thanks to the publication on the topic, to be able to use "external" tools
> to "react" when a change occurs. For instance, I plan to make a demo with an
> Apache Camel route (sending e-mails for example) when data change.
> 2/ staying in falcon/oozie/hadoop, to be able to setup a pipeline triggered
> by data change: for instance, trigger a job when the data change.
> The first PoC is HDFS/fs centric but I think we can do diff on HBase or Hive.
> Thoughts ?
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)