Hi all,
I'm hoping to get some input on a redesign of the way we run our data
pipeline with oozie. We have a use case where we frequently get delayed
data after we've processed a particular time window--that is, we can run
a workflow on a given hour of data, receive new input for that hour, and
then need to reprocess the hour. To give a more concrete example, say we
have a coordinator application with inputs and data sets:
<datasets>
<dataset name="input1" frequency="60"
initial-instance="2015-05-14T19:00Z" timezone="UTC">
<uri-template>${hdfs}/tmp/revenue_feed/${YEAR}/${MONTH}/${DAY}/${HOUR}</uri-template>
</dataset>
</datasets>
<input-events>
<data-in name="coordInput1" dataset="input1">
<start-instance>${coord:current(-1)}</start-instance>
<end-instance>${coord:current(0)}</end-instance>
</data-in>
</input-events>
If the coordinator system sees a done-flag for 2015-05-14T19:00:00 in
revenue_feed, it will kick off its job. However, if new data comes in to
revenue_feed, it won't kick off another job to handle it (afaik). As a
result, the downstream datasets from this coordinator will remain out of
date.
Does oozie provide any means for handling this kind of scenario? As far
as I can tell, for a given coordinator, once an hour is processed, it
remains processed, and the coordinator system won't rerun it, even if
new input data comes in--is that understanding correct?
Thank you very much for your help!
Andrew