Hi all,

I'm hoping to get some input on a redesign of the way we run our data pipeline with oozie. We have a use case where we frequently get delayed data after we've processed a particular time window--that is, we can run a workflow on a given hour of data, receive new input for that hour, and then need to reprocess the hour. To give a more concrete example, say we have a coordinator application with inputs and data sets:

   <datasets>
<dataset name="input1" frequency="60" initial-instance="2015-05-14T19:00Z" timezone="UTC">
<uri-template>${hdfs}/tmp/revenue_feed/${YEAR}/${MONTH}/${DAY}/${HOUR}</uri-template>
      </dataset>
   </datasets>
   <input-events>
      <data-in name="coordInput1" dataset="input1">
<start-instance>${coord:current(-1)}</start-instance>
<end-instance>${coord:current(0)}</end-instance>
      </data-in>
   </input-events>

If the coordinator system sees a done-flag for 2015-05-14T19:00:00 in revenue_feed, it will kick off its job. However, if new data comes in to revenue_feed, it won't kick off another job to handle it (afaik). As a result, the downstream datasets from this coordinator will remain out of date.

Does oozie provide any means for handling this kind of scenario? As far as I can tell, for a given coordinator, once an hour is processed, it remains processed, and the coordinator system won't rerun it, even if new input data comes in--is that understanding correct?

Thank you very much for your help!

Andrew

Reply via email to