Reprocessing a data pipeline with coordinators

Andrew Mains Thu, 14 May 2015 15:27:49 -0700

Hi all,

I'm hoping to get some input on a redesign of the way we run our datapipeline with oozie. We have a use case where we frequently get delayeddata after we've processed a particular time window--that is, we can runa workflow on a given hour of data, receive new input for that hour, andthen need to reprocess the hour. To give a more concrete example, say wehave a coordinator application with inputs and data sets:


   <datasets>

<dataset name="input1" frequency="60"initial-instance="2015-05-14T19:00Z" timezone="UTC">

<uri-template>${hdfs}/tmp/revenue_feed/${YEAR}/${MONTH}/${DAY}/${HOUR}</uri-template>
      </dataset>
   </datasets>
   <input-events>
      <data-in name="coordInput1" dataset="input1">
<start-instance>${coord:current(-1)}</start-instance>
<end-instance>${coord:current(0)}</end-instance>
      </data-in>
   </input-events>

If the coordinator system sees a done-flag for 2015-05-14T19:00:00 inrevenue_feed, it will kick off its job. However, if new data comes in torevenue_feed, it won't kick off another job to handle it (afaik). As aresult, the downstream datasets from this coordinator will remain out ofdate.

Does oozie provide any means for handling this kind of scenario? As faras I can tell, for a given coordinator, once an hour is processed, itremains processed, and the coordinator system won't rerun it, even ifnew input data comes in--is that understanding correct?


Thank you very much for your help!

Andrew

Reprocessing a data pipeline with coordinators

Reply via email to