Hi James. Sounds good! I'll give this a try. I have a follow up question related to this DAG ...
The DAG is currently downloading data for 14 countries. In several occasions, I've gotten a request to pull data for more countries. Before Airflow, I've handled these manually by running a script that will pull data for a specific country going back to the start date. Is there a more elegant solution to do this in airlflow than kicking off an adhoc DAG for the additional countries? Also, I am thinking about using Redshift Spectrum against the resulting files so having each country in a separate file would be desirable for partitioning. Is it a good pattern to create parallel tasks for each country? This would significantly increase the number of tasks in the dag. I am currently generating a single set of tasks that pulls data for all the required countries. Thanks! On Thu, Jun 7, 2018, 4:42 AM James Meickle <jmeic...@quantopian.com> wrote: > One way to do this would be to have your DAG file return two nearly > identical DAGs (like put it in a factory function and call it twice). The > difference would be that the "final" run would have a conditional to add an > extra time sensor at the DAG root, to wait N days for the data to finalize. > The effect would be that two DAG runs would start each day, but the latter > would stay uncompleted for a while. > > Doing it that way has the advantage of still aligning with daily updates > rather than having a daily DAG and a weekly DAG; keeping execution dates > correct to which data is being processed; and not having the primary DAG > runs stay open for days (better for status tracking). > > It has the disadvantage that any extremely long running tasks (including > sensors) will be more likely to fail, and that you would have several more > tasks and DAG runs open and taking resources/concurrency. > > On Jun 6, 2018 3:34 PM, "Pedro Machado" <pe...@205datalab.com> wrote: > > This is a similar case. My idea was to rerun the whole data pull. The > current DAG is idempotent so there is no issue with inserting duplicates. > Now, I'm trying to figure out the best way to code it in Airflow. Thanks > > > On Wed, Jun 6, 2018 at 2:29 PM Ben Gregory <b...@astronomer.io> wrote: > > > I've seen a similar use case with DoubleClick/Google Analytics ( > > https://support.google.com/ds/answer/2791195?hl=en), where the reporting > > metrics have a "lookback window" of up to 30 days to mark conversion > > attribution (so if a user converts on the 14th day of clicking on an ad > it > > will still be counted. > > > > What we ended up doing in that case is setting the start/stop params of > the > > query to the the full window (pulled daily) and then upserted in Redshift > > based on a primary key (in our case actually a composite key with > multiple > > attributes). So you end up pulling a lot of redundant data but since > > there's no way to pull only updated records (which sounds like the case > > you're in), it's the best way to ensure your reporting is up-to-date. > > > > > > > > On Wed, Jun 6, 2018 at 1:10 PM Pedro Machado <pe...@205datalab.com> > wrote: > > > > > Yes. It's literally the same API calls with the same dates, only done a > > few > > > days later. It's just redoing the same data pull but instead of pulling > > one > > > date each dag run, it would pull all dates for the previous week on > > > Tuesdays. > > > > > > Thanks! > > > > > > > > > -- > > > > [image: Astronomer Logo] <https://www.astronomer.io/> > > > > *Ben Gregory* > > Data Engineer > > > > Mobile: +1-615-483-3653 • Online: astronomer.io < > > https://www.astronomer.io/> > > > > Download our new ebook. <http://marketing.astronomer.io/guide/> From > > Volume > > to Value - A Guide to Data Engineering. > > >