1. External system which i am thinking of doesn't have the ability to run code 2. If we use first task as sensor, how is the dag run getting created for the first task which is a sensor to run?
Running user code on scheduler does seems problematic, can think of some other way for that. I will think about it Thanks, Bharath On Fri, Feb 14, 2020 at 3:57 PM Ash Berlin-Taylor <[email protected]> wrote: > If you have the ability to run code from the external system you might > want to consider using the ("experimental") API to trigger the dag run from > the external system? > > > http://airflow.apache.org/docs/stable/api.html#post--api-experimental-dags--DAG_ID--dag_runs > When using the API doesn't work for you the common approach I have seen is > as you hint at -- having a "trigger" dag that runs (frequently depending on > your needs), checks the external condition and uses TriggerDagRunOperator. > The other way I have seen this done is to just have the first task of your > dag be a sensor that checks/waits on the external resource. With the > recently added "reschedule" mode of sensors this also doesn't tie up a > worker slot when the sensor isn't running. This is the approach I have used > in the past when processingly weekly datasets that would appear anywhere in > a 72 hour window after the expected delivery time. > > Given these options exist I'm not quite sure I see the need for a new > parameter to the DAG (especially one which runs user code in the scheduler, > that gets quite a strong no from me) Could you perhaps explain your idea in > more detail, specifically how it fits in to your workflow, and why you > don't want to use the two methods I talked about here? > Thanks, > Ash > On Feb 14 2020, at 10:10 am, bharath palaksha <[email protected]> > wrote: > > Hi, > > > > I have been using airflow extensively in my current work at Walmart Labs. > > while working on our requirements, came across a functionality which is > > missing in airflow and if implemented will be very useful. > > Currently, Airflow is a schedule based workflow management, a cron > > expression defines the creation of dag runs. If there is a dependency on > a > > different dag - TriggerDagRunOperator helps in creating dag runs. > > Suppose, there is a dependency which is outside of Airflow cluster eg: > > different database, filesystem or an event from an API which is an > upstream > > dependency. There is no way in Airflow to achieve this unless we > schedule a > > DAG for a a very short interval and allow it to poll. > > > > To solve above issue, what if airflow takes 2 different args - > > schedule_interval > > and trigger_sensor. > > > > - schedule_interval - works the same way as it is already working now > > - trigger_sensor - accepts a sensor which returns true when an event is > > sensed and this in turn creates a dag run > > > > If you specify both the argument, schedule_interval takes precedence. > > Scheduler parses all DAGs in a loop for every heartbeat and checks for > DAGs > > which has reached scheduled time and creates DAG run, same loop can also > > check for trigger_sensor and if argument is set - check if it returns > true > > to create dag run. This might slow down scheduler as it has to execute > > sensors now, we can find some other way to avoid slowness. > > Can we create AIP for this? Any thoughts? > > > > Thanks, > > Bharath > > > >
