Wow, Great work from Seelmann! Thanks Fokko for letting us know it. We are super happy to have this feature.
On Wed, Mar 6, 2019 at 11:24 AM Driesprong, Fokko <[email protected]> wrote: > Thanks for bringing this up. I've added a comment on the Wiki: > > https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-17+Airflow+sensor+optimization > > Have you looked into the work by Seelmann? Recently he introduced the > ability to reschedule sensors. When rescheduling, the slot will be given > back to the scheduler after a poke operation. Therefore the slot won't be > occupied all the time. The details are in the PR > https://github.com/apache/airflow/pull/3596 > > I would propose to make this the default behavior in Airflow 2.0. > > Cheers, Fokko > > Op wo 6 mrt. 2019 om 15:32 schreef Yingbo Wang <[email protected]>: > > > hi, > > > > I would like to open an AIP for Airflow sensor optimization. > > > > > > *Motivation*: > > > > Low efficiency in Airflow Sensor Implementation > > > > Sensors are a special kind of operator that will keep running until a > > certain criterion is met. Examples include a specific file landing in > HDFS > > or S3, a partition appearing in Hive, or a specific time of the day. > > Sensors are derived from BaseSensorOperator and run a poke method at a > > specified poke_interval until it returns True. > > > > The reason that the sensor tasks are inefficient is because in current > > design, we sprawn a separate worker process for each partition sensor. > This > > worker might last a long time, until the target partition is available. > In > > the case where there are many sensor tasks that need to run within > certain > > time limits, we have to allocate a lot of resources to have enough > workers > > for the sensor tasks. > > > > *Idea:* > > > > We propose two approaches that could address this issues, batch-sensor > > and smart-sensor. > > > > > > > > Batch-sensor > > > > The basic idea of batch-sensor is to batch process sensor tasks to save > > resources. During running, a batch-sensor will take N partition sensor > > requests as the input and poke those N partitions periodically. If the > > batch-sensor finds that the criteria of some sensor task is met, the > > batch-sensor will update the database about this sensor tasks. > > > > > > To do this, we need to create a sensor basic class called ‘batchable’ and > > make all sensors inherit from this basic class. We also need to change > the > > behavior of schedule regarding a batchable sensor tasks. The schedule > will > > find as many as possible batchable sensor tasks and run those tasks in a > > batch. > > > > > > Smart-sensor > > > > Smart-sensor is an improvement on top of batch-sensor. > > > > The idea of smart-sensor is that the worker process of smart-sensor will > > run like a service. To do this, we need to persist Sensor details in > > Airflow DB and the worker process periodically queries task-instance > table > > to find sensor tasks; poke the metastore and update the task instance > table > > if it detects that certain partition or file created. > > >
