hi, I would like to open an AIP for Airflow sensor optimization.
*Motivation*: Low efficiency in Airflow Sensor Implementation Sensors are a special kind of operator that will keep running until a certain criterion is met. Examples include a specific file landing in HDFS or S3, a partition appearing in Hive, or a specific time of the day. Sensors are derived from BaseSensorOperator and run a poke method at a specified poke_interval until it returns True. The reason that the sensor tasks are inefficient is because in current design, we sprawn a separate worker process for each partition sensor. This worker might last a long time, until the target partition is available. In the case where there are many sensor tasks that need to run within certain time limits, we have to allocate a lot of resources to have enough workers for the sensor tasks. *Idea:* We propose two approaches that could address this issues, batch-sensor and smart-sensor. Batch-sensor The basic idea of batch-sensor is to batch process sensor tasks to save resources. During running, a batch-sensor will take N partition sensor requests as the input and poke those N partitions periodically. If the batch-sensor finds that the criteria of some sensor task is met, the batch-sensor will update the database about this sensor tasks. To do this, we need to create a sensor basic class called ‘batchable’ and make all sensors inherit from this basic class. We also need to change the behavior of schedule regarding a batchable sensor tasks. The schedule will find as many as possible batchable sensor tasks and run those tasks in a batch. Smart-sensor Smart-sensor is an improvement on top of batch-sensor. The idea of smart-sensor is that the worker process of smart-sensor will run like a service. To do this, we need to persist Sensor details in Airflow DB and the worker process periodically queries task-instance table to find sensor tasks; poke the metastore and update the task instance table if it detects that certain partition or file created.