[Discuss] Airflow sensor optimization

Yingbo Wang Wed, 06 Mar 2019 06:32:38 -0800

hi,

I would like to open an AIP for Airflow sensor optimization.



*Motivation*:

Low efficiency in Airflow Sensor Implementation

Sensors are a special kind of operator that will keep running until a
certain criterion is met. Examples include a specific file landing in HDFS
or S3, a partition appearing in Hive, or a specific time of the day.
Sensors are derived from BaseSensorOperator and run a poke method at a
specified poke_interval until it returns True.

The reason that the sensor tasks are inefficient is because in current
design, we sprawn a separate worker process for each partition sensor. This
worker might last a long time, until the target partition is available.  In
the case where there are many sensor tasks that need to run within certain
time limits, we have to allocate a lot of resources to have enough workers
for the sensor tasks.

*Idea:*

We propose two approaches that could address this issues, batch-sensor
and smart-sensor.



Batch-sensor

The basic idea of batch-sensor is to batch process sensor tasks to save
resources. During running, a batch-sensor will take N partition sensor
requests as the input and poke those N partitions periodically. If the
batch-sensor finds that the criteria of some sensor task is met, the
batch-sensor will update the database about this sensor tasks.


To do this, we need to create a sensor basic class called ‘batchable’ and
make all sensors inherit from this basic class. We also need to change the
behavior of schedule regarding a batchable sensor tasks. The schedule will
find as many as possible batchable sensor tasks and run those tasks in a
batch.


Smart-sensor

Smart-sensor is an improvement on top of batch-sensor.

The idea of smart-sensor is that the worker process of smart-sensor will
run like a service. To do this, we need to persist Sensor details in
Airflow DB and the worker process periodically queries task-instance table
to find sensor tasks; poke the metastore and update the task instance table
if it detects that certain partition or file created.

[Discuss] Airflow sensor optimization

Reply via email to