I would still like to get some feedback on the batch sensor/smart sensor
idea after viewing the sensor rescheduling PR. Since the reschedule mode
does not reduce the number of worker processes for sensor. The batch sensor
idea is proposed for this purpose and should work well with reschedule
mode.

On Wed, Mar 6, 2019 at 11:30 AM Yingbo Wang <ybw...@gmail.com> wrote:

> Wow, Great work from Seelmann! Thanks Fokko for letting us know it. We are
> super happy to have this feature.
>
> On Wed, Mar 6, 2019 at 11:24 AM Driesprong, Fokko <fo...@driesprong.frl>
> wrote:
>
>> Thanks for bringing this up. I've added a comment on the Wiki:
>>
>> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-17+Airflow+sensor+optimization
>>
>> Have you looked into the work by Seelmann? Recently he introduced the
>> ability to reschedule sensors. When rescheduling, the slot will be given
>> back to the scheduler after a poke operation. Therefore the slot won't be
>> occupied all the time. The details are in the PR
>> https://github.com/apache/airflow/pull/3596
>>
>> I would propose to make this the default behavior in Airflow 2.0.
>>
>> Cheers, Fokko
>>
>> Op wo 6 mrt. 2019 om 15:32 schreef Yingbo Wang <ybw...@gmail.com>:
>>
>> > hi,
>> >
>> > I would like to open an AIP for Airflow sensor optimization.
>> >
>> >
>> > *Motivation*:
>> >
>> > Low efficiency in Airflow Sensor Implementation
>> >
>> > Sensors are a special kind of operator that will keep running until a
>> > certain criterion is met. Examples include a specific file landing in
>> HDFS
>> > or S3, a partition appearing in Hive, or a specific time of the day.
>> > Sensors are derived from BaseSensorOperator and run a poke method at a
>> > specified poke_interval until it returns True.
>> >
>> > The reason that the sensor tasks are inefficient is because in current
>> > design, we sprawn a separate worker process for each partition sensor.
>> This
>> > worker might last a long time, until the target partition is
>> available.  In
>> > the case where there are many sensor tasks that need to run within
>> certain
>> > time limits, we have to allocate a lot of resources to have enough
>> workers
>> > for the sensor tasks.
>> >
>> > *Idea:*
>> >
>> > We propose two approaches that could address this issues, batch-sensor
>> > and smart-sensor.
>> >
>> >
>> >
>> > Batch-sensor
>> >
>> > The basic idea of batch-sensor is to batch process sensor tasks to save
>> > resources. During running, a batch-sensor will take N partition sensor
>> > requests as the input and poke those N partitions periodically. If the
>> > batch-sensor finds that the criteria of some sensor task is met, the
>> > batch-sensor will update the database about this sensor tasks.
>> >
>> >
>> > To do this, we need to create a sensor basic class called ‘batchable’
>> and
>> > make all sensors inherit from this basic class. We also need to change
>> the
>> > behavior of schedule regarding a batchable sensor tasks. The schedule
>> will
>> > find as many as possible batchable sensor tasks and run those tasks in a
>> > batch.
>> >
>> >
>> > Smart-sensor
>> >
>> > Smart-sensor is an improvement on top of batch-sensor.
>> >
>> > The idea of smart-sensor is that the worker process of smart-sensor will
>> > run like a service. To do this, we need to persist Sensor details in
>> > Airflow DB and the worker process periodically queries task-instance
>> table
>> > to find sensor tasks; poke the metastore and update the task instance
>> table
>> > if it detects that certain partition or file created.
>> >
>>
>

Reply via email to