Sensor-service thing seems to open the door to make sensors a pubsub-type deal where possible. For example, in Hive, you can keep an in-memory registry of what partitions to sense for, and tail the audit log to see when they are populated, instead of polling.
On Wed, Mar 6, 2019 at 1:51 PM Alex Guziel <alex.guz...@airbnb.com> wrote: > Smart sensor seems like a good idea, but I wonder how much performance > will be improved in practice. And of course, one must think about sharding > and such. > > I'm not sure how helpful rescheduling sensors is, since it will add > scheduler and DB load seemingly, which is already a bottleneck. > > On Wed, Mar 6, 2019 at 12:43 PM Yingbo Wang <ybw...@gmail.com> wrote: > >> I would still like to get some feedback on the batch sensor/smart sensor >> idea after viewing the sensor rescheduling PR. Since the reschedule mode >> does not reduce the number of worker processes for sensor. The batch >> sensor >> idea is proposed for this purpose and should work well with reschedule >> mode. >> >> On Wed, Mar 6, 2019 at 11:30 AM Yingbo Wang <ybw...@gmail.com> wrote: >> >> > Wow, Great work from Seelmann! Thanks Fokko for letting us know it. We >> are >> > super happy to have this feature. >> > >> > On Wed, Mar 6, 2019 at 11:24 AM Driesprong, Fokko <fo...@driesprong.frl >> > >> > wrote: >> > >> >> Thanks for bringing this up. I've added a comment on the Wiki: >> >> >> >> >> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-17+Airflow+sensor+optimization >> >> >> >> Have you looked into the work by Seelmann? Recently he introduced the >> >> ability to reschedule sensors. When rescheduling, the slot will be >> given >> >> back to the scheduler after a poke operation. Therefore the slot won't >> be >> >> occupied all the time. The details are in the PR >> >> https://github.com/apache/airflow/pull/3596 >> >> >> >> I would propose to make this the default behavior in Airflow 2.0. >> >> >> >> Cheers, Fokko >> >> >> >> Op wo 6 mrt. 2019 om 15:32 schreef Yingbo Wang <ybw...@gmail.com>: >> >> >> >> > hi, >> >> > >> >> > I would like to open an AIP for Airflow sensor optimization. >> >> > >> >> > >> >> > *Motivation*: >> >> > >> >> > Low efficiency in Airflow Sensor Implementation >> >> > >> >> > Sensors are a special kind of operator that will keep running until a >> >> > certain criterion is met. Examples include a specific file landing in >> >> HDFS >> >> > or S3, a partition appearing in Hive, or a specific time of the day. >> >> > Sensors are derived from BaseSensorOperator and run a poke method at >> a >> >> > specified poke_interval until it returns True. >> >> > >> >> > The reason that the sensor tasks are inefficient is because in >> current >> >> > design, we sprawn a separate worker process for each partition >> sensor. >> >> This >> >> > worker might last a long time, until the target partition is >> >> available. In >> >> > the case where there are many sensor tasks that need to run within >> >> certain >> >> > time limits, we have to allocate a lot of resources to have enough >> >> workers >> >> > for the sensor tasks. >> >> > >> >> > *Idea:* >> >> > >> >> > We propose two approaches that could address this issues, >> batch-sensor >> >> > and smart-sensor. >> >> > >> >> > >> >> > >> >> > Batch-sensor >> >> > >> >> > The basic idea of batch-sensor is to batch process sensor tasks to >> save >> >> > resources. During running, a batch-sensor will take N partition >> sensor >> >> > requests as the input and poke those N partitions periodically. If >> the >> >> > batch-sensor finds that the criteria of some sensor task is met, the >> >> > batch-sensor will update the database about this sensor tasks. >> >> > >> >> > >> >> > To do this, we need to create a sensor basic class called ‘batchable’ >> >> and >> >> > make all sensors inherit from this basic class. We also need to >> change >> >> the >> >> > behavior of schedule regarding a batchable sensor tasks. The schedule >> >> will >> >> > find as many as possible batchable sensor tasks and run those tasks >> in a >> >> > batch. >> >> > >> >> > >> >> > Smart-sensor >> >> > >> >> > Smart-sensor is an improvement on top of batch-sensor. >> >> > >> >> > The idea of smart-sensor is that the worker process of smart-sensor >> will >> >> > run like a service. To do this, we need to persist Sensor details in >> >> > Airflow DB and the worker process periodically queries task-instance >> >> table >> >> > to find sensor tasks; poke the metastore and update the task instance >> >> table >> >> > if it detects that certain partition or file created. >> >> > >> >> >> > >> >