Re: [Discuss] Airflow sensor optimization

2019-03-07 Thread Kevin Yang
Thank Yingbo for starting this and everyone for joining the discussion, great point about sharding. This would be really useful for large scale clusters. I image at the first stage we can reuse the existing logic and make the smart sensor a special kind of operator( maybe even make scheduler treat

Re: [Discuss] Airflow sensor optimization

2019-03-07 Thread Elser Rosa Leiva
On 2019/03/06 14:31:57, Yingbo Wang wrote: > hi,> > > I would like to open an AIP for Airflow sensor optimization.> > > > *Motivation*:> > > Low efficiency in Airflow Sensor Implementation> > > Sensors are a special kind of operator that will keep running until a> > certain criterion is met. Examp

Re: [Discuss] Airflow sensor optimization

2019-03-07 Thread Yingbo Wang
There are two dimension to evaluate how much resource all sensors take in Airflow: the number of sensors and the duration of each sensor task take. Batch/smart sensor idea is proposed for the first one and the rescheduling is for the second one. For airflow cluster running large number of sensor ta

Re: [Discuss] Airflow sensor optimization

2019-03-07 Thread Ash Berlin-Taylor
Rescheduling is of massive use for a DAG where we are waiting for a weekly S3 file delivery from a third party supplier with _massive_ variance in the delivery time. It'll appear at some point between Thursday AM and Sunday evening. Not having an executor slot tied up with the S3KeySensor is gre

Re: [Discuss] Airflow sensor optimization

2019-03-06 Thread Alex Guziel
Sensor-service thing seems to open the door to make sensors a pubsub-type deal where possible. For example, in Hive, you can keep an in-memory registry of what partitions to sense for, and tail the audit log to see when they are populated, instead of polling. On Wed, Mar 6, 2019 at 1:51 PM Alex Gu

Re: [Discuss] Airflow sensor optimization

2019-03-06 Thread Alex Guziel
Smart sensor seems like a good idea, but I wonder how much performance will be improved in practice. And of course, one must think about sharding and such. I'm not sure how helpful rescheduling sensors is, since it will add scheduler and DB load seemingly, which is already a bottleneck. On Wed, M

Re: [Discuss] Airflow sensor optimization

2019-03-06 Thread Yingbo Wang
I would still like to get some feedback on the batch sensor/smart sensor idea after viewing the sensor rescheduling PR. Since the reschedule mode does not reduce the number of worker processes for sensor. The batch sensor idea is proposed for this purpose and should work well with reschedule mode.

Re: [Discuss] Airflow sensor optimization

2019-03-06 Thread Yingbo Wang
Wow, Great work from Seelmann! Thanks Fokko for letting us know it. We are super happy to have this feature. On Wed, Mar 6, 2019 at 11:24 AM Driesprong, Fokko wrote: > Thanks for bringing this up. I've added a comment on the Wiki: > > https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-17+Ai

Re: [Discuss] Airflow sensor optimization

2019-03-06 Thread Driesprong, Fokko
Thanks for bringing this up. I've added a comment on the Wiki: https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-17+Airflow+sensor+optimization Have you looked into the work by Seelmann? Recently he introduced the ability to reschedule sensors. When rescheduling, the slot will be given back

[Discuss] Airflow sensor optimization

2019-03-06 Thread Yingbo Wang
hi, I would like to open an AIP for Airflow sensor optimization. *Motivation*: Low efficiency in Airflow Sensor Implementation Sensors are a special kind of operator that will keep running until a certain criterion is met. Examples include a specific file landing in HDFS or S3, a partition app