That looks interesting. My main worry is how to handle multiple executions of 
Signal based operators. If I follow your definition correctly, the idea is to 
run multiple evaluations of the online trained model (on a permanently running 
DAGRun). So what happens when you have triggered the downstream operators using 
a signal once? Do you clear state of the operators once a new signal is 
generated? Do you generate a new DAGRun?

How do you evaluate the model several times while keeping a history of operator 
execution?

Related to this, TFX has a similar proposal for ASYNC DAGs, which basically 
describe something similar to what you are proposing in this AIP: 
https://github.com/tensorflow/community/pull/253 (interesting read as it’s also 
related to the ML field).

My main concern would be that you may need to change a few more things to 
support signal based triggers, as the relationship between upstream tasks to 
downstream tasks is no longer 1:1 but 1:many.

Gerard Casas Saez
Twitter | Cortex | @casassaez
On Jun 16, 2020, 7:00 AM -0600, 蒋晓峰 <thanosxnicho...@gmail.com>, wrote:
> Hello everyone,
>
> Sending a message to everyone and collecting feedback on the AIP-35 on
> adding signal-based scheduling. This was previously briefly mentioned in
> the discussion of development slack channel. The key motivation of this
> proposal is to support a mixture of batch and stream jobs in the same
> workflow.
>
> In practice, there are many business logics that need collaboration between
> streaming and batch jobs. For example, in machine learning, an online
> learning job is a streaming job running forever. It emits a model
> periodically and a batch evaluate job should run to validate each model. So
> this is a streaming-triggered-batch case. If we continue from the above
> example, once the model passes the validation, the model needs to be
> deployed to a serving job. That serving job could be a streaming job that
> keeps polling records from Kafka and uses a model to do prediction. And the
> availability of a new model requires that streaming prediction job either
> to restart, or to reload the new model on the fly. In either case, it is a
> batch-to-streaming job triggering.
>
> At this point above, people basically have to invent their own way to deal
> with such workflows consisting of both batch and streaming jobs. I think
> having a system that can help smoothly work with a mixture of streaming and
> batch jobs is valuable here.
>
> This AIP-35 focuses on signal-based scheduling to let the operators send
> signals to the scheduler to trigger a scheduling action, such as starting
> jobs, stopping jobs and restarting jobs. With the change of
> scheduling mechanism, a streaming job can send signals to the scheduler to
> indicate the state change of the dependencies.
>
> Signal-based scheduling allows the scheduler to know the change of the
> dependency state immediately without periodically querying the metadata
> database. This also allows potential support for richer scheduling
> semantics such as periodic execution and manual trigger at per operator
> granularity.
>
> Changes proposed:
>
> - Signals are used to define conditions that must be met to run an
> operator. State change of the upstream tasks is one type of the signals.
> There may be other types of signals. The scheduler may take different
> actions when receiving different signals. To let the operators take signals
> as their starting condition, we propose to introduce SignalOperator which
> is mentioned in the public interface section.
> - A notification service is necessary to help receive and propagate the
> signals from the operators and other sources to the scheduler. Upon
> receiving a signal, the scheduler can take action according to the
> predefined signal-based conditions on the operators. Therefore we propose
> to introduce a Signal Notification Service component to Airflow.
>
> Please see related documents and PRs for details:
>
> AIP:
> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-35+Add+Signal+Based+Scheduling+To+Airflow
> Issue: https://github.com/apache/airflow/issues/9321
>
> Please let me know if there are any aspects that you agree/disagree with or
> need more clarification (especially the SignalOperator and SignalService).
> Any comments are welcome and I am looking forward to it!
>
> Thanks,
> Nicholas

Reply via email to