when only one POD finishes its work for example -> when only one *container* finishes its work for example
On Tue, Nov 26, 2019 at 12:41 PM Jarek Potiuk <[email protected]> wrote: > Another thought. > > It looks like a "Sub-task operator". Kind of a special "Operator" to > handle such case (for example Docker-Compose or Kubernetes-POD driven). > > Currently we can trigger other tasks at the same time (parallel) or > sequentially (with dependency for finishing/failing the task). But we could > also have dependencies that are triggered DURING the task execution (when > some of the data has already been produced but some other processes are > still running inside the task - and will trigger another dependency after > it finishes). > > We could have the operator trigger other tasks during the run (when only > one POD finishes its work for example). This could be far less > paradigm-changing and can work with all the executors I think. It's still a > bit niche case though and maybe unnecessary optimization. > > J. > > On Tue, Nov 26, 2019 at 12:25 PM Ash Berlin-Taylor <[email protected]> wrote: > >> My gut reaction to this is no, not as a general purpose thing. It would >> only work for LocalExecutor reliably - it's never going to work with Kube >> executor, and almost never work in Celery. >> >> Also the cases where the data is small enough to fit in memory but not so >> large you need to put it in S3/hdfs etc don't seem worth special casing for? >> >> > On 26 Nov 2019, at 11:21, Jarek Potiuk <[email protected]> >> wrote: >> > >> > *TL;DR; Discuss whether shared memory data sharing for some tasks is an >> > interesting feature for future Airflow.* >> > >> > I had a few discussions recently with several Airflow users (including >> at >> > Slack [1] and in person at Warsaw Airflow meetup) about using shared >> memory >> > for inter-task communication. >> > >> > Airflow is not currently good for such case. It sounds doable, but >> fairly >> > complex to implement (and modifies Airflow paradigm a bit). I am not >> 100% >> > sure if it's a good idea to have such feature in the future. >> > >> > I see the need for it and I like it, however I would love to ask you for >> > opinions. >> > >> > *Context* >> > >> > The case is to have several independent tasks using a lot of temporary >> data >> > in memory. They either run in parallel and share loaded data, or use >> shared >> > memory to pass results between tasks. Examples: machine learning (like >> > audio processing). It makes sense to only load the audio files once (to >> > memory) and run several tasks on those loaded data. >> > >> > Best way to achieve it now is to combine such sharing-memory tasks into >> > single operator (Docker-compose for example ?) and run them as a single >> > Airflow Task. But maybe those tasks could still be modelled as separate >> > tasks in Airflow DAG. One benefit is that there might be different >> > dependencies for different tasks, processing results from some tasks >> could >> > be sent independently using different - existing - operators. >> > >> > As a workaround - we can play with queues and have one dedicated >> machine to >> > run all such tasks, but it has multiple limitations. >> > >> > *High-level idea* >> > >> > High level it would require defining some affinity between tasks to >> make >> > sure that: >> > >> > 1) they are all executed on the same worker machine >> > 2) the processes should remain in-memory until all tasks finish for data >> > sharing (even if there is a dependency between the tasks) >> > 3) back-filling should act on the whole group of such tasks as "single >> > unit". >> > >> > I would love to hear your feedback. >> > >> > J >> > >> > >> > [1] Slack discussion on shared memory: >> > https://apache-airflow.slack.com/archives/CCR6P6JRL/p1574745209437200 >> > >> > J. >> > -- >> > >> > Jarek Potiuk >> > Polidea <https://www.polidea.com/> | Principal Software Engineer >> > >> > M: +48 660 796 129 <+48660796129> >> > [image: Polidea] <https://www.polidea.com/> >> >> > > -- > > Jarek Potiuk > Polidea <https://www.polidea.com/> | Principal Software Engineer > > M: +48 660 796 129 <+48660796129> > [image: Polidea] <https://www.polidea.com/> > > -- Jarek Potiuk Polidea <https://www.polidea.com/> | Principal Software Engineer M: +48 660 796 129 <+48660796129> [image: Polidea] <https://www.polidea.com/>
