Hi, I'm the guy who gave the Movie Magic talk. Since it's possible to write stateful transforms with Beam, it is capable of some very sophisticated flow control. I've not seen a python framework that combines this with streaming data nearly as well. That said, there aren't a lot of great working examples out there for transforms that do sophisticated flow control, and I feel like we're always wrestling with differences in behavior between the direct runner and Dataflow. There was a thread about polling patterns [1] on this list that never really got a satisfying resolution. Likewise, there was a thread about using an SDF with an unbound source [2] that also didn't get fully resolved.
[1] https://lists.apache.org/thread/nsxs49vjokcc5wkvdvbvsqwzq682s7qw [2] https://lists.apache.org/thread/n3xgml0z8fok7101q79rsmdgp06lofnb On Sun, Dec 17, 2023 at 3:53 PM Austin Bennett <aus...@apache.org> wrote: > https://beamsummit.org/sessions/event-driven-movie-magic/ > > ^^ the question made me think of that use case. Though, unclear how close > it is to what you're thinking about. > > Cheers - > > On Fri, Dec 15, 2023 at 7:01 AM Byron Ellis via user <user@beam.apache.org> > wrote: > >> As Jan says, theoretically possible? Sure. That particular set of >> operations? Overkill. If you don't have it already set up I'd say even >> something like Airflow is overkill here. If all you need to do is "launch >> job and wait" when a file arrives... that's a small script and not >> something that particularly requires a distributed data processing system. >> >> On Fri, Dec 15, 2023 at 4:58 AM Jan Lukavský <je...@seznam.cz> wrote: >> >>> Hi, >>> >>> Apache Beam describes itself as "Apache Beam is an open-source, unified >>> programming model for batch and streaming data processing pipelines, ...". >>> As such, it is possible to use it to express essentially arbitrary logic >>> and run it as a streaming pipeline. A streaming pipeline processes input >>> data and produces output data and/or actions. Given these assumptions, it >>> is technically feasible to use Apache Beam for orchestrating other >>> workflows, the problem is that it will very much likely not be efficient. >>> Apache Beam has a lot of heavy-lifting related to the fact it is designed >>> to process large volumes of data in a scalable way, which is probably not >>> what would one need for workflow orchestration. So, my two cents would be, >>> that although it _could_ be done, it probably _should not_ be done. >>> >>> Best, >>> >>> Jan >>> On 12/15/23 13:39, Mikhail Khludnev wrote: >>> >>> Hello, >>> I think this page >>> https://beam.apache.org/documentation/ml/orchestration/ might answer >>> your question. >>> Frankly speaking: GCP Workflows and Apache Airflow. >>> But Beam itself is a data-stream/flow or batch processor; not a workflow >>> engine (IMHO). >>> >>> On Fri, Dec 15, 2023 at 3:13 PM data_nerd_666 <dataner...@gmail.com> >>> wrote: >>> >>>> I know it is technically possible, but my case may be a little special. >>>> Say I have 3 steps for my control flow (ETL workflow): >>>> Step 1. upstream file watching >>>> Step 2. call some external service to run one job, e.g. run a notebook, >>>> run a python script >>>> Step 3. notify downstream workflow >>>> Can I use apache beam to build a DAG with 3 nodes and run this as >>>> either flink or spark job. It might be a little weird, but I just want to >>>> learn from the community whether this is the right way to use apache beam, >>>> and has anyone done this before? Thanks >>>> >>>> >>>> >>>> On Fri, Dec 15, 2023 at 10:28 AM Byron Ellis via user < >>>> user@beam.apache.org> wrote: >>>> >>>>> It’s technically possible but the closest thing I can think of would >>>>> be triggering things based on things like file watching. >>>>> >>>>> On Thu, Dec 14, 2023 at 2:46 PM data_nerd_666 <dataner...@gmail.com> >>>>> wrote: >>>>> >>>>>> Not using beam as time-based scheduler, but just use it to control >>>>>> execution orders of ETL workflow DAG, because beam's abstraction is also >>>>>> a >>>>>> DAG. >>>>>> I know it is a little weird, just want to confirm with the community, >>>>>> has anyone used beam like this before? >>>>>> >>>>>> >>>>>> >>>>>> On Thu, Dec 14, 2023 at 10:59 PM Jan Lukavský <je...@seznam.cz> >>>>>> wrote: >>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> can you give an example of what you mean for better understanding? >>>>>>> Do >>>>>>> you mean using Beam as a scheduler of other ETL workflows? >>>>>>> >>>>>>> Jan >>>>>>> >>>>>>> On 12/14/23 13:17, data_nerd_666 wrote: >>>>>>> > Hi all, >>>>>>> > >>>>>>> > I am new to apache beam, and am very excited to find beam in >>>>>>> apache >>>>>>> > community. I see lots of use cases of using apache beam for data >>>>>>> flow >>>>>>> > (process large amount of batch/streaming data). I am just >>>>>>> wondering >>>>>>> > whether I can use apache beam for control flow (ETL workflow). I >>>>>>> don't >>>>>>> > mean the spark/flink job in the ETL workflow, I mean the ETL >>>>>>> workflow >>>>>>> > itself. Because ETL workflow is also a DAG which is very similar >>>>>>> as >>>>>>> > the abstraction of apache beam, but unfortunately I didn't find >>>>>>> such >>>>>>> > use cases on internet. So I'd like to ask this question in beam >>>>>>> > community to confirm whether I can use apache beam for control >>>>>>> flow >>>>>>> > (ETL workflow). If yes, please let me know some success stories of >>>>>>> > this. Thanks >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> >>>>>> >>> >>> -- >>> Sincerely yours >>> Mikhail Khludnev >>> >>>