Hi,
I'm the guy who gave the Movie Magic talk.  Since it's possible to write
stateful transforms with Beam, it is capable of some very sophisticated
flow control.   I've not seen a python framework that combines this with
streaming data nearly as well.  That said, there aren't a lot of great
working examples out there for transforms that do sophisticated flow
control, and I feel like we're always wrestling with differences in
behavior between the direct runner and Dataflow.  There was a thread about
polling patterns [1] on this list that never really got a satisfying
resolution.  Likewise, there was a thread about using an SDF with an
unbound source [2] that also didn't get fully resolved.

[1] https://lists.apache.org/thread/nsxs49vjokcc5wkvdvbvsqwzq682s7qw
[2] https://lists.apache.org/thread/n3xgml0z8fok7101q79rsmdgp06lofnb



On Sun, Dec 17, 2023 at 3:53 PM Austin Bennett <aus...@apache.org> wrote:

> https://beamsummit.org/sessions/event-driven-movie-magic/
>
> ^^ the question made me think of that use case.  Though, unclear how close
> it is to what you're thinking about.
>
> Cheers -
>
> On Fri, Dec 15, 2023 at 7:01 AM Byron Ellis via user <user@beam.apache.org>
> wrote:
>
>> As Jan says, theoretically possible? Sure. That particular set of
>> operations? Overkill. If you don't have it already set up I'd say even
>> something like Airflow is overkill here. If all you need to do is "launch
>> job and wait" when a file arrives... that's a small script and not
>> something that particularly requires a distributed data processing system.
>>
>> On Fri, Dec 15, 2023 at 4:58 AM Jan Lukavský <je...@seznam.cz> wrote:
>>
>>> Hi,
>>>
>>> Apache Beam describes itself as "Apache Beam is an open-source, unified
>>> programming model for batch and streaming data processing pipelines, ...".
>>> As such, it is possible to use it to express essentially arbitrary logic
>>> and run it as a streaming pipeline. A streaming pipeline processes input
>>> data and produces output data and/or actions. Given these assumptions, it
>>> is technically feasible to use Apache Beam for orchestrating other
>>> workflows, the problem is that it will very much likely not be efficient.
>>> Apache Beam has a lot of heavy-lifting related to the fact it is designed
>>> to process large volumes of data in a scalable way, which is probably not
>>> what would one need for workflow orchestration. So, my two cents would be,
>>> that although it _could_ be done, it probably _should not_ be done.
>>>
>>> Best,
>>>
>>>  Jan
>>> On 12/15/23 13:39, Mikhail Khludnev wrote:
>>>
>>> Hello,
>>> I think this page
>>> https://beam.apache.org/documentation/ml/orchestration/ might answer
>>> your question.
>>> Frankly speaking: GCP Workflows and Apache Airflow.
>>> But Beam itself is a data-stream/flow or batch processor; not a workflow
>>> engine (IMHO).
>>>
>>> On Fri, Dec 15, 2023 at 3:13 PM data_nerd_666 <dataner...@gmail.com>
>>> wrote:
>>>
>>>> I know it is technically possible, but my case may be a little special.
>>>> Say I have 3 steps for my control flow (ETL workflow):
>>>> Step 1. upstream file watching
>>>> Step 2. call some external service to run one job, e.g. run a notebook,
>>>> run a python script
>>>> Step 3. notify downstream workflow
>>>> Can I use apache beam to build a DAG with 3 nodes and run this as
>>>> either flink or spark job.  It might be a little weird, but I just want to
>>>> learn from the community whether this is the right way to use apache beam,
>>>> and has anyone done this before? Thanks
>>>>
>>>>
>>>>
>>>> On Fri, Dec 15, 2023 at 10:28 AM Byron Ellis via user <
>>>> user@beam.apache.org> wrote:
>>>>
>>>>> It’s technically possible but the closest thing I can think of would
>>>>> be triggering things based on things like file watching.
>>>>>
>>>>> On Thu, Dec 14, 2023 at 2:46 PM data_nerd_666 <dataner...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Not using beam as time-based scheduler, but just use it to control
>>>>>> execution orders of ETL workflow DAG, because beam's abstraction is also 
>>>>>> a
>>>>>> DAG.
>>>>>> I know it is a little weird, just want to confirm with the community,
>>>>>> has anyone used beam like this before?
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, Dec 14, 2023 at 10:59 PM Jan Lukavský <je...@seznam.cz>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> can you give an example of what you mean for better understanding?
>>>>>>> Do
>>>>>>> you mean using Beam as a scheduler of other ETL workflows?
>>>>>>>
>>>>>>>   Jan
>>>>>>>
>>>>>>> On 12/14/23 13:17, data_nerd_666 wrote:
>>>>>>> > Hi all,
>>>>>>> >
>>>>>>> > I am new to apache beam, and am very excited to find beam in
>>>>>>> apache
>>>>>>> > community. I see lots of use cases of using apache beam for data
>>>>>>> flow
>>>>>>> > (process large amount of batch/streaming data). I am just
>>>>>>> wondering
>>>>>>> > whether I can use apache beam for control flow (ETL workflow). I
>>>>>>> don't
>>>>>>> > mean the spark/flink job in the ETL workflow, I mean the ETL
>>>>>>> workflow
>>>>>>> > itself. Because ETL workflow is also a DAG which is very similar
>>>>>>> as
>>>>>>> > the abstraction of apache beam, but unfortunately I didn't find
>>>>>>> such
>>>>>>> > use cases on internet. So I'd like to ask this question in beam
>>>>>>> > community to confirm whether I can use apache beam for control
>>>>>>> flow
>>>>>>> > (ETL workflow). If yes, please let me know some success stories of
>>>>>>> > this. Thanks
>>>>>>> >
>>>>>>> >
>>>>>>> >
>>>>>>>
>>>>>>
>>>
>>> --
>>> Sincerely yours
>>> Mikhail Khludnev
>>>
>>>

Reply via email to