Hi,

Apache Beam describes itself as "Apache Beam is an open-source, unified programming model for batch and streaming data processing pipelines, ...". As such, it is possible to use it to express essentially arbitrary logic and run it as a streaming pipeline. A streaming pipeline processes input data and produces output data and/or actions. Given these assumptions, it is technically feasible to use Apache Beam for orchestrating other workflows, the problem is that it will very much likely not be efficient. Apache Beam has a lot of heavy-lifting related to the fact it is designed to process large volumes of data in a scalable way, which is probably not what would one need for workflow orchestration. So, my two cents would be, that although it _could_ be done, it probably _should not_ be done.

Best,

 Jan

On 12/15/23 13:39, Mikhail Khludnev wrote:
Hello,
I think this page https://beam.apache.org/documentation/ml/orchestration/ might answer your question.
Frankly speaking: GCP Workflows and Apache Airflow.
But Beam itself is a data-stream/flow or batch processor; not a workflow engine (IMHO).

On Fri, Dec 15, 2023 at 3:13 PM data_nerd_666 <dataner...@gmail.com> wrote:

    I know it is technically possible, but my case may be a little
    special. Say I have 3 steps for my control flow (ETL workflow):
    Step 1. upstream file watching
    Step 2. call some external service to run one job, e.g. run a
    notebook, run a python script
    Step 3. notify downstream workflow
    Can I use apache beam to build a DAG with 3 nodes and run this as
    either flink or spark job.  It might be a little weird, but I just
    want to learn from the community whether this is the right way to
    use apache beam, and has anyone done this before? Thanks



    On Fri, Dec 15, 2023 at 10:28 AM Byron Ellis via user
    <user@beam.apache.org> wrote:

        It’s technically possible but the closest thing I can think of
        would be triggering things based on things like file watching.

        On Thu, Dec 14, 2023 at 2:46 PM data_nerd_666
        <dataner...@gmail.com> wrote:

            Not using beam as time-based scheduler, but just use it to
            control execution orders of ETL workflow DAG, because
            beam's abstraction is also a DAG.
            I know it is a little weird, just want to confirm with the
            community, has anyone used beam like this before?



            On Thu, Dec 14, 2023 at 10:59 PM Jan Lukavský
            <je...@seznam.cz> wrote:

                Hi,

                can you give an example of what you mean for better
                understanding? Do
                you mean using Beam as a scheduler of other ETL workflows?

                  Jan

                On 12/14/23 13:17, data_nerd_666 wrote:
                > Hi all,
                >
                > I am new to apache beam, and am very excited to find
                beam in apache
                > community. I see lots of use cases of using apache
                beam for data flow
                > (process large amount of batch/streaming data). I am
                just wondering
                > whether I can use apache beam for control flow (ETL
                workflow). I don't
                > mean the spark/flink job in the ETL workflow, I mean
                the ETL workflow
                > itself. Because ETL workflow is also a DAG which is
                very similar as
                > the abstraction of apache beam, but unfortunately I
                didn't find such
                > use cases on internet. So I'd like to ask this
                question in beam
                > community to confirm whether I can use apache beam
                for control flow
                > (ETL workflow). If yes, please let me know some
                success stories of
                > this. Thanks
                >
                >
                >



--
Sincerely yours
Mikhail Khludnev

Reply via email to