Hi,
Apache Beam describes itself as "Apache Beam is an open-source, unified
programming model for batch and streaming data processing pipelines,
...". As such, it is possible to use it to express essentially arbitrary
logic and run it as a streaming pipeline. A streaming pipeline processes
input data and produces output data and/or actions. Given these
assumptions, it is technically feasible to use Apache Beam for
orchestrating other workflows, the problem is that it will very much
likely not be efficient. Apache Beam has a lot of heavy-lifting related
to the fact it is designed to process large volumes of data in a
scalable way, which is probably not what would one need for workflow
orchestration. So, my two cents would be, that although it _could_ be
done, it probably _should not_ be done.
Best,
Jan
On 12/15/23 13:39, Mikhail Khludnev wrote:
Hello,
I think this page
https://beam.apache.org/documentation/ml/orchestration/ might answer
your question.
Frankly speaking: GCP Workflows and Apache Airflow.
But Beam itself is a data-stream/flow or batch processor; not a
workflow engine (IMHO).
On Fri, Dec 15, 2023 at 3:13 PM data_nerd_666 <dataner...@gmail.com>
wrote:
I know it is technically possible, but my case may be a little
special. Say I have 3 steps for my control flow (ETL workflow):
Step 1. upstream file watching
Step 2. call some external service to run one job, e.g. run a
notebook, run a python script
Step 3. notify downstream workflow
Can I use apache beam to build a DAG with 3 nodes and run this as
either flink or spark job. It might be a little weird, but I just
want to learn from the community whether this is the right way to
use apache beam, and has anyone done this before? Thanks
On Fri, Dec 15, 2023 at 10:28 AM Byron Ellis via user
<user@beam.apache.org> wrote:
It’s technically possible but the closest thing I can think of
would be triggering things based on things like file watching.
On Thu, Dec 14, 2023 at 2:46 PM data_nerd_666
<dataner...@gmail.com> wrote:
Not using beam as time-based scheduler, but just use it to
control execution orders of ETL workflow DAG, because
beam's abstraction is also a DAG.
I know it is a little weird, just want to confirm with the
community, has anyone used beam like this before?
On Thu, Dec 14, 2023 at 10:59 PM Jan Lukavský
<je...@seznam.cz> wrote:
Hi,
can you give an example of what you mean for better
understanding? Do
you mean using Beam as a scheduler of other ETL workflows?
Jan
On 12/14/23 13:17, data_nerd_666 wrote:
> Hi all,
>
> I am new to apache beam, and am very excited to find
beam in apache
> community. I see lots of use cases of using apache
beam for data flow
> (process large amount of batch/streaming data). I am
just wondering
> whether I can use apache beam for control flow (ETL
workflow). I don't
> mean the spark/flink job in the ETL workflow, I mean
the ETL workflow
> itself. Because ETL workflow is also a DAG which is
very similar as
> the abstraction of apache beam, but unfortunately I
didn't find such
> use cases on internet. So I'd like to ask this
question in beam
> community to confirm whether I can use apache beam
for control flow
> (ETL workflow). If yes, please let me know some
success stories of
> this. Thanks
>
>
>
--
Sincerely yours
Mikhail Khludnev