Chained Job Graph Apache Beam | Dataflow

Ravi Kapoor Mon, 13 Jun 2022 08:09:59 -0700

Hi Team,

I am currently using Beam in my project with Dataflow Runner.
I am trying to create a pipeline where the data flows from the source to
staging then to target such as:


A (Source) -> B(Staging) -> C (Target)

When I create a pipeline as below:

PCollection<TableRow> table_A_records = p.apply(BigQueryIO.readTableRows()
        .from("project:dataset.table_A"));

table_A_records.apply(BigQueryIO.writeTableRows().
        to("project:dataset.table_B")
        .withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_NEVER)
        
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_TRUNCATE));

PCollection<TableRow> table_B_records = p.apply(BigQueryIO.readTableRows()
        .from("project:dataset.table_B"));
table_B_records.apply(BigQueryIO.writeTableRows().
        to("project:dataset.table_C")
        .withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_NEVER)
        
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_TRUNCATE));
p.run().waitUntilFinish();


It basically creates two parallel job graphs in dataflow instead creating a
transformation as expected:
A -> B
B -> C
I needed to create data pipeline which flows the data in chain like:
                     D
                   /
A -> B -> C
                  \
                    E
Is there a way to achieve this transformation in between source and target
tables?

Thanks,
Ravi

Chained Job Graph Apache Beam | Dataflow

Reply via email to