Hi Team,
I am currently using Beam in my project with Dataflow Runner.
I am trying to create a pipeline where the data flows from the source to
staging then to target such as:
A (Source) -> B(Staging) -> C (Target)
When I create a pipeline as below:
PCollection<TableRow> table_A_records = p.apply(BigQueryIO.readTableRows()
.from("project:dataset.table_A"));
table_A_records.apply(BigQueryIO.writeTableRows().
to("project:dataset.table_B")
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_NEVER)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_TRUNCATE));
PCollection<TableRow> table_B_records = p.apply(BigQueryIO.readTableRows()
.from("project:dataset.table_B"));
table_B_records.apply(BigQueryIO.writeTableRows().
to("project:dataset.table_C")
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_NEVER)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_TRUNCATE));
p.run().waitUntilFinish();
It basically creates two parallel job graphs in dataflow instead creating a
transformation as expected:
A -> B
B -> C
I needed to create data pipeline which flows the data in chain like:
D
/
A -> B -> C
\
E
Is there a way to achieve this transformation in between source and target
tables?
Thanks,
Ravi