Re: Chained Job Graph Apache Beam | Dataflow

Ravi Kapoor Wed, 15 Jun 2022 12:27:03 -0700

Hi Daniel,

I have a use case where I join two tables say A and B and write the joined
Collection to C.
Then I would like to filter some records on C and put it to another table
say D.
So, the pipeline on Dataflow UI should look like this right?


A
   \
    C -> D
   /
B

However, the pipeline is writing C -> D in parallel.
How can this pipeline run in parallel as data has not been pushed yet to C
by the previous pipeline?

Even when I ran this pipeline, Table D did not get any records inserted as
well which is apparent.
Can you help me with this use case?

Thanks,
Ravi



On Tue, Jun 14, 2022 at 9:01 PM Daniel Collins <[email protected]> wrote:

> Can you speak to what specifically you want to be different? The job graph
> you see, with the A -> B and B-> C being separate is an accurate reflection
> of your pipeline. table_B is outside of the beam model, by pushing your
> data there, Dataflow has no ability to identify that no manipulation of
> data is happening at table_B.
>
> If you want to just process data from A to destinations D and E, while
> writing an intermediate output to table_B, you should just remove the read
> from table B and modify table_A_records again directly. If that is not what
> you want, you would need to explain more specifically what you want that is
> different. Is it a pure UI change? Is it a functional change?
>
> -Daniel
>
> On Tue, Jun 14, 2022 at 11:12 AM Ravi Kapoor <[email protected]>
> wrote:
>
>> Team,
>> Any update on this?
>>
>> On Mon, Jun 13, 2022 at 8:39 PM Ravi Kapoor <[email protected]>
>> wrote:
>>
>>> Hi Team,
>>>
>>> I am currently using Beam in my project with Dataflow Runner.
>>> I am trying to create a pipeline where the data flows from the source to
>>> staging then to target such as:
>>>
>>> A (Source) -> B(Staging) -> C (Target)
>>>
>>> When I create a pipeline as below:
>>>
>>> PCollection<TableRow> table_A_records = p.apply(BigQueryIO.readTableRows()
>>>         .from("project:dataset.table_A"));
>>>
>>> table_A_records.apply(BigQueryIO.writeTableRows().
>>>         to("project:dataset.table_B")
>>>         
>>> .withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_NEVER)
>>>         
>>> .withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_TRUNCATE));
>>>
>>> PCollection<TableRow> table_B_records = p.apply(BigQueryIO.readTableRows()
>>>         .from("project:dataset.table_B"));
>>> table_B_records.apply(BigQueryIO.writeTableRows().
>>>         to("project:dataset.table_C")
>>>         
>>> .withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_NEVER)
>>>         
>>> .withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_TRUNCATE));
>>> p.run().waitUntilFinish();
>>>
>>>
>>> It basically creates two parallel job graphs in dataflow instead
>>> creating a transformation as expected:
>>> A -> B
>>> B -> C
>>> I needed to create data pipeline which flows the data in chain like:
>>>                      D
>>>                    /
>>> A -> B -> C
>>>                   \
>>>                     E
>>> Is there a way to achieve this transformation in between source and
>>> target tables?
>>>
>>> Thanks,
>>> Ravi
>>>
>>
>>
>> --
>> Thanks,
>> Ravi Kapoor
>> +91-9818764564 <+91%2098187%2064564>
>> [email protected]
>>
>

-- 
Thanks,
Ravi Kapoor
+91-9818764564
[email protected]

Re: Chained Job Graph Apache Beam | Dataflow

Reply via email to