Re: Designing an existing pipeline in Beam

Praveen K Viswanathan Mon, 22 Jun 2020 18:03:23 -0700

Another way to put this question is, how do we write a beam pipeline for
an existing pipeline (in Java) that has a dozen of custom objects and you
have to work with multiple HashMaps of those custom objects in order to
transform it. Currently, I am writing a beam pipeline by using the same
Custom objects, getters and setters and HashMap<CustomObjects> *but inside
a DoFn*. Is this the optimal way or does Beam offer something else?


On Mon, Jun 22, 2020 at 3:47 PM Praveen K Viswanathan <
harish.prav...@gmail.com> wrote:

> Hi Luke,
>
> We can say Map 2 as a kind of a template using which you want to enrich
> data in Map 1. As I mentioned in my previous post, this is a high level
> scenario.
>
> All these logic are spread across several classes (with ~4K lines of code
> in total). As in any Java application,
>
> 1. The code has been modularized with multiple method calls
> 2. Passing around HashMaps<CustomObject> as argument to each method
> 3. Accessing the attributes of the custom object using getters and setters.
>
> This is a common pattern in a normal Java application but I have not seen
> such an example of code in Beam.
>
>
> On Mon, Jun 22, 2020 at 8:23 AM Luke Cwik <lc...@google.com> wrote:
>
>> Who reads map 1?
>> Can it be stale?
>>
>> It is unclear what you are trying to do in parallel and why you wouldn't
>> stick all this logic into a single DoFn / stateful DoFn.
>>
>> On Sat, Jun 20, 2020 at 7:14 PM Praveen K Viswanathan <
>> harish.prav...@gmail.com> wrote:
>>
>>> Hello Everyone,
>>>
>>> I am in the process of implementing an existing pipeline (written using
>>> Java and Kafka) in Apache Beam. The data from the source stream is
>>> contrived and had to go through several steps of enrichment using REST API
>>> calls and parsing of JSON data. The key
>>> transformation in the existing pipeline is in shown below (a super high
>>> level flow)
>>>
>>> *Method A*
>>> ----Calls *Method B*
>>>       ----Creates *Map 1, Map 2*
>>> ----Calls *Method C*
>>>      ----Read *Map 2*
>>>      ----Create *Map 3*
>>> ----*Method C*
>>>      ----Read *Map 3* and
>>>      ----update *Map 1*
>>>
>>> The Map we use are multi-level maps and I am thinking of having
>>> PCollections for each Maps and pass them as side inputs in a DoFn wherever
>>> I have transformations that need two or more Maps. But there are certain
>>> tasks which I want to make sure that I am following right approach, for
>>> instance updating one of the side input maps inside a DoFn.
>>>
>>> These are my initial thoughts/questions and I would like to get some
>>> expert advice on how we typically design such an interleaved transformation
>>> in Apache Beam. Appreciate your valuable insights on this.
>>>
>>> --
>>> Thanks,
>>> Praveen K Viswanathan
>>>
>>
>
> --
> Thanks,
> Praveen K Viswanathan
>


-- 
Thanks,
Praveen K Viswanathan

Re: Designing an existing pipeline in Beam

Reply via email to