Re: Designing an existing pipeline in Beam
Thanks Luke, I would like to try the latter approach. Would be able to share any pseudo-code or point to any example on how to call a common method inside a DoFn's, let's say, ProcessElement method? On Tue, Jun 23, 2020 at 6:35 PM Luke Cwik wrote: > You can apply the same DoFn / Transform instance multiple times in the > graph or you can follow regular development practices where the common code > is factored into a method and two different DoFn's invoke it. > > On Tue, Jun 23, 2020 at 4:28 PM Praveen K Viswanathan < > harish.prav...@gmail.com> wrote: > >> Hi Luke - Thanks for the explanation. The limitation due to directed >> graph processing and the option of external storage clears most of the >> questions I had with respect to designing this pipeline. I do have one more >> scenario to clarify on this thread. >> >> If I had a certain piece of logic that I had to use in more than one >> DoFns how do we do that. In a normal Java application, we can put it as a >> separate method and call it wherever we want. Is it possible to replicate >> something like that in Beam's DoFn? >> >> On Tue, Jun 23, 2020 at 3:47 PM Luke Cwik wrote: >> >>> Beam is really about parallelizing the processing. Using a single DoFn >>> that does everything is fine as long as the DoFn can process elements in >>> parallel (e.g. upstream source produces lots of elements). Composing >>> multiple DoFns is great for re-use and testing but it isn't strictly >>> necessary. Also, Beam doesn't support back edges in the processing graph so >>> all data flows in one direction and you can't have a cycle. This only >>> allows for map 1 to producie map 2 which then produces map 3 which is then >>> used to update map 1 if all of that logic is within a single DoFn/Transform >>> or you create a cycle using an external system such as write to Kafka topic >>> X and read from Kafka topic X within the same pipeline or update a database >>> downstream from where it is read. There is a lot of ordering complexity and >>> stale data issues whenever using an external store to create a cycle though. >>> >>> On Mon, Jun 22, 2020 at 6:02 PM Praveen K Viswanathan < >>> harish.prav...@gmail.com> wrote: >>> Another way to put this question is, how do we write a beam pipeline for an existing pipeline (in Java) that has a dozen of custom objects and you have to work with multiple HashMaps of those custom objects in order to transform it. Currently, I am writing a beam pipeline by using the same Custom objects, getters and setters and HashMap *but inside a DoFn*. Is this the optimal way or does Beam offer something else? On Mon, Jun 22, 2020 at 3:47 PM Praveen K Viswanathan < harish.prav...@gmail.com> wrote: > Hi Luke, > > We can say Map 2 as a kind of a template using which you want to > enrich data in Map 1. As I mentioned in my previous post, this is a high > level scenario. > > All these logic are spread across several classes (with ~4K lines of > code in total). As in any Java application, > > 1. The code has been modularized with multiple method calls > 2. Passing around HashMaps as argument to each method > 3. Accessing the attributes of the custom object using getters and > setters. > > This is a common pattern in a normal Java application but I have not > seen such an example of code in Beam. > > > On Mon, Jun 22, 2020 at 8:23 AM Luke Cwik wrote: > >> Who reads map 1? >> Can it be stale? >> >> It is unclear what you are trying to do in parallel and why you >> wouldn't stick all this logic into a single DoFn / stateful DoFn. >> >> On Sat, Jun 20, 2020 at 7:14 PM Praveen K Viswanathan < >> harish.prav...@gmail.com> wrote: >> >>> Hello Everyone, >>> >>> I am in the process of implementing an existing pipeline (written >>> using Java and Kafka) in Apache Beam. The data from the source stream is >>> contrived and had to go through several steps of enrichment using REST >>> API >>> calls and parsing of JSON data. The key >>> transformation in the existing pipeline is in shown below (a super >>> high level flow) >>> >>> *Method A* >>> Calls *Method B* >>> Creates *Map 1, Map 2* >>> Calls *Method C* >>> Read *Map 2* >>> Create *Map 3* >>> *Method C* >>> Read *Map 3* and >>> update *Map 1* >>> >>> The Map we use are multi-level maps and I am thinking of having >>> PCollections for each Maps and pass them as side inputs in a DoFn >>> wherever >>> I have transformations that need two or more Maps. But there are certain >>> tasks which I want to make sure that I am following right approach, for >>> instance updating one of the side input maps inside a DoFn. >>> >>> These are my initial thoughts/questions and I would like to get
Re: Designing an existing pipeline in Beam
You can apply the same DoFn / Transform instance multiple times in the graph or you can follow regular development practices where the common code is factored into a method and two different DoFn's invoke it. On Tue, Jun 23, 2020 at 4:28 PM Praveen K Viswanathan < harish.prav...@gmail.com> wrote: > Hi Luke - Thanks for the explanation. The limitation due to directed graph > processing and the option of external storage clears most of the questions > I had with respect to designing this pipeline. I do have one more scenario > to clarify on this thread. > > If I had a certain piece of logic that I had to use in more than one DoFns > how do we do that. In a normal Java application, we can put it as a > separate method and call it wherever we want. Is it possible to replicate > something like that in Beam's DoFn? > > On Tue, Jun 23, 2020 at 3:47 PM Luke Cwik wrote: > >> Beam is really about parallelizing the processing. Using a single DoFn >> that does everything is fine as long as the DoFn can process elements in >> parallel (e.g. upstream source produces lots of elements). Composing >> multiple DoFns is great for re-use and testing but it isn't strictly >> necessary. Also, Beam doesn't support back edges in the processing graph so >> all data flows in one direction and you can't have a cycle. This only >> allows for map 1 to producie map 2 which then produces map 3 which is then >> used to update map 1 if all of that logic is within a single DoFn/Transform >> or you create a cycle using an external system such as write to Kafka topic >> X and read from Kafka topic X within the same pipeline or update a database >> downstream from where it is read. There is a lot of ordering complexity and >> stale data issues whenever using an external store to create a cycle though. >> >> On Mon, Jun 22, 2020 at 6:02 PM Praveen K Viswanathan < >> harish.prav...@gmail.com> wrote: >> >>> Another way to put this question is, how do we write a beam pipeline for >>> an existing pipeline (in Java) that has a dozen of custom objects and you >>> have to work with multiple HashMaps of those custom objects in order to >>> transform it. Currently, I am writing a beam pipeline by using the same >>> Custom objects, getters and setters and HashMap *but >>> inside a DoFn*. Is this the optimal way or does Beam offer something >>> else? >>> >>> On Mon, Jun 22, 2020 at 3:47 PM Praveen K Viswanathan < >>> harish.prav...@gmail.com> wrote: >>> Hi Luke, We can say Map 2 as a kind of a template using which you want to enrich data in Map 1. As I mentioned in my previous post, this is a high level scenario. All these logic are spread across several classes (with ~4K lines of code in total). As in any Java application, 1. The code has been modularized with multiple method calls 2. Passing around HashMaps as argument to each method 3. Accessing the attributes of the custom object using getters and setters. This is a common pattern in a normal Java application but I have not seen such an example of code in Beam. On Mon, Jun 22, 2020 at 8:23 AM Luke Cwik wrote: > Who reads map 1? > Can it be stale? > > It is unclear what you are trying to do in parallel and why you > wouldn't stick all this logic into a single DoFn / stateful DoFn. > > On Sat, Jun 20, 2020 at 7:14 PM Praveen K Viswanathan < > harish.prav...@gmail.com> wrote: > >> Hello Everyone, >> >> I am in the process of implementing an existing pipeline (written >> using Java and Kafka) in Apache Beam. The data from the source stream is >> contrived and had to go through several steps of enrichment using REST >> API >> calls and parsing of JSON data. The key >> transformation in the existing pipeline is in shown below (a super >> high level flow) >> >> *Method A* >> Calls *Method B* >> Creates *Map 1, Map 2* >> Calls *Method C* >> Read *Map 2* >> Create *Map 3* >> *Method C* >> Read *Map 3* and >> update *Map 1* >> >> The Map we use are multi-level maps and I am thinking of having >> PCollections for each Maps and pass them as side inputs in a DoFn >> wherever >> I have transformations that need two or more Maps. But there are certain >> tasks which I want to make sure that I am following right approach, for >> instance updating one of the side input maps inside a DoFn. >> >> These are my initial thoughts/questions and I would like to get some >> expert advice on how we typically design such an interleaved >> transformation >> in Apache Beam. Appreciate your valuable insights on this. >> >> -- >> Thanks, >> Praveen K Viswanathan >> > -- Thanks, Praveen K Viswanathan >>> >>> >>> -- >>> Thanks, >>> Praveen K Viswanathan >>> >> > >
Re: Designing an existing pipeline in Beam
Hi Luke - Thanks for the explanation. The limitation due to directed graph processing and the option of external storage clears most of the questions I had with respect to designing this pipeline. I do have one more scenario to clarify on this thread. If I had a certain piece of logic that I had to use in more than one DoFns how do we do that. In a normal Java application, we can put it as a separate method and call it wherever we want. Is it possible to replicate something like that in Beam's DoFn? On Tue, Jun 23, 2020 at 3:47 PM Luke Cwik wrote: > Beam is really about parallelizing the processing. Using a single DoFn > that does everything is fine as long as the DoFn can process elements in > parallel (e.g. upstream source produces lots of elements). Composing > multiple DoFns is great for re-use and testing but it isn't strictly > necessary. Also, Beam doesn't support back edges in the processing graph so > all data flows in one direction and you can't have a cycle. This only > allows for map 1 to producie map 2 which then produces map 3 which is then > used to update map 1 if all of that logic is within a single DoFn/Transform > or you create a cycle using an external system such as write to Kafka topic > X and read from Kafka topic X within the same pipeline or update a database > downstream from where it is read. There is a lot of ordering complexity and > stale data issues whenever using an external store to create a cycle though. > > On Mon, Jun 22, 2020 at 6:02 PM Praveen K Viswanathan < > harish.prav...@gmail.com> wrote: > >> Another way to put this question is, how do we write a beam pipeline for >> an existing pipeline (in Java) that has a dozen of custom objects and you >> have to work with multiple HashMaps of those custom objects in order to >> transform it. Currently, I am writing a beam pipeline by using the same >> Custom objects, getters and setters and HashMap *but >> inside a DoFn*. Is this the optimal way or does Beam offer something >> else? >> >> On Mon, Jun 22, 2020 at 3:47 PM Praveen K Viswanathan < >> harish.prav...@gmail.com> wrote: >> >>> Hi Luke, >>> >>> We can say Map 2 as a kind of a template using which you want to enrich >>> data in Map 1. As I mentioned in my previous post, this is a high level >>> scenario. >>> >>> All these logic are spread across several classes (with ~4K lines of >>> code in total). As in any Java application, >>> >>> 1. The code has been modularized with multiple method calls >>> 2. Passing around HashMaps as argument to each method >>> 3. Accessing the attributes of the custom object using getters and >>> setters. >>> >>> This is a common pattern in a normal Java application but I have not >>> seen such an example of code in Beam. >>> >>> >>> On Mon, Jun 22, 2020 at 8:23 AM Luke Cwik wrote: >>> Who reads map 1? Can it be stale? It is unclear what you are trying to do in parallel and why you wouldn't stick all this logic into a single DoFn / stateful DoFn. On Sat, Jun 20, 2020 at 7:14 PM Praveen K Viswanathan < harish.prav...@gmail.com> wrote: > Hello Everyone, > > I am in the process of implementing an existing pipeline (written > using Java and Kafka) in Apache Beam. The data from the source stream is > contrived and had to go through several steps of enrichment using REST API > calls and parsing of JSON data. The key > transformation in the existing pipeline is in shown below (a super > high level flow) > > *Method A* > Calls *Method B* > Creates *Map 1, Map 2* > Calls *Method C* > Read *Map 2* > Create *Map 3* > *Method C* > Read *Map 3* and > update *Map 1* > > The Map we use are multi-level maps and I am thinking of having > PCollections for each Maps and pass them as side inputs in a DoFn wherever > I have transformations that need two or more Maps. But there are certain > tasks which I want to make sure that I am following right approach, for > instance updating one of the side input maps inside a DoFn. > > These are my initial thoughts/questions and I would like to get some > expert advice on how we typically design such an interleaved > transformation > in Apache Beam. Appreciate your valuable insights on this. > > -- > Thanks, > Praveen K Viswanathan > >>> >>> -- >>> Thanks, >>> Praveen K Viswanathan >>> >> >> >> -- >> Thanks, >> Praveen K Viswanathan >> > -- Thanks, Praveen K Viswanathan
Re: Designing an existing pipeline in Beam
Beam is really about parallelizing the processing. Using a single DoFn that does everything is fine as long as the DoFn can process elements in parallel (e.g. upstream source produces lots of elements). Composing multiple DoFns is great for re-use and testing but it isn't strictly necessary. Also, Beam doesn't support back edges in the processing graph so all data flows in one direction and you can't have a cycle. This only allows for map 1 to producie map 2 which then produces map 3 which is then used to update map 1 if all of that logic is within a single DoFn/Transform or you create a cycle using an external system such as write to Kafka topic X and read from Kafka topic X within the same pipeline or update a database downstream from where it is read. There is a lot of ordering complexity and stale data issues whenever using an external store to create a cycle though. On Mon, Jun 22, 2020 at 6:02 PM Praveen K Viswanathan < harish.prav...@gmail.com> wrote: > Another way to put this question is, how do we write a beam pipeline for > an existing pipeline (in Java) that has a dozen of custom objects and you > have to work with multiple HashMaps of those custom objects in order to > transform it. Currently, I am writing a beam pipeline by using the same > Custom objects, getters and setters and HashMap *but > inside a DoFn*. Is this the optimal way or does Beam offer something else? > > On Mon, Jun 22, 2020 at 3:47 PM Praveen K Viswanathan < > harish.prav...@gmail.com> wrote: > >> Hi Luke, >> >> We can say Map 2 as a kind of a template using which you want to enrich >> data in Map 1. As I mentioned in my previous post, this is a high level >> scenario. >> >> All these logic are spread across several classes (with ~4K lines of code >> in total). As in any Java application, >> >> 1. The code has been modularized with multiple method calls >> 2. Passing around HashMaps as argument to each method >> 3. Accessing the attributes of the custom object using getters and >> setters. >> >> This is a common pattern in a normal Java application but I have not seen >> such an example of code in Beam. >> >> >> On Mon, Jun 22, 2020 at 8:23 AM Luke Cwik wrote: >> >>> Who reads map 1? >>> Can it be stale? >>> >>> It is unclear what you are trying to do in parallel and why you wouldn't >>> stick all this logic into a single DoFn / stateful DoFn. >>> >>> On Sat, Jun 20, 2020 at 7:14 PM Praveen K Viswanathan < >>> harish.prav...@gmail.com> wrote: >>> Hello Everyone, I am in the process of implementing an existing pipeline (written using Java and Kafka) in Apache Beam. The data from the source stream is contrived and had to go through several steps of enrichment using REST API calls and parsing of JSON data. The key transformation in the existing pipeline is in shown below (a super high level flow) *Method A* Calls *Method B* Creates *Map 1, Map 2* Calls *Method C* Read *Map 2* Create *Map 3* *Method C* Read *Map 3* and update *Map 1* The Map we use are multi-level maps and I am thinking of having PCollections for each Maps and pass them as side inputs in a DoFn wherever I have transformations that need two or more Maps. But there are certain tasks which I want to make sure that I am following right approach, for instance updating one of the side input maps inside a DoFn. These are my initial thoughts/questions and I would like to get some expert advice on how we typically design such an interleaved transformation in Apache Beam. Appreciate your valuable insights on this. -- Thanks, Praveen K Viswanathan >>> >> >> -- >> Thanks, >> Praveen K Viswanathan >> > > > -- > Thanks, > Praveen K Viswanathan >
Re: Designing an existing pipeline in Beam
Another way to put this question is, how do we write a beam pipeline for an existing pipeline (in Java) that has a dozen of custom objects and you have to work with multiple HashMaps of those custom objects in order to transform it. Currently, I am writing a beam pipeline by using the same Custom objects, getters and setters and HashMap *but inside a DoFn*. Is this the optimal way or does Beam offer something else? On Mon, Jun 22, 2020 at 3:47 PM Praveen K Viswanathan < harish.prav...@gmail.com> wrote: > Hi Luke, > > We can say Map 2 as a kind of a template using which you want to enrich > data in Map 1. As I mentioned in my previous post, this is a high level > scenario. > > All these logic are spread across several classes (with ~4K lines of code > in total). As in any Java application, > > 1. The code has been modularized with multiple method calls > 2. Passing around HashMaps as argument to each method > 3. Accessing the attributes of the custom object using getters and setters. > > This is a common pattern in a normal Java application but I have not seen > such an example of code in Beam. > > > On Mon, Jun 22, 2020 at 8:23 AM Luke Cwik wrote: > >> Who reads map 1? >> Can it be stale? >> >> It is unclear what you are trying to do in parallel and why you wouldn't >> stick all this logic into a single DoFn / stateful DoFn. >> >> On Sat, Jun 20, 2020 at 7:14 PM Praveen K Viswanathan < >> harish.prav...@gmail.com> wrote: >> >>> Hello Everyone, >>> >>> I am in the process of implementing an existing pipeline (written using >>> Java and Kafka) in Apache Beam. The data from the source stream is >>> contrived and had to go through several steps of enrichment using REST API >>> calls and parsing of JSON data. The key >>> transformation in the existing pipeline is in shown below (a super high >>> level flow) >>> >>> *Method A* >>> Calls *Method B* >>> Creates *Map 1, Map 2* >>> Calls *Method C* >>> Read *Map 2* >>> Create *Map 3* >>> *Method C* >>> Read *Map 3* and >>> update *Map 1* >>> >>> The Map we use are multi-level maps and I am thinking of having >>> PCollections for each Maps and pass them as side inputs in a DoFn wherever >>> I have transformations that need two or more Maps. But there are certain >>> tasks which I want to make sure that I am following right approach, for >>> instance updating one of the side input maps inside a DoFn. >>> >>> These are my initial thoughts/questions and I would like to get some >>> expert advice on how we typically design such an interleaved transformation >>> in Apache Beam. Appreciate your valuable insights on this. >>> >>> -- >>> Thanks, >>> Praveen K Viswanathan >>> >> > > -- > Thanks, > Praveen K Viswanathan > -- Thanks, Praveen K Viswanathan
Re: Designing an existing pipeline in Beam
Hi Luke, We can say Map 2 as a kind of a template using which you want to enrich data in Map 1. As I mentioned in my previous post, this is a high level scenario. All these logic are spread across several classes (with ~4K lines of code in total). As in any Java application, 1. The code has been modularized with multiple method calls 2. Passing around HashMaps as argument to each method 3. Accessing the attributes of the custom object using getters and setters. This is a common pattern in a normal Java application but I have not seen such an example of code in Beam. On Mon, Jun 22, 2020 at 8:23 AM Luke Cwik wrote: > Who reads map 1? > Can it be stale? > > It is unclear what you are trying to do in parallel and why you wouldn't > stick all this logic into a single DoFn / stateful DoFn. > > On Sat, Jun 20, 2020 at 7:14 PM Praveen K Viswanathan < > harish.prav...@gmail.com> wrote: > >> Hello Everyone, >> >> I am in the process of implementing an existing pipeline (written using >> Java and Kafka) in Apache Beam. The data from the source stream is >> contrived and had to go through several steps of enrichment using REST API >> calls and parsing of JSON data. The key >> transformation in the existing pipeline is in shown below (a super high >> level flow) >> >> *Method A* >> Calls *Method B* >> Creates *Map 1, Map 2* >> Calls *Method C* >> Read *Map 2* >> Create *Map 3* >> *Method C* >> Read *Map 3* and >> update *Map 1* >> >> The Map we use are multi-level maps and I am thinking of having >> PCollections for each Maps and pass them as side inputs in a DoFn wherever >> I have transformations that need two or more Maps. But there are certain >> tasks which I want to make sure that I am following right approach, for >> instance updating one of the side input maps inside a DoFn. >> >> These are my initial thoughts/questions and I would like to get some >> expert advice on how we typically design such an interleaved transformation >> in Apache Beam. Appreciate your valuable insights on this. >> >> -- >> Thanks, >> Praveen K Viswanathan >> > -- Thanks, Praveen K Viswanathan
Re: Designing an existing pipeline in Beam
Who reads map 1? Can it be stale? It is unclear what you are trying to do in parallel and why you wouldn't stick all this logic into a single DoFn / stateful DoFn. On Sat, Jun 20, 2020 at 7:14 PM Praveen K Viswanathan < harish.prav...@gmail.com> wrote: > Hello Everyone, > > I am in the process of implementing an existing pipeline (written using > Java and Kafka) in Apache Beam. The data from the source stream is > contrived and had to go through several steps of enrichment using REST API > calls and parsing of JSON data. The key > transformation in the existing pipeline is in shown below (a super high > level flow) > > *Method A* > Calls *Method B* > Creates *Map 1, Map 2* > Calls *Method C* > Read *Map 2* > Create *Map 3* > *Method C* > Read *Map 3* and > update *Map 1* > > The Map we use are multi-level maps and I am thinking of having > PCollections for each Maps and pass them as side inputs in a DoFn wherever > I have transformations that need two or more Maps. But there are certain > tasks which I want to make sure that I am following right approach, for > instance updating one of the side input maps inside a DoFn. > > These are my initial thoughts/questions and I would like to get some > expert advice on how we typically design such an interleaved transformation > in Apache Beam. Appreciate your valuable insights on this. > > -- > Thanks, > Praveen K Viswanathan >
Designing an existing pipeline in Beam
Hello Everyone, I am in the process of implementing an existing pipeline (written using Java and Kafka) in Apache Beam. The data from the source stream is contrived and had to go through several steps of enrichment using REST API calls and parsing of JSON data. The key transformation in the existing pipeline is in shown below (a super high level flow) *Method A* Calls *Method B* Creates *Map 1, Map 2* Calls *Method C* Read *Map 2* Create *Map 3* *Method C* Read *Map 3* and update *Map 1* The Map we use are multi-level maps and I am thinking of having PCollections for each Maps and pass them as side inputs in a DoFn wherever I have transformations that need two or more Maps. But there are certain tasks which I want to make sure that I am following right approach, for instance updating one of the side input maps inside a DoFn. These are my initial thoughts/questions and I would like to get some expert advice on how we typically design such an interleaved transformation in Apache Beam. Appreciate your valuable insights on this. -- Thanks, Praveen K Viswanathan