Re: How can I work with multiple pcollections?
Lukasz, It has been a few days since your reply, but I wanted to thank you for pointing me toward the "additional outputs" portion of the documentation. I had already read through that (if not completely thoroughly) although, at the time, I did not quite know the requirements of what I would be doing, so I did not really remember that part. I have some more work to do on my code before I can begin to use Beam (to make it much better!) but I think this should help quite a bit. Thanks again! Steve On Thu, Sep 12, 2019 at 5:31 PM Lukasz Cwik wrote: > Yes you can create multiple output PCollections using a ParDo with > multiple outputs instead of inserting them into Mongo. > > It could be useful to read through the programming guide related to > PCollections[1] and PTransforms with multiple outputs[2] and feel free to > return with more questions. > > 1: https://beam.apache.org/documentation/programming-guide/#pcollections > 2: > https://beam.apache.org/documentation/programming-guide/#additional-outputs > > On Thu, Sep 12, 2019 at 2:24 PM Steve973 wrote: > >> I am new to Beam, and I am pretty excited to get started. I have been >> doing quite a bit of research and playing around with the API. But for my >> use case, unless I am not approaching it correctly, suggests that I will >> need to process multiple PCollections in some parts of my pipeline. >> >> I am working out some of my business logic without a parallelization >> framework to get the solution working. Then I will convert the workflow to >> Beam. What I am doing is reading millions of files from the file system, >> and I am processing parts of the file into three different output types, >> and storing them in MongoDB in three collections. After this initial >> extraction (mapping), I modify some of the data which will result in >> duplicates. So the next step is a reduction step to eliminate the >> duplicates (based on a number of fields) and aggregate the references to >> the other 2 data types, so the reduced object contains the dedupe fields, >> and a list of references to documents in the other 2 collections. I'm not >> touching either of these two collections at this time, but this is where my >> question comes in. If I map this data, can I create three separate >> PCollections instead of inserting them into Mongo? After the >> deduplication, I will need to combine data in two of the streams, and I >> need to store the results of that combination into mongo. Then I need to >> process the third collection, which will go into its own mongo collection. >> >> I hope my description was at least enough to get the conversation >> started. Is my approach reasonable, and can I create multiple PCollections >> and use them at different phases of my pipeline? Or is there another way >> that I should be looking at this? >> >> Thanks in advance! >> Steve >> >
Re: How can I work with multiple pcollections?
Yes you can create multiple output PCollections using a ParDo with multiple outputs instead of inserting them into Mongo. It could be useful to read through the programming guide related to PCollections[1] and PTransforms with multiple outputs[2] and feel free to return with more questions. 1: https://beam.apache.org/documentation/programming-guide/#pcollections 2: https://beam.apache.org/documentation/programming-guide/#additional-outputs On Thu, Sep 12, 2019 at 2:24 PM Steve973 wrote: > I am new to Beam, and I am pretty excited to get started. I have been > doing quite a bit of research and playing around with the API. But for my > use case, unless I am not approaching it correctly, suggests that I will > need to process multiple PCollections in some parts of my pipeline. > > I am working out some of my business logic without a parallelization > framework to get the solution working. Then I will convert the workflow to > Beam. What I am doing is reading millions of files from the file system, > and I am processing parts of the file into three different output types, > and storing them in MongoDB in three collections. After this initial > extraction (mapping), I modify some of the data which will result in > duplicates. So the next step is a reduction step to eliminate the > duplicates (based on a number of fields) and aggregate the references to > the other 2 data types, so the reduced object contains the dedupe fields, > and a list of references to documents in the other 2 collections. I'm not > touching either of these two collections at this time, but this is where my > question comes in. If I map this data, can I create three separate > PCollections instead of inserting them into Mongo? After the > deduplication, I will need to combine data in two of the streams, and I > need to store the results of that combination into mongo. Then I need to > process the third collection, which will go into its own mongo collection. > > I hope my description was at least enough to get the conversation > started. Is my approach reasonable, and can I create multiple PCollections > and use them at different phases of my pipeline? Or is there another way > that I should be looking at this? > > Thanks in advance! > Steve >
How can I work with multiple pcollections?
I am new to Beam, and I am pretty excited to get started. I have been doing quite a bit of research and playing around with the API. But for my use case, unless I am not approaching it correctly, suggests that I will need to process multiple PCollections in some parts of my pipeline. I am working out some of my business logic without a parallelization framework to get the solution working. Then I will convert the workflow to Beam. What I am doing is reading millions of files from the file system, and I am processing parts of the file into three different output types, and storing them in MongoDB in three collections. After this initial extraction (mapping), I modify some of the data which will result in duplicates. So the next step is a reduction step to eliminate the duplicates (based on a number of fields) and aggregate the references to the other 2 data types, so the reduced object contains the dedupe fields, and a list of references to documents in the other 2 collections. I'm not touching either of these two collections at this time, but this is where my question comes in. If I map this data, can I create three separate PCollections instead of inserting them into Mongo? After the deduplication, I will need to combine data in two of the streams, and I need to store the results of that combination into mongo. Then I need to process the third collection, which will go into its own mongo collection. I hope my description was at least enough to get the conversation started. Is my approach reasonable, and can I create multiple PCollections and use them at different phases of my pipeline? Or is there another way that I should be looking at this? Thanks in advance! Steve