I am new to Beam, and I am pretty excited to get started. I have been doing quite a bit of research and playing around with the API. But for my use case, unless I am not approaching it correctly, suggests that I will need to process multiple PCollections in some parts of my pipeline.
I am working out some of my business logic without a parallelization framework to get the solution working. Then I will convert the workflow to Beam. What I am doing is reading millions of files from the file system, and I am processing parts of the file into three different output types, and storing them in MongoDB in three collections. After this initial extraction (mapping), I modify some of the data which will result in duplicates. So the next step is a reduction step to eliminate the duplicates (based on a number of fields) and aggregate the references to the other 2 data types, so the reduced object contains the dedupe fields, and a list of references to documents in the other 2 collections. I'm not touching either of these two collections at this time, but this is where my question comes in. If I map this data, can I create three separate PCollections instead of inserting them into Mongo? After the deduplication, I will need to combine data in two of the streams, and I need to store the results of that combination into mongo. Then I need to process the third collection, which will go into its own mongo collection. I hope my description was at least enough to get the conversation started. Is my approach reasonable, and can I create multiple PCollections and use them at different phases of my pipeline? Or is there another way that I should be looking at this? Thanks in advance! Steve