I am new to Beam, and I am pretty excited to get started.  I have been
doing quite a bit of research and playing around with the API.  But for my
use case, unless I am not approaching it correctly, suggests that I will
need to process multiple PCollections in some parts of my pipeline.

I am working out some of my business logic without a parallelization
framework to get the solution working.  Then I will convert the workflow to
Beam.  What I am doing is reading millions of files from the file system,
and I am processing parts of the file into three different output types,
and storing them in MongoDB in three collections.  After this initial
extraction (mapping), I modify some of the data which will result in
duplicates.  So the next step is a reduction step to eliminate the
duplicates (based on a number of fields) and aggregate the references to
the other 2 data types, so the reduced object contains the dedupe fields,
and a list of references to documents in the other 2 collections.  I'm not
touching either of these two collections at this time, but this is where my
question comes in.  If I map this data, can I create three separate
PCollections instead of inserting them into Mongo?  After the
deduplication, I will need to combine data in two of the streams, and I
need to store the results of that combination into mongo.  Then I need to
process the third collection, which will go into its own mongo collection.

I hope my description was at least enough to get the conversation started.
Is my approach reasonable, and can I create multiple PCollections and use
them at different phases of my pipeline?  Or is there another way that I
should be looking at this?

Thanks in advance!
Steve

Reply via email to