Re: How can I work with multiple pcollections?

2019-09-16 Thread Steve973
Lukasz,

It has been a few days since your reply, but I wanted to thank you for
pointing me toward the "additional outputs" portion of the documentation.
I had already read through that (if not completely thoroughly) although, at
the time, I did not quite know the requirements of what I would be doing,
so I did not really remember that part.  I have some more work to do on my
code before I can begin to use Beam (to make it much better!) but I think
this should help quite a bit.

Thanks again!
Steve

On Thu, Sep 12, 2019 at 5:31 PM Lukasz Cwik  wrote:

> Yes you can create multiple output PCollections using a ParDo with
> multiple outputs instead of inserting them into Mongo.
>
> It could be useful to read through the programming guide related to
> PCollections[1] and PTransforms with multiple outputs[2] and feel free to
> return with more questions.
>
> 1: https://beam.apache.org/documentation/programming-guide/#pcollections
> 2:
> https://beam.apache.org/documentation/programming-guide/#additional-outputs
>
> On Thu, Sep 12, 2019 at 2:24 PM Steve973  wrote:
>
>> I am new to Beam, and I am pretty excited to get started.  I have been
>> doing quite a bit of research and playing around with the API.  But for my
>> use case, unless I am not approaching it correctly, suggests that I will
>> need to process multiple PCollections in some parts of my pipeline.
>>
>> I am working out some of my business logic without a parallelization
>> framework to get the solution working.  Then I will convert the workflow to
>> Beam.  What I am doing is reading millions of files from the file system,
>> and I am processing parts of the file into three different output types,
>> and storing them in MongoDB in three collections.  After this initial
>> extraction (mapping), I modify some of the data which will result in
>> duplicates.  So the next step is a reduction step to eliminate the
>> duplicates (based on a number of fields) and aggregate the references to
>> the other 2 data types, so the reduced object contains the dedupe fields,
>> and a list of references to documents in the other 2 collections.  I'm not
>> touching either of these two collections at this time, but this is where my
>> question comes in.  If I map this data, can I create three separate
>> PCollections instead of inserting them into Mongo?  After the
>> deduplication, I will need to combine data in two of the streams, and I
>> need to store the results of that combination into mongo.  Then I need to
>> process the third collection, which will go into its own mongo collection.
>>
>> I hope my description was at least enough to get the conversation
>> started.  Is my approach reasonable, and can I create multiple PCollections
>> and use them at different phases of my pipeline?  Or is there another way
>> that I should be looking at this?
>>
>> Thanks in advance!
>> Steve
>>
>


Re: How can I work with multiple pcollections?

2019-09-12 Thread Lukasz Cwik
Yes you can create multiple output PCollections using a ParDo with multiple
outputs instead of inserting them into Mongo.

It could be useful to read through the programming guide related to
PCollections[1] and PTransforms with multiple outputs[2] and feel free to
return with more questions.

1: https://beam.apache.org/documentation/programming-guide/#pcollections
2:
https://beam.apache.org/documentation/programming-guide/#additional-outputs

On Thu, Sep 12, 2019 at 2:24 PM Steve973  wrote:

> I am new to Beam, and I am pretty excited to get started.  I have been
> doing quite a bit of research and playing around with the API.  But for my
> use case, unless I am not approaching it correctly, suggests that I will
> need to process multiple PCollections in some parts of my pipeline.
>
> I am working out some of my business logic without a parallelization
> framework to get the solution working.  Then I will convert the workflow to
> Beam.  What I am doing is reading millions of files from the file system,
> and I am processing parts of the file into three different output types,
> and storing them in MongoDB in three collections.  After this initial
> extraction (mapping), I modify some of the data which will result in
> duplicates.  So the next step is a reduction step to eliminate the
> duplicates (based on a number of fields) and aggregate the references to
> the other 2 data types, so the reduced object contains the dedupe fields,
> and a list of references to documents in the other 2 collections.  I'm not
> touching either of these two collections at this time, but this is where my
> question comes in.  If I map this data, can I create three separate
> PCollections instead of inserting them into Mongo?  After the
> deduplication, I will need to combine data in two of the streams, and I
> need to store the results of that combination into mongo.  Then I need to
> process the third collection, which will go into its own mongo collection.
>
> I hope my description was at least enough to get the conversation
> started.  Is my approach reasonable, and can I create multiple PCollections
> and use them at different phases of my pipeline?  Or is there another way
> that I should be looking at this?
>
> Thanks in advance!
> Steve
>


How can I work with multiple pcollections?

2019-09-12 Thread Steve973
I am new to Beam, and I am pretty excited to get started.  I have been
doing quite a bit of research and playing around with the API.  But for my
use case, unless I am not approaching it correctly, suggests that I will
need to process multiple PCollections in some parts of my pipeline.

I am working out some of my business logic without a parallelization
framework to get the solution working.  Then I will convert the workflow to
Beam.  What I am doing is reading millions of files from the file system,
and I am processing parts of the file into three different output types,
and storing them in MongoDB in three collections.  After this initial
extraction (mapping), I modify some of the data which will result in
duplicates.  So the next step is a reduction step to eliminate the
duplicates (based on a number of fields) and aggregate the references to
the other 2 data types, so the reduced object contains the dedupe fields,
and a list of references to documents in the other 2 collections.  I'm not
touching either of these two collections at this time, but this is where my
question comes in.  If I map this data, can I create three separate
PCollections instead of inserting them into Mongo?  After the
deduplication, I will need to combine data in two of the streams, and I
need to store the results of that combination into mongo.  Then I need to
process the third collection, which will go into its own mongo collection.

I hope my description was at least enough to get the conversation started.
Is my approach reasonable, and can I create multiple PCollections and use
them at different phases of my pipeline?  Or is there another way that I
should be looking at this?

Thanks in advance!
Steve