Re: Can apache beam be used for control flow (ETL workflow)

2023-12-24 Thread Steve973
Wouldn't Apache camel be more appropriate for the orchestration aspect?
Then delegate to beam for processing?

On Sun, Dec 24, 2023, 8:47 AM data_nerd_666  wrote:

> Thanks Austin & Chad, but my use case is to use beam to do ETL workflow
> control, which seems different from your case. I would like to check
> whether anyone has used beam for this kind of use case and whether beam is
> a good choice.
>
> On Sat, Dec 23, 2023 at 12:58 AM Chad Dombrova  wrote:
>
>> Hi,
>> I'm the guy who gave the Movie Magic talk.  Since it's possible to write
>> stateful transforms with Beam, it is capable of some very sophisticated
>> flow control.   I've not seen a python framework that combines this with
>> streaming data nearly as well.  That said, there aren't a lot of great
>> working examples out there for transforms that do sophisticated flow
>> control, and I feel like we're always wrestling with differences in
>> behavior between the direct runner and Dataflow.  There was a thread about
>> polling patterns [1] on this list that never really got a satisfying
>> resolution.  Likewise, there was a thread about using an SDF with an
>> unbound source [2] that also didn't get fully resolved.
>>
>> [1] https://lists.apache.org/thread/nsxs49vjokcc5wkvdvbvsqwzq682s7qw
>> [2] https://lists.apache.org/thread/n3xgml0z8fok7101q79rsmdgp06lofnb
>>
>>
>>
>> On Sun, Dec 17, 2023 at 3:53 PM Austin Bennett  wrote:
>>
>>> https://beamsummit.org/sessions/event-driven-movie-magic/
>>>
>>> ^^ the question made me think of that use case.  Though, unclear how
>>> close it is to what you're thinking about.
>>>
>>> Cheers -
>>>
>>> On Fri, Dec 15, 2023 at 7:01 AM Byron Ellis via user <
>>> user@beam.apache.org> wrote:
>>>
 As Jan says, theoretically possible? Sure. That particular set of
 operations? Overkill. If you don't have it already set up I'd say even
 something like Airflow is overkill here. If all you need to do is "launch
 job and wait" when a file arrives... that's a small script and not
 something that particularly requires a distributed data processing system.

 On Fri, Dec 15, 2023 at 4:58 AM Jan Lukavský  wrote:

> Hi,
>
> Apache Beam describes itself as "Apache Beam is an open-source,
> unified programming model for batch and streaming data processing
> pipelines, ...". As such, it is possible to use it to express essentially
> arbitrary logic and run it as a streaming pipeline. A streaming pipeline
> processes input data and produces output data and/or actions. Given these
> assumptions, it is technically feasible to use Apache Beam for
> orchestrating other workflows, the problem is that it will very much 
> likely
> not be efficient. Apache Beam has a lot of heavy-lifting related to the
> fact it is designed to process large volumes of data in a scalable way,
> which is probably not what would one need for workflow orchestration. So,
> my two cents would be, that although it _could_ be done, it probably
> _should not_ be done.
>
> Best,
>
>  Jan
> On 12/15/23 13:39, Mikhail Khludnev wrote:
>
> Hello,
> I think this page
> https://beam.apache.org/documentation/ml/orchestration/ might answer
> your question.
> Frankly speaking: GCP Workflows and Apache Airflow.
> But Beam itself is a data-stream/flow or batch processor; not a
> workflow engine (IMHO).
>
> On Fri, Dec 15, 2023 at 3:13 PM data_nerd_666 
> wrote:
>
>> I know it is technically possible, but my case may be a little
>> special. Say I have 3 steps for my control flow (ETL workflow):
>> Step 1. upstream file watching
>> Step 2. call some external service to run one job, e.g. run a
>> notebook, run a python script
>> Step 3. notify downstream workflow
>> Can I use apache beam to build a DAG with 3 nodes and run this as
>> either flink or spark job.  It might be a little weird, but I just want 
>> to
>> learn from the community whether this is the right way to use apache 
>> beam,
>> and has anyone done this before? Thanks
>>
>>
>>
>> On Fri, Dec 15, 2023 at 10:28 AM Byron Ellis via user <
>> user@beam.apache.org> wrote:
>>
>>> It’s technically possible but the closest thing I can think of would
>>> be triggering things based on things like file watching.
>>>
>>> On Thu, Dec 14, 2023 at 2:46 PM data_nerd_666 
>>> wrote:
>>>
 Not using beam as time-based scheduler, but just use it to control
 execution orders of ETL workflow DAG, because beam's abstraction is 
 also a
 DAG.
 I know it is a little weird, just want to confirm with the
 community, has anyone used beam like this before?



 On Thu, Dec 14, 2023 at 10:59 PM Jan Lukavský 
 wrote:

> Hi,
>
> can you give an example of what you mean for 

Re: ETL with Beam?

2019-10-11 Thread Steve973
The real benefit of a good ETL framework is being able to externalize your
extraction and transformation mappings.  If I didn't have to write that
part, that would be really cool!

On Fri, Oct 11, 2019 at 1:28 PM Robert Bradshaw  wrote:

> I would like to call out that Beam itself can be directly used for
> ETL, no extra framework required (not to say that both of these
> frameworks don't provide additional value, e.g. GUI-style construction
> of pipelines).
>
>
> On Fri, Oct 11, 2019 at 9:29 AM Ryan Skraba  wrote:
> >
> > Hello!  Talend has a big data ETL product in the cloud called Pipeline
> > Designer, entirely powered by Beam.  There was a talk at Beam Summit
> > 2018 (https://www.youtube.com/watch?v=1AlEGUtiQek), but unfortunately
> > the live demo wasn't captured in the video.  You can find other videos
> > of Pipeline Designer online to see if it might fit your needs, and
> > there is a free trial!  Depending on how your work project is
> > oriented, it may be of interest.
> >
> > Best regards, Ryan
> >
> > On Fri, Oct 11, 2019 at 12:26 PM Steve973  wrote:
> > >
> > > Thank you for your reply.  I will check it out!  I'm in the evaluation
> phase, especially since I have some time before I have to implement all of
> this.
> > >
> > > On Fri, Oct 11, 2019 at 3:25 AM Dan  wrote:
> > >>
> > >> I'm not sure if this will help but kettle runs on beam too.
> > >>
> > >> https://github.com/mattcasters/kettle-beam
> > >>
> > >> https://youtu.be/vgpGrQJnqkM
> > >>
> > >> Depends on your use case but kettle rocks for etl.
> > >>
> > >> Dan
> > >>
> > >> Sent from my phone
> > >>
> > >> On Thu, 10 Oct 2019, 10:12 pm Steve973,  wrote:
> > >>>
> > >>> Hello, all.  I still have not been given the tasking to convert my
> work project to use Beam, but it is still something that I am looking to do
> in the fairly near future.  Our data workflow consists of ingest and
> transformation, and I was hoping that there are ETL frameworks that work
> well with Beam.  Does anyone have some recommendations and maybe some
> samples that show how people might use and ETL framework with Beam?
> > >>>
> > >>> Thanks in advance and have a great day!
>


Re: ETL with Beam?

2019-10-11 Thread Steve973
Thank you for your reply.  I will check it out!  I'm in the evaluation
phase, especially since I have some time before I have to implement all of
this.

On Fri, Oct 11, 2019 at 3:25 AM Dan  wrote:

> I'm not sure if this will help but kettle runs on beam too.
>
> https://github.com/mattcasters/kettle-beam
>
> https://youtu.be/vgpGrQJnqkM
>
> Depends on your use case but kettle rocks for etl.
>
> Dan
>
> Sent from my phone
>
> On Thu, 10 Oct 2019, 10:12 pm Steve973,  wrote:
>
>> Hello, all.  I still have not been given the tasking to convert my work
>> project to use Beam, but it is still something that I am looking to do in
>> the fairly near future.  Our data workflow consists of ingest and
>> transformation, and I was hoping that there are ETL frameworks that work
>> well with Beam.  Does anyone have some recommendations and maybe some
>> samples that show how people might use and ETL framework with Beam?
>>
>> Thanks in advance and have a great day!
>>
>


ETL with Beam?

2019-10-10 Thread Steve973
Hello, all.  I still have not been given the tasking to convert my work
project to use Beam, but it is still something that I am looking to do in
the fairly near future.  Our data workflow consists of ingest and
transformation, and I was hoping that there are ETL frameworks that work
well with Beam.  Does anyone have some recommendations and maybe some
samples that show how people might use and ETL framework with Beam?

Thanks in advance and have a great day!


A couple questions from someone new to Beam

2019-09-25 Thread Steve973
Hi, all.  I am still ramping up on my learning of how to use Beam, and I
have a couple of questions for the experts.  And, while I have read the
documentation, I have either looked at the wrong parts, or my particular
questions were not specifically answered.  If I have missed something, then
please point me in the right direction.

   1. When using the MongoDB, for reading and writing from an execution
   node, does it need to take the time, each time an executor runs, to set up
   the connection to Mongo?  Or does Beam cache the connections and reuse them
   to mitigate the performance hit of setting up the connection each time?  If
   so, I am curious how it handles that for multiple nodes, unless Beam is
   "smart" enough to pre-cache connections in a pool on execution nodes in
   advance.
   2. When something is executed in parallel (ParDo), do the parallel jobs
   run in one thread on an execution node?  Or, will Beam utilize more
   resources/threads, as available, on a node?  I would like to utilize as
   many threads as possible on available cluster nodes.  My thought is that,
   if a job is stateless, it seems reasonable to be able to utilize multiple
   threads on a node to further parallelize and maximize performance.
   Although, it also occurs to me that this would probably be
   implementation-dependent on the runner.  The other approach that I can see
   is to simply use CompletableFutures in my jobs, which is what I am already
   doing in my code that does not (yet) use Beam. But it would be preferable
   to allow Beam to manage all of the parallelization.

I am sure that I will have some more questions as time goes on, but this
would be great info to have for now.

Thanks,
Steve


Re: How can I work with multiple pcollections?

2019-09-16 Thread Steve973
Lukasz,

It has been a few days since your reply, but I wanted to thank you for
pointing me toward the "additional outputs" portion of the documentation.
I had already read through that (if not completely thoroughly) although, at
the time, I did not quite know the requirements of what I would be doing,
so I did not really remember that part.  I have some more work to do on my
code before I can begin to use Beam (to make it much better!) but I think
this should help quite a bit.

Thanks again!
Steve

On Thu, Sep 12, 2019 at 5:31 PM Lukasz Cwik  wrote:

> Yes you can create multiple output PCollections using a ParDo with
> multiple outputs instead of inserting them into Mongo.
>
> It could be useful to read through the programming guide related to
> PCollections[1] and PTransforms with multiple outputs[2] and feel free to
> return with more questions.
>
> 1: https://beam.apache.org/documentation/programming-guide/#pcollections
> 2:
> https://beam.apache.org/documentation/programming-guide/#additional-outputs
>
> On Thu, Sep 12, 2019 at 2:24 PM Steve973  wrote:
>
>> I am new to Beam, and I am pretty excited to get started.  I have been
>> doing quite a bit of research and playing around with the API.  But for my
>> use case, unless I am not approaching it correctly, suggests that I will
>> need to process multiple PCollections in some parts of my pipeline.
>>
>> I am working out some of my business logic without a parallelization
>> framework to get the solution working.  Then I will convert the workflow to
>> Beam.  What I am doing is reading millions of files from the file system,
>> and I am processing parts of the file into three different output types,
>> and storing them in MongoDB in three collections.  After this initial
>> extraction (mapping), I modify some of the data which will result in
>> duplicates.  So the next step is a reduction step to eliminate the
>> duplicates (based on a number of fields) and aggregate the references to
>> the other 2 data types, so the reduced object contains the dedupe fields,
>> and a list of references to documents in the other 2 collections.  I'm not
>> touching either of these two collections at this time, but this is where my
>> question comes in.  If I map this data, can I create three separate
>> PCollections instead of inserting them into Mongo?  After the
>> deduplication, I will need to combine data in two of the streams, and I
>> need to store the results of that combination into mongo.  Then I need to
>> process the third collection, which will go into its own mongo collection.
>>
>> I hope my description was at least enough to get the conversation
>> started.  Is my approach reasonable, and can I create multiple PCollections
>> and use them at different phases of my pipeline?  Or is there another way
>> that I should be looking at this?
>>
>> Thanks in advance!
>> Steve
>>
>


How can I work with multiple pcollections?

2019-09-12 Thread Steve973
I am new to Beam, and I am pretty excited to get started.  I have been
doing quite a bit of research and playing around with the API.  But for my
use case, unless I am not approaching it correctly, suggests that I will
need to process multiple PCollections in some parts of my pipeline.

I am working out some of my business logic without a parallelization
framework to get the solution working.  Then I will convert the workflow to
Beam.  What I am doing is reading millions of files from the file system,
and I am processing parts of the file into three different output types,
and storing them in MongoDB in three collections.  After this initial
extraction (mapping), I modify some of the data which will result in
duplicates.  So the next step is a reduction step to eliminate the
duplicates (based on a number of fields) and aggregate the references to
the other 2 data types, so the reduced object contains the dedupe fields,
and a list of references to documents in the other 2 collections.  I'm not
touching either of these two collections at this time, but this is where my
question comes in.  If I map this data, can I create three separate
PCollections instead of inserting them into Mongo?  After the
deduplication, I will need to combine data in two of the streams, and I
need to store the results of that combination into mongo.  Then I need to
process the third collection, which will go into its own mongo collection.

I hope my description was at least enough to get the conversation started.
Is my approach reasonable, and can I create multiple PCollections and use
them at different phases of my pipeline?  Or is there another way that I
should be looking at this?

Thanks in advance!
Steve


Using RedisIO for hashes and selecting a database

2019-08-24 Thread Steve973
I am new to Beam and I am familiarizing myself with various aspects of this
cool framework.  I need to be able to get key/value pairs from a Redis
hash, and not just the key/value store in the default database (database
0).  I have only seen the ability to get key/value pairs that have been set
in Redis, but the source code does not make it seem like I can look up
keys/values in a hash, and I have not seen a way to select a database
number.  Have I missed something, or is this feature not (yet)
implemented?  If I want to use this, am I going to have to extend the
RedisIO class?  Thanks in advance!