Re: Can apache beam be used for control flow (ETL workflow)
Wouldn't Apache camel be more appropriate for the orchestration aspect? Then delegate to beam for processing? On Sun, Dec 24, 2023, 8:47 AM data_nerd_666 wrote: > Thanks Austin & Chad, but my use case is to use beam to do ETL workflow > control, which seems different from your case. I would like to check > whether anyone has used beam for this kind of use case and whether beam is > a good choice. > > On Sat, Dec 23, 2023 at 12:58 AM Chad Dombrova wrote: > >> Hi, >> I'm the guy who gave the Movie Magic talk. Since it's possible to write >> stateful transforms with Beam, it is capable of some very sophisticated >> flow control. I've not seen a python framework that combines this with >> streaming data nearly as well. That said, there aren't a lot of great >> working examples out there for transforms that do sophisticated flow >> control, and I feel like we're always wrestling with differences in >> behavior between the direct runner and Dataflow. There was a thread about >> polling patterns [1] on this list that never really got a satisfying >> resolution. Likewise, there was a thread about using an SDF with an >> unbound source [2] that also didn't get fully resolved. >> >> [1] https://lists.apache.org/thread/nsxs49vjokcc5wkvdvbvsqwzq682s7qw >> [2] https://lists.apache.org/thread/n3xgml0z8fok7101q79rsmdgp06lofnb >> >> >> >> On Sun, Dec 17, 2023 at 3:53 PM Austin Bennett wrote: >> >>> https://beamsummit.org/sessions/event-driven-movie-magic/ >>> >>> ^^ the question made me think of that use case. Though, unclear how >>> close it is to what you're thinking about. >>> >>> Cheers - >>> >>> On Fri, Dec 15, 2023 at 7:01 AM Byron Ellis via user < >>> user@beam.apache.org> wrote: >>> As Jan says, theoretically possible? Sure. That particular set of operations? Overkill. If you don't have it already set up I'd say even something like Airflow is overkill here. If all you need to do is "launch job and wait" when a file arrives... that's a small script and not something that particularly requires a distributed data processing system. On Fri, Dec 15, 2023 at 4:58 AM Jan Lukavský wrote: > Hi, > > Apache Beam describes itself as "Apache Beam is an open-source, > unified programming model for batch and streaming data processing > pipelines, ...". As such, it is possible to use it to express essentially > arbitrary logic and run it as a streaming pipeline. A streaming pipeline > processes input data and produces output data and/or actions. Given these > assumptions, it is technically feasible to use Apache Beam for > orchestrating other workflows, the problem is that it will very much > likely > not be efficient. Apache Beam has a lot of heavy-lifting related to the > fact it is designed to process large volumes of data in a scalable way, > which is probably not what would one need for workflow orchestration. So, > my two cents would be, that although it _could_ be done, it probably > _should not_ be done. > > Best, > > Jan > On 12/15/23 13:39, Mikhail Khludnev wrote: > > Hello, > I think this page > https://beam.apache.org/documentation/ml/orchestration/ might answer > your question. > Frankly speaking: GCP Workflows and Apache Airflow. > But Beam itself is a data-stream/flow or batch processor; not a > workflow engine (IMHO). > > On Fri, Dec 15, 2023 at 3:13 PM data_nerd_666 > wrote: > >> I know it is technically possible, but my case may be a little >> special. Say I have 3 steps for my control flow (ETL workflow): >> Step 1. upstream file watching >> Step 2. call some external service to run one job, e.g. run a >> notebook, run a python script >> Step 3. notify downstream workflow >> Can I use apache beam to build a DAG with 3 nodes and run this as >> either flink or spark job. It might be a little weird, but I just want >> to >> learn from the community whether this is the right way to use apache >> beam, >> and has anyone done this before? Thanks >> >> >> >> On Fri, Dec 15, 2023 at 10:28 AM Byron Ellis via user < >> user@beam.apache.org> wrote: >> >>> It’s technically possible but the closest thing I can think of would >>> be triggering things based on things like file watching. >>> >>> On Thu, Dec 14, 2023 at 2:46 PM data_nerd_666 >>> wrote: >>> Not using beam as time-based scheduler, but just use it to control execution orders of ETL workflow DAG, because beam's abstraction is also a DAG. I know it is a little weird, just want to confirm with the community, has anyone used beam like this before? On Thu, Dec 14, 2023 at 10:59 PM Jan Lukavský wrote: > Hi, > > can you give an example of what you mean for
Re: ETL with Beam?
The real benefit of a good ETL framework is being able to externalize your extraction and transformation mappings. If I didn't have to write that part, that would be really cool! On Fri, Oct 11, 2019 at 1:28 PM Robert Bradshaw wrote: > I would like to call out that Beam itself can be directly used for > ETL, no extra framework required (not to say that both of these > frameworks don't provide additional value, e.g. GUI-style construction > of pipelines). > > > On Fri, Oct 11, 2019 at 9:29 AM Ryan Skraba wrote: > > > > Hello! Talend has a big data ETL product in the cloud called Pipeline > > Designer, entirely powered by Beam. There was a talk at Beam Summit > > 2018 (https://www.youtube.com/watch?v=1AlEGUtiQek), but unfortunately > > the live demo wasn't captured in the video. You can find other videos > > of Pipeline Designer online to see if it might fit your needs, and > > there is a free trial! Depending on how your work project is > > oriented, it may be of interest. > > > > Best regards, Ryan > > > > On Fri, Oct 11, 2019 at 12:26 PM Steve973 wrote: > > > > > > Thank you for your reply. I will check it out! I'm in the evaluation > phase, especially since I have some time before I have to implement all of > this. > > > > > > On Fri, Oct 11, 2019 at 3:25 AM Dan wrote: > > >> > > >> I'm not sure if this will help but kettle runs on beam too. > > >> > > >> https://github.com/mattcasters/kettle-beam > > >> > > >> https://youtu.be/vgpGrQJnqkM > > >> > > >> Depends on your use case but kettle rocks for etl. > > >> > > >> Dan > > >> > > >> Sent from my phone > > >> > > >> On Thu, 10 Oct 2019, 10:12 pm Steve973, wrote: > > >>> > > >>> Hello, all. I still have not been given the tasking to convert my > work project to use Beam, but it is still something that I am looking to do > in the fairly near future. Our data workflow consists of ingest and > transformation, and I was hoping that there are ETL frameworks that work > well with Beam. Does anyone have some recommendations and maybe some > samples that show how people might use and ETL framework with Beam? > > >>> > > >>> Thanks in advance and have a great day! >
Re: ETL with Beam?
Thank you for your reply. I will check it out! I'm in the evaluation phase, especially since I have some time before I have to implement all of this. On Fri, Oct 11, 2019 at 3:25 AM Dan wrote: > I'm not sure if this will help but kettle runs on beam too. > > https://github.com/mattcasters/kettle-beam > > https://youtu.be/vgpGrQJnqkM > > Depends on your use case but kettle rocks for etl. > > Dan > > Sent from my phone > > On Thu, 10 Oct 2019, 10:12 pm Steve973, wrote: > >> Hello, all. I still have not been given the tasking to convert my work >> project to use Beam, but it is still something that I am looking to do in >> the fairly near future. Our data workflow consists of ingest and >> transformation, and I was hoping that there are ETL frameworks that work >> well with Beam. Does anyone have some recommendations and maybe some >> samples that show how people might use and ETL framework with Beam? >> >> Thanks in advance and have a great day! >> >
ETL with Beam?
Hello, all. I still have not been given the tasking to convert my work project to use Beam, but it is still something that I am looking to do in the fairly near future. Our data workflow consists of ingest and transformation, and I was hoping that there are ETL frameworks that work well with Beam. Does anyone have some recommendations and maybe some samples that show how people might use and ETL framework with Beam? Thanks in advance and have a great day!
A couple questions from someone new to Beam
Hi, all. I am still ramping up on my learning of how to use Beam, and I have a couple of questions for the experts. And, while I have read the documentation, I have either looked at the wrong parts, or my particular questions were not specifically answered. If I have missed something, then please point me in the right direction. 1. When using the MongoDB, for reading and writing from an execution node, does it need to take the time, each time an executor runs, to set up the connection to Mongo? Or does Beam cache the connections and reuse them to mitigate the performance hit of setting up the connection each time? If so, I am curious how it handles that for multiple nodes, unless Beam is "smart" enough to pre-cache connections in a pool on execution nodes in advance. 2. When something is executed in parallel (ParDo), do the parallel jobs run in one thread on an execution node? Or, will Beam utilize more resources/threads, as available, on a node? I would like to utilize as many threads as possible on available cluster nodes. My thought is that, if a job is stateless, it seems reasonable to be able to utilize multiple threads on a node to further parallelize and maximize performance. Although, it also occurs to me that this would probably be implementation-dependent on the runner. The other approach that I can see is to simply use CompletableFutures in my jobs, which is what I am already doing in my code that does not (yet) use Beam. But it would be preferable to allow Beam to manage all of the parallelization. I am sure that I will have some more questions as time goes on, but this would be great info to have for now. Thanks, Steve
Re: How can I work with multiple pcollections?
Lukasz, It has been a few days since your reply, but I wanted to thank you for pointing me toward the "additional outputs" portion of the documentation. I had already read through that (if not completely thoroughly) although, at the time, I did not quite know the requirements of what I would be doing, so I did not really remember that part. I have some more work to do on my code before I can begin to use Beam (to make it much better!) but I think this should help quite a bit. Thanks again! Steve On Thu, Sep 12, 2019 at 5:31 PM Lukasz Cwik wrote: > Yes you can create multiple output PCollections using a ParDo with > multiple outputs instead of inserting them into Mongo. > > It could be useful to read through the programming guide related to > PCollections[1] and PTransforms with multiple outputs[2] and feel free to > return with more questions. > > 1: https://beam.apache.org/documentation/programming-guide/#pcollections > 2: > https://beam.apache.org/documentation/programming-guide/#additional-outputs > > On Thu, Sep 12, 2019 at 2:24 PM Steve973 wrote: > >> I am new to Beam, and I am pretty excited to get started. I have been >> doing quite a bit of research and playing around with the API. But for my >> use case, unless I am not approaching it correctly, suggests that I will >> need to process multiple PCollections in some parts of my pipeline. >> >> I am working out some of my business logic without a parallelization >> framework to get the solution working. Then I will convert the workflow to >> Beam. What I am doing is reading millions of files from the file system, >> and I am processing parts of the file into three different output types, >> and storing them in MongoDB in three collections. After this initial >> extraction (mapping), I modify some of the data which will result in >> duplicates. So the next step is a reduction step to eliminate the >> duplicates (based on a number of fields) and aggregate the references to >> the other 2 data types, so the reduced object contains the dedupe fields, >> and a list of references to documents in the other 2 collections. I'm not >> touching either of these two collections at this time, but this is where my >> question comes in. If I map this data, can I create three separate >> PCollections instead of inserting them into Mongo? After the >> deduplication, I will need to combine data in two of the streams, and I >> need to store the results of that combination into mongo. Then I need to >> process the third collection, which will go into its own mongo collection. >> >> I hope my description was at least enough to get the conversation >> started. Is my approach reasonable, and can I create multiple PCollections >> and use them at different phases of my pipeline? Or is there another way >> that I should be looking at this? >> >> Thanks in advance! >> Steve >> >
How can I work with multiple pcollections?
I am new to Beam, and I am pretty excited to get started. I have been doing quite a bit of research and playing around with the API. But for my use case, unless I am not approaching it correctly, suggests that I will need to process multiple PCollections in some parts of my pipeline. I am working out some of my business logic without a parallelization framework to get the solution working. Then I will convert the workflow to Beam. What I am doing is reading millions of files from the file system, and I am processing parts of the file into three different output types, and storing them in MongoDB in three collections. After this initial extraction (mapping), I modify some of the data which will result in duplicates. So the next step is a reduction step to eliminate the duplicates (based on a number of fields) and aggregate the references to the other 2 data types, so the reduced object contains the dedupe fields, and a list of references to documents in the other 2 collections. I'm not touching either of these two collections at this time, but this is where my question comes in. If I map this data, can I create three separate PCollections instead of inserting them into Mongo? After the deduplication, I will need to combine data in two of the streams, and I need to store the results of that combination into mongo. Then I need to process the third collection, which will go into its own mongo collection. I hope my description was at least enough to get the conversation started. Is my approach reasonable, and can I create multiple PCollections and use them at different phases of my pipeline? Or is there another way that I should be looking at this? Thanks in advance! Steve
Using RedisIO for hashes and selecting a database
I am new to Beam and I am familiarizing myself with various aspects of this cool framework. I need to be able to get key/value pairs from a Redis hash, and not just the key/value store in the default database (database 0). I have only seen the ability to get key/value pairs that have been set in Redis, but the source code does not make it seem like I can look up keys/values in a hash, and I have not seen a way to select a database number. Have I missed something, or is this feature not (yet) implemented? If I want to use this, am I going to have to extend the RedisIO class? Thanks in advance!