I don't think it's an antipattern per se. You can implement arbitrary operations in a DoFn or an SDF to read data.
But if a single resource ID maps to a large amount of data, Beam runners (including Dataflow) will be able to parallelize reading, hence your solution may have suboptimal performance compared to reading from a Beam source that can be fully parallelized. Thanks, Cham On Wed, Jul 20, 2022 at 11:53 AM Shree Tanna <shree.ta...@gmail.com> wrote: > Thank you! > I will try this out. > One more question on this, is it considered anti-pattern to do HTTP > ingestion on GCP Dataflow due to the reasoning I mentioned in my original > message? I ask because I am getting that indication from some of my > co-workers and also from google cloud support. Not sure if this is the > right place to ask this question. Happy to move this conversation to > somewhere else if not. > > On Tue, Jul 19, 2022 at 5:18 PM Luke Cwik via user <user@beam.apache.org> > wrote: > >> Even if you don't have the resource ids ahead of time, you can have a >> pipeline like: >> Impulse -> ParDo(GenerateResourceIds) -> Reshuffle -> >> ParDo(ReadResourceIds) -> ... >> >> You could also compose these as splittable DoFns [1, 2, 3]: >> ParDo(SplittableGenerateResourceIds) -> ParDo(SplittableReadResourceIds) >> >> The first approach is the simplest as the reshuffle will rebalance the >> reading of each resource id across worker nodes but is limited in >> generating resource ids on one worker. Making the generation a splittable >> DoFn will mean that you can increase the parallelism of generation which is >> important if there are so many that it could crash a worker or fail to have >> the output committed (these kinds of failures are runner dependent on how >> well they handle single bundles with large outputs). Making the reading >> splittable allows you to handle a large resource (imagine a large file) so >> that it can be read and processed in parallel (and will have similar >> failures if the runner can't handle single bundles with large outputs). >> >> You can always start with the first solution and swap either piece to be >> a splittable DoFn depending on your performance requirements and how well >> the simple solution works. >> >> 1: https://beam.apache.org/blog/splittable-do-fn/ >> 2: https://beam.apache.org/blog/splittable-do-fn-is-available/ >> 3: >> https://beam.apache.org/documentation/programming-guide/#splittable-dofns >> >> >> On Tue, Jul 19, 2022 at 10:05 AM Damian Akpan <damianakpan2...@gmail.com> >> wrote: >> >>> Provided you have all the resources ids ahead of fetching, Beam will >>> spread the fetches to its workers. It will still fetch synchronously but >>> within that worker. >>> >>> On Tue, Jul 19, 2022 at 5:40 PM Shree Tanna <shree.ta...@gmail.com> >>> wrote: >>> >>>> Hi all, >>>> >>>> I'm planning to use Apache beam to extract and load part of the ETL >>>> pipeline and run the jobs on Dataflow. I will have to do the REST API >>>> ingestion on our platform. I can opt to make sync API calls from DoFn. With >>>> that pipelines will stall while REST requests are made over the network. >>>> >>>> Is it best practice to run the REST ingestion job on Dataflow? Is there >>>> any best practice I can follow to accomplish this? Just as a reference I'm >>>> adding this >>>> <https://stackoverflow.com/questions/50335521/best-practices-in-http-calls-in-cloud-dataflow-java> >>>> StackOverflow thread here too. Also, I notice that Rest I/O transform >>>> <https://beam.apache.org/documentation/io/built-in/> built-in >>>> connector is in progress for Java. >>>> >>>> Let me know if this is the right group to ask this question. I can also >>>> ask d...@beam.apache.org if needed. >>>> -- >>>> Thanks, >>>> Shree >>>> >>> > > -- > Best, > Shree >