Re: [Dataflow][Python] Guidance on HTTP ingestion on Dataflow

Luke Cwik via user Tue, 19 Jul 2022 14:18:42 -0700

Even if you don't have the resource ids ahead of time, you can have a
pipeline like:
Impulse -> ParDo(GenerateResourceIds) -> Reshuffle ->
ParDo(ReadResourceIds) -> ...

You could also compose these as splittable DoFns [1, 2, 3]:
ParDo(SplittableGenerateResourceIds) -> ParDo(SplittableReadResourceIds)

The first approach is the simplest as the reshuffle will rebalance the
reading of each resource id across worker nodes but is limited in
generating resource ids on one worker. Making the generation a splittable
DoFn will mean that you can increase the parallelism of generation which is
important if there are so many that it could crash a worker or fail to have
the output committed (these kinds of failures are runner dependent on how
well they handle single bundles with large outputs). Making the reading
splittable allows you to handle a large resource (imagine a large file) so
that it can be read and processed in parallel (and will have similar
failures if the runner can't handle single bundles with large outputs).

You can always start with the first solution and swap either piece to be a
splittable DoFn depending on your performance requirements and how well the
simple solution works.

1: https://beam.apache.org/blog/splittable-do-fn/
2: https://beam.apache.org/blog/splittable-do-fn-is-available/
3: https://beam.apache.org/documentation/programming-guide/#splittable-dofns

On Tue, Jul 19, 2022 at 10:05 AM Damian Akpan <damianakpan2...@gmail.com>
wrote:

> Provided you have all the resources ids ahead of fetching, Beam will
> spread the fetches to its workers. It will still fetch synchronously but
> within that worker.
>
> On Tue, Jul 19, 2022 at 5:40 PM Shree Tanna <shree.ta...@gmail.com> wrote:
>
>> Hi all,
>>
>> I'm planning to use Apache beam to extract and load part of the ETL
>> pipeline and run the jobs on Dataflow. I will have to do the REST API
>> ingestion on our platform. I can opt to make sync API calls from DoFn. With
>> that pipelines will stall while REST requests are made over the network.
>>
>> Is it best practice to run the REST ingestion job on Dataflow? Is there
>> any best practice I can follow to accomplish this? Just as a reference I'm
>> adding this
>> <https://stackoverflow.com/questions/50335521/best-practices-in-http-calls-in-cloud-dataflow-java>
>> StackOverflow thread here too. Also, I notice that Rest I/O transform
>> <https://beam.apache.org/documentation/io/built-in/> built-in connector
>> is in progress for Java.
>>
>> Let me know if this is the right group to ask this question. I can also
>> ask d...@beam.apache.org if needed.
>> --
>> Thanks,
>> Shree
>>
>

Re: [Dataflow][Python] Guidance on HTTP ingestion on Dataflow

Reply via email to