Re: [Dataflow][Python] Guidance on HTTP ingestion on Dataflow

Shree Tanna Wed, 20 Jul 2022 11:53:02 -0700

Thank you!
I will try this out.
One more question on this, is it considered anti-pattern to do HTTP
ingestion on GCP Dataflow due to the reasoning I mentioned in my original
message? I ask because I am getting that indication from some of my
co-workers and also from google cloud support. Not sure if this is the
right place to ask this question. Happy to move this conversation to
somewhere else if not.


On Tue, Jul 19, 2022 at 5:18 PM Luke Cwik via user <[email protected]>
wrote:

> Even if you don't have the resource ids ahead of time, you can have a
> pipeline like:
> Impulse -> ParDo(GenerateResourceIds) -> Reshuffle ->
> ParDo(ReadResourceIds) -> ...
>
> You could also compose these as splittable DoFns [1, 2, 3]:
> ParDo(SplittableGenerateResourceIds) -> ParDo(SplittableReadResourceIds)
>
> The first approach is the simplest as the reshuffle will rebalance the
> reading of each resource id across worker nodes but is limited in
> generating resource ids on one worker. Making the generation a splittable
> DoFn will mean that you can increase the parallelism of generation which is
> important if there are so many that it could crash a worker or fail to have
> the output committed (these kinds of failures are runner dependent on how
> well they handle single bundles with large outputs). Making the reading
> splittable allows you to handle a large resource (imagine a large file) so
> that it can be read and processed in parallel (and will have similar
> failures if the runner can't handle single bundles with large outputs).
>
> You can always start with the first solution and swap either piece to be a
> splittable DoFn depending on your performance requirements and how well the
> simple solution works.
>
> 1: https://beam.apache.org/blog/splittable-do-fn/
> 2: https://beam.apache.org/blog/splittable-do-fn-is-available/
> 3:
> https://beam.apache.org/documentation/programming-guide/#splittable-dofns
>
>
> On Tue, Jul 19, 2022 at 10:05 AM Damian Akpan <[email protected]>
> wrote:
>
>> Provided you have all the resources ids ahead of fetching, Beam will
>> spread the fetches to its workers. It will still fetch synchronously but
>> within that worker.
>>
>> On Tue, Jul 19, 2022 at 5:40 PM Shree Tanna <[email protected]>
>> wrote:
>>
>>> Hi all,
>>>
>>> I'm planning to use Apache beam to extract and load part of the ETL
>>> pipeline and run the jobs on Dataflow. I will have to do the REST API
>>> ingestion on our platform. I can opt to make sync API calls from DoFn. With
>>> that pipelines will stall while REST requests are made over the network.
>>>
>>> Is it best practice to run the REST ingestion job on Dataflow? Is there
>>> any best practice I can follow to accomplish this? Just as a reference I'm
>>> adding this
>>> <https://stackoverflow.com/questions/50335521/best-practices-in-http-calls-in-cloud-dataflow-java>
>>> StackOverflow thread here too. Also, I notice that Rest I/O transform
>>> <https://beam.apache.org/documentation/io/built-in/> built-in connector
>>> is in progress for Java.
>>>
>>> Let me know if this is the right group to ask this question. I can also
>>> ask [email protected] if needed.
>>> --
>>> Thanks,
>>> Shree
>>>
>>

-- 
Best,
Shree

Re: [Dataflow][Python] Guidance on HTTP ingestion on Dataflow

Reply via email to