Re: [Dataflow][Python] Guidance on HTTP ingestion on Dataflow

Chamikara Jayalath via user Wed, 20 Jul 2022 12:58:13 -0700

I don't think it's an antipattern per se. You can implement arbitrary
operations in a DoFn or an SDF to read data.


But if a single resource ID maps to a large amount of data, Beam runners
(including Dataflow) will be able to parallelize reading, hence your
solution may have suboptimal performance compared to reading from a Beam
source that can be fully parallelized.

Thanks,
Cham

On Wed, Jul 20, 2022 at 11:53 AM Shree Tanna <shree.ta...@gmail.com> wrote:

> Thank you!
> I will try this out.
> One more question on this, is it considered anti-pattern to do HTTP
> ingestion on GCP Dataflow due to the reasoning I mentioned in my original
> message? I ask because I am getting that indication from some of my
> co-workers and also from google cloud support. Not sure if this is the
> right place to ask this question. Happy to move this conversation to
> somewhere else if not.
>
> On Tue, Jul 19, 2022 at 5:18 PM Luke Cwik via user <user@beam.apache.org>
> wrote:
>
>> Even if you don't have the resource ids ahead of time, you can have a
>> pipeline like:
>> Impulse -> ParDo(GenerateResourceIds) -> Reshuffle ->
>> ParDo(ReadResourceIds) -> ...
>>
>> You could also compose these as splittable DoFns [1, 2, 3]:
>> ParDo(SplittableGenerateResourceIds) -> ParDo(SplittableReadResourceIds)
>>
>> The first approach is the simplest as the reshuffle will rebalance the
>> reading of each resource id across worker nodes but is limited in
>> generating resource ids on one worker. Making the generation a splittable
>> DoFn will mean that you can increase the parallelism of generation which is
>> important if there are so many that it could crash a worker or fail to have
>> the output committed (these kinds of failures are runner dependent on how
>> well they handle single bundles with large outputs). Making the reading
>> splittable allows you to handle a large resource (imagine a large file) so
>> that it can be read and processed in parallel (and will have similar
>> failures if the runner can't handle single bundles with large outputs).
>>
>> You can always start with the first solution and swap either piece to be
>> a splittable DoFn depending on your performance requirements and how well
>> the simple solution works.
>>
>> 1: https://beam.apache.org/blog/splittable-do-fn/
>> 2: https://beam.apache.org/blog/splittable-do-fn-is-available/
>> 3:
>> https://beam.apache.org/documentation/programming-guide/#splittable-dofns
>>
>>
>> On Tue, Jul 19, 2022 at 10:05 AM Damian Akpan <damianakpan2...@gmail.com>
>> wrote:
>>
>>> Provided you have all the resources ids ahead of fetching, Beam will
>>> spread the fetches to its workers. It will still fetch synchronously but
>>> within that worker.
>>>
>>> On Tue, Jul 19, 2022 at 5:40 PM Shree Tanna <shree.ta...@gmail.com>
>>> wrote:
>>>
>>>> Hi all,
>>>>
>>>> I'm planning to use Apache beam to extract and load part of the ETL
>>>> pipeline and run the jobs on Dataflow. I will have to do the REST API
>>>> ingestion on our platform. I can opt to make sync API calls from DoFn. With
>>>> that pipelines will stall while REST requests are made over the network.
>>>>
>>>> Is it best practice to run the REST ingestion job on Dataflow? Is there
>>>> any best practice I can follow to accomplish this? Just as a reference I'm
>>>> adding this
>>>> <https://stackoverflow.com/questions/50335521/best-practices-in-http-calls-in-cloud-dataflow-java>
>>>> StackOverflow thread here too. Also, I notice that Rest I/O transform
>>>> <https://beam.apache.org/documentation/io/built-in/> built-in
>>>> connector is in progress for Java.
>>>>
>>>> Let me know if this is the right group to ask this question. I can also
>>>> ask d...@beam.apache.org if needed.
>>>> --
>>>> Thanks,
>>>> Shree
>>>>
>>>
>
> --
> Best,
> Shree
>

Re: [Dataflow][Python] Guidance on HTTP ingestion on Dataflow

Reply via email to