Re: Combining multiple DoFn's into one

Dmitry Demeshchuk Thu, 01 Jun 2017 15:12:34 -0700

I may be wrong on that, indeed.

Originally, I couldn't even run the regular WordCount on version 2.0.0, it
was coming down to some Beam-specific errors, and my reaction was "okay,
this is probably too early, I'll go back to 0.6.0 for now".


Also, when reading the code I sometimes see things like "this is meant only
for DirectRunner" and such, so the degree of support of 2.0.0 by Dataflow
is a bit unclear to me.

On Thu, Jun 1, 2017 at 2:59 PM, Chamikara Jayalath <[email protected]>
wrote:

>
>
> On Thu, Jun 1, 2017 at 2:56 PM Dmitry Demeshchuk <[email protected]>
> wrote:
>
>> Haha, thanks, Sourabh, you beat me to it :)
>>
>> On Thu, Jun 1, 2017 at 2:55 PM, Dmitry Demeshchuk <[email protected]>
>> wrote:
>>
>>> Looks like the expand method should do the trick, similar to how it's
>>> done in GroupByKey?
>>>
>>> https://github.com/apache/beam/blob/dc4acfdd1bb30a07a9c48849f88a67
>>> f60bc8ff08/sdks/python/apache_beam/transforms/core.py#L1104
>>>
>>> On Thu, Jun 1, 2017 at 2:37 PM, Dmitry Demeshchuk <[email protected]>
>>> wrote:
>>>
>>>> Hi folks,
>>>>
>>>> I'm currently playing with the Python SDK, primarily 0.6.0, since 2.0.0
>>>> is not apparently supported by Dataflow, but trying to understand the 2.0.0
>>>> API better too.
>>>>
>>>>
> I think Dataflow supports 2.0.0 release. Did you find some documentation
> that says otherwise ?
>
> - Cham
>
>
>> I've been trying to find a way of combining two or more DoFn's into a
>>>> single one, so that one doesn't have to repeat the same pattern over and
>>>> over again.
>>>>
>>>> Specifically, my use case is getting data out of Redshift via the
>>>> "UNLOAD" command:
>>>>
>>>> 1. Connect to Redshift via Postgres protocol and do the unload
>>>> <http://docs.aws.amazon.com/redshift/latest/dg/r_UNLOAD.html>.
>>>> 2. Connect to S3 and fetch the files that Redshift unloaded there,
>>>> converting them into a PCollection.
>>>>
>>>> It's worth noting here that Redshift generates multiple files, usually
>>>> at least 10 or so, the exact number may depend on the amount of cores of
>>>> the Redshift instance, some settings, etc. Reading these files in parallel
>>>> sounds like a good idea.
>>>>
>>>> So, it feels like this is just a combination of two FlatMaps:
>>>> 1. SQL query -> list of S3 files
>>>> 2. List of S3 files -> rows of data
>>>>
>>>> I could just create two DoFns for that and make people combine them,
>>>> but that feels like an overkill. Instead, one should just call
>>>> ReadFromRedshift and not really care about what exactly happens under the
>>>> hood.
>>>>
>>>> Plus, it just feels like the ability of taking somewhat complex pieces
>>>> of the execution graph and encapsulating them into a DoFn would be a nice
>>>> capability.
>>>>
>>>> Are there any officially recommended ways to do that?
>>>>
>>>> Thank you.
>>>>
>>>> --
>>>> Best regards,
>>>> Dmitry Demeshchuk.
>>>>
>>>
>>>
>>>
>>> --
>>> Best regards,
>>> Dmitry Demeshchuk.
>>>
>>
>>
>>
>> --
>> Best regards,
>> Dmitry Demeshchuk.
>>
>


-- 
Best regards,
Dmitry Demeshchuk.

Re: Combining multiple DoFn's into one

Reply via email to