Re: Combining multiple DoFn's into one

Dmitry Demeshchuk Thu, 01 Jun 2017 18:28:07 -0700

I did some digging and finally found it: turns out, my version of the
google-cloud-dataflow package was somehow 0.6.0, which was not getting
fixed by just installing apache_beam[gcp]==2.0.0.


Now 2.0.0 works for me on Dataflow. Thanks for bringing my attention to
this, Chamikara!

On Thu, Jun 1, 2017 at 3:22 PM, Chamikara Jayalath <[email protected]>
wrote:

> Is it possible that you didn't install GCP components when installing Beam
> ? You have to do following to install Beam with support for Dataflow.
>
> pip install apache-beam[gcp]
>
> Please file a JIRA if you find any issues.
>
> Thanks,
> Cham
>
>
> On Thu, Jun 1, 2017 at 3:12 PM Dmitry Demeshchuk <[email protected]>
> wrote:
>
>> I may be wrong on that, indeed.
>>
>> Originally, I couldn't even run the regular WordCount on version 2.0.0,
>> it was coming down to some Beam-specific errors, and my reaction was "okay,
>> this is probably too early, I'll go back to 0.6.0 for now".
>>
>> Also, when reading the code I sometimes see things like "this is meant
>> only for DirectRunner" and such, so the degree of support of 2.0.0 by
>> Dataflow is a bit unclear to me.
>>
>> On Thu, Jun 1, 2017 at 2:59 PM, Chamikara Jayalath <[email protected]>
>> wrote:
>>
>>>
>>>
>>> On Thu, Jun 1, 2017 at 2:56 PM Dmitry Demeshchuk <[email protected]>
>>> wrote:
>>>
>>>> Haha, thanks, Sourabh, you beat me to it :)
>>>>
>>>> On Thu, Jun 1, 2017 at 2:55 PM, Dmitry Demeshchuk <[email protected]
>>>> > wrote:
>>>>
>>>>> Looks like the expand method should do the trick, similar to how it's
>>>>> done in GroupByKey?
>>>>>
>>>>> https://github.com/apache/beam/blob/dc4acfdd1bb30a07a9c48849f88a67
>>>>> f60bc8ff08/sdks/python/apache_beam/transforms/core.py#L1104
>>>>>
>>>>> On Thu, Jun 1, 2017 at 2:37 PM, Dmitry Demeshchuk <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> Hi folks,
>>>>>>
>>>>>> I'm currently playing with the Python SDK, primarily 0.6.0, since
>>>>>> 2.0.0 is not apparently supported by Dataflow, but trying to understand 
>>>>>> the
>>>>>> 2.0.0 API better too.
>>>>>>
>>>>>>
>>> I think Dataflow supports 2.0.0 release. Did you find some documentation
>>> that says otherwise ?
>>>
>>> - Cham
>>>
>>>
>>>> I've been trying to find a way of combining two or more DoFn's into a
>>>>>> single one, so that one doesn't have to repeat the same pattern over and
>>>>>> over again.
>>>>>>
>>>>>> Specifically, my use case is getting data out of Redshift via the
>>>>>> "UNLOAD" command:
>>>>>>
>>>>>> 1. Connect to Redshift via Postgres protocol and do the unload
>>>>>> <http://docs.aws.amazon.com/redshift/latest/dg/r_UNLOAD.html>.
>>>>>> 2. Connect to S3 and fetch the files that Redshift unloaded there,
>>>>>> converting them into a PCollection.
>>>>>>
>>>>>> It's worth noting here that Redshift generates multiple files,
>>>>>> usually at least 10 or so, the exact number may depend on the amount of
>>>>>> cores of the Redshift instance, some settings, etc. Reading these files 
>>>>>> in
>>>>>> parallel sounds like a good idea.
>>>>>>
>>>>>> So, it feels like this is just a combination of two FlatMaps:
>>>>>> 1. SQL query -> list of S3 files
>>>>>> 2. List of S3 files -> rows of data
>>>>>>
>>>>>> I could just create two DoFns for that and make people combine them,
>>>>>> but that feels like an overkill. Instead, one should just call
>>>>>> ReadFromRedshift and not really care about what exactly happens under the
>>>>>> hood.
>>>>>>
>>>>>> Plus, it just feels like the ability of taking somewhat complex
>>>>>> pieces of the execution graph and encapsulating them into a DoFn would 
>>>>>> be a
>>>>>> nice capability.
>>>>>>
>>>>>> Are there any officially recommended ways to do that?
>>>>>>
>>>>>> Thank you.
>>>>>>
>>>>>> --
>>>>>> Best regards,
>>>>>> Dmitry Demeshchuk.
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Best regards,
>>>>> Dmitry Demeshchuk.
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Best regards,
>>>> Dmitry Demeshchuk.
>>>>
>>>
>>
>>
>> --
>> Best regards,
>> Dmitry Demeshchuk.
>>
>


-- 
Best regards,
Dmitry Demeshchuk.

Re: Combining multiple DoFn's into one

Reply via email to