I did some digging and finally found it: turns out, my version of the google-cloud-dataflow package was somehow 0.6.0, which was not getting fixed by just installing apache_beam[gcp]==2.0.0.
Now 2.0.0 works for me on Dataflow. Thanks for bringing my attention to this, Chamikara! On Thu, Jun 1, 2017 at 3:22 PM, Chamikara Jayalath <[email protected]> wrote: > Is it possible that you didn't install GCP components when installing Beam > ? You have to do following to install Beam with support for Dataflow. > > pip install apache-beam[gcp] > > Please file a JIRA if you find any issues. > > Thanks, > Cham > > > On Thu, Jun 1, 2017 at 3:12 PM Dmitry Demeshchuk <[email protected]> > wrote: > >> I may be wrong on that, indeed. >> >> Originally, I couldn't even run the regular WordCount on version 2.0.0, >> it was coming down to some Beam-specific errors, and my reaction was "okay, >> this is probably too early, I'll go back to 0.6.0 for now". >> >> Also, when reading the code I sometimes see things like "this is meant >> only for DirectRunner" and such, so the degree of support of 2.0.0 by >> Dataflow is a bit unclear to me. >> >> On Thu, Jun 1, 2017 at 2:59 PM, Chamikara Jayalath <[email protected]> >> wrote: >> >>> >>> >>> On Thu, Jun 1, 2017 at 2:56 PM Dmitry Demeshchuk <[email protected]> >>> wrote: >>> >>>> Haha, thanks, Sourabh, you beat me to it :) >>>> >>>> On Thu, Jun 1, 2017 at 2:55 PM, Dmitry Demeshchuk <[email protected] >>>> > wrote: >>>> >>>>> Looks like the expand method should do the trick, similar to how it's >>>>> done in GroupByKey? >>>>> >>>>> https://github.com/apache/beam/blob/dc4acfdd1bb30a07a9c48849f88a67 >>>>> f60bc8ff08/sdks/python/apache_beam/transforms/core.py#L1104 >>>>> >>>>> On Thu, Jun 1, 2017 at 2:37 PM, Dmitry Demeshchuk < >>>>> [email protected]> wrote: >>>>> >>>>>> Hi folks, >>>>>> >>>>>> I'm currently playing with the Python SDK, primarily 0.6.0, since >>>>>> 2.0.0 is not apparently supported by Dataflow, but trying to understand >>>>>> the >>>>>> 2.0.0 API better too. >>>>>> >>>>>> >>> I think Dataflow supports 2.0.0 release. Did you find some documentation >>> that says otherwise ? >>> >>> - Cham >>> >>> >>>> I've been trying to find a way of combining two or more DoFn's into a >>>>>> single one, so that one doesn't have to repeat the same pattern over and >>>>>> over again. >>>>>> >>>>>> Specifically, my use case is getting data out of Redshift via the >>>>>> "UNLOAD" command: >>>>>> >>>>>> 1. Connect to Redshift via Postgres protocol and do the unload >>>>>> <http://docs.aws.amazon.com/redshift/latest/dg/r_UNLOAD.html>. >>>>>> 2. Connect to S3 and fetch the files that Redshift unloaded there, >>>>>> converting them into a PCollection. >>>>>> >>>>>> It's worth noting here that Redshift generates multiple files, >>>>>> usually at least 10 or so, the exact number may depend on the amount of >>>>>> cores of the Redshift instance, some settings, etc. Reading these files >>>>>> in >>>>>> parallel sounds like a good idea. >>>>>> >>>>>> So, it feels like this is just a combination of two FlatMaps: >>>>>> 1. SQL query -> list of S3 files >>>>>> 2. List of S3 files -> rows of data >>>>>> >>>>>> I could just create two DoFns for that and make people combine them, >>>>>> but that feels like an overkill. Instead, one should just call >>>>>> ReadFromRedshift and not really care about what exactly happens under the >>>>>> hood. >>>>>> >>>>>> Plus, it just feels like the ability of taking somewhat complex >>>>>> pieces of the execution graph and encapsulating them into a DoFn would >>>>>> be a >>>>>> nice capability. >>>>>> >>>>>> Are there any officially recommended ways to do that? >>>>>> >>>>>> Thank you. >>>>>> >>>>>> -- >>>>>> Best regards, >>>>>> Dmitry Demeshchuk. >>>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> Best regards, >>>>> Dmitry Demeshchuk. >>>>> >>>> >>>> >>>> >>>> -- >>>> Best regards, >>>> Dmitry Demeshchuk. >>>> >>> >> >> >> -- >> Best regards, >> Dmitry Demeshchuk. >> > -- Best regards, Dmitry Demeshchuk.
