Hi folks,

I'm currently playing with the Python SDK, primarily 0.6.0, since 2.0.0 is
not apparently supported by Dataflow, but trying to understand the 2.0.0
API better too.

I've been trying to find a way of combining two or more DoFn's into a
single one, so that one doesn't have to repeat the same pattern over and
over again.

Specifically, my use case is getting data out of Redshift via the "UNLOAD"
command:

1. Connect to Redshift via Postgres protocol and do the unload
<http://docs.aws.amazon.com/redshift/latest/dg/r_UNLOAD.html>.
2. Connect to S3 and fetch the files that Redshift unloaded there,
converting them into a PCollection.

It's worth noting here that Redshift generates multiple files, usually at
least 10 or so, the exact number may depend on the amount of cores of the
Redshift instance, some settings, etc. Reading these files in parallel
sounds like a good idea.

So, it feels like this is just a combination of two FlatMaps:
1. SQL query -> list of S3 files
2. List of S3 files -> rows of data

I could just create two DoFns for that and make people combine them, but
that feels like an overkill. Instead, one should just call ReadFromRedshift
and not really care about what exactly happens under the hood.

Plus, it just feels like the ability of taking somewhat complex pieces of
the execution graph and encapsulating them into a DoFn would be a nice
capability.

Are there any officially recommended ways to do that?

Thank you.

-- 
Best regards,
Dmitry Demeshchuk.

Reply via email to