Hi, list,

I was hoping someone could give me a general code review on a Redshift
source I wrote:
https://gist.github.com/doubleyou/d3236180691dc9b146e17bc046ec1fc1. It also
relies on modules `s3` and `config` from our internal library, I can add
them too if needed, it just was more hassle to open up the entire
repository with the code, since it contains some company-specific code at
the moment.

My hope was also to find out if you wanted me to file a pull request, we'd
be totally fine to open source this piece, as well as some other AWS
sources and sinks in the future.

Finally, I have a specific question about cleanup. My impression was that
https://gist.github.com/doubleyou/d3236180691dc9b146e17bc046ec1fc1#file-redshift-py-L153
would help making sure that there's no possible data loss after we delete
the S3 files, however, in a personal conversation Eugene Kirpichev pointed
out that this way does not ensure the PCollection persistence, and that
Dataflow will just fuse multiple phases together.

Also, Eugene pointed out that this cleanup problem has been worked around
in the BigQuery source in Java SDK. To my understanding, it's this one:
https://github.com/apache/beam/blob/70e53e7dc5d58e4d9f88c6d4f1cff036429429c1/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQuerySourceBase.java#L100,
however I don't yet have enough knowledge about the parity between Java and
Python SDKs to tell whether I can or cannot implement a Python source in a
similar fashion (from what I remember, implementing sources is generally
frowned upon, as opposed to writing a DoFn instead).

Any thoughts and suggestions would be highly appreciated.

Thank you.

-- 
Best regards,
Dmitry Demeshchuk.

Reply via email to