Hi, list, I was hoping someone could give me a general code review on a Redshift source I wrote: https://gist.github.com/doubleyou/d3236180691dc9b146e17bc046ec1fc1. It also relies on modules `s3` and `config` from our internal library, I can add them too if needed, it just was more hassle to open up the entire repository with the code, since it contains some company-specific code at the moment.
My hope was also to find out if you wanted me to file a pull request, we'd be totally fine to open source this piece, as well as some other AWS sources and sinks in the future. Finally, I have a specific question about cleanup. My impression was that https://gist.github.com/doubleyou/d3236180691dc9b146e17bc046ec1fc1#file-redshift-py-L153 would help making sure that there's no possible data loss after we delete the S3 files, however, in a personal conversation Eugene Kirpichev pointed out that this way does not ensure the PCollection persistence, and that Dataflow will just fuse multiple phases together. Also, Eugene pointed out that this cleanup problem has been worked around in the BigQuery source in Java SDK. To my understanding, it's this one: https://github.com/apache/beam/blob/70e53e7dc5d58e4d9f88c6d4f1cff036429429c1/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQuerySourceBase.java#L100, however I don't yet have enough knowledge about the parity between Java and Python SDKs to tell whether I can or cannot implement a Python source in a similar fashion (from what I remember, implementing sources is generally frowned upon, as opposed to writing a DoFn instead). Any thoughts and suggestions would be highly appreciated. Thank you. -- Best regards, Dmitry Demeshchuk.
