Hi Luke, Thank you for your help. If I have any further questions, I'll let you know.
All the best, Dennis On Fri, Jul 17, 2020 at 10:41 PM Luke Cwik <lc...@google.com> wrote: > There is typically no trivial way to have SQL queries be partitioned to > parallelize the reading automatically since you typically can't have > multiple "readers" connect and use the same query results. There are a > couple of options that you could use: > 1) Use a single query use the databases abilitity to dump the contents to > a file and then perform a parallel read over the files contents > 2) Paritition your query over a primary key and submit N queries where N > is the amount of parallel reads you want (alternatively dump to a temp > table and partition your query) > 3) Write your own splittable DoFn that can partition a query arbitrarily > > Do you see some meaningful errors when running your pipeline on Dataflow > in the logs? If its an issue that pyodbc isn't installed on the remote > worker you should check out how to manage pipeline dependencies[1]. > > 3 has the best parallelism potential followed by 1 while 2 is the easiest > to get working followed by 1 (assuming that the dump file format is already > supported by Beam). On the writing side, you can use GroupIntoBatches[2] to > have meaningfully large transactions or try to use the JdbcIO write > transform when it becomes available in a cross language way. > > There is ongoing work[3, 4, 5] to use the existing JdbcIO in Beam Java > connector as a cross language transform available to Python. > > I don't know of another way to get on the mailing list then sending an > e-mail to user-subscr...@beam.apache.org > > 1: > https://beam.apache.org/documentation/sdks/python-pipeline-dependencies/ > 2: > https://github.com/apache/beam/blob/62118fa66788ad45032a60abc30799cd6f0e4658/sdks/python/apache_beam/transforms/util.py#L744 > 3: https://github.com/apache/beam/pull/12145 > 4: https://issues.apache.org/jira/browse/BEAM-10135 > 5: https://issues.apache.org/jira/browse/BEAM-10136 > > On Fri, Jul 17, 2020 at 9:21 AM Dennis <denniszvigel...@gmail.com> wrote: > >> Hello, >> >> >> >> I'm writing in order to inquire about developing a pipeline (using the >> Python SDK) with multiple PTransforms that can read from, write to, and >> alter data from an MSSQL server. >> >> >> >> I've been using beam-nuggets (https://pypi.org/project/beam-nuggets/), a >> community I/O Connector for dealing with these kinds of PTransforms for a >> MySQL server, and was looking to see if there's an option to do this for >> MSSQL. >> >> >> >> So far, I've been able to run a pipeline with DirectRunner that reads >> data from MSSQL using pyodbc. While this is a good starting point, it's not >> running with DataflowRunner (even after configuring Private IP), and it's >> not parallelized. >> >> >> >> I tried to look into SQLAlchemy, but it seems that there isn't as much >> support as there is for MySQL, especially for the insertion method. It is >> expected that the default insertion method is upsert. For MySQL, this was >> implemented using: >> >> >> >> from sqlalchemy.dialects.mysql import insert as mysql_insert >> >> >> >> There is not such a package available for MSSQL... >> >> >> >> How would one go about doing this? I've looked at several stack overflow >> articles, but there wasn't any solution there that had any similar >> functionality to that of beam-nuggets. Perhaps I missed a solution? >> >> >> >> I realize that this is a loaded question, so I greatly appreciate any >> help in advance. >> >> >> >> Thanks, >> >> Dennis >> >> >> >> P.S. I had trouble adding my work email address, dzvigel...@questrade.com >> to the mailing list (even though I went through the same steps to subscribe >> as with this one), could you please add it? Thanks. >> >