Hi Luke,

Thank you for your help. If I have any further questions, I'll let you know.

All the best,
Dennis

On Fri, Jul 17, 2020 at 10:41 PM Luke Cwik <lc...@google.com> wrote:

> There is typically no trivial way to have SQL queries be partitioned to
> parallelize the reading automatically since you typically can't have
> multiple "readers" connect and use the same query results. There are a
> couple of options that you could use:
> 1) Use a single query use the databases abilitity to dump the contents to
> a file and then perform a parallel read over the files contents
> 2) Paritition your query over a primary key and submit N queries where N
> is the amount of parallel reads you want (alternatively dump to a temp
> table and partition your query)
> 3) Write your own splittable DoFn that can partition a query arbitrarily
>
> Do you see some meaningful errors when running your pipeline on Dataflow
> in the logs? If its an issue that pyodbc isn't installed on the remote
> worker you should check out how to manage pipeline dependencies[1].
>
> 3 has the best parallelism potential followed by 1 while 2 is the easiest
> to get working followed by 1 (assuming that the dump file format is already
> supported by Beam). On the writing side, you can use GroupIntoBatches[2] to
> have meaningfully large transactions or try to use the JdbcIO write
> transform when it becomes available in a cross language way.
>
> There is ongoing work[3, 4, 5] to use the existing JdbcIO in Beam Java
> connector as a cross language transform available to Python.
>
> I don't know of another way to get on the mailing list then sending an
> e-mail to user-subscr...@beam.apache.org
>
> 1:
> https://beam.apache.org/documentation/sdks/python-pipeline-dependencies/
> 2:
> https://github.com/apache/beam/blob/62118fa66788ad45032a60abc30799cd6f0e4658/sdks/python/apache_beam/transforms/util.py#L744
> 3: https://github.com/apache/beam/pull/12145
> 4: https://issues.apache.org/jira/browse/BEAM-10135
> 5: https://issues.apache.org/jira/browse/BEAM-10136
>
> On Fri, Jul 17, 2020 at 9:21 AM Dennis <denniszvigel...@gmail.com> wrote:
>
>> Hello,
>>
>>
>>
>> I'm writing in order to inquire about developing a pipeline (using the
>> Python SDK) with multiple PTransforms that can read from, write to, and
>> alter data from an MSSQL server.
>>
>>
>>
>> I've been using beam-nuggets (https://pypi.org/project/beam-nuggets/), a
>> community I/O Connector for dealing with these kinds of PTransforms for a
>> MySQL server, and was looking to see if there's an option to do this for
>> MSSQL.
>>
>>
>>
>> So far, I've been able to run a pipeline with DirectRunner that reads
>> data from MSSQL using pyodbc. While this is a good starting point, it's not
>> running with DataflowRunner (even after configuring Private IP), and it's
>> not parallelized.
>>
>>
>>
>> I tried to look into SQLAlchemy, but it seems that there isn't as much
>> support as there is for MySQL, especially for the insertion method. It is
>> expected that the default insertion method is upsert. For MySQL, this was
>> implemented using:
>>
>>
>>
>> from sqlalchemy.dialects.mysql import insert as mysql_insert
>>
>>
>>
>> There is not such a package available for MSSQL...
>>
>>
>>
>> How would one go about doing this? I've looked at several stack overflow
>> articles, but there wasn't any solution there that had any similar
>> functionality to that of beam-nuggets. Perhaps I missed a solution?
>>
>>
>>
>> I realize that this is a loaded question, so I greatly appreciate any
>> help in advance.
>>
>>
>>
>> Thanks,
>>
>> Dennis
>>
>>
>>
>> P.S. I had trouble adding my work email address, dzvigel...@questrade.com
>> to the mailing list (even though I went through the same steps to subscribe
>> as with this one), could you please add it? Thanks.
>>
>

Reply via email to