Github user ilganeli commented on the issue: https://github.com/apache/spark/pull/16685 @xwu0226 Thanks for the comments, I've reviewed your submission and commented here https://github.com/apache/spark/pull/16692. Specifically in response to your comments: 1) We did not find the join to be a limiting factor in our tests. Granted, this is very dataset specific but conceptually, Spark can do distributed joins very effectively and extracting the data from the database is an O(n) operation. The main cost of this approach is the additional copy of data out of the database and then back in as an INSERT + UPDATE. However, an UPSERT operation is equivalent to a DELETE and INSERT operation. I think there may be a slight horse race between CopyOutOFDb/INSERT/UPDATE and UPSERT but I'm not convinced there's a dramatic performance cost in this step, particularly considering the dramatic cost of enforcing the uniqueness constraint for UPSERT. 2) This is indeed a valid concern. This approach requires the Spark programmer to enforce and maintain the uniqueness constraints on the table, rather than the other way around. This is a conceptual shift from how things are usually implemented (where the DB Admin is king) but in our case this choice was justified by massive performance improvements. 3) I agree using Prepared Statement would be better. I tried initially with Prepared Statement and ran into issues with certain datatypes (particularly timestamps). I haven't yet tried with the wildcards as it's currently implemented in JdbcUtils Insert statement, I think it's definitely doable that way. This might also help to boost performance. 4) I like the approach that you guys took to expand JDBCDialect in https://github.com/apache/spark/pull/16692. It's a well modularized approach. Agree that something similar could be done here.
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org