Github user ilganeli commented on the issue:

    https://github.com/apache/spark/pull/16685
  
    @xwu0226 Thanks for the comments, I've reviewed your submission and 
commented here https://github.com/apache/spark/pull/16692. 
    
    Specifically in response to your comments:
    1) We did not find the join to be a limiting factor in our tests. Granted, 
this is very dataset specific but conceptually, Spark can do distributed joins 
very effectively and extracting the data from the database is an O(n) 
operation. The main cost of this approach is the additional copy of data out of 
the database and then back in as an INSERT + UPDATE. However, an UPSERT 
operation is equivalent to a DELETE and INSERT operation. I think there may be 
a slight horse race between CopyOutOFDb/INSERT/UPDATE and UPSERT but I'm not 
convinced there's a dramatic performance cost in this step, particularly 
considering the dramatic cost of enforcing the uniqueness constraint for UPSERT.
    
    2) This is indeed a valid concern. This approach requires the Spark 
programmer to enforce and maintain the uniqueness constraints on the table, 
rather than the other way around. This is a conceptual shift from how things 
are usually implemented (where the DB Admin is king) but in our case this 
choice was justified by massive performance improvements.
    
    3) I agree using Prepared Statement would be better. I tried initially with 
Prepared Statement and ran into issues with certain datatypes (particularly 
timestamps). I haven't yet tried with the wildcards as it's currently 
implemented in JdbcUtils Insert statement, I think it's definitely doable that 
way. This might also help to boost performance.
    
    4) I like the approach that you guys took to expand JDBCDialect in 
https://github.com/apache/spark/pull/16692. It's a well modularized approach. 
Agree that something similar could be done here. 
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to