Hi all,

I'm writing a data source that shares similarities with Spark's own JDBC 
implementation, and I'd like to ask a question about how Spark handles failure 
scenarios involving JDBC and SQL databases. To my understanding, if an executor 
dies while it's running a task, Spark will revive the executor and try to 
re-run that task. However, how does this play out in the context of data 
integrity and Spark's JDBC data source API (e.g. 
df.write.format("jdbc").option(...).save())?

In the savePartition function of 
JdbcUtils.scala<https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala>,
 we see Spark calling the commit and rollback functions of the Java connection 
object generated from the database url/credentials provided by the user 
(screenshot below). Can someone provide some guidance on what exactly happens 
under certain failure scenarios? For example, if an executor dies right after 
commit() finishes or before rollback() is called, does Spark try to re-run the 
task and write the same data partition again, essentially creating duplicate 
committed rows in the database? What happens if the executor dies in the middle 
of calling commit() or rollback()?

Thanks for your help!

[cid:image001.png@01D4A77C.C29ADFA0]

Reply via email to