I dug some of my old stuff using Spark as ETL.
Regarding the question
"Any reason why Spark's SaveMode doesn't have mode that ignore any Primary
Key/Unique constraint violations?"
There is no way Spark can determine if PK constraint is violated until it
receives such message from Oracle through
JDBC read from Oracle table requires Oracle jdbc driver ojdbc6.jar or
higher. ojdbc6.jar works for 11 and 12c added as --jars /ojdbc6.jar
Example with parallel read (4 connections) to Oracle with ID being your PK
in Oracle table
var _ORACLEserver= "jdbc:oracle:thin:@rhes564:1521:mydb12"
var
This behaviour is governed by the underlying RDBMS for bulk insert, where
it either commits or roll backs.
You can insert new rows into an staging table in Oracle (which is common in
ETL) and then insert/select into Oracle table in shell routine.
The other way is to use JDBC in Spark to read
This is not an issue of Spark, but the underlying database. The primary key
constraint has a purpose and ignoring it would defeat that purpose.
Then to handle your use case, you would need to make multiple decisions that
may imply you don’t want to simply insert if not exist. Maybe you want to
Any reason why Spark's SaveMode doesn't have mode that ignore any Primary
Key/Unique constraint violations?
Let's say I'm using spark to migrate some data from Cassandra to Oracle, I
want the insert operation to be "ignore if exist primary keys" instead of
failing the whole batch.
Thanks,