Re: Implementing SaveMode.Overwrite in my v2 DataSource

Mich Talebzadeh Mon, 09 Mar 2026 02:59:12 -0700

Hi,

A brief answer for now


Use the DataSource V2 API. V1 is considered legacy for new connectors. In
V2 you typically implement *TableProvider → Table → SupportsWrite /
BatchWrite*, and Spark handles table existence checks and creation through
the Catalog/Table interfaces. Your database-specific bulk load logic (e.g.,
copy) would live inside the BatchWrite implementation.

In practice most database connectors still include some database-specific
table detection/creation logic, since Spark cannot abstract SQL dialects or
bulk loaders completely.

HTH

Dr Mich Talebzadeh,
Data Scientist | Distributed Systems (Spark) | Financial Forensics &
Metadata Analytics | Transaction Reconstruction | Audit & Evidence-Based
Analytics

   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>





On Mon, 9 Mar 2026 at 08:52, Joeri van Ruth via dev <[email protected]>
wrote:

> Hi Mich!
>
> First, thanks for the clear explanation!
>
> However, it doesn't really answer my question. I understand why the JDBC
> Data Source works the way it does. I also understand and accept that moving
> data OUT of Spark is never going to get as much attention from the Spark
> devs as moving INTO Spark gets :)
>
> My question is, if I do need to upload lots of data, the JDBC source just
> doesn't cut it. Right now my users export to CSV and use external tooling
> to import that in our database, but I would prefer them to just be able to
> use dataframe.write().
>
> So if I write a new Data Source specific to my database, how do I best go
> about it? Do I use the v2 API and if so, how do I integrate into its table
> detection/creation framework? Or do still use the v1 API even though v2 is
> available?
>
> I'm currently just adding my own table detection and creation code but I
> imagine that is not really 'the Spark way'. But I can't see the forest for
> the trees!
>
> Joeri
>
>
> On 3/6/26 11:06 PM, Mich Talebzadeh wrote:
>
> Hi,
>
> Spark uses the JDBC data source / connector. to read data. Internally the
> database performs a scan of the table and streams the rows out, i.e
> sequential table scan or streaming result set
>
> Writing through JDBC is different. Each row must be processed as a
> transaction operation.
>
> Spark row
>    │
>    ▼
> INSERT statement
>    │
>    ▼
> database engine (monetDB, Oracle etc)
>    ├─ constraint checks
>    ├─ index updates
>    ├─ transaction logging
>    └─ storage update
>
> So the database does a lot more work for each row. This creates a
> row-by-row workflow, which looks serial and inefficient as you observed
>
> INSERT row 1
> INSERT row 2
>
> Spark’s generic JDBC writer deliberately avoids database-specific features
> and therefore falls back to the safest universally supported mechanism i.e
> standard SQL INSERT statements.
> JDBC is designed to work with all relational databases, such as Oracle,
> MonetDB etc. Because of that, Spark must use the lowest common denominator
> that every database supports.
>
> That lowest common denominator is
>
> INSERT INTO table VALUES (...)
>
> In short, because the JDBC interface is database-agnostic, Spark uses the
> safest universally supported operation (standard SQL INSERT). Since Spark
> cannot assume the availability of database-specific bulk loaders, the
> generic implementation often inserts rows individually.
>
> HTH,
>
> Dr Mich Talebzadeh,
> Data Scientist | Distributed Systems (Spark) | Financial Forensics &
> Metadata Analytics | Transaction Reconstruction | Audit & Evidence-Based
> Analytics
>
>    view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>
>
>
> On Fri, 6 Mar 2026 at 12:43, Joeri van Ruth via dev <[email protected]>
> wrote:
>
>> Hi!
>>
>> I'm a developer working on MonetDB, a column-oriented SQL database.  See
>> https://www.monetdb.org.
>>
>> I've created a JdbcDialect for MonetDB, it seems to work fine. The
>> source code is at https://github.com/MonetDB/monetdb-spark.
>>
>> Unfortunately it turns out the JDBC Data Source is good at downloading
>> data from the database but really slow when uploading. The reason it's
>> so slow is that it uses a separate INSERT statement for each row.
>>
>> To work around this, I implemented a custom data source that uses
>> MonetDB's COPY BINARY INTO feature to more efficiently upload data.
>> This is orders of magnitude faster, but it currently only supports
>> Append mode. I would like to also support Overwrite mode. This
>> turned out to be harder than expected.
>>
>> It seems the table existence checks and creation functionality is part
>> of org.apache.spark.sql.catalog.Catalog. Do I have to hook into that
>> somehow? And if so, how does my
>>
>>     dataframe
>>         .write()
>>         .source("org.monetdb.spark")
>>         .mode(SaveMode.Overwrite)
>>         .option("url", url)
>>         .option("dbtable", "foo")
>>         .save()
>>
>> find my catalog? The Catalog interface also contains lots of methods
>> that I don't really understand, do I have to implement all of these?
>>
>> Can someone give me an overview of the big picture?
>>
>>
>> Note: another approach would be to not try to implement a v2 DataSource
>> but
>> more or less "subclass" the v1 JDBC Data Source like the now abandoned
>> SQL Server dialect seems to do:
>>
>>     https://github.com/microsoft/sql-spark-connector.
>>
>> Would that still be the way to go?
>>
>>
>> Best regards,
>>
>> Joeri van Ruth
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: [email protected]
>>
>>

Re: Implementing SaveMode.Overwrite in my v2 DataSource

Reply via email to