Re: Implementing SaveMode.Overwrite in my v2 DataSource

Mich Talebzadeh Tue, 10 Mar 2026 05:23:14 -0700

Hi Joeri,

For SaveMode.Overwrite you typically advertise
TableCapability.OVERWRITE_BY_FILTER and implement SupportsOverwriteV2. If
you only want to support full-table overwrite, it is reasonable to reject
predicates in canOverwrite() and implement overwrite internally using
TRUNCATE or DROP + CREATE.


OverwriteDynamic is only relevant for partitioned tables, so if MonetDB
tables are not partitioned you can safely ignore it.

*Regarding catalogs: when users call*

df.write.format("org.monetdb.spark").mode("overwrite").save()

Spark does not go through a CatalogPlugin. In this case Spark only resolves
your TableProvider and then proceeds via the Table → WriteBuilder →
BatchWrite flow. Because of that there isn’t a catalog you can hook into or
reuse here. This also means that implementing table detection and creation
inside the connector is actually the expected pattern. The built-in JDBC
data source does essentially the same thing: Spark determines the write
semantics (append vs overwrite), but the connector itself handles the
physical implementation such as TRUNCATE, DROP, or CREATE.

You only need to implement a CatalogPlugin if you wanted Spark SQL to
manage tables through a catalog, for example:

CREATE TABLE monet.<Table> USING monetdb
INSERT INTO monet.<TABLE> SELECT ...

For the format(...).save() path, handling table lifecycle inside the
connector is normal and aligns with how other database connectors behave.

HTH

Dr Mich Talebzadeh,
Data Scientist | Distributed Systems (Spark) | Financial Forensics &
Metadata Analytics | Transaction Reconstruction | Audit & Evidence-Based
Analytics

   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>





On Tue, 10 Mar 2026 at 09:39, Joeri van Ruth via dev <[email protected]>
wrote:

> Hi Mich! Thank you!
>
> I've implemented the TableProvider -> SupportsWrite -> etc route and it
> works for SaveMode.Append.
>
> For example, my users currently write (in pyspark):
>
>
> df.write.format('org.monetdb.spark').mode('append').options(driver=DRIVER,
> url=DBURL, dbtable='foo').save()
>
> My MonetTable advertises TableCapability.BATCH_WRITE and implements
> SupportsWrite so this works.
>
> I assume that for Overwrite mode my MonetTable also needs to advertise
> TableCapability.OverwriteByFilter and/or TableCapability.OverwriteDynamic.
>
> For OverwriteByFilter, if I just make
> SupportsOverwriteV2#canOverwrite(Predicate[] predicates) return false if
> there are predicates, I can implement overwriting using DROP + CREATE or
> TRUNCATE.
>
> OverwriteDynamic I'm not sure if it's relevant. The documentation talks
> about partitions, which I don't have. I guess I'll first try without.
>
>
> But if I do it this way I'm basically implementing the table (re-)creation
> by hand. How do I get from df.write.save() to the right Catalog that can do
> the table creation etc? And do I have to fully implement that Catalog
> myself or can I somehow piggybag on the JDBC Catalog?
>
> Thanks for you patience, if you have any left :)
>
> Joeri
>
>
> On 3/9/26 10:58 AM, Mich Talebzadeh wrote:
>
> Hi,
>
> A brief answer for now
>
> Use the DataSource V2 API. V1 is considered legacy for new connectors. In
> V2 you typically implement *TableProvider → Table → SupportsWrite /
> BatchWrite*, and Spark handles table existence checks and creation
> through the Catalog/Table interfaces. Your database-specific bulk load
> logic (e.g., copy) would live inside the BatchWrite implementation.
>
> In practice most database connectors still include some database-specific
> table detection/creation logic, since Spark cannot abstract SQL dialects or
> bulk loaders completely.
>
> HTH
>
> Dr Mich Talebzadeh,
> Data Scientist | Distributed Systems (Spark) | Financial Forensics &
> Metadata Analytics | Transaction Reconstruction | Audit & Evidence-Based
> Analytics
>
>    view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>
>
>
> On Mon, 9 Mar 2026 at 08:52, Joeri van Ruth via dev <[email protected]>
> wrote:
>
>> Hi Mich!
>>
>> First, thanks for the clear explanation!
>>
>> However, it doesn't really answer my question. I understand why the JDBC
>> Data Source works the way it does. I also understand and accept that moving
>> data OUT of Spark is never going to get as much attention from the Spark
>> devs as moving INTO Spark gets :)
>>
>> My question is, if I do need to upload lots of data, the JDBC source just
>> doesn't cut it. Right now my users export to CSV and use external tooling
>> to import that in our database, but I would prefer them to just be able to
>> use dataframe.write().
>>
>> So if I write a new Data Source specific to my database, how do I best go
>> about it? Do I use the v2 API and if so, how do I integrate into its table
>> detection/creation framework? Or do still use the v1 API even though v2 is
>> available?
>>
>> I'm currently just adding my own table detection and creation code but I
>> imagine that is not really 'the Spark way'. But I can't see the forest for
>> the trees!
>>
>> Joeri
>>
>>
>> On 3/6/26 11:06 PM, Mich Talebzadeh wrote:
>>
>> Hi,
>>
>> Spark uses the JDBC data source / connector. to read data. Internally the
>> database performs a scan of the table and streams the rows out, i.e
>> sequential table scan or streaming result set
>>
>> Writing through JDBC is different. Each row must be processed as a
>> transaction operation.
>>
>> Spark row
>>    │
>>    ▼
>> INSERT statement
>>    │
>>    ▼
>> database engine (monetDB, Oracle etc)
>>    ├─ constraint checks
>>    ├─ index updates
>>    ├─ transaction logging
>>    └─ storage update
>>
>> So the database does a lot more work for each row. This creates a
>> row-by-row workflow, which looks serial and inefficient as you observed
>>
>> INSERT row 1
>> INSERT row 2
>>
>> Spark’s generic JDBC writer deliberately avoids database-specific
>> features and therefore falls back to the safest universally supported
>> mechanism i.e standard SQL INSERT statements.
>> JDBC is designed to work with all relational databases, such as Oracle,
>> MonetDB etc. Because of that, Spark must use the lowest common denominator
>> that every database supports.
>>
>> That lowest common denominator is
>>
>> INSERT INTO table VALUES (...)
>>
>> In short, because the JDBC interface is database-agnostic, Spark uses the
>> safest universally supported operation (standard SQL INSERT). Since Spark
>> cannot assume the availability of database-specific bulk loaders, the
>> generic implementation often inserts rows individually.
>>
>> HTH,
>>
>> Dr Mich Talebzadeh,
>> Data Scientist | Distributed Systems (Spark) | Financial Forensics &
>> Metadata Analytics | Transaction Reconstruction | Audit & Evidence-Based
>> Analytics
>>
>>    view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>
>>
>>
>> On Fri, 6 Mar 2026 at 12:43, Joeri van Ruth via dev <[email protected]>
>> wrote:
>>
>>> Hi!
>>>
>>> I'm a developer working on MonetDB, a column-oriented SQL database.  See
>>> https://www.monetdb.org.
>>>
>>> I've created a JdbcDialect for MonetDB, it seems to work fine. The
>>> source code is at https://github.com/MonetDB/monetdb-spark.
>>>
>>> Unfortunately it turns out the JDBC Data Source is good at downloading
>>> data from the database but really slow when uploading. The reason it's
>>> so slow is that it uses a separate INSERT statement for each row.
>>>
>>> To work around this, I implemented a custom data source that uses
>>> MonetDB's COPY BINARY INTO feature to more efficiently upload data.
>>> This is orders of magnitude faster, but it currently only supports
>>> Append mode. I would like to also support Overwrite mode. This
>>> turned out to be harder than expected.
>>>
>>> It seems the table existence checks and creation functionality is part
>>> of org.apache.spark.sql.catalog.Catalog. Do I have to hook into that
>>> somehow? And if so, how does my
>>>
>>>     dataframe
>>>         .write()
>>>         .source("org.monetdb.spark")
>>>         .mode(SaveMode.Overwrite)
>>>         .option("url", url)
>>>         .option("dbtable", "foo")
>>>         .save()
>>>
>>> find my catalog? The Catalog interface also contains lots of methods
>>> that I don't really understand, do I have to implement all of these?
>>>
>>> Can someone give me an overview of the big picture?
>>>
>>>
>>> Note: another approach would be to not try to implement a v2 DataSource
>>> but
>>> more or less "subclass" the v1 JDBC Data Source like the now abandoned
>>> SQL Server dialect seems to do:
>>>
>>>     https://github.com/microsoft/sql-spark-connector.
>>>
>>> Would that still be the way to go?
>>>
>>>
>>> Best regards,
>>>
>>> Joeri van Ruth
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe e-mail: [email protected]
>>>
>>>

Re: Implementing SaveMode.Overwrite in my v2 DataSource

Reply via email to