Re: Implementing SaveMode.Overwrite in my v2 DataSource

Joeri van Ruth via dev Wed, 11 Mar 2026 01:38:11 -0700

Hi Mich!

Thanks for the perfect answer!


Joeri

On 3/10/26 1:20 PM, Mich Talebzadeh wrote:


Hi Joeri,

ForSaveMode.Overwrite you typically advertiseTableCapability.OVERWRITE_BY_FILTER and implement SupportsOverwriteV2.If you only want to support full-table overwrite, it is reasonable toreject predicates in canOverwrite() and implement overwrite internallyusing TRUNCATE or DROP + CREATE.

OverwriteDynamic is only relevant for partitioned tables, so ifMonetDB tables are not partitioned you can safely ignore it.


/Regarding catalogs: when users call/

df.write.format("org.monetdb.spark").mode("overwrite").save()

Spark does not go through a CatalogPlugin. In this case Spark onlyresolves your TableProvider and then proceeds via theTable →WriteBuilder → BatchWrite flow. Because of that there isn’t a catalogyou can hook into or reuse here. This also means that implementingtable detection and creation inside the connector is actually theexpected pattern. The built-in JDBC data source does essentially thesame thing: Spark determines the write semantics (append vsoverwrite), but the connector itself handles the physicalimplementation such as TRUNCATE, DROP, or CREATE.

You only need to implement a CatalogPlugin if you wanted Spark SQL tomanage tables through a catalog, for example:


CREATE TABLE monet.<Table> USING monetdb
INSERT INTO monet.<TABLE> SELECT ...

For theformat(...).save() path, handling table lifecycle inside theconnector is normal and aligns with how other database connectors behave.


HTH

Dr Mich Talebzadeh,

Data Scientist | Distributed Systems (Spark) | Financial Forensics &Metadata Analytics | Transaction Reconstruction | Audit &Evidence-Based Analytics

**view my Linkedin profile<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>

On Tue, 10 Mar 2026 at 09:39, Joeri van Ruth via dev<[email protected]> wrote:


    Hi Mich! Thank you!

    I've implemented the TableProvider -> SupportsWrite -> etc route
    and it works for SaveMode.Append.

    For example, my users currently write (in pyspark):

    df.write.format('org.monetdb.spark').mode('append').options(driver=DRIVER,
    url=DBURL, dbtable='foo').save()

    My MonetTable advertises TableCapability.BATCH_WRITE and
    implements SupportsWrite so this works.

    I assume that for Overwrite mode my MonetTable also needs to
    advertise TableCapability.OverwriteByFilter and/or
    TableCapability.OverwriteDynamic.

    For OverwriteByFilter, if I just make
    SupportsOverwriteV2#canOverwrite(Predicate[] predicates) return
    false if there are predicates, I can implement overwriting using
    DROP + CREATE or TRUNCATE.

    OverwriteDynamic I'm not sure if it's relevant. The documentation
    talks about partitions, which I don't have. I guess I'll first try
    without.


    But if I do it this way I'm basically implementing the table
    (re-)creation by hand. How do I get from df.write.save() to the
    right Catalog that can do the table creation etc? And do I have to
    fully implement that Catalog myself or can I somehow piggybag on
    the JDBC Catalog?

    Thanks for you patience, if you have any left :)

    Joeri


    On 3/9/26 10:58 AM, Mich Talebzadeh wrote:

    Hi,

    A brief answer for now

    Use the DataSource V2 API. V1 is considered legacy for new
    connectors. In V2 you typically implement /TableProvider → Table
    → SupportsWrite / BatchWrite/, and Spark handles table existence
    checks and creation through the Catalog/Table interfaces. Your
    database-specific bulk load logic (e.g., copy) would live inside
    the BatchWrite implementation.

    In practice most database connectors still include some
    database-specific table detection/creation logic, since Spark
    cannot abstract SQL dialects or bulk loaders completely.

    HTH

    Dr Mich Talebzadeh,
    Data Scientist | Distributed Systems (Spark) | Financial
    Forensics & Metadata Analytics | Transaction Reconstruction |
    Audit & Evidence-Based Analytics

    **view my Linkedin profile
    <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>





    On Mon, 9 Mar 2026 at 08:52, Joeri van Ruth via dev
    <[email protected]> wrote:

        Hi Mich!

        First, thanks for the clear explanation!

        However, it doesn't really answer my question. I understand
        why the JDBC Data Source works the way it does. I also
        understand and accept that moving data OUT of Spark is never
        going to get as much attention from the Spark devs as moving
        INTO Spark gets :)

        My question is, if I do need to upload lots of data, the JDBC
        source just doesn't cut it. Right now my users export to CSV
        and use external tooling to import that in our database, but
        I would prefer them to just be able to use dataframe.write().

        So if I write a new Data Source specific to my database, how
        do I best go about it? Do I use the v2 API and if so, how do
        I integrate into its table detection/creation framework? Or
        do still use the v1 API even though v2 is available?

        I'm currently just adding my own table detection and creation
        code but I imagine that is not really 'the Spark way'. But I
        can't see the forest for the trees!

        Joeri


        On 3/6/26 11:06 PM, Mich Talebzadeh wrote:

        Hi,

        Spark uses the JDBC data source / connector. to read data.
        Internally the database performs a scan of the table and
        streams the rows out, i.e   sequential table scan or
        streaming result set

        Writing through JDBC is different. Each row must be
        processed as a transaction operation.

        Spark row
           │
           ▼
        INSERT statement
           │
           ▼
        database engine (monetDB, Oracle etc)
           ├─ constraint checks
           ├─ index updates
           ├─ transaction logging
           └─ storage update

        So the database does a lot more work for each row. This
        creates a row-by-row workflow, which looks serial and
        inefficient as you observed

        INSERT row 1
        INSERT row 2

        Spark’s generic JDBC writer deliberately avoids
        database-specific features and therefore falls back to the
        safest universally supported mechanism i.e standard SQL
        INSERT statements.
        JDBC is designed to work with all relational databases, such
        as Oracle, MonetDB etc. Because of that, Spark must use the
        lowest common denominator that every database supports.

        That lowest common denominator is

        INSERT INTO table VALUES (...)

        In short, because the JDBC interface is database-agnostic,
        Spark uses the safest universally supported operation
        (standard SQL INSERT). Since Spark cannot assume the
        availability of database-specific bulk loaders, the generic
        implementation often inserts rows individually.

        HTH,

        Dr Mich Talebzadeh,
        Data Scientist | Distributed Systems (Spark) | Financial
        Forensics & Metadata Analytics | Transaction Reconstruction
        | Audit & Evidence-Based Analytics

        **view my Linkedin profile
        <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>





        On Fri, 6 Mar 2026 at 12:43, Joeri van Ruth via dev
        <[email protected]> wrote:

            Hi!

            I'm a developer working on MonetDB, a column-oriented
            SQL database.  See
            https://www.monetdb.org.

            I've created a JdbcDialect for MonetDB, it seems to work
            fine. The
            source code is at https://github.com/MonetDB/monetdb-spark.

            Unfortunately it turns out the JDBC Data Source is good
            at downloading
            data from the database but really slow when uploading.
            The reason it's
            so slow is that it uses a separate INSERT statement for
            each row.

            To work around this, I implemented a custom data source
            that uses
            MonetDB's COPY BINARY INTO feature to more efficiently
            upload data.
            This is orders of magnitude faster, but it currently
            only supports
            Append mode. I would like to also support Overwrite
            mode. This
            turned out to be harder than expected.

            It seems the table existence checks and creation
            functionality is part
            of org.apache.spark.sql.catalog.Catalog. Do I have to
            hook into that
            somehow? And if so, how does my

                dataframe
                    .write()
                    .source("org.monetdb.spark")
                    .mode(SaveMode.Overwrite)
                    .option("url", url)
                    .option("dbtable", "foo")
                    .save()

            find my catalog? The Catalog interface also contains
            lots of methods
            that I don't really understand, do I have to implement
            all of these?

            Can someone give me an overview of the big picture?


            Note: another approach would be to not try to implement
            a v2 DataSource but
            more or less "subclass" the v1 JDBC Data Source like the
            now abandoned
            SQL Server dialect seems to do:

            https://github.com/microsoft/sql-spark-connector.

            Would that still be the way to go?


            Best regards,

            Joeri van Ruth

            
---------------------------------------------------------------------
            To unsubscribe e-mail: [email protected]

Re: Implementing SaveMode.Overwrite in my v2 DataSource

Reply via email to