Re: Implementing SaveMode.Overwrite in my v2 DataSource

Joeri van Ruth via dev Tue, 10 Mar 2026 02:38:49 -0700

Hi Mich! Thank you!

I've implemented the TableProvider -> SupportsWrite -> etc route and itworks for SaveMode.Append.


For example, my users currently write (in pyspark):

df.write.format('org.monetdb.spark').mode('append').options(driver=DRIVER,url=DBURL, dbtable='foo').save()

My MonetTable advertises TableCapability.BATCH_WRITE and implementsSupportsWrite so this works.

I assume that for Overwrite mode my MonetTable also needs to advertiseTableCapability.OverwriteByFilter and/or TableCapability.OverwriteDynamic.

For OverwriteByFilter, if I just makeSupportsOverwriteV2#canOverwrite(Predicate[] predicates) return false ifthere are predicates, I can implement overwriting using DROP + CREATE orTRUNCATE.

OverwriteDynamic I'm not sure if it's relevant. The documentation talksabout partitions, which I don't have. I guess I'll first try without.

But if I do it this way I'm basically implementing the table(re-)creation by hand. How do I get from df.write.save() to the rightCatalog that can do the table creation etc? And do I have to fullyimplement that Catalog myself or can I somehow piggybag on the JDBC Catalog?


Thanks for you patience, if you have any left :)

Joeri


On 3/9/26 10:58 AM, Mich Talebzadeh wrote:

Hi,

A brief answer for now

Use the DataSource V2 API. V1 is considered legacy for new connectors.In V2 you typically implement /TableProvider → Table → SupportsWrite /BatchWrite/, and Spark handles table existence checks and creationthrough the Catalog/Table interfaces. Your database-specific bulk loadlogic (e.g., copy) would live inside the BatchWrite implementation.

In practice most database connectors still include somedatabase-specific table detection/creation logic, since Spark cannotabstract SQL dialects or bulk loaders completely.


HTH

Dr Mich Talebzadeh,

Data Scientist | Distributed Systems (Spark) | Financial Forensics &Metadata Analytics | Transaction Reconstruction | Audit &Evidence-Based Analytics

**view my Linkedin profile<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>

On Mon, 9 Mar 2026 at 08:52, Joeri van Ruth via dev<[email protected]> wrote:


    Hi Mich!

    First, thanks for the clear explanation!

    However, it doesn't really answer my question. I understand why
    the JDBC Data Source works the way it does. I also understand and
    accept that moving data OUT of Spark is never going to get as much
    attention from the Spark devs as moving INTO Spark gets :)

    My question is, if I do need to upload lots of data, the JDBC
    source just doesn't cut it. Right now my users export to CSV and
    use external tooling to import that in our database, but I would
    prefer them to just be able to use dataframe.write().

    So if I write a new Data Source specific to my database, how do I
    best go about it? Do I use the v2 API and if so, how do I
    integrate into its table detection/creation framework? Or do still
    use the v1 API even though v2 is available?

    I'm currently just adding my own table detection and creation code
    but I imagine that is not really 'the Spark way'. But I can't see
    the forest for the trees!

    Joeri


    On 3/6/26 11:06 PM, Mich Talebzadeh wrote:

    Hi,

    Spark uses the JDBC data source / connector. to read data.
    Internally the database performs a scan of the table and streams
    the rows out, i.e   sequential table scan or streaming result set

    Writing through JDBC is different. Each row must be processed as
    a transaction operation.

    Spark row
       │
       ▼
    INSERT statement
       │
       ▼
    database engine (monetDB, Oracle etc)
       ├─ constraint checks
       ├─ index updates
       ├─ transaction logging
       └─ storage update

    So the database does a lot more work for each row. This creates a
    row-by-row workflow, which looks serial and inefficient as you
    observed

    INSERT row 1
    INSERT row 2

    Spark’s generic JDBC writer deliberately avoids database-specific
    features and therefore falls back to the safest universally
    supported mechanism i.e standard SQL INSERT statements.
    JDBC is designed to work with all relational databases, such
    as Oracle, MonetDB etc. Because of that, Spark must use the
    lowest common denominator that every database supports.

    That lowest common denominator is

    INSERT INTO table VALUES (...)

    In short, because the JDBC interface is database-agnostic, Spark
    uses the safest universally supported operation (standard SQL
    INSERT). Since Spark cannot assume the availability of
    database-specific bulk loaders, the generic implementation often
    inserts rows individually.

    HTH,

    Dr Mich Talebzadeh,
    Data Scientist | Distributed Systems (Spark) | Financial
    Forensics & Metadata Analytics | Transaction Reconstruction |
    Audit & Evidence-Based Analytics

    **view my Linkedin profile
    <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>





    On Fri, 6 Mar 2026 at 12:43, Joeri van Ruth via dev
    <[email protected]> wrote:

        Hi!

        I'm a developer working on MonetDB, a column-oriented SQL
        database.  See
        https://www.monetdb.org.

        I've created a JdbcDialect for MonetDB, it seems to work
        fine. The
        source code is at https://github.com/MonetDB/monetdb-spark.

        Unfortunately it turns out the JDBC Data Source is good at
        downloading
        data from the database but really slow when uploading. The
        reason it's
        so slow is that it uses a separate INSERT statement for each row.

        To work around this, I implemented a custom data source that uses
        MonetDB's COPY BINARY INTO feature to more efficiently upload
        data.
        This is orders of magnitude faster, but it currently only
        supports
        Append mode. I would like to also support Overwrite mode. This
        turned out to be harder than expected.

        It seems the table existence checks and creation
        functionality is part
        of org.apache.spark.sql.catalog.Catalog. Do I have to hook
        into that
        somehow? And if so, how does my

            dataframe
                .write()
                .source("org.monetdb.spark")
                .mode(SaveMode.Overwrite)
                .option("url", url)
                .option("dbtable", "foo")
                .save()

        find my catalog? The Catalog interface also contains lots of
        methods
        that I don't really understand, do I have to implement all of
        these?

        Can someone give me an overview of the big picture?


        Note: another approach would be to not try to implement a v2
        DataSource but
        more or less "subclass" the v1 JDBC Data Source like the now
        abandoned
        SQL Server dialect seems to do:

        https://github.com/microsoft/sql-spark-connector.

        Would that still be the way to go?


        Best regards,

        Joeri van Ruth

        ---------------------------------------------------------------------
        To unsubscribe e-mail: [email protected]

Re: Implementing SaveMode.Overwrite in my v2 DataSource

Reply via email to