Hi, A brief answer for now
Use the DataSource V2 API. V1 is considered legacy for new connectors. In V2 you typically implement *TableProvider → Table → SupportsWrite / BatchWrite*, and Spark handles table existence checks and creation through the Catalog/Table interfaces. Your database-specific bulk load logic (e.g., copy) would live inside the BatchWrite implementation. In practice most database connectors still include some database-specific table detection/creation logic, since Spark cannot abstract SQL dialects or bulk loaders completely. HTH Dr Mich Talebzadeh, Data Scientist | Distributed Systems (Spark) | Financial Forensics & Metadata Analytics | Transaction Reconstruction | Audit & Evidence-Based Analytics view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> On Mon, 9 Mar 2026 at 08:52, Joeri van Ruth via dev <[email protected]> wrote: > Hi Mich! > > First, thanks for the clear explanation! > > However, it doesn't really answer my question. I understand why the JDBC > Data Source works the way it does. I also understand and accept that moving > data OUT of Spark is never going to get as much attention from the Spark > devs as moving INTO Spark gets :) > > My question is, if I do need to upload lots of data, the JDBC source just > doesn't cut it. Right now my users export to CSV and use external tooling > to import that in our database, but I would prefer them to just be able to > use dataframe.write(). > > So if I write a new Data Source specific to my database, how do I best go > about it? Do I use the v2 API and if so, how do I integrate into its table > detection/creation framework? Or do still use the v1 API even though v2 is > available? > > I'm currently just adding my own table detection and creation code but I > imagine that is not really 'the Spark way'. But I can't see the forest for > the trees! > > Joeri > > > On 3/6/26 11:06 PM, Mich Talebzadeh wrote: > > Hi, > > Spark uses the JDBC data source / connector. to read data. Internally the > database performs a scan of the table and streams the rows out, i.e > sequential table scan or streaming result set > > Writing through JDBC is different. Each row must be processed as a > transaction operation. > > Spark row > │ > ▼ > INSERT statement > │ > ▼ > database engine (monetDB, Oracle etc) > ├─ constraint checks > ├─ index updates > ├─ transaction logging > └─ storage update > > So the database does a lot more work for each row. This creates a > row-by-row workflow, which looks serial and inefficient as you observed > > INSERT row 1 > INSERT row 2 > > Spark’s generic JDBC writer deliberately avoids database-specific features > and therefore falls back to the safest universally supported mechanism i.e > standard SQL INSERT statements. > JDBC is designed to work with all relational databases, such as Oracle, > MonetDB etc. Because of that, Spark must use the lowest common denominator > that every database supports. > > That lowest common denominator is > > INSERT INTO table VALUES (...) > > In short, because the JDBC interface is database-agnostic, Spark uses the > safest universally supported operation (standard SQL INSERT). Since Spark > cannot assume the availability of database-specific bulk loaders, the > generic implementation often inserts rows individually. > > HTH, > > Dr Mich Talebzadeh, > Data Scientist | Distributed Systems (Spark) | Financial Forensics & > Metadata Analytics | Transaction Reconstruction | Audit & Evidence-Based > Analytics > > view my Linkedin profile > <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> > > > > > > On Fri, 6 Mar 2026 at 12:43, Joeri van Ruth via dev <[email protected]> > wrote: > >> Hi! >> >> I'm a developer working on MonetDB, a column-oriented SQL database. See >> https://www.monetdb.org. >> >> I've created a JdbcDialect for MonetDB, it seems to work fine. The >> source code is at https://github.com/MonetDB/monetdb-spark. >> >> Unfortunately it turns out the JDBC Data Source is good at downloading >> data from the database but really slow when uploading. The reason it's >> so slow is that it uses a separate INSERT statement for each row. >> >> To work around this, I implemented a custom data source that uses >> MonetDB's COPY BINARY INTO feature to more efficiently upload data. >> This is orders of magnitude faster, but it currently only supports >> Append mode. I would like to also support Overwrite mode. This >> turned out to be harder than expected. >> >> It seems the table existence checks and creation functionality is part >> of org.apache.spark.sql.catalog.Catalog. Do I have to hook into that >> somehow? And if so, how does my >> >> dataframe >> .write() >> .source("org.monetdb.spark") >> .mode(SaveMode.Overwrite) >> .option("url", url) >> .option("dbtable", "foo") >> .save() >> >> find my catalog? The Catalog interface also contains lots of methods >> that I don't really understand, do I have to implement all of these? >> >> Can someone give me an overview of the big picture? >> >> >> Note: another approach would be to not try to implement a v2 DataSource >> but >> more or less "subclass" the v1 JDBC Data Source like the now abandoned >> SQL Server dialect seems to do: >> >> https://github.com/microsoft/sql-spark-connector. >> >> Would that still be the way to go? >> >> >> Best regards, >> >> Joeri van Ruth >> >> --------------------------------------------------------------------- >> To unsubscribe e-mail: [email protected] >> >>
