Thanks Khalid!
That was indeed very helpful because it made me more aware of the Spark
SQL side of things than I've been so far.
Your example uses spark.sql.catalog.* settings to register a JDBC data
source in the Catalog and then accesses it from Spark SQL and writes to
it using dataframe.writeTo(). I wasn't aware of the latter so far, my
users tend to dataframe.write.format().mode().options().save() in Pyspark.
I'm going to test how well my MonetDialect works when used like that,
I'm probably going to have to implement a few more functions..
Thanks!
Joeri
On 3/9/26 10:27 AM, Khalid Mammadov wrote:
Please see below my old post that is relevant to your questions.
Perhaps you may find it useful. It's bit old but uses V2 API and
defines catalog.
https://khalidmammadov.github.io/spark/dataframewriter_v2.html
On Mon, 9 Mar 2026, 08:50 Joeri van Ruth via dev,
<[email protected]> wrote:
Hi Mich!
First, thanks for the clear explanation!
However, it doesn't really answer my question. I understand why
the JDBC Data Source works the way it does. I also understand and
accept that moving data OUT of Spark is never going to get as much
attention from the Spark devs as moving INTO Spark gets :)
My question is, if I do need to upload lots of data, the JDBC
source just doesn't cut it. Right now my users export to CSV and
use external tooling to import that in our database, but I would
prefer them to just be able to use dataframe.write().
So if I write a new Data Source specific to my database, how do I
best go about it? Do I use the v2 API and if so, how do I
integrate into its table detection/creation framework? Or do still
use the v1 API even though v2 is available?
I'm currently just adding my own table detection and creation code
but I imagine that is not really 'the Spark way'. But I can't see
the forest for the trees!
Joeri
On 3/6/26 11:06 PM, Mich Talebzadeh wrote:
Hi,
Spark uses the JDBC data source / connector. to read data.
Internally the database performs a scan of the table and streams
the rows out, i.e sequential table scan or streaming result set
Writing through JDBC is different. Each row must be processed as
a transaction operation.
Spark row
│
▼
INSERT statement
│
▼
database engine (monetDB, Oracle etc)
├─ constraint checks
├─ index updates
├─ transaction logging
└─ storage update
So the database does a lot more work for each row. This creates a
row-by-row workflow, which looks serial and inefficient as you
observed
INSERT row 1
INSERT row 2
Spark’s generic JDBC writer deliberately avoids database-specific
features and therefore falls back to the safest universally
supported mechanism i.e standard SQL INSERT statements.
JDBC is designed to work with all relational databases, such
as Oracle, MonetDB etc. Because of that, Spark must use the
lowest common denominator that every database supports.
That lowest common denominator is
INSERT INTO table VALUES (...)
In short, because the JDBC interface is database-agnostic, Spark
uses the safest universally supported operation (standard SQL
INSERT). Since Spark cannot assume the availability of
database-specific bulk loaders, the generic implementation often
inserts rows individually.
HTH,
Dr Mich Talebzadeh,
Data Scientist | Distributed Systems (Spark) | Financial
Forensics & Metadata Analytics | Transaction Reconstruction |
Audit & Evidence-Based Analytics
**view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
On Fri, 6 Mar 2026 at 12:43, Joeri van Ruth via dev
<[email protected]> wrote:
Hi!
I'm a developer working on MonetDB, a column-oriented SQL
database. See
https://www.monetdb.org.
I've created a JdbcDialect for MonetDB, it seems to work
fine. The
source code is at https://github.com/MonetDB/monetdb-spark.
Unfortunately it turns out the JDBC Data Source is good at
downloading
data from the database but really slow when uploading. The
reason it's
so slow is that it uses a separate INSERT statement for each row.
To work around this, I implemented a custom data source that uses
MonetDB's COPY BINARY INTO feature to more efficiently upload
data.
This is orders of magnitude faster, but it currently only
supports
Append mode. I would like to also support Overwrite mode. This
turned out to be harder than expected.
It seems the table existence checks and creation
functionality is part
of org.apache.spark.sql.catalog.Catalog. Do I have to hook
into that
somehow? And if so, how does my
dataframe
.write()
.source("org.monetdb.spark")
.mode(SaveMode.Overwrite)
.option("url", url)
.option("dbtable", "foo")
.save()
find my catalog? The Catalog interface also contains lots of
methods
that I don't really understand, do I have to implement all of
these?
Can someone give me an overview of the big picture?
Note: another approach would be to not try to implement a v2
DataSource but
more or less "subclass" the v1 JDBC Data Source like the now
abandoned
SQL Server dialect seems to do:
https://github.com/microsoft/sql-spark-connector.
Would that still be the way to go?
Best regards,
Joeri van Ruth
---------------------------------------------------------------------
To unsubscribe e-mail: [email protected]