[Spark SQL]: Configuring/Using Spark + Catalyst optimally for read-heavy transactional workloads in JDBC sources?

Gavin Ray Mon, 16 May 2022 10:04:09 -0700

Hi all,

I've not got much experience with Spark, but have been reading the Catalyst
and
Datasources V2 code/tests to try to get a basic understanding.


I'm interested in trying Catalyst's query planner + optimizer for queries
spanning one-or-more JDBC sources.

Somewhat unusually, I'd like to do this with as minimal latency as possible
to
see what the experience for standard line-of-business apps is like (~90/10
read/write ratio).
Few rows would be returned in the reads (something on the order of
1-to-1,000).

My question is: What configuration settings would you want to use for
something
like this?

I imagine that doing codegen/JIT compilation of the query plan might not be
worth the cost, so maybe you'd want to disable that and do interpretation?

And possibly you'd want to use query plan config/rules that reduce the time
spent in planning, trading efficiency for latency?

Does anyone know how you'd configure Spark to test something like this?

Would greatly appreciate any input (even if it's "This is a bad idea and
will
never work well").

Thank you =)

[Spark SQL]: Configuring/Using Spark + Catalyst optimally for read-heavy transactional workloads in JDBC sources?

Reply via email to