Following up on this in case anyone runs across it in the archives in the future >From reading through the config docs and trying various combinations, I've discovered that:
- You don't want to disable codegen. This roughly doubled the time to perform simple, few-column/few-row queries from basic testing - Can test this by setting an internal property after setting "spark.testing" to "true" in system properties > System.setProperty("spark.testing", "true") > val spark = SparkSession.builder() > .config("spark.sql.codegen.wholeStage", "false") > .config("spark.sql.codegen.factoryMode", "NO_CODEGEN") > - The following gave the best performance. I don't know if enabling CBO did much. val spark = SparkSession.builder() > .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") > .config("spark.kryo.unsafe", "true") > .config("spark.sql.adaptive.enabled", "true") > .config("spark.sql.cbo.enabled", "true") > .config("spark.sql.cbo.joinReorder.dp.star.filter", "true") > .config("spark.sql.cbo.joinReorder.enabled", "true") > .config("spark.sql.cbo.planStats.enabled", "true") > .config("spark.sql.cbo.starSchemaDetection", "true") If you're running on more recent JDK's, you'll need to set "--add-opens" flags for a few namespaces for "kryo.unsafe" to work. On Mon, May 16, 2022 at 12:55 PM Gavin Ray <ray.gavi...@gmail.com> wrote: > Hi all, > > I've not got much experience with Spark, but have been reading the > Catalyst and > Datasources V2 code/tests to try to get a basic understanding. > > I'm interested in trying Catalyst's query planner + optimizer for queries > spanning one-or-more JDBC sources. > > Somewhat unusually, I'd like to do this with as minimal latency as > possible to > see what the experience for standard line-of-business apps is like (~90/10 > read/write ratio). > Few rows would be returned in the reads (something on the order of > 1-to-1,000). > > My question is: What configuration settings would you want to use for > something > like this? > > I imagine that doing codegen/JIT compilation of the query plan might not be > worth the cost, so maybe you'd want to disable that and do interpretation? > > And possibly you'd want to use query plan config/rules that reduce the time > spent in planning, trading efficiency for latency? > > Does anyone know how you'd configure Spark to test something like this? > > Would greatly appreciate any input (even if it's "This is a bad idea and > will > never work well"). > > Thank you =) >