[ https://issues.apache.org/jira/browse/SPARK-25802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Dongjoon Hyun updated SPARK-25802: ---------------------------------- Affects Version/s: (was: 3.0.0) 3.1.0 > Use JDBC Oracle Binds from Spark SQL > ------------------------------------ > > Key: SPARK-25802 > URL: https://issues.apache.org/jira/browse/SPARK-25802 > Project: Spark > Issue Type: New Feature > Components: SQL > Affects Versions: 3.1.0 > Reporter: Nathan Loyer > Priority: Major > > In case those reading aren't aware, any time a query is run against Oracle, > the database creates a plan and caches it. When a query is run, first it > checks to see if it can reuse a plan from the cache. When you use literals it > has to create a new plan even though there is one in the cache that matches > everything except for that literal value, which is the case for the spark > generated queries. Using binds/parameters instead allows the database to > reuse the previous plans and reduce the load on the database. > My team is using spark sql with JDBC to query large amounts of data from > production Oracle databases. The queries built with the JDBCRDD class today > results in our databases having to do more work than they really need to, > which results in more load on our databases, which affects our users. For > this reason we've been investigating if it is possible to use spark sql with > query binds/parameters. From what I can tell from reviewing documentation > and diving into the spark source code, this does not appear to be possible > today. > Our spark usage looks like this: > {code:java} > spark.read() > .format("jdbc") > .option("url", connectionUrl) > .option("dbtable", "( select c1, c2, c3 from tableName where c4 > > TO_DATE('2018-01-01 00:00:00', 'YYYY-MM-DD HH24:MI:SS') )") > .option("driver", "oracle.jdbc.OracleDriver") > .option("fetchSize", fetchSize) > .option("lowerBound", minId) > .option("upperBound", maxId) > .option("partitionColumn", "ID") > .option("numPartitions", numPartitions) > .load(); > {code} > So one way to alter the call to get what I am looking for would be like this: > {code:java} > spark.read() > .format("jdbc") > .option("url", connectionUrl) > .option("dbtable", "( select c1, c2, c3 from tableName where c4 > > TO_DATE(:timestamp, 'YYYY-MM-DD HH24:MI:SS') )") > .option("driver", "oracle.jdbc.OracleDriver") > .option("fetchSize", fetchSize) > .option("lowerBound", minId) > .option("upperBound", maxId) > .option("partitionColumn", "ID") > .option("numPartitions", numPartitions) > .option("binds", ImmutableMap.of("timestamp", "2018-01-01 00:00:00")) > .load(); > {code} > The queries that spark generates from this should something look like: > {code:sql} > SELECT c1, c2, c3 > FROM > ( > SELECT c1, c2, c3 > FROM tableName > WHERE column > TO_DATE(:timestamp, 'YYYY-MM-DD HH24:MI:SS') > ) AS __SPARK_GEN_JDBC_SUBQUERY_NAME_1 > WHERE > ID >= :partitionLowerBound > AND ID < :partitionUpperBound > {code} > I am not certain if this parameterized query syntax is supported by all other > jdbc drivers or if it improves the performance on those databases or not. > I'm also not sure if I picked the correct component or versions. Feel free to > correct them. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org