John Muller created SPARK-20259: ----------------------------------- Summary: Support push down join optimizations in DataFrameReader when loading from JDBC Key: SPARK-20259 URL: https://issues.apache.org/jira/browse/SPARK-20259 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.1.0, 1.6.2 Reporter: John Muller Priority: Minor
Given two dataframes loaded from the same JDBC connection: {code:title=UnoptimizedJDBCJoin.scala|borderStyle=solid} val ordersDF = spark.read .format("jdbc") .option("url", "jdbc:postgresql:dbserver") .option("dbtable", "northwind.orders") .option("user", "username") .option("password", "password") .load().toDS val productDF = spark.read .format("jdbc") .option("url", "jdbc:postgresql:dbserver") .option("dbtable", "northwind.product") .option("user", "username") .option("password", "password") .load().toDS ordersDF.createOrReplaceTempView("orders") productDF.createOrReplaceTempView("product") // Followed by a join between them: val ordersByProduct = sql("SELECT p.name, SUM(o.qty) AS qty FROM orders AS o INNER JOIN product AS p ON o.product_id = p.product_id GROUP BY p.name") {code} Catalyst should optimize the query to be: SELECT northwind.product.name, SUM(northwind.orders.qty) FROM northwind.orders INNER JOIN northwind.product ON northwind.orders.product_id = northwind.product.product_id GROUP BY p.name -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org