John Muller created SPARK-20259:
-----------------------------------

             Summary: Support push down join optimizations in DataFrameReader 
when loading from JDBC
                 Key: SPARK-20259
                 URL: https://issues.apache.org/jira/browse/SPARK-20259
             Project: Spark
          Issue Type: Improvement
          Components: SQL
    Affects Versions: 2.1.0, 1.6.2
            Reporter: John Muller
            Priority: Minor


Given two dataframes loaded from the same JDBC connection:

{code:title=UnoptimizedJDBCJoin.scala|borderStyle=solid}
val ordersDF = spark.read
  .format("jdbc")
  .option("url", "jdbc:postgresql:dbserver")
  .option("dbtable", "northwind.orders")
  .option("user", "username")
  .option("password", "password")
  .load().toDS
  
val productDF = spark.read
  .format("jdbc")
  .option("url", "jdbc:postgresql:dbserver")
  .option("dbtable", "northwind.product")
  .option("user", "username")
  .option("password", "password")
  .load().toDS
  
ordersDF.createOrReplaceTempView("orders")
productDF.createOrReplaceTempView("product")

// Followed by a join between them:
val ordersByProduct = sql("SELECT p.name, SUM(o.qty) AS qty FROM orders AS o 
INNER JOIN product AS p ON o.product_id = p.product_id GROUP BY p.name")
{code}

Catalyst should optimize the query to be:
SELECT northwind.product.name, SUM(northwind.orders.qty)
FROM northwind.orders
INNER JOIN northwind.product ON
  northwind.orders.product_id = northwind.product.product_id
GROUP BY p.name



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to