spark git commit: [SPARK-11246] [SQL] Table cache for Parquet broken in 1.5

yhuai Thu, 29 Oct 2015 07:58:15 -0700

Repository: spark
Updated Branches:
  refs/heads/branch-1.5 9e3197aaa -> 76d742386



[SPARK-11246] [SQL] Table cache for Parquet broken in 1.5

The root cause is that when spark.sql.hive.convertMetastoreParquet=true by 
default, the cached InMemoryRelation of the ParquetRelation can not be looked 
up from the cachedData of CacheManager because the key comparison fails even 
though it is the same LogicalPlan representing the Subquery that wraps the 
ParquetRelation.
The solution in this PR is overriding the LogicalPlan.sameResult function in 
Subquery case class to eliminate subquery node first before directly comparing 
the child (ParquetRelation), which will find the key  to the cached 
InMemoryRelation.

Author: xin Wu <xi...@us.ibm.com>

Closes #9326 from xwu0226/spark-11246-commit.

(cherry picked from commit f7a51deebad1b4c3b970a051f25d286110b94438)
Signed-off-by: Yin Huai <yh...@databricks.com>

Conflicts:
        sql/hive/src/test/scala/org/apache/spark/sql/hive/CachedTableSuite.scala


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/76d74238
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/76d74238
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/76d74238

Branch: refs/heads/branch-1.5
Commit: 76d742386cd045526969a4f8b4c2cf30d54fd30c
Parents: 9e3197a
Author: xin Wu <xi...@us.ibm.com>
Authored: Thu Oct 29 07:42:46 2015 -0700
Committer: Yin Huai <yh...@databricks.com>
Committed: Thu Oct 29 07:57:10 2015 -0700

----------------------------------------------------------------------
 .../spark/sql/execution/datasources/LogicalRelation.scala |  5 +++++
 .../org/apache/spark/sql/hive/CachedTableSuite.scala      | 10 ++++++++++
 2 files changed, 15 insertions(+)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/spark/blob/76d74238/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/LogicalRelation.scala
----------------------------------------------------------------------
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/LogicalRelation.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/LogicalRelation.scala
index 4069179..c9cc7d5 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/LogicalRelation.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/LogicalRelation.scala
@@ -62,6 +62,11 @@ private[sql] case class LogicalRelation(
     case _ => false
   }
 
+  // When comparing two LogicalRelations from within LogicalPlan.sameResult, 
we only need
+  // LogicalRelation.cleanArgs to return Seq(relation), since 
expectedOutputAttribute's
+  // expId can be different but the relation is still the same.
+  override lazy val cleanArgs: Seq[Any] = Seq(relation)
+
   @transient override lazy val statistics: Statistics = Statistics(
     sizeInBytes = BigInt(relation.sizeInBytes)
   )

http://git-wip-us.apache.org/repos/asf/spark/blob/76d74238/sql/hive/src/test/scala/org/apache/spark/sql/hive/CachedTableSuite.scala
----------------------------------------------------------------------
diff --git 
a/sql/hive/src/test/scala/org/apache/spark/sql/hive/CachedTableSuite.scala 
b/sql/hive/src/test/scala/org/apache/spark/sql/hive/CachedTableSuite.scala
index 39d315a..7f7b079 100644
--- a/sql/hive/src/test/scala/org/apache/spark/sql/hive/CachedTableSuite.scala
+++ b/sql/hive/src/test/scala/org/apache/spark/sql/hive/CachedTableSuite.scala
@@ -203,4 +203,14 @@ class CachedTableSuite extends QueryTest {
     sql("DROP TABLE refreshTable")
     Utils.deleteRecursively(tempPath)
   }
+
+  test("SPARK-11246 cache parquet table") {
+    sql("CREATE TABLE cachedTable STORED AS PARQUET AS SELECT 1")
+
+    cacheTable("cachedTable")
+    val sparkPlan = sql("SELECT * FROM cachedTable").queryExecution.sparkPlan
+    assert(sparkPlan.collect { case e: InMemoryColumnarTableScan => e }.size 
=== 1)
+
+    sql("DROP TABLE cachedTable")
+  }
 }


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-11246] [SQL] Table cache for Parquet broken in 1.5

Reply via email to