spark git commit: [SPARK-22355][SQL] Dataset.collect is not threadsafe

lixiao Thu, 26 Oct 2017 17:53:32 -0700

Repository: spark
Updated Branches:
  refs/heads/branch-2.2 a607ddc52 -> 2839280ad



[SPARK-22355][SQL] Dataset.collect is not threadsafe

It's possible that users create a `Dataset`, and call `collect` of this 
`Dataset` in many threads at the same time. Currently `Dataset#collect` just 
call `encoder.fromRow` to convert spark rows to objects of type T, and this 
encoder is per-dataset. This means `Dataset#collect` is not thread-safe, 
because the encoder uses a projection to output the object to a re-usable row.

This PR fixes this problem, by creating a new projection when calling 
`Dataset#collect`, so that we have the re-usable row for each method call, 
instead of each Dataset.

N/A

Author: Wenchen Fan <wenc...@databricks.com>

Closes #19577 from cloud-fan/encoder.

(cherry picked from commit 5c3a1f3fad695317c2fff1243cdb9b3ceb25c317)
Signed-off-by: gatorsmile <gatorsm...@gmail.com>


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/2839280a
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/2839280a
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/2839280a

Branch: refs/heads/branch-2.2
Commit: 2839280adc930593c64a74892fec79dcc666d468
Parents: a607ddc
Author: Wenchen Fan <wenc...@databricks.com>
Authored: Thu Oct 26 17:51:16 2017 -0700
Committer: gatorsmile <gatorsm...@gmail.com>
Committed: Thu Oct 26 17:52:26 2017 -0700

----------------------------------------------------------------------
 .../scala/org/apache/spark/sql/Dataset.scala    | 33 +++++++++++++-------
 1 file changed, 22 insertions(+), 11 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/spark/blob/2839280a/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
----------------------------------------------------------------------
diff --git a/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
index a775fb8..1acbad9 100644
--- a/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
+++ b/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
@@ -39,6 +39,7 @@ import org.apache.spark.sql.catalyst.catalog.HiveTableRelation
 import org.apache.spark.sql.catalyst.encoders._
 import org.apache.spark.sql.catalyst.expressions._
 import org.apache.spark.sql.catalyst.expressions.aggregate._
+import org.apache.spark.sql.catalyst.expressions.codegen.GenerateSafeProjection
 import org.apache.spark.sql.catalyst.json.{JacksonGenerator, JSONOptions}
 import org.apache.spark.sql.catalyst.optimizer.CombineUnions
 import org.apache.spark.sql.catalyst.parser.ParseException
@@ -195,15 +196,10 @@ class Dataset[T] private[sql](
    */
   private[sql] implicit val exprEnc: ExpressionEncoder[T] = encoderFor(encoder)
 
-  /**
-   * Encoder is used mostly as a container of serde expressions in Dataset.  
We build logical
-   * plans by these serde expressions and execute it within the query 
framework.  However, for
-   * performance reasons we may want to use encoder as a function to 
deserialize internal rows to
-   * custom objects, e.g. collect.  Here we resolve and bind the encoder so 
that we can call its
-   * `fromRow` method later.
-   */
-  private val boundEnc =
-    exprEnc.resolveAndBind(logicalPlan.output, 
sparkSession.sessionState.analyzer)
+  // The deserializer expression which can be used to build a projection and 
turn rows to objects
+  // of type T, after collecting rows to the driver side.
+  private val deserializer =
+    exprEnc.resolveAndBind(logicalPlan.output, 
sparkSession.sessionState.analyzer).deserializer
 
   private implicit def classTag = exprEnc.clsTag
 
@@ -2418,7 +2414,15 @@ class Dataset[T] private[sql](
    */
   def toLocalIterator(): java.util.Iterator[T] = {
     withAction("toLocalIterator", queryExecution) { plan =>
-      plan.executeToIterator().map(boundEnc.fromRow).asJava
+      // This projection writes output to a `InternalRow`, which means 
applying this projection is
+      // not thread-safe. Here we create the projection inside this method to 
make `Dataset`
+      // thread-safe.
+      val objProj = GenerateSafeProjection.generate(deserializer :: Nil)
+      plan.executeToIterator().map { row =>
+        // The row returned by SafeProjection is `SpecificInternalRow`, which 
ignore the data type
+        // parameter of its `get` method, so it's safe to use null here.
+        objProj(row).get(0, null).asInstanceOf[T]
+      }.asJava
     }
   }
 
@@ -2851,7 +2855,14 @@ class Dataset[T] private[sql](
    * Collect all elements from a spark plan.
    */
   private def collectFromPlan(plan: SparkPlan): Array[T] = {
-    plan.executeCollect().map(boundEnc.fromRow)
+    // This projection writes output to a `InternalRow`, which means applying 
this projection is not
+    // thread-safe. Here we create the projection inside this method to make 
`Dataset` thread-safe.
+    val objProj = GenerateSafeProjection.generate(deserializer :: Nil)
+    plan.executeCollect().map { row =>
+      // The row returned by SafeProjection is `SpecificInternalRow`, which 
ignore the data type
+      // parameter of its `get` method, so it's safe to use null here.
+      objProj(row).get(0, null).asInstanceOf[T]
+    }
   }
 
   private def sortInternal(global: Boolean, sortExprs: Seq[Column]): 
Dataset[T] = {


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-22355][SQL] Dataset.collect is not threadsafe

Reply via email to