Josh Rosen created SPARK-29419:
----------------------------------

             Summary: Seq.toDS / spark.createDataset(Seq) is not thread-safe
                 Key: SPARK-29419
                 URL: https://issues.apache.org/jira/browse/SPARK-29419
             Project: Spark
          Issue Type: Improvement
          Components: SQL
    Affects Versions: 2.4.0, 3.0.0
            Reporter: Josh Rosen
            Assignee: Josh Rosen


The {{Seq.toDS}} and {{spark.createDataset(Seq)}} code is not thread-safe: if 
the caller-supplied {{Encoder}} is used in multiple threads then 
{{createDataset}}'s usage of the encoder may lead to incorrect answers because 
the Encoder's internal mutable state will be updated by from multiple threads.

Here is an example demonstrating the problem:
{code:java}
import org.apache.spark.sql._

val enc = implicitly[Encoder[(Int, Int)]]

val datasets = (1 to 100).par.map { _ =>
  val pairs = (1 to 100).map(x => (x, x))
  spark.createDataset(pairs)(enc)
}

datasets.reduce(_ union _).collect().foreach {
  pair => require(pair._1 == pair._2, s"Pair elements are mismatched: $pair")
}{code}
Due to the thread-safety issue, the above example results in the creation of 
corrupted records where different input records' fields are co-mingled.

This bug is similar to SPARK-22355, a related problem in {{Dataset.collect()}} 
(fixed in Spark 2.2.1+).

Fortunately, this has a simple one-line fix (copy the encoder); I'll submit a 
patch for this shortly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to