Josh Rosen created SPARK-29419: ---------------------------------- Summary: Seq.toDS / spark.createDataset(Seq) is not thread-safe Key: SPARK-29419 URL: https://issues.apache.org/jira/browse/SPARK-29419 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.0, 3.0.0 Reporter: Josh Rosen Assignee: Josh Rosen
The {{Seq.toDS}} and {{spark.createDataset(Seq)}} code is not thread-safe: if the caller-supplied {{Encoder}} is used in multiple threads then {{createDataset}}'s usage of the encoder may lead to incorrect answers because the Encoder's internal mutable state will be updated by from multiple threads. Here is an example demonstrating the problem: {code:java} import org.apache.spark.sql._ val enc = implicitly[Encoder[(Int, Int)]] val datasets = (1 to 100).par.map { _ => val pairs = (1 to 100).map(x => (x, x)) spark.createDataset(pairs)(enc) } datasets.reduce(_ union _).collect().foreach { pair => require(pair._1 == pair._2, s"Pair elements are mismatched: $pair") }{code} Due to the thread-safety issue, the above example results in the creation of corrupted records where different input records' fields are co-mingled. This bug is similar to SPARK-22355, a related problem in {{Dataset.collect()}} (fixed in Spark 2.2.1+). Fortunately, this has a simple one-line fix (copy the encoder); I'll submit a patch for this shortly. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org