Eric Wasserman created SPARK-15070: -------------------------------------- Summary: Data corruption when using Dataset.groupBy[K : Encoder](func: T => K) when data loaded from JSON file. Key: SPARK-15070 URL: https://issues.apache.org/jira/browse/SPARK-15070 Project: Spark Issue Type: Bug Components: Input/Output, SQL Affects Versions: 1.6.1 Environment: produced on Mac OS X 10.11.4 in local mode Reporter: Eric Wasserman
full running case at: https://github.com/ewasserman/spark-bug.git Bug.scala ========== package bug import org.apache.spark.sql.functions._ import org.apache.spark.sql.SQLContext import org.apache.spark.{SparkContext, SparkConf} case class BugRecord(m: String, elapsed_time: java.lang.Double) object Bug { def main(args: Array[String]): Unit = { val c = new SparkConf().setMaster("local[2]").setAppName("BugTest") val sc = new SparkContext(c) val sqlc = new SQLContext(sc) import sqlc.implicits._ val logs = sqlc.read.json("bug-data.json").as[BugRecord] logs.groupBy(r => "FOO").agg(avg($"elapsed_time").as[Double]).show(20, truncate = false) sc.stop() } } bug-data.json ========== {"m":"POST","elapsed_time":0.123456789012345678,"source_time":"abcdefghijk"} ----------------- Expected Output: +-----------+-------------------+ |_1 |_2 | +-----------+-------------------+ |FOO |0.12345678901234568| +-----------+-------------------+ Observed Output: +-----------+-------------------+ |_1 |_2 | +-----------+-------------------+ |POSTabc|0.12345726584950388| +-----------+-------------------+ The grouping key has been corrupted (it is *not* the product of the groupBy function) and is a combination of bytes from the actual key column and an extra attribute in the JSON not present in the case class. The aggregated value is also corrupted. NOTE: The problem does not manifest when using an alternate form of groupBy: logs.groupBy($"m").agg(avg($"elapsed_time").as[Double]) The corrupted key problem does not manifest when there is not an additional field in the JSON. Ie. if the data file is this: {"m":"POST","elapsed_time":0.123456789012345678} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org