[ https://issues.apache.org/jira/browse/SPARK-15070?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15268500#comment-15268500 ]
Pete Robbins commented on SPARK-15070: -------------------------------------- could this be related to https://issues.apache.org/jira/browse/SPARK-12555 ? > Data corruption when using Dataset.groupBy[K : Encoder](func: T => K) when > data loaded from JSON file. > ------------------------------------------------------------------------------------------------------ > > Key: SPARK-15070 > URL: https://issues.apache.org/jira/browse/SPARK-15070 > Project: Spark > Issue Type: Bug > Components: Input/Output, SQL > Affects Versions: 1.6.1 > Environment: produced on Mac OS X 10.11.4 in local mode > Reporter: Eric Wasserman > > full running case at: https://github.com/ewasserman/spark-bug.git > Bug.scala > ========== > package bug > import org.apache.spark.sql.functions._ > import org.apache.spark.sql.SQLContext > import org.apache.spark.{SparkContext, SparkConf} > case class BugRecord(m: String, elapsed_time: java.lang.Double) > object Bug { > def main(args: Array[String]): Unit = { > val c = new SparkConf().setMaster("local[2]").setAppName("BugTest") > val sc = new SparkContext(c) > val sqlc = new SQLContext(sc) > import sqlc.implicits._ > val logs = sqlc.read.json("bug-data.json").as[BugRecord] > logs.groupBy(r => "FOO").agg(avg($"elapsed_time").as[Double]).show(20, > truncate = false) > > sc.stop() > } > } > bug-data.json > ========== > {"m":"POST","elapsed_time":0.123456789012345678,"source_time":"abcdefghijk"} > ----------------- > Expected Output: > +-----------+-------------------+ > |_1 |_2 | > +-----------+-------------------+ > |FOO |0.12345678901234568| > +-----------+-------------------+ > Observed Output: > +-----------+-------------------+ > |_1 |_2 | > +-----------+-------------------+ > |POSTabc|0.12345726584950388| > +-----------+-------------------+ > The grouping key has been corrupted (it is *not* the product of the groupBy > function) and is a combination of bytes from the actual key column and an > extra attribute in the JSON not present in the case class. The aggregated > value is also corrupted. > NOTE: > The problem does not manifest when using an alternate form of groupBy: > logs.groupBy($"m").agg(avg($"elapsed_time").as[Double]) > The corrupted key problem does not manifest when there is not an additional > field in the JSON. Ie. if the data file is this: > {"m":"POST","elapsed_time":0.123456789012345678} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org