Eric Wasserman created SPARK-15070:
--------------------------------------

             Summary: Data corruption when using Dataset.groupBy[K : 
Encoder](func: T => K) when data loaded from JSON file.
                 Key: SPARK-15070
                 URL: https://issues.apache.org/jira/browse/SPARK-15070
             Project: Spark
          Issue Type: Bug
          Components: Input/Output, SQL
    Affects Versions: 1.6.1
         Environment: produced on Mac OS X 10.11.4 in local mode
            Reporter: Eric Wasserman


full running case at: https://github.com/ewasserman/spark-bug.git

Bug.scala
==========
package bug

import org.apache.spark.sql.functions._
import org.apache.spark.sql.SQLContext
import org.apache.spark.{SparkContext, SparkConf}

case class BugRecord(m: String, elapsed_time: java.lang.Double)

object Bug {
  def main(args: Array[String]): Unit = {
    val c = new SparkConf().setMaster("local[2]").setAppName("BugTest")
    val sc = new SparkContext(c)
    val sqlc = new SQLContext(sc)

    import sqlc.implicits._
    val logs = sqlc.read.json("bug-data.json").as[BugRecord]
    logs.groupBy(r => "FOO").agg(avg($"elapsed_time").as[Double]).show(20, 
truncate = false)
    
    sc.stop()
  }
}


bug-data.json
==========
{"m":"POST","elapsed_time":0.123456789012345678,"source_time":"abcdefghijk"}

-----------------
Expected Output:
+-----------+-------------------+
|_1         |_2                 |
+-----------+-------------------+
|FOO     |0.12345678901234568|
+-----------+-------------------+

Observed Output:
+-----------+-------------------+
|_1         |_2                 |
+-----------+-------------------+
|POSTabc|0.12345726584950388|
+-----------+-------------------+

The grouping key has been corrupted (it is *not* the product of the groupBy 
function) and is a combination of bytes from the actual key column and an extra 
attribute in the JSON not present in the case class. The aggregated value is 
also corrupted.

NOTE:
The problem does not manifest when using an alternate form of groupBy:
logs.groupBy($"m").agg(avg($"elapsed_time").as[Double])

The corrupted key problem does not manifest when there is not an additional 
field in the JSON. Ie. if the data file is this:

{"m":"POST","elapsed_time":0.123456789012345678}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to