[jira] [Commented] (SPARK-15070) Data corruption when using Dataset.groupBy[K : Encoder](func: T => K) when data loaded from JSON file.

Pete Robbins (JIRA) Tue, 03 May 2016 03:42:24 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-15070?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15268500#comment-15268500
 ]


Pete Robbins commented on SPARK-15070:
--------------------------------------

could this be related to https://issues.apache.org/jira/browse/SPARK-12555 ?

> Data corruption when using Dataset.groupBy[K : Encoder](func: T => K) when 
> data loaded from JSON file.
> ------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-15070
>                 URL: https://issues.apache.org/jira/browse/SPARK-15070
>             Project: Spark
>          Issue Type: Bug
>          Components: Input/Output, SQL
>    Affects Versions: 1.6.1
>         Environment: produced on Mac OS X 10.11.4 in local mode
>            Reporter: Eric Wasserman
>
> full running case at: https://github.com/ewasserman/spark-bug.git
> Bug.scala
> ==========
> package bug
> import org.apache.spark.sql.functions._
> import org.apache.spark.sql.SQLContext
> import org.apache.spark.{SparkContext, SparkConf}
> case class BugRecord(m: String, elapsed_time: java.lang.Double)
> object Bug {
>   def main(args: Array[String]): Unit = {
>     val c = new SparkConf().setMaster("local[2]").setAppName("BugTest")
>     val sc = new SparkContext(c)
>     val sqlc = new SQLContext(sc)
>     import sqlc.implicits._
>     val logs = sqlc.read.json("bug-data.json").as[BugRecord]
>     logs.groupBy(r => "FOO").agg(avg($"elapsed_time").as[Double]).show(20, 
> truncate = false)
>     
>     sc.stop()
>   }
> }
> bug-data.json
> ==========
> {"m":"POST","elapsed_time":0.123456789012345678,"source_time":"abcdefghijk"}
> -----------------
> Expected Output:
> +-----------+-------------------+
> |_1         |_2                 |
> +-----------+-------------------+
> |FOO     |0.12345678901234568|
> +-----------+-------------------+
> Observed Output:
> +-----------+-------------------+
> |_1         |_2                 |
> +-----------+-------------------+
> |POSTabc|0.12345726584950388|
> +-----------+-------------------+
> The grouping key has been corrupted (it is *not* the product of the groupBy 
> function) and is a combination of bytes from the actual key column and an 
> extra attribute in the JSON not present in the case class. The aggregated 
> value is also corrupted.
> NOTE:
> The problem does not manifest when using an alternate form of groupBy:
> logs.groupBy($"m").agg(avg($"elapsed_time").as[Double])
> The corrupted key problem does not manifest when there is not an additional 
> field in the JSON. Ie. if the data file is this:
> {"m":"POST","elapsed_time":0.123456789012345678}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15070) Data corruption when using Dataset.groupBy[K : Encoder](func: T => K) when data loaded from JSON file.

Reply via email to