subject:"\[jira\] \[Commented\] \(SPARK\-15070\) Data corruption when using Dataset.groupBy\[K \: Encoder\]\(func\: T => K\) when data loaded from JSON file."

[jira] [Commented] (SPARK-15070) Data corruption when using Dataset.groupBy[K : Encoder](func: T => K) when data loaded from JSON file.

2016-05-03 Thread Eric Wasserman (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15070?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15269468#comment-15269468
 ] 

Eric Wasserman commented on SPARK-15070:


I tried the same example (actually slightly modified to accommodate the 2.0 API 
change groupBy -> groupByKey) against the current 2.0 branch and the bug does 
not reproduce. The fix for SPARK-12555 was put into the 2.0.0 branch. So it 
seems this bug might have the same origins as 12555.

> Data corruption when using Dataset.groupBy[K : Encoder](func: T => K) when 
> data loaded from JSON file.
> --
>
> Key: SPARK-15070
> URL: https://issues.apache.org/jira/browse/SPARK-15070
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output, SQL
>Affects Versions: 1.6.1
> Environment: produced on Mac OS X 10.11.4 in local mode
>Reporter: Eric Wasserman
>
> full running case at: https://github.com/ewasserman/spark-bug.git
> Bug.scala
> ==
> package bug
> import org.apache.spark.sql.functions._
> import org.apache.spark.sql.SQLContext
> import org.apache.spark.{SparkContext, SparkConf}
> case class BugRecord(m: String, elapsed_time: java.lang.Double)
> object Bug {
>   def main(args: Array[String]): Unit = {
> val c = new SparkConf().setMaster("local[2]").setAppName("BugTest")
> val sc = new SparkContext(c)
> val sqlc = new SQLContext(sc)
> import sqlc.implicits._
> val logs = sqlc.read.json("bug-data.json").as[BugRecord]
> logs.groupBy(r => "FOO").agg(avg($"elapsed_time").as[Double]).show(20, 
> truncate = false)
> 
> sc.stop()
>   }
> }
> bug-data.json
> ==
> {"m":"POST","elapsed_time":0.123456789012345678,"source_time":"abcdefghijk"}
> -
> Expected Output:
> +---+---+
> |_1 |_2 |
> +---+---+
> |FOO |0.12345678901234568|
> +---+---+
> Observed Output:
> +---+---+
> |_1 |_2 |
> +---+---+
> |POSTabc|0.12345726584950388|
> +---+---+
> The grouping key has been corrupted (it is *not* the product of the groupBy 
> function) and is a combination of bytes from the actual key column and an 
> extra attribute in the JSON not present in the case class. The aggregated 
> value is also corrupted.
> NOTE:
> The problem does not manifest when using an alternate form of groupBy:
> logs.groupBy($"m").agg(avg($"elapsed_time").as[Double])
> The corrupted key problem does not manifest when there is not an additional 
> field in the JSON. Ie. if the data file is this:
> {"m":"POST","elapsed_time":0.123456789012345678}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15070) Data corruption when using Dataset.groupBy[K : Encoder](func: T => K) when data loaded from JSON file.

2016-05-03 Thread Pete Robbins (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15070?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15268500#comment-15268500
 ] 

Pete Robbins commented on SPARK-15070:
--

could this be related to https://issues.apache.org/jira/browse/SPARK-12555 ?

> Data corruption when using Dataset.groupBy[K : Encoder](func: T => K) when 
> data loaded from JSON file.
> --
>
> Key: SPARK-15070
> URL: https://issues.apache.org/jira/browse/SPARK-15070
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output, SQL
>Affects Versions: 1.6.1
> Environment: produced on Mac OS X 10.11.4 in local mode
>Reporter: Eric Wasserman
>
> full running case at: https://github.com/ewasserman/spark-bug.git
> Bug.scala
> ==
> package bug
> import org.apache.spark.sql.functions._
> import org.apache.spark.sql.SQLContext
> import org.apache.spark.{SparkContext, SparkConf}
> case class BugRecord(m: String, elapsed_time: java.lang.Double)
> object Bug {
>   def main(args: Array[String]): Unit = {
> val c = new SparkConf().setMaster("local[2]").setAppName("BugTest")
> val sc = new SparkContext(c)
> val sqlc = new SQLContext(sc)
> import sqlc.implicits._
> val logs = sqlc.read.json("bug-data.json").as[BugRecord]
> logs.groupBy(r => "FOO").agg(avg($"elapsed_time").as[Double]).show(20, 
> truncate = false)
> 
> sc.stop()
>   }
> }
> bug-data.json
> ==
> {"m":"POST","elapsed_time":0.123456789012345678,"source_time":"abcdefghijk"}
> -
> Expected Output:
> +---+---+
> |_1 |_2 |
> +---+---+
> |FOO |0.12345678901234568|
> +---+---+
> Observed Output:
> +---+---+
> |_1 |_2 |
> +---+---+
> |POSTabc|0.12345726584950388|
> +---+---+
> The grouping key has been corrupted (it is *not* the product of the groupBy 
> function) and is a combination of bytes from the actual key column and an 
> extra attribute in the JSON not present in the case class. The aggregated 
> value is also corrupted.
> NOTE:
> The problem does not manifest when using an alternate form of groupBy:
> logs.groupBy($"m").agg(avg($"elapsed_time").as[Double])
> The corrupted key problem does not manifest when there is not an additional 
> field in the JSON. Ie. if the data file is this:
> {"m":"POST","elapsed_time":0.123456789012345678}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15070) Data corruption when using Dataset.groupBy[K : Encoder](func: T => K) when data loaded from JSON file.

[jira] [Commented] (SPARK-15070) Data corruption when using Dataset.groupBy[K : Encoder](func: T => K) when data loaded from JSON file.

2 matches

Site Navigation

Mail list logo

Footer information