Derrick Burns created SPARK-5917: ------------------------------------ Summary: Distinct is broken Key: SPARK-5917 URL: https://issues.apache.org/jira/browse/SPARK-5917 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.1.1 Environment: Spark 1.1.1 running on YARN 2.4 via Amazon EMR. Reporter: Derrick Burns Priority: Critical
I hate to file bugs that are hard to reproduce (by other people), but after spending a full week trying to debug my code, I constructed a scenario where the following assertion FAILS. val x : RDD[T] = .... val y = x.distinct() assert( y.count() <= x.count() ) I am at a complete loss as to how this can occur under ANY definition of equality/order unless the RDD underlying x is mutable. Since none of my RDD transforms mutate any existing RDD data and I am reading from immutable sources (data on S3), I conclude that there must be a bug in Spark or I am mutating my data unknowingly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org