[ https://issues.apache.org/jira/browse/SPARK-5917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sean Owen updated SPARK-5917: ----------------------------- Component/s: (was: MLlib) Spark Core Priority: Major (was: Critical) I can't reproduce this locally or with Hadoop / HDFS / YARN on a few sample data sets. I tried caching and no caching. Downgrading for now. > Distinct is broken > ------------------ > > Key: SPARK-5917 > URL: https://issues.apache.org/jira/browse/SPARK-5917 > Project: Spark > Issue Type: Bug > Components: Spark Core > Affects Versions: 1.1.1 > Environment: Spark 1.1.1 running on YARN 2.4 via Amazon EMR. > Reporter: Derrick Burns > > I hate to file bugs that are hard to reproduce (by other people), but after > spending a full week trying to debug my code, I constructed a scenario where > the following assertion FAILS. > val x : RDD[T] = .... > val y = x.distinct() > assert( y.count() <= x.count() ) > I am at a complete loss as to how this can occur under ANY definition of > equality/order unless the RDD underlying x is mutable. Since none of my RDD > transforms mutate any existing RDD data and I am reading from immutable > sources (data on S3), I conclude that there must be a bug in Spark or I am > mutating my data unknowingly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org