[ https://issues.apache.org/jira/browse/SPARK-10493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14737296#comment-14737296 ]
Glenn Strycker commented on SPARK-10493: ---------------------------------------- [~srowen], the code I attached did run correctly. However, I have similar code that I run in Yarn via spark-submit that is NOT returning 1 record per key. I mean that when I run code spark-submit that generates temp5, I get a set as follows: {noformat} ((cluster021,cluster023),(cluster021,1,2,1,3)) ((cluster031,cluster033),(cluster031,1,2,1,3)) ((cluster041,cluster043),(cluster041,5,2,1,3)) ((cluster041,cluster043),(cluster041,1,2,1,3)) ((cluster041,cluster044),(cluster041,3,2,1,3)) ((cluster041,cluster044),(cluster041,4,2,1,3)) ((cluster051,cluster052),(cluster051,6,2,1,3)) ((cluster051,cluster053),(cluster051,1,2,1,3)) ((cluster051,cluster054),(cluster051,1,2,1,3)) ((cluster051,cluster055),(cluster051,1,2,1,3)) ((cluster051,cluster056),(cluster051,1,2,1,3)) ((cluster052,cluster053),(cluster051,1,1,1,2)) ((cluster052,cluster054),(cluster051,8,1,1,2)) ((cluster053,cluster054),(cluster051,7,1,1,2)) ((cluster055,cluster056),(cluster051,9,1,1,2)) {noformat} note that the keys (cluster041,cluster043) or (cluster041,cluster044) have 2 records each in the results, which should NEVER happen! Here is what I expected (which is what I see in my example code I attached to this ticket, which ran successfully in spark-shell): {noformat} ((cluster021,cluster023),(cluster021,1,2,1,3)) ((cluster031,cluster033),(cluster031,1,2,1,3)) ((cluster041,cluster043),(cluster041,6,2,1,3)) ((cluster041,cluster044),(cluster041,7,2,1,3)) ((cluster051,cluster052),(cluster051,6,2,1,3)) ((cluster051,cluster053),(cluster051,1,2,1,3)) ((cluster051,cluster054),(cluster051,1,2,1,3)) ((cluster051,cluster055),(cluster051,1,2,1,3)) ((cluster051,cluster056),(cluster051,1,2,1,3)) ((cluster052,cluster053),(cluster051,1,1,1,2)) ((cluster052,cluster054),(cluster051,8,1,1,2)) ((cluster053,cluster054),(cluster051,7,1,1,2)) ((cluster055,cluster056),(cluster051,9,1,1,2)) {noformat} You are right that in my example, distinct is not really the issue, since the records with the same keys do have different values. The issue is with reduceByKey, which is NOT reducing my RDDs correctly and resulting in 1 record per key. Does reduceByKey not support (String, String) keys? > reduceByKey not returning distinct results > ------------------------------------------ > > Key: SPARK-10493 > URL: https://issues.apache.org/jira/browse/SPARK-10493 > Project: Spark > Issue Type: Bug > Components: Spark Core > Reporter: Glenn Strycker > Attachments: reduceByKey_example_001.scala > > > I am running Spark 1.3.0 and creating an RDD by unioning several earlier RDDs > (using zipPartitions), partitioning by a hash partitioner, and then applying > a reduceByKey to summarize statistics by key. > Since my set before the reduceByKey consists of records such as (K, V1), (K, > V2), (K, V3), I expect the results after reduceByKey to be just (K, > f(V1,V2,V3)), where the function f is appropriately associative, commutative, > etc. Therefore, the results after reduceByKey ought to be distinct, correct? > I am running counts of my RDD and finding that adding an additional > .distinct after my .reduceByKey is changing the final count!! > Here is some example code: > rdd3 = tempRDD1. > zipPartitions(tempRDD2, true)((iter, iter2) => iter++iter2). > partitionBy(new HashPartitioner(numPartitions)). > reduceByKey((a,b) => (math.Ordering.String.min(a._1, b._1), a._2 + b._2, > math.max(a._3, b._3), math.max(a._4, b._4), math.max(a._5, b._5))) > println(rdd3.count) > rdd4 = rdd3.distinct > println(rdd4.count) > I am using persistence, checkpointing, and other stuff in my actual code that > I did not paste here, so I can paste my actual code if it would be helpful. > This issue may be related to SPARK-2620, except I am not using case classes, > to my knowledge. > See also > http://stackoverflow.com/questions/32466176/apache-spark-rdd-reducebykey-operation-not-returning-correct-distinct-results -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org