[ 
https://issues.apache.org/jira/browse/SPARK-10493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14737296#comment-14737296
 ] 

Glenn Strycker commented on SPARK-10493:
----------------------------------------

[~srowen], the code I attached did run correctly.  However, I have similar code 
that I run in Yarn via spark-submit that is NOT returning 1 record per key.

I mean that when I run code spark-submit that generates temp5, I get a set as 
follows:

{noformat}
((cluster021,cluster023),(cluster021,1,2,1,3))
((cluster031,cluster033),(cluster031,1,2,1,3))
((cluster041,cluster043),(cluster041,5,2,1,3))
((cluster041,cluster043),(cluster041,1,2,1,3))
((cluster041,cluster044),(cluster041,3,2,1,3))
((cluster041,cluster044),(cluster041,4,2,1,3))
((cluster051,cluster052),(cluster051,6,2,1,3))
((cluster051,cluster053),(cluster051,1,2,1,3))
((cluster051,cluster054),(cluster051,1,2,1,3))
((cluster051,cluster055),(cluster051,1,2,1,3))
((cluster051,cluster056),(cluster051,1,2,1,3))
((cluster052,cluster053),(cluster051,1,1,1,2))
((cluster052,cluster054),(cluster051,8,1,1,2))
((cluster053,cluster054),(cluster051,7,1,1,2))
((cluster055,cluster056),(cluster051,9,1,1,2))
{noformat}

note that the keys (cluster041,cluster043) or (cluster041,cluster044) have 2 
records each in the results, which should NEVER happen!

Here is what I expected (which is what I see in my example code I attached to 
this ticket, which ran successfully in spark-shell):

{noformat}
((cluster021,cluster023),(cluster021,1,2,1,3))
((cluster031,cluster033),(cluster031,1,2,1,3))
((cluster041,cluster043),(cluster041,6,2,1,3))
((cluster041,cluster044),(cluster041,7,2,1,3))
((cluster051,cluster052),(cluster051,6,2,1,3))
((cluster051,cluster053),(cluster051,1,2,1,3))
((cluster051,cluster054),(cluster051,1,2,1,3))
((cluster051,cluster055),(cluster051,1,2,1,3))
((cluster051,cluster056),(cluster051,1,2,1,3))
((cluster052,cluster053),(cluster051,1,1,1,2))
((cluster052,cluster054),(cluster051,8,1,1,2))
((cluster053,cluster054),(cluster051,7,1,1,2))
((cluster055,cluster056),(cluster051,9,1,1,2))
{noformat}

You are right that in my example, distinct is not really the issue, since the 
records with the same keys do have different values.  The issue is with 
reduceByKey, which is NOT reducing my RDDs correctly and resulting in 1 record 
per key.  Does reduceByKey not support (String, String) keys?

> reduceByKey not returning distinct results
> ------------------------------------------
>
>                 Key: SPARK-10493
>                 URL: https://issues.apache.org/jira/browse/SPARK-10493
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>            Reporter: Glenn Strycker
>         Attachments: reduceByKey_example_001.scala
>
>
> I am running Spark 1.3.0 and creating an RDD by unioning several earlier RDDs 
> (using zipPartitions), partitioning by a hash partitioner, and then applying 
> a reduceByKey to summarize statistics by key.
> Since my set before the reduceByKey consists of records such as (K, V1), (K, 
> V2), (K, V3), I expect the results after reduceByKey to be just (K, 
> f(V1,V2,V3)), where the function f is appropriately associative, commutative, 
> etc.  Therefore, the results after reduceByKey ought to be distinct, correct? 
>  I am running counts of my RDD and finding that adding an additional 
> .distinct after my .reduceByKey is changing the final count!!
> Here is some example code:
> rdd3 = tempRDD1.
>    zipPartitions(tempRDD2, true)((iter, iter2) => iter++iter2).
>    partitionBy(new HashPartitioner(numPartitions)).
>    reduceByKey((a,b) => (math.Ordering.String.min(a._1, b._1), a._2 + b._2, 
> math.max(a._3, b._3), math.max(a._4, b._4), math.max(a._5, b._5)))
> println(rdd3.count)
> rdd4 = rdd3.distinct
> println(rdd4.count)
> I am using persistence, checkpointing, and other stuff in my actual code that 
> I did not paste here, so I can paste my actual code if it would be helpful.
> This issue may be related to SPARK-2620, except I am not using case classes, 
> to my knowledge.
> See also 
> http://stackoverflow.com/questions/32466176/apache-spark-rdd-reducebykey-operation-not-returning-correct-distinct-results



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to