[jira] [Commented] (SPARK-3098) In some cases, operation zipWithIndex get a wrong results

Sean Owen (JIRA) Tue, 19 Aug 2014 04:49:35 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-3098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14102145#comment-14102145
 ]


Sean Owen commented on SPARK-3098:
----------------------------------

Yes I get the same result with Spark 1.0.0 with patches, including the fix for 
SPARK-2043, in standalone mode: 
{code}
Array[(Int, (Long, Long))] = Array((9272040,(13,14)), (9985320,(14,13)), 
(32797680,(24,26)))
{code}

If I change the code above so that the ranges are not overlapping to begin 
with, and remove distinct(), I don't see the issue.
It also goes away if the RDD c is cached.

I would assume distinct() is deterministic, even if it doesn't guarantee an 
ordering. Same with zipWithIndex(). Either those assumptions are wrong, or it 
could be an issue either place.

A quick check says most keys are correct (no mismatch), and the mismatch is 
generally small. This makes me wonder if there's some kind of race condition in 
handing out numbers? I'll look at the code too.

>  In some cases, operation zipWithIndex get a wrong results
> ----------------------------------------------------------
>
>                 Key: SPARK-3098
>                 URL: https://issues.apache.org/jira/browse/SPARK-3098
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 1.0.1
>            Reporter: Guoqiang Li
>            Priority: Critical
>
> I do not know how to reproduce the bug.
> This is the case. When I was in operating 10 billion data by groupByKey. the 
> results error:
> {noformat}
> (4696501, 370568)
> (4696501, 376672)
> (4696501, 374880)
> .....
> (4696502, 350264)
> (4696502, 358458)
> (4696502, 398502)
> ......
> {noformat} 
> => 
> {noformat}
> (4696501,ArrayBuffer(350264, 358458, 398502 ........)), 
> (4696502,ArrayBuffer(376621, ......))
> {noformat}
> code :
> {code}
>     val dealOuts = clickPreferences(sc, dealOutPath, periodTime)
>     val dealOrders = orderPreferences(sc, dealOrderPath, periodTime)
>     val favorites = favoritePreferences(sc, favoritePath, periodTime)
>     val allBehaviors = (dealOrders ++ favorites ++ dealOuts)
>     val peferences= allBehaviors.groupByKey().map { ... } 
> {code}
> spark-defaults.conf:
> {code}
> spark.default.parallelism    280
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3098) In some cases, operation zipWithIndex get a wrong results

Reply via email to