[
https://issues.apache.org/jira/browse/SPARK-3098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14102044#comment-14102044
]
Sean Owen commented on SPARK-3098:
----------------------------------
It would be helpful if you would explain what you are trying to reproduce here;
this is just code, and there's not a continuous value here, for example. It
looks like you're producing overlapping sequences of numbers like 1..10000,
6001..16000, ... and then flattening and removing duplicates, to get the range
1..47404000. That's zipped with its index to get these as (n,n-1) pairs.
Then you map the second element to a String, and do the same to a subset of the
data, join them, and see if there are any mismatches, because there shouldn't
be. All the keys are values are unique. But why would this demonstrate
something about zipWithIndex more directly than a test of the RDD c?
More importantly, I ran this locally and got an empty Array, as expected.
> In some cases, operation zipWithIndex get a wrong results
> ----------------------------------------------------------
>
> Key: SPARK-3098
> URL: https://issues.apache.org/jira/browse/SPARK-3098
> Project: Spark
> Issue Type: Bug
> Components: Spark Core
> Affects Versions: 1.0.1
> Reporter: Guoqiang Li
> Priority: Critical
>
> I do not know how to reproduce the bug.
> This is the case. When I was in operating 10 billion data by groupByKey. the
> results error:
> {noformat}
> (4696501, 370568)
> (4696501, 376672)
> (4696501, 374880)
> .....
> (4696502, 350264)
> (4696502, 358458)
> (4696502, 398502)
> ......
> {noformat}
> =>
> {noformat}
> (4696501,ArrayBuffer(350264, 358458, 398502 ........)),
> (4696502,ArrayBuffer(376621, ......))
> {noformat}
> code :
> {code}
> val dealOuts = clickPreferences(sc, dealOutPath, periodTime)
> val dealOrders = orderPreferences(sc, dealOrderPath, periodTime)
> val favorites = favoritePreferences(sc, favoritePath, periodTime)
> val allBehaviors = (dealOrders ++ favorites ++ dealOuts)
> val peferences= allBehaviors.groupByKey().map { ... }
> {code}
> spark-defaults.conf:
> {code}
> spark.default.parallelism 280
> {code}
--
This message was sent by Atlassian JIRA
(v6.2#6252)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]