Hi everyone, I got a trouble these days,and I don't know whether it is a bug of spark.When I use GroupByKey for our sequenceFile Data,I find that different partition number lead different result, so as ReduceByKey. I think the problem happens on the shuffle stage.I read the source code, but still can't find the answer.
this is the main code: val rdd = sc.sequenceFile[UserWritable, TagsWritable](input, classOf[UserWritable], classOf[TagsWritable]) val combinedRdd = rdd.map(s => (s._1.getuserid(), s._2)).groupByKey(num).filter(_._1 == uid) num is the number of partition and uid is a filter id for result comparision. TagsWritable implements WritableComparable<TagsWritable> and Serializable. I used GroupByKey on text file, the result was right. Thanks, Devin Huang -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Different-partition-number-of-GroupByKey-leads-different-result-tp24989.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org