Re: Different partition number of GroupByKey leads different result

Sean Owen Fri, 09 Oct 2015 02:37:58 -0700

Another guess, since you say the key is String (offline): you are not
cloning the value of TagsWritable. Hadoop reuses the object under the
hood, and so is changing your object value. You can't save references
to the object you get from reading a SequenceFile.


On Fri, Oct 9, 2015 at 10:22 AM, Sean Owen <so...@cloudera.com> wrote:
> First guess: your key class does not implement hashCode/equals
>
> On Fri, Oct 9, 2015 at 10:05 AM, Devin Huang <hos...@163.com> wrote:
>> Hi everyone,
>>
>>      I got a trouble these days，and I don't know whether it is a bug of
>> spark.When I use  GroupByKey for our sequenceFile Data,I find that different
>> partition number lead different result, so as ReduceByKey. I think the
>> problem happens on the shuffle stage.I read the source code,  but still
>> can't find the answer.
>>
>>
>> this is the main code:
>>
>> val rdd = sc.sequenceFile[UserWritable, TagsWritable](input,
>> classOf[UserWritable], classOf[TagsWritable])
>> val combinedRdd = rdd.map(s => (s._1.getuserid(),
>> s._2)).groupByKey(num).filter(_._1 == uid)
>>
>> num is the number of partition and uid is a filter id for result
>> comparision.
>> TagsWritable implements WritableComparable<TagsWritable> and Serializable.
>>
>> I used GroupByKey on text file, the result was right.
>>
>> Thanks,
>> Devin Huang
>>
>>
>>
>>
>> --
>> View this message in context: 
>> http://apache-spark-user-list.1001560.n3.nabble.com/Different-partition-number-of-GroupByKey-leads-different-result-tp24989.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Different partition number of GroupByKey leads different result

Reply via email to