Re: Spark distinct() returns incorrect results for some types?

Crystal Xing Thu, 11 Jun 2015 11:55:31 -0700

That is a little scary.
 So you mean in general, we shouldn't use hadoop's writable as Key in RDD?


Zheng zheng

On Thu, Jun 11, 2015 at 6:44 PM, Sean Owen <so...@cloudera.com> wrote:

> Guess: it has something to do with the Text object being reused by Hadoop?
> You can't in general keep around refs to them since they change. So you may
> have a bunch of copies of one object at the end that become just one in
> each partition.
>
> On Thu, Jun 11, 2015, 8:36 PM Crystal Xing <crystalxin...@gmail.com>
> wrote:
>
>> I load a   list of ids from a text file as NLineInputFormat, and when I
>> do distinct(), it returns incorrect number.
>>  JavaRDD<Text> idListData = jvc
>>                 .hadoopFile(idList, NLineInputFormat.class,
>>                         LongWritable.class,
>> Text.class).values().distinct()
>>
>>
>> I should have 7000K distinct value, how every it only returns 7000
>> values, which is the same as number of tasks.  The type I am using is
>> import org.apache.hadoop.io.Text;
>>
>>
>> However,  if I switch to use String instead of Text, it works correcly.
>>
>> I think the Text class should have correct implementation of equals() and
>> hashCode() functions since it is the hadoop class.
>>
>> Does anyone have clue what is going on?
>>
>> I am using spark 1.2.
>>
>> Zheng zheng
>>
>>
>>

Re: Spark distinct() returns incorrect results for some types?

Reply via email to