Re: Spark distinct() returns incorrect results for some types?

Sean Owen Thu, 11 Jun 2015 12:18:07 -0700

Yep you need to use a transformation of the raw value; use toString for
example.


On Thu, Jun 11, 2015, 8:54 PM Crystal Xing <crystalxin...@gmail.com> wrote:

> That is a little scary.
>  So you mean in general, we shouldn't use hadoop's writable as Key in RDD?
>
> Zheng zheng
>
> On Thu, Jun 11, 2015 at 6:44 PM, Sean Owen <so...@cloudera.com> wrote:
>
>> Guess: it has something to do with the Text object being reused by
>> Hadoop? You can't in general keep around refs to them since they change. So
>> you may have a bunch of copies of one object at the end that become just
>> one in each partition.
>>
>> On Thu, Jun 11, 2015, 8:36 PM Crystal Xing <crystalxin...@gmail.com>
>> wrote:
>>
>>> I load a   list of ids from a text file as NLineInputFormat, and when I
>>> do distinct(), it returns incorrect number.
>>>  JavaRDD<Text> idListData = jvc
>>>                 .hadoopFile(idList, NLineInputFormat.class,
>>>                         LongWritable.class,
>>> Text.class).values().distinct()
>>>
>>>
>>> I should have 7000K distinct value, how every it only returns 7000
>>> values, which is the same as number of tasks.  The type I am using is
>>> import org.apache.hadoop.io.Text;
>>>
>>>
>>> However,  if I switch to use String instead of Text, it works correcly.
>>>
>>> I think the Text class should have correct implementation of equals()
>>> and hashCode() functions since it is the hadoop class.
>>>
>>> Does anyone have clue what is going on?
>>>
>>> I am using spark 1.2.
>>>
>>> Zheng zheng
>>>
>>>
>>>
>

Re: Spark distinct() returns incorrect results for some types?

Reply via email to