Guess: it has something to do with the Text object being reused by Hadoop?
You can't in general keep around refs to them since they change. So you may
have a bunch of copies of one object at the end that become just one in
each partition.

On Thu, Jun 11, 2015, 8:36 PM Crystal Xing <crystalxin...@gmail.com> wrote:

> I load a   list of ids from a text file as NLineInputFormat, and when I do
> distinct(), it returns incorrect number.
>  JavaRDD<Text> idListData = jvc
>                 .hadoopFile(idList, NLineInputFormat.class,
>                         LongWritable.class, Text.class).values().distinct()
>
>
> I should have 7000K distinct value, how every it only returns 7000 values,
> which is the same as number of tasks.  The type I am using is
> import org.apache.hadoop.io.Text;
>
>
> However,  if I switch to use String instead of Text, it works correcly.
>
> I think the Text class should have correct implementation of equals() and
> hashCode() functions since it is the hadoop class.
>
> Does anyone have clue what is going on?
>
> I am using spark 1.2.
>
> Zheng zheng
>
>
>

Reply via email to