Spark distinct() returns incorrect results for some types?

Crystal Xing Thu, 11 Jun 2015 11:36:48 -0700

I load a   list of ids from a text file as NLineInputFormat, and when I do
distinct(), it returns incorrect number.
 JavaRDD<Text> idListData = jvc
                .hadoopFile(idList, NLineInputFormat.class,
                        LongWritable.class, Text.class).values().distinct()



I should have 7000K distinct value, how every it only returns 7000 values,
which is the same as number of tasks.  The type I am using is
import org.apache.hadoop.io.Text;


However,  if I switch to use String instead of Text, it works correcly.

I think the Text class should have correct implementation of equals() and
hashCode() functions since it is the hadoop class.

Does anyone have clue what is going on?

I am using spark 1.2.

Zheng zheng

Spark distinct() returns incorrect results for some types?

Reply via email to