I load a list of ids from a text file as NLineInputFormat, and when I do distinct(), it returns incorrect number. JavaRDD<Text> idListData = jvc .hadoopFile(idList, NLineInputFormat.class, LongWritable.class, Text.class).values().distinct()
I should have 7000K distinct value, how every it only returns 7000 values, which is the same as number of tasks. The type I am using is import org.apache.hadoop.io.Text; However, if I switch to use String instead of Text, it works correcly. I think the Text class should have correct implementation of equals() and hashCode() functions since it is the hadoop class. Does anyone have clue what is going on? I am using spark 1.2. Zheng zheng