Guess: it has something to do with the Text object being reused by Hadoop? You can't in general keep around refs to them since they change. So you may have a bunch of copies of one object at the end that become just one in each partition.
On Thu, Jun 11, 2015, 8:36 PM Crystal Xing <crystalxin...@gmail.com> wrote: > I load a list of ids from a text file as NLineInputFormat, and when I do > distinct(), it returns incorrect number. > JavaRDD<Text> idListData = jvc > .hadoopFile(idList, NLineInputFormat.class, > LongWritable.class, Text.class).values().distinct() > > > I should have 7000K distinct value, how every it only returns 7000 values, > which is the same as number of tasks. The type I am using is > import org.apache.hadoop.io.Text; > > > However, if I switch to use String instead of Text, it works correcly. > > I think the Text class should have correct implementation of equals() and > hashCode() functions since it is the hadoop class. > > Does anyone have clue what is going on? > > I am using spark 1.2. > > Zheng zheng > > >