That is a little scary. So you mean in general, we shouldn't use hadoop's writable as Key in RDD?
Zheng zheng On Thu, Jun 11, 2015 at 6:44 PM, Sean Owen <so...@cloudera.com> wrote: > Guess: it has something to do with the Text object being reused by Hadoop? > You can't in general keep around refs to them since they change. So you may > have a bunch of copies of one object at the end that become just one in > each partition. > > On Thu, Jun 11, 2015, 8:36 PM Crystal Xing <crystalxin...@gmail.com> > wrote: > >> I load a list of ids from a text file as NLineInputFormat, and when I >> do distinct(), it returns incorrect number. >> JavaRDD<Text> idListData = jvc >> .hadoopFile(idList, NLineInputFormat.class, >> LongWritable.class, >> Text.class).values().distinct() >> >> >> I should have 7000K distinct value, how every it only returns 7000 >> values, which is the same as number of tasks. The type I am using is >> import org.apache.hadoop.io.Text; >> >> >> However, if I switch to use String instead of Text, it works correcly. >> >> I think the Text class should have correct implementation of equals() and >> hashCode() functions since it is the hadoop class. >> >> Does anyone have clue what is going on? >> >> I am using spark 1.2. >> >> Zheng zheng >> >> >>