Yep you need to use a transformation of the raw value; use toString for example.
On Thu, Jun 11, 2015, 8:54 PM Crystal Xing <crystalxin...@gmail.com> wrote: > That is a little scary. > So you mean in general, we shouldn't use hadoop's writable as Key in RDD? > > Zheng zheng > > On Thu, Jun 11, 2015 at 6:44 PM, Sean Owen <so...@cloudera.com> wrote: > >> Guess: it has something to do with the Text object being reused by >> Hadoop? You can't in general keep around refs to them since they change. So >> you may have a bunch of copies of one object at the end that become just >> one in each partition. >> >> On Thu, Jun 11, 2015, 8:36 PM Crystal Xing <crystalxin...@gmail.com> >> wrote: >> >>> I load a list of ids from a text file as NLineInputFormat, and when I >>> do distinct(), it returns incorrect number. >>> JavaRDD<Text> idListData = jvc >>> .hadoopFile(idList, NLineInputFormat.class, >>> LongWritable.class, >>> Text.class).values().distinct() >>> >>> >>> I should have 7000K distinct value, how every it only returns 7000 >>> values, which is the same as number of tasks. The type I am using is >>> import org.apache.hadoop.io.Text; >>> >>> >>> However, if I switch to use String instead of Text, it works correcly. >>> >>> I think the Text class should have correct implementation of equals() >>> and hashCode() functions since it is the hadoop class. >>> >>> Does anyone have clue what is going on? >>> >>> I am using spark 1.2. >>> >>> Zheng zheng >>> >>> >>> >