Re: Spark distinct() returns incorrect results for some types?

2015-06-11 Thread Crystal Xing
I see. It makes a lot of sense now. It is not unique to spark but it would be great if it is mentioned in spark documentation. I have been using hadoop for a while and I am not aware of it! Zheng zheng On Thu, Jun 11, 2015 at 7:21 PM, Will Briggs wrbri...@gmail.com wrote: To be fair, this is

Spark distinct() returns incorrect results for some types?

2015-06-11 Thread Crystal Xing
I load a list of ids from a text file as NLineInputFormat, and when I do distinct(), it returns incorrect number. JavaRDDText idListData = jvc .hadoopFile(idList, NLineInputFormat.class, LongWritable.class, Text.class).values().distinct() I should have

Re: Spark distinct() returns incorrect results for some types?

2015-06-11 Thread Sean Owen
Guess: it has something to do with the Text object being reused by Hadoop? You can't in general keep around refs to them since they change. So you may have a bunch of copies of one object at the end that become just one in each partition. On Thu, Jun 11, 2015, 8:36 PM Crystal Xing

Re: Spark distinct() returns incorrect results for some types?

2015-06-11 Thread Crystal Xing
That is a little scary. So you mean in general, we shouldn't use hadoop's writable as Key in RDD? Zheng zheng On Thu, Jun 11, 2015 at 6:44 PM, Sean Owen so...@cloudera.com wrote: Guess: it has something to do with the Text object being reused by Hadoop? You can't in general keep around refs

Re: Spark distinct() returns incorrect results for some types?

2015-06-11 Thread Sean Owen
Yep you need to use a transformation of the raw value; use toString for example. On Thu, Jun 11, 2015, 8:54 PM Crystal Xing crystalxin...@gmail.com wrote: That is a little scary. So you mean in general, we shouldn't use hadoop's writable as Key in RDD? Zheng zheng On Thu, Jun 11, 2015 at

Re: Spark distinct() returns incorrect results for some types?

2015-06-11 Thread Will Briggs
To be fair, this is a long-standing issue due to optimizations for object reuse in the Hadoop API, and isn't necessarily a failing in Spark - see this blog post (https://cornercases.wordpress.com/2011/08/18/hadoop-object-reuse-pitfall-all-my-reducer-values-are-the-same/) from 2011 documenting