I see. It makes a lot of sense now. It is not unique to spark but it would
be great if it is mentioned in spark documentation.
I have been using hadoop for a while and I am not aware of it!
Zheng zheng
On Thu, Jun 11, 2015 at 7:21 PM, Will Briggs wrbri...@gmail.com wrote:
To be fair, this is
I load a list of ids from a text file as NLineInputFormat, and when I do
distinct(), it returns incorrect number.
JavaRDDText idListData = jvc
.hadoopFile(idList, NLineInputFormat.class,
LongWritable.class, Text.class).values().distinct()
I should have
Guess: it has something to do with the Text object being reused by Hadoop?
You can't in general keep around refs to them since they change. So you may
have a bunch of copies of one object at the end that become just one in
each partition.
On Thu, Jun 11, 2015, 8:36 PM Crystal Xing
That is a little scary.
So you mean in general, we shouldn't use hadoop's writable as Key in RDD?
Zheng zheng
On Thu, Jun 11, 2015 at 6:44 PM, Sean Owen so...@cloudera.com wrote:
Guess: it has something to do with the Text object being reused by Hadoop?
You can't in general keep around refs
Yep you need to use a transformation of the raw value; use toString for
example.
On Thu, Jun 11, 2015, 8:54 PM Crystal Xing crystalxin...@gmail.com wrote:
That is a little scary.
So you mean in general, we shouldn't use hadoop's writable as Key in RDD?
Zheng zheng
On Thu, Jun 11, 2015 at
To be fair, this is a long-standing issue due to optimizations for object reuse
in the Hadoop API, and isn't necessarily a failing in Spark - see this blog
post
(https://cornercases.wordpress.com/2011/08/18/hadoop-object-reuse-pitfall-all-my-reducer-values-are-the-same/)
from 2011 documenting