You're not seeing the issue because you perform one additional "map".
map{case (k,v) => (k.get(), v.toString)}Instead of being able to use the read
Text you had to create a tuple (single) out of the string of the text.
That is exactly why I asked this question.Why do we have t do this additional
processing? What is the rationale behind it?
Is there other ways of reading a hadoop file (or any other file) that would not
incur this additional step?thanks
Date: Thu, 19 Nov 2015 13:26:31 +0800
Subject: Re: FW: SequenceFile and object reuse
From: [email protected]
To: [email protected]
CC: [email protected]
Would this be an issue on the raw data ? I use the following simple code, and
don't hit the issue you mentioned. Or it would be better to share your code.
val rdd =sc.sequenceFile("/Users/hadoop/Temp/Seq", classOf[IntWritable],
classOf[Text])
rdd.map{case (k,v) => (k.get(), v.toString)}.collect() foreach println
On Thu, Nov 19, 2015 at 12:04 PM, jeff saremi <[email protected]> wrote:
I sent this to the user forum. I got no responses. Could someone here please
help? thanks
jeff
From: [email protected]
To: [email protected]
Subject: SequenceFile and object reuse
Date: Fri, 13 Nov 2015 13:29:58 -0500
So we tried reading a sequencefile in Spark and realized that all our records
have ended up becoming the same.
THen one of us found this:
Note: Because Hadoop's RecordReader class re-uses the same Writable object for
each record, directly caching the returned RDD or directly passing it to an
aggregation or shuffle operation will create many references to the same
object. If you plan to directly cache, sort, or aggregate Hadoop writable
objects, you should first copy them using a map function.
Is there anyone that can shed some light on this bizzare behavior and the
decisions behind it?
And I also would like to know if anyone's able to read a binary file and not to
incur the additional map() as suggested by the above? What format did you use?
thanksJeff
--
Best Regards
Jeff Zhang