I sent this to the user forum. I got no responses. Could someone here please 
help? thanks
jeff

From: jeffsar...@hotmail.com
To: u...@spark.apache.org
Subject: SequenceFile and object reuse
Date: Fri, 13 Nov 2015 13:29:58 -0500




So we tried reading a sequencefile in Spark and realized that all our records 
have ended up becoming the same.
THen one of us found this:

Note: Because Hadoop's RecordReader class re-uses the same Writable object for 
each record, directly caching the returned RDD or directly passing it to an 
aggregation or shuffle operation will create many references to the same 
object. If you plan to directly cache, sort, or aggregate Hadoop writable 
objects, you should first copy them using a map function.

Is there anyone that can shed some light on this bizzare behavior and the 
decisions behind it?
And I also would like to know if anyone's able to read a binary file and not to 
incur the additional map() as suggested by the above? What format did you use?

thanksJeff                                                                      
          

Reply via email to