Zhan's reply on stackoverflow is correct.
down vote Please refer to the comments in sequenceFile. /** Get an RDD for a Hadoop SequenceFile with given key and value types. * * '''Note:''' Because Hadoop's RecordReader class re-uses the same Writable object for each * record, directly caching the returned RDD or directly passing it to an aggregation or shuffle * operation will create many references to the same object. * If you plan to directly cache, sort, or aggregate Hadoop writable objects, you should first * copy them using a map function. */ On Wed, Mar 23, 2016 at 11:58 AM, Jeff Zhang <zjf...@gmail.com> wrote: > I think I got the root cause, you can use Text.toString() to solve this > issue. Because the Text is shared so the last record display multiple > times. > > On Wed, Mar 23, 2016 at 11:37 AM, Jeff Zhang <zjf...@gmail.com> wrote: > >> Looks like a spark bug. I can reproduce it for sequence file, but it >> works for text file. >> >> On Wed, Mar 23, 2016 at 10:56 AM, Thamme Gowda N. <tgow...@gmail.com> >> wrote: >> >>> Hi spark experts, >>> >>> I am facing issues with cached RDDs. I noticed that few entries >>> get duplicated for n times when the RDD is cached. >>> >>> I asked a question on Stackoverflow with my code snippet to reproduce it. >>> >>> I really appreciate if you can visit >>> http://stackoverflow.com/q/36168827/1506477 >>> and answer my question / give your comments. >>> >>> Or at the least confirm that it is a bug. >>> >>> Thanks in advance for your help! >>> >>> -- >>> Thamme >>> >> >> >> >> -- >> Best Regards >> >> Jeff Zhang >> > > > > -- > Best Regards > > Jeff Zhang > -- Best Regards Jeff Zhang