Re: [Critical] Issue with cached RDDs created from hadoop sequence files

Jeff Zhang Tue, 22 Mar 2016 21:01:16 -0700

Zhan's reply on stackoverflow is correct.

down vote

Please refer to the comments in sequenceFile.

/** Get an RDD for a Hadoop SequenceFile with given key and value types. *
* '''Note:''' Because Hadoop's RecordReader class re-uses the same Writable
object for each * record, directly caching the returned RDD or directly
passing it to an aggregation or shuffle * operation will create many
references to the same object. * If you plan to directly cache, sort, or
aggregate Hadoop writable objects, you should first * copy them using
a map function.
*/

On Wed, Mar 23, 2016 at 11:58 AM, Jeff Zhang <zjf...@gmail.com> wrote:

> I think I got the root cause, you can use Text.toString() to solve this
> issue.  Because the Text is shared so the last record display multiple
> times.
>
> On Wed, Mar 23, 2016 at 11:37 AM, Jeff Zhang <zjf...@gmail.com> wrote:
>
>> Looks like a spark bug. I can reproduce it for sequence file, but it
>> works for text file.
>>
>> On Wed, Mar 23, 2016 at 10:56 AM, Thamme Gowda N. <tgow...@gmail.com>
>> wrote:
>>
>>> Hi spark experts,
>>>
>>> I am facing issues with cached RDDs. I noticed that few entries
>>> get duplicated for n times when the RDD is cached.
>>>
>>> I asked a question on Stackoverflow with my code snippet to reproduce it.
>>>
>>> I really appreciate  if you can visit
>>> http://stackoverflow.com/q/36168827/1506477
>>> and answer my question / give your comments.
>>>
>>> Or at the least confirm that it is a bug.
>>>
>>> Thanks in advance for your help!
>>>
>>> --
>>> Thamme
>>>
>>
>>
>>
>> --
>> Best Regards
>>
>> Jeff Zhang
>>
>
>
>
> --
> Best Regards
>
> Jeff Zhang
>

-- 
Best Regards

Jeff Zhang

Re: [Critical] Issue with cached RDDs created from hadoop sequence files

Reply via email to