Spark uses the Twitter Chill library, which registers a bunch of core Scala
and Java classes by default.  I'm assuming that java.util.Date is
automatically registered by that, but Joda's DateTime is not.  We could
always take a look through the source to confirm too.

As far as the class name, my understanding was that it would have the class
name at the start of every serialized object, not just once.  I did some
tests at one point to confirm that, but it's a little fuzzy so I won't say
definitely that the class name is repeated.  Can you look at the
Kryo-serialized version of the classes at some point to see what actually
happens?


On Thu, May 22, 2014 at 5:02 PM, Mohit Jaggi <mohitja...@gmail.com> wrote:

> Andrew,
> I did not register anything explicitly based on the belief that the class
> name is written out in full only once. I also wondered why that problem
> would be specific to JodaTime and not show up with Java.util.date...I guess
> it is possible based on internals of Joda time.
> If I remove DateTime from my RDD, the problem goes away.
> I will try explicit registration(and add DateTime back to my RDD) and see
> if that makes things better.
>
> Mohit.
>
>
>
>
> On Wed, May 21, 2014 at 8:36 PM, Andrew Ash <and...@andrewash.com> wrote:
>
>> Hi Mohit,
>>
>> The log line about the ExternalAppendOnlyMap is more of a symptom of
>> slowness than causing slowness itself.  The ExternalAppendOnlyMap is used
>> when a shuffle is causing too much data to be held in memory.  Rather than
>> OOM'ing, Spark writes the data out to disk in a sorted order and reads it
>> back from disk later on when it's needed.  That's the job of the
>> ExternalAppendOnlyMap.
>>
>> I wouldn't normally expect a conversion from Date to a Joda DateTime to
>> take significantly more memory.  But since you're using Kryo and classes
>> should be registered with it, may may have forgotten to register DateTime
>> with Kryo.  If you don't register a class, it writes the class name at the
>> beginning of every serialized instance, which for DateTime objects of size
>> roughly 1 long, that's a ton of extra space and very inefficient.
>>
>> Can you confirm that DateTime is registered with Kryo?
>>
>> http://spark.apache.org/docs/latest/tuning.html#data-serialization
>>
>>
>> On Wed, May 21, 2014 at 2:35 PM, Mohit Jaggi <mohitja...@gmail.com>wrote:
>>
>>> Hi,
>>>
>>> I changed my application to use Joda time instead of java.util.Date and
>>> I started getting this:
>>>
>>> WARN ExternalAppendOnlyMap: Spilling in-memory map of 484 MB to disk (1
>>> time so far)
>>>
>>> What does this mean? How can I fix this? Due to this a small job takes
>>> forever.
>>>
>>> Mohit.
>>>
>>>
>>> P.S.: I am using kyro serialization, have played around with several
>>> values of sparkRddMemFraction
>>>
>>
>>
>

Reply via email to