Re: ExternalAppendOnlyMap: Spilling in-memory map

2014-05-22 Thread Mohit Jaggi
Andrew,
I did not register anything explicitly based on the belief that the class
name is written out in full only once. I also wondered why that problem
would be specific to JodaTime and not show up with Java.util.date...I guess
it is possible based on internals of Joda time.
If I remove DateTime from my RDD, the problem goes away.
I will try explicit registration(and add DateTime back to my RDD) and see
if that makes things better.

Mohit.




On Wed, May 21, 2014 at 8:36 PM, Andrew Ash and...@andrewash.com wrote:

 Hi Mohit,

 The log line about the ExternalAppendOnlyMap is more of a symptom of
 slowness than causing slowness itself.  The ExternalAppendOnlyMap is used
 when a shuffle is causing too much data to be held in memory.  Rather than
 OOM'ing, Spark writes the data out to disk in a sorted order and reads it
 back from disk later on when it's needed.  That's the job of the
 ExternalAppendOnlyMap.

 I wouldn't normally expect a conversion from Date to a Joda DateTime to
 take significantly more memory.  But since you're using Kryo and classes
 should be registered with it, may may have forgotten to register DateTime
 with Kryo.  If you don't register a class, it writes the class name at the
 beginning of every serialized instance, which for DateTime objects of size
 roughly 1 long, that's a ton of extra space and very inefficient.

 Can you confirm that DateTime is registered with Kryo?

 http://spark.apache.org/docs/latest/tuning.html#data-serialization


 On Wed, May 21, 2014 at 2:35 PM, Mohit Jaggi mohitja...@gmail.com wrote:

 Hi,

 I changed my application to use Joda time instead of java.util.Date and I
 started getting this:

 WARN ExternalAppendOnlyMap: Spilling in-memory map of 484 MB to disk (1
 time so far)

 What does this mean? How can I fix this? Due to this a small job takes
 forever.

 Mohit.


 P.S.: I am using kyro serialization, have played around with several
 values of sparkRddMemFraction





Re: ExternalAppendOnlyMap: Spilling in-memory map

2014-05-22 Thread Andrew Ash
Spark uses the Twitter Chill library, which registers a bunch of core Scala
and Java classes by default.  I'm assuming that java.util.Date is
automatically registered by that, but Joda's DateTime is not.  We could
always take a look through the source to confirm too.

As far as the class name, my understanding was that it would have the class
name at the start of every serialized object, not just once.  I did some
tests at one point to confirm that, but it's a little fuzzy so I won't say
definitely that the class name is repeated.  Can you look at the
Kryo-serialized version of the classes at some point to see what actually
happens?


On Thu, May 22, 2014 at 5:02 PM, Mohit Jaggi mohitja...@gmail.com wrote:

 Andrew,
 I did not register anything explicitly based on the belief that the class
 name is written out in full only once. I also wondered why that problem
 would be specific to JodaTime and not show up with Java.util.date...I guess
 it is possible based on internals of Joda time.
 If I remove DateTime from my RDD, the problem goes away.
 I will try explicit registration(and add DateTime back to my RDD) and see
 if that makes things better.

 Mohit.




 On Wed, May 21, 2014 at 8:36 PM, Andrew Ash and...@andrewash.com wrote:

 Hi Mohit,

 The log line about the ExternalAppendOnlyMap is more of a symptom of
 slowness than causing slowness itself.  The ExternalAppendOnlyMap is used
 when a shuffle is causing too much data to be held in memory.  Rather than
 OOM'ing, Spark writes the data out to disk in a sorted order and reads it
 back from disk later on when it's needed.  That's the job of the
 ExternalAppendOnlyMap.

 I wouldn't normally expect a conversion from Date to a Joda DateTime to
 take significantly more memory.  But since you're using Kryo and classes
 should be registered with it, may may have forgotten to register DateTime
 with Kryo.  If you don't register a class, it writes the class name at the
 beginning of every serialized instance, which for DateTime objects of size
 roughly 1 long, that's a ton of extra space and very inefficient.

 Can you confirm that DateTime is registered with Kryo?

 http://spark.apache.org/docs/latest/tuning.html#data-serialization


 On Wed, May 21, 2014 at 2:35 PM, Mohit Jaggi mohitja...@gmail.comwrote:

 Hi,

 I changed my application to use Joda time instead of java.util.Date and
 I started getting this:

 WARN ExternalAppendOnlyMap: Spilling in-memory map of 484 MB to disk (1
 time so far)

 What does this mean? How can I fix this? Due to this a small job takes
 forever.

 Mohit.


 P.S.: I am using kyro serialization, have played around with several
 values of sparkRddMemFraction






Re: ExternalAppendOnlyMap: Spilling in-memory map

2014-05-21 Thread Andrew Ash
Hi Mohit,

The log line about the ExternalAppendOnlyMap is more of a symptom of
slowness than causing slowness itself.  The ExternalAppendOnlyMap is used
when a shuffle is causing too much data to be held in memory.  Rather than
OOM'ing, Spark writes the data out to disk in a sorted order and reads it
back from disk later on when it's needed.  That's the job of the
ExternalAppendOnlyMap.

I wouldn't normally expect a conversion from Date to a Joda DateTime to
take significantly more memory.  But since you're using Kryo and classes
should be registered with it, may may have forgotten to register DateTime
with Kryo.  If you don't register a class, it writes the class name at the
beginning of every serialized instance, which for DateTime objects of size
roughly 1 long, that's a ton of extra space and very inefficient.

Can you confirm that DateTime is registered with Kryo?

http://spark.apache.org/docs/latest/tuning.html#data-serialization


On Wed, May 21, 2014 at 2:35 PM, Mohit Jaggi mohitja...@gmail.com wrote:

 Hi,

 I changed my application to use Joda time instead of java.util.Date and I
 started getting this:

 WARN ExternalAppendOnlyMap: Spilling in-memory map of 484 MB to disk (1
 time so far)

 What does this mean? How can I fix this? Due to this a small job takes
 forever.

 Mohit.


 P.S.: I am using kyro serialization, have played around with several
 values of sparkRddMemFraction