Andrew, I did not register anything explicitly based on the belief that the class name is written out in full only once. I also wondered why that problem would be specific to JodaTime and not show up with Java.util.date...I guess it is possible based on internals of Joda time. If I remove DateTime from my RDD, the problem goes away. I will try explicit registration(and add DateTime back to my RDD) and see if that makes things better.
Mohit. On Wed, May 21, 2014 at 8:36 PM, Andrew Ash <and...@andrewash.com> wrote: > Hi Mohit, > > The log line about the ExternalAppendOnlyMap is more of a symptom of > slowness than causing slowness itself. The ExternalAppendOnlyMap is used > when a shuffle is causing too much data to be held in memory. Rather than > OOM'ing, Spark writes the data out to disk in a sorted order and reads it > back from disk later on when it's needed. That's the job of the > ExternalAppendOnlyMap. > > I wouldn't normally expect a conversion from Date to a Joda DateTime to > take significantly more memory. But since you're using Kryo and classes > should be registered with it, may may have forgotten to register DateTime > with Kryo. If you don't register a class, it writes the class name at the > beginning of every serialized instance, which for DateTime objects of size > roughly 1 long, that's a ton of extra space and very inefficient. > > Can you confirm that DateTime is registered with Kryo? > > http://spark.apache.org/docs/latest/tuning.html#data-serialization > > > On Wed, May 21, 2014 at 2:35 PM, Mohit Jaggi <mohitja...@gmail.com> wrote: > >> Hi, >> >> I changed my application to use Joda time instead of java.util.Date and I >> started getting this: >> >> WARN ExternalAppendOnlyMap: Spilling in-memory map of 484 MB to disk (1 >> time so far) >> >> What does this mean? How can I fix this? Due to this a small job takes >> forever. >> >> Mohit. >> >> >> P.S.: I am using kyro serialization, have played around with several >> values of sparkRddMemFraction >> > >