Does this happen repeatedly if you keep running the computation, or just the 
first time? It may take time to move these Java objects to the old generation 
the first time you run queries, which could lead to a GC pause that also slows 
down the small queries.

If you can run with -XX:+PrintGCDetails in your Java options, it would also be 
good to see what percent of each GC generation is used.

The concurrent mark-and-sweep GC -XX:+UseConcMarkSweepGC or the G1 GC in Java 7 
(-XX:+UseG1GC) might also avoid these pauses by GCing concurrently with your 
application threads.

Matei

On Mar 10, 2014, at 3:18 PM, Koert Kuipers <ko...@tresata.com> wrote:

> hello all,
> i am observing a strange result. i have a computation that i run on a cached 
> RDD in spark-standalone. it typically takes about 4 seconds. 
> 
> but when other RDDs that are not relevant to the computation at hand are 
> cached in memory (in same spark context), the computation takes 40 seconds or 
> more.
> 
> the problem seems to be GC time, which goes from milliseconds to tens of 
> seconds.
> 
> note that my issue is not that memory is full. i have cached about 14G in 
> RDDs with 66G available across workers for the application. also my 
> computation did not push any cached RDD out of memory.
> 
> any ideas?

Reply via email to