Which storage scheme are you using? I am guessing it is MEMORY_ONLY. In
large datasets, MEMORY_AND_DISK or MEMORY_AND_DISK_SER work better.

You can call unpersist on an RDD to remove it from Cache though.


On Thu, Mar 27, 2014 at 11:57 AM, Sai Prasanna <ansaiprasa...@gmail.com>wrote:

> No i am running on 0.8.1.
> Yes i am caching a lot, i am benchmarking a simple code in spark where in
> 512mb, 1g and 2g text files are taken, some basic intermediate operations
> are done while the intermediate result which will be used in subsequent
> operations are cached.
>
> I thought that, we need not manually unpersist, if i need to cache
> something and if cache is found full, automatically space will be created
> by evacuating the earlier. Do i need to unpersist?
>
> Moreover, if i run several times, will the previously cached RDDs still
> remain in the cache? If so can i flush them manually out before the next
> run? [something like complete cache flush]
>
>
> On Thu, Mar 27, 2014 at 11:16 PM, Andrew Or <and...@databricks.com> wrote:
>
>> Are you caching a lot of RDD's? If so, maybe you should unpersist() the
>> ones that you're not using. Also, if you're on 0.9, make sure
>> spark.shuffle.spill is enabled (which it is by default). This allows your
>> application to spill in-memory content to disk if necessary.
>>
>> How much memory are you giving to your executors? The default,
>> spark.executor.memory is 512m, which is quite low. Consider raising this.
>> Checking the web UI is a good way to figure out your runtime memory usage.
>>
>>
>> On Thu, Mar 27, 2014 at 9:22 AM, Ognen Duzlevski <
>> og...@plainvanillagames.com> wrote:
>>
>>>  Look at the tuning guide on Spark's webpage for strategies to cope with
>>> this.
>>> I have run into quite a few memory issues like these, some are resolved
>>> by changing the StorageLevel strategy and employing things like Kryo, some
>>> are solved by specifying the number of tasks to break down a given
>>> operation into etc.
>>>
>>> Ognen
>>>
>>>
>>> On 3/27/14, 10:21 AM, Sai Prasanna wrote:
>>>
>>> "java.lang.OutOfMemoryError: GC overhead limit exceeded"
>>>
>>>  What is the problem. The same code, i run, one instance it runs in 8
>>> second, next time it takes really long time, say 300-500 seconds...
>>> I see the logs a lot of GC overhead limit exceeded is seen. What should
>>> be done ??
>>>
>>>  Please can someone throw some light on it ??
>>>
>>>
>>>
>>>  --
>>>  *Sai Prasanna. AN*
>>> *II M.Tech (CS), SSSIHL*
>>>
>>>
>>> * Entire water in the ocean can never sink a ship, Unless it gets
>>> inside. All the pressures of life can never hurt you, Unless you let them
>>> in.*
>>>
>>>
>>>
>>
>
>
> --
> *Sai Prasanna. AN*
> *II M.Tech (CS), SSSIHL*
>
>
> *Entire water in the ocean can never sink a ship, Unless it gets inside.
> All the pressures of life can never hurt you, Unless you let them in.*
>

Reply via email to