[GitHub] spark pull request: [SPARK-1103] [WIP] Automatic garbage collectio...

tdas Wed, 12 Mar 2014 17:25:42 -0700

Github user tdas commented on the pull request:

    https://github.com/apache/spark/pull/126#issuecomment-37485786
  
    @yaoshengzhe I agree using finalizer is not the most ideal thing in the 
world. However, the problem that we are dealing with here is that there is no 
clean and safe way to detect whether an RDD or a shuffle has gone out of scope, 
other than, using the garbage collection mechanisms. There already exists 
mechanisms like RDD.unpersist() to cleanup  persisted RDDs. That is, as long as 
the developer diligently keeps track of all RDDs and make sure to unpersist 
them while keeping track of dependencies. That's a pain, just like malloc and 
free. Similarly, for the shuffle data (map outputs), its hard to figure out 
when all the RDDs that depend on the shuffle data, so its hard to figure out 
when it is safe to clean up the shuffle data. Furthermore, if you consider RDD 
checkpointing, which transparently modifies the RDD DAG structure behind the 
scenes, its get even harder to keep track of RDDs and clean them. So the only 
safe way is to use Java garbage collection mechanism. 
    
    However, one can argue that one can implement this functionality without 
using finalizer() by using weak references and reference queues (reference 
queues keep track which objects got garbage collected). However, that requires 
all RDDs, etc. to be wrapped with WeakReference objects. That's much 
complicated and error-prone solution. Hence, I have used finalizer() for now. 
As @andrewor14 has already pointed out that I have taken care in making sure 
the finalizer() function is as cheap as possible (just a insert into a queue). 
And regarding what the article says about object initialization being long if 
finalize() function is define, I think it is an acceptable overhead (few ms) as 
RDDs are not created at the rate of 1000s per second.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1103] [WIP] Automatic garbage collectio...

Reply via email to