Github user tdas commented on the pull request:
https://github.com/apache/spark/pull/126#issuecomment-37485786
@yaoshengzhe I agree using finalizer is not the most ideal thing in the
world. However, the problem that we are dealing with here is that there is no
clean and safe way to detect whether an RDD or a shuffle has gone out of scope,
other than, using the garbage collection mechanisms. There already exists
mechanisms like RDD.unpersist() to cleanup persisted RDDs. That is, as long as
the developer diligently keeps track of all RDDs and make sure to unpersist
them while keeping track of dependencies. That's a pain, just like malloc and
free. Similarly, for the shuffle data (map outputs), its hard to figure out
when all the RDDs that depend on the shuffle data, so its hard to figure out
when it is safe to clean up the shuffle data. Furthermore, if you consider RDD
checkpointing, which transparently modifies the RDD DAG structure behind the
scenes, its get even harder to keep track of RDDs and clean them. So the only
safe way is to use Java garbage collection mechanism.
However, one can argue that one can implement this functionality without
using finalizer() by using weak references and reference queues (reference
queues keep track which objects got garbage collected). However, that requires
all RDDs, etc. to be wrapped with WeakReference objects. That's much
complicated and error-prone solution. Hence, I have used finalizer() for now.
As @andrewor14 has already pointed out that I have taken care in making sure
the finalizer() function is as cheap as possible (just a insert into a queue).
And regarding what the article says about object initialization being long if
finalize() function is define, I think it is an acceptable overhead (few ms) as
RDDs are not created at the rate of 1000s per second.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---