Github user tdas commented on the pull request:
https://github.com/apache/spark/pull/126#issuecomment-37498313
@yaoshengzhe
This is only safe, best-effort attempt to clean metadata, so not guarantee
is being provided here. All we are trying to do for long running Spark
computations (say, Spark Streaming program that runs 24/7) there is _something_
that cleans up in a safe way.
I am taking care to make sure call to the finalize() is cheap, just a
insert to a queue which does not block (inserts in LinkedBlockingQueue without
capacity constraint does not block for all practical purposes).
Regarding phantom references, from what I understand, does not provide any
more guarantee on when garbage collection is performed than the current method.
It just gets called after finalize is done on objects. The main source of
uncertainty comes directly from the garbage collection step, which cannot be
avoided by any method. Rather, using any sort of weak or phantom reference
queue requires _every_ RDD to be wrapped by WeakReference or PhantomReference.
That is seems to me to be an unnecessary complexity with little added benefit.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---