GitHub user willb opened a pull request:
https://github.com/apache/spark/pull/143
SPARK-897: preemptively serialize closures
These commits cause `ClosureCleaner.clean` to attempt to serialize the
cleaned closure with the default closure serializer and throw a
`SparkException` if doing so fails. This behavior is enabled by default but
can be disabled at individual callsites of `SparkContext.clean`.
Commit 98e01ae8 fixes some no-op assertions in `GraphSuite` that this work
exposed; I'm happy to put that in a separate PR if that would be more
appropriate.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/willb/spark spark-897
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/143.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #143
----
commit bcab2f0414a956ffa89c5dd0fee16de1b33320a2
Author: William Benton <[email protected]>
Date: 2014-03-13T02:56:32Z
Test case for SPARK-897.
Tests to make sure that passing an unserializable closure
to a transformation fails fast.
commit f2ef54e4ec92d8f0ee3e91af4f507bcabd29a7c0
Author: William Benton <[email protected]>
Date: 2014-03-13T19:21:45Z
Generalized proactive closure serialization test.
commit 6cb921874c02f3f03dd66db697c6995dc9565a0f
Author: William Benton <[email protected]>
Date: 2014-03-13T19:40:42Z
Adds proactive closure-serializablilty checking
ClosureCleaner.clean now checks to ensure that its closure argument
is serializable by default and throws a SparkException with the
underlying NotSerializableException in the detail message otherwise.
As a result, transformation invocations with unserializable closures
will fail at their call sites rather than when they actually execute.
ClosureCleaner.clean now takes a second boolean argument; pass false
to disable serializability-checking behavior at call sites where this
behavior isn't desired.
commit 98e01ae854dd3fce03d753d5f25a6022ae6f58d6
Author: William Benton <[email protected]>
Date: 2014-03-14T16:40:56Z
Ensure assertions in Graph.apply are asserted.
The Graph.apply test in GraphSuite had some assertions in a closure in
a graph transformation. This caused two problems:
1. because assert() was called, test classes were reachable from the
closures, which made them not serializable, and
2. (more importantly) these assertions never actually executed, since
they occurred within a lazy map()
This commit simply changes the Graph.apply test to collects the graph
triplets so it can assert about each triplet from a map method.
commit 70a449d87018e7bfa8dbf7249948a7f48a891719
Author: William Benton <[email protected]>
Date: 2014-03-14T17:33:33Z
Make proactive serializability checking optional.
SparkContext.clean uses ClosureCleaner's proactive serializability
checking by default. This commit adds an overloaded clean method
to SparkContext that allows clients to specify that serializability
checking should not occur as part of closure cleaning.
commit 9eb301387644d5c14a03a0bbb96c6b007f228f3d
Author: William Benton <[email protected]>
Date: 2014-03-14T17:34:42Z
Don't check serializability of DStream transforms.
Since the DStream is reachable from within these closures, they aren't
checkable by the straightforward technique of passing them to the
closure serializer.
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---