[
https://issues.apache.org/jira/browse/TINKERPOP-1321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15317279#comment-15317279
]
ASF GitHub Bot commented on TINKERPOP-1321:
-------------------------------------------
Github user dalaro commented on the issue:
https://github.com/apache/incubator-tinkerpop/pull/325
These questions involved a lot of back-and-forth that would have been slow
to resolve on github comments, so and I discussed the failures and the code on
a hangout earlier today.
Summary:
* Marko created
https://github.com/apache/incubator-tinkerpop/tree/TINKERPOP-1321 on the TP
repo. This branch is my PR plus additional tweaks Marko and I worked out on
top. His branch supersedes this PR branch now. This PR branch **should not be
merged because Marko's branch subsumes/extends it**.
* I have a rough idea of what changed as we discussed code on the hangout,
but I will go through his branch's new commits and review.
* The new serialization code is lower-priority than the old code. It has
to be turned on through explicit user action, like this is the test provider
code:
```
System.setProperty(SHIM_CLASS_SYSTEM_PROPERTY,
UnshadedKryoShimService.class.getCanonicalName());
config.put("spark.serializer",
KryoSerializer.class.getCanonicalName());
config.put("spark.kryo.registrator",
GryoRegistrator.class.getCanonicalName());
```
* Marko setup SparkHadoopGraphProvider to randomly choose either the old
serialization code or the new stuff in
https://github.com/apache/incubator-tinkerpop/commit/e7003635e27c625b3f30492111f20f4fe4e24eb5#diff-afffc6796745845d46e6f60ea3a992f9R91.
IMHO that should be overrideable to make it deterministic, but it's not a
huge deal since it's limited to test code.
* We replaced `GryoClassResolver` with a series of class registrations in
https://github.com/apache/incubator-tinkerpop/commit/e7003635e27c625b3f30492111f20f4fe4e24eb5#diff-2e881d0a6ab7ed1ca7d1f593bceadfd2R207.
The idea here is to explicitly configure which types need a detaching
serializer rather than relying on `GryoClassResolver` to automagically do it
regardless of explicit class registrations. Three factors influence this
choice:
* Spark's KryoSerializer does not support a user-provided ClassResolver,
and working around that limitation entails either getting a change through
Spark upstream or copying and pasting a bunch of KryoSerializer's
private/hardcoded class registrations, some of which could change in future
Spark versions. The latter is ugly and would make the code brittle across
Spark upgrades.
* `GryoClassResolver` is convenient but somewhat mysterious to third
parties, who are likelier to have seen a custom class registration mechanism
(like TP's `IoRegistry` or Spark's `spark.kryo.registrator`) than a custom
resolver. It also removes vendor control in that its detachment behavior is
not configurable, so modifying it requires subclassing or reimplementation. For
instance, if a vendor had a lightweight Vertex implementation with a custom
serializer and wanted to bypass TP's resolver-driven detachment, then I don't
think it would be possible without modifying TP's resolver.
* `GryoClassResolver` is written against a bunch of shaded types and
relies on `readVarInt`/`writeVarInt`, which were added in Kryo 2.22, one
version too late for compatibility with Spark's Kryo 2.21. This is the least
important concern of the three, more an implementation detail.
* Marko is running the full integration test suite now. He ran the spark
ITs just before pushing and reported success on those specifically.
> Loosen coupling between TinkerPop serialization logic and shaded Kryo
> ---------------------------------------------------------------------
>
> Key: TINKERPOP-1321
> URL: https://issues.apache.org/jira/browse/TINKERPOP-1321
> Project: TinkerPop
> Issue Type: Improvement
> Components: io
> Affects Versions: 3.2.0-incubating
> Reporter: Dan LaRocque
> Fix For: 3.2.1
>
>
> When running jobs on Spark, TinkerPop currently recommends setting
> spark.serializer=GryoSerializer. This makes GryoSerializer responsible for
> serializing not just TinkerPop types but also scala runtime types and Spark
> internals. GryoSerializer doesn't extend either of the two serializers
> provided by Spark. It effectively assumes responsibility for reimplementing
> them.
> This is problematic. It is not totally trivial to replicate the
> functionality of Spark's standard serializers. It is also not easy to
> empirically test all meaningful cases. For instance, there is a conditional
> within Spark that selects between two concrete Map implementations depending
> on whether the current RDD partition count exceeds 2k
> (https://github.com/apache/spark/blob/branch-1.6/core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala#L47-L53).
> The implementation used below this threshold serializes fine on
> GryoSerializer. The implementation used above the threshold does not. Above
> the partition threshold, I've started getting
> {{org.apache.spark.SparkException: Job aborted due to stage failure:
> Exception while getting task result:
> org.apache.tinkerpop.shaded.kryo.KryoException: java.io.IOException: I failed
> to find one of the right cookies.}} Google leads to
> https://github.com/RoaringBitmap/RoaringBitmap/issues/64. However, just
> switching to Spark's {{KryoSerializer}} without changing anything somehow
> fixes the problem in my environment, implying that Spark has done something
> to address this problem that may not be fully replicated in TinkerPop.
> However, "just switching to Spark's KryoSerializer" is not a great approach.
> For one thing, we lose the benefit of TinkerPop's space-efficient StarGraph
> serializer, and Spark traversals can produce a lot of little ego-StarGraphs.
> These still serialize, but KryoSerializer uses its default behavior
> (FieldSerializer), which is not as clever about StarGraphs as TinkerPop's
> StarGraphSerializer. TinkerPop's reliance on its own internal shaded Kryo
> means that its serializers cannot be registered with Spark's unshaded Kryo.
> More concerning, it's impossible to completely switch to KryoSerializer just
> by tweaking the configuration. Besides spark.serializer, there is also a
> setting spark.closure.serializer for which the only supported value is
> JavaSerializer. Key TP classes that make it into the object reference graphs
> of Spark closures implement Serializable by resorting to TinkerPop's shaded
> Kryo via HadoopPools (looking at Object/VertexWritable). This leads to
> surprises with custom property data types. It doesn't matter if those types
> implement Serializable, and it doesn't matter if Spark's KryoSerializer is
> configured to accept those types. If those types are reachable from
> Object/VertexWritable, then they must be registered with TinkerPop's internal
> shaded Kryo, or else it will choke on them (unless it was explicitly
> configured to allow unregistered classes).
> I suggest the following change to give users more flexibility in their choice
> of spark.serializer, and to allow them to reuse TinkerPop's serializers if
> they choose not to use GryoSerializer: introduce lightweight interfaces that
> decouple TinkerPop's serialization logic from the exact Kryo shaded/unshaded
> package doing the work. TinkerPop's serialization logic would be written
> against interfaces that replicate a minimal subset of Kryo, and then TP's
> shaded Kryo or Spark's unshaded Kryo could be plugged in underneath without
> having to touch the source, recompile any TinkerPop code, or munge bytecode
> at runtime.
> This would not resolve all of the potential problems/complexity around
> TinkerPop serialization, but it would make it possible for users to apply the
> spark.serializer of their choice, switching off GryoSerializer if they so
> choose. Users could also continue to use GyroSerializer as they have until
> now.
> I've already run this through a few iterations locally and have an
> abstraction that allows me to run with spark.serializer=KryoSerializer,
> register GryoMapper/GryoSerializer's standard set of types, reuse TinkerPop's
> StarGraph serializer, and bypass GryoSerializer in HadoopPools to boot. I
> just rebased it on current master. I'll submit a PR for discussion/review.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)