[
https://issues.apache.org/jira/browse/TINKERPOP-1321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15314655#comment-15314655
]
ASF GitHub Bot commented on TINKERPOP-1321:
-------------------------------------------
Github user dalaro commented on the issue:
https://github.com/apache/incubator-tinkerpop/pull/325
After discussion on Slack, I think @okram and I tentatively agreed to
proceed with this PR after I do additional work to save users who have custom
serializers the effort of maintaining separate `IoRegistry` and
`spark.kryo.registrator` implementations with near-identical contents.
I may discover other complications during the process, but I think this
involves two changes:
1. I will attempt to subclass KryoSerializer so that I can access the
SparkConf passed to its constructior and check for
`GryoPool.CONFIG_IO_REGISTRY` (similar to what GryoSerializer does now), then
apply any registrations found therein so long as each registration either:
* specifies no explicit serializer (using Kryo's internal default) or
* specifies a shim serializer
In particular, if the registry contains an old-style TP shaded Kryo
serializer that hasn't been ported to the shim, then the KryoSerializer
subclass will throw an exception.
This change is necessary to support custom-serialized,
`IoRegistry`-listed datatypes that appear in most Spark data outside of
closures (say, in the RDD itself).
2. I will replace current callsites of `HadoopPools.initialize(conf)` with
some other static method that calls `HadoopPools.initialize(conf)` and then
calls some roughly equivalent `initialize(conf)` method for the static KryoPool
backing `KryoPoolShimService`.
This change is necessary to support custom-serialized,
`IoRegistry`-listed datatypes that appear in closures.
> Loosen coupling between TinkerPop serialization logic and shaded Kryo
> ---------------------------------------------------------------------
>
> Key: TINKERPOP-1321
> URL: https://issues.apache.org/jira/browse/TINKERPOP-1321
> Project: TinkerPop
> Issue Type: Improvement
> Components: io
> Affects Versions: 3.2.0-incubating
> Reporter: Dan LaRocque
> Fix For: 3.2.1
>
>
> When running jobs on Spark, TinkerPop currently recommends setting
> spark.serializer=GryoSerializer. This makes GryoSerializer responsible for
> serializing not just TinkerPop types but also scala runtime types and Spark
> internals. GryoSerializer doesn't extend either of the two serializers
> provided by Spark. It effectively assumes responsibility for reimplementing
> them.
> This is problematic. It is not totally trivial to replicate the
> functionality of Spark's standard serializers. It is also not easy to
> empirically test all meaningful cases. For instance, there is a conditional
> within Spark that selects between two concrete Map implementations depending
> on whether the current RDD partition count exceeds 2k
> (https://github.com/apache/spark/blob/branch-1.6/core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala#L47-L53).
> The implementation used below this threshold serializes fine on
> GryoSerializer. The implementation used above the threshold does not. Above
> the partition threshold, I've started getting
> {{org.apache.spark.SparkException: Job aborted due to stage failure:
> Exception while getting task result:
> org.apache.tinkerpop.shaded.kryo.KryoException: java.io.IOException: I failed
> to find one of the right cookies.}} Google leads to
> https://github.com/RoaringBitmap/RoaringBitmap/issues/64. However, just
> switching to Spark's {{KryoSerializer}} without changing anything somehow
> fixes the problem in my environment, implying that Spark has done something
> to address this problem that may not be fully replicated in TinkerPop.
> However, "just switching to Spark's KryoSerializer" is not a great approach.
> For one thing, we lose the benefit of TinkerPop's space-efficient StarGraph
> serializer, and Spark traversals can produce a lot of little ego-StarGraphs.
> These still serialize, but KryoSerializer uses its default behavior
> (FieldSerializer), which is not as clever about StarGraphs as TinkerPop's
> StarGraphSerializer. TinkerPop's reliance on its own internal shaded Kryo
> means that its serializers cannot be registered with Spark's unshaded Kryo.
> More concerning, it's impossible to completely switch to KryoSerializer just
> by tweaking the configuration. Besides spark.serializer, there is also a
> setting spark.closure.serializer for which the only supported value is
> JavaSerializer. Key TP classes that make it into the object reference graphs
> of Spark closures implement Serializable by resorting to TinkerPop's shaded
> Kryo via HadoopPools (looking at Object/VertexWritable). This leads to
> surprises with custom property data types. It doesn't matter if those types
> implement Serializable, and it doesn't matter if Spark's KryoSerializer is
> configured to accept those types. If those types are reachable from
> Object/VertexWritable, then they must be registered with TinkerPop's internal
> shaded Kryo, or else it will choke on them (unless it was explicitly
> configured to allow unregistered classes).
> I suggest the following change to give users more flexibility in their choice
> of spark.serializer, and to allow them to reuse TinkerPop's serializers if
> they choose not to use GryoSerializer: introduce lightweight interfaces that
> decouple TinkerPop's serialization logic from the exact Kryo shaded/unshaded
> package doing the work. TinkerPop's serialization logic would be written
> against interfaces that replicate a minimal subset of Kryo, and then TP's
> shaded Kryo or Spark's unshaded Kryo could be plugged in underneath without
> having to touch the source, recompile any TinkerPop code, or munge bytecode
> at runtime.
> This would not resolve all of the potential problems/complexity around
> TinkerPop serialization, but it would make it possible for users to apply the
> spark.serializer of their choice, switching off GryoSerializer if they so
> choose. Users could also continue to use GyroSerializer as they have until
> now.
> I've already run this through a few iterations locally and have an
> abstraction that allows me to run with spark.serializer=KryoSerializer,
> register GryoMapper/GryoSerializer's standard set of types, reuse TinkerPop's
> StarGraph serializer, and bypass GryoSerializer in HadoopPools to boot. I
> just rebased it on current master. I'll submit a PR for discussion/review.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)