[
https://issues.apache.org/jira/browse/TINKERPOP-1321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15311952#comment-15311952
]
Dan LaRocque commented on TINKERPOP-1321:
-----------------------------------------
I opened a PR: https://github.com/apache/incubator-tinkerpop/pull/325
> Loosen coupling between TinkerPop serialization logic and shaded Kryo
> ---------------------------------------------------------------------
>
> Key: TINKERPOP-1321
> URL: https://issues.apache.org/jira/browse/TINKERPOP-1321
> Project: TinkerPop
> Issue Type: Improvement
> Reporter: Dan LaRocque
>
> When running jobs on Spark, TinkerPop currently recommends setting
> spark.serializer=GryoSerializer. This makes GryoSerializer responsible for
> serializing not just TinkerPop types but also scala runtime types and Spark
> internals. GryoSerializer doesn't extend either of the two serializers
> provided by Spark. It effectively assumes responsibility for reimplementing
> them.
> This is problematic. It is not totally trivial to replicate the
> functionality of Spark's standard serializers. It is also not easy to
> empirically test all meaningful cases. For instance, there is a conditional
> within Spark that selects between two concrete Map implementations depending
> on whether the current RDD partition count exceeds 2k
> (https://github.com/apache/spark/blob/branch-1.6/core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala#L47-L53).
> The implementation used below this threshold serializes fine on
> GryoSerializer. The implementation used above the threshold does not. Above
> the partition threshold, I've started getting
> {{org.apache.spark.SparkException: Job aborted due to stage failure:
> Exception while getting task result:
> org.apache.tinkerpop.shaded.kryo.KryoException: java.io.IOException: I failed
> to find one of the right cookies.}} Google leads to
> https://github.com/RoaringBitmap/RoaringBitmap/issues/64. However, just
> switching to Spark's {{KryoSerializer}} without changing anything somehow
> fixes the problem in my environment, implying that Spark has done something
> to address this problem that may not be fully replicated in TinkerPop.
> However, "just switching to Spark's KryoSerializer" is not a great approach.
> For one thing, we lose the benefit of TinkerPop's space-efficient StarGraph
> serializer, and Spark traversals can produce a lot of little ego-StarGraphs.
> These still serialize, but KryoSerializer uses its default behavior
> (FieldSerializer), which is not as clever about StarGraphs as TinkerPop's
> StarGraphSerializer. TinkerPop's reliance on its own internal shaded Kryo
> means that its serializers cannot be registered with Spark's unshaded Kryo.
> More concerning, it's impossible to completely switch to KryoSerializer just
> by tweaking the configuration. Besides spark.serializer, there is also a
> setting spark.closure.serializer for which the only supported value is
> JavaSerializer. Key TP classes that make it into the object reference graphs
> of Spark closures implement Serializable by resorting to TinkerPop's shaded
> Kryo via HadoopPools (looking at Object/VertexWritable). This leads to
> surprises with custom property data types. It doesn't matter if those types
> implement Serializable, and it doesn't matter if Spark's KryoSerializer is
> configured to accept those types. If those types are reachable from
> Object/VertexWritable, then they must be registered with TinkerPop's internal
> shaded Kryo, or else it will choke on them (unless it was explicitly
> configured to allow unregistered classes).
> I suggest the following change to give users more flexibility in their choice
> of spark.serializer, and to allow them to reuse TinkerPop's serializers if
> they choose not to use GryoSerializer: introduce lightweight interfaces that
> decouple TinkerPop's serialization logic from the exact Kryo shaded/unshaded
> package doing the work. TinkerPop's serialization logic would be written
> against interfaces that replicate a minimal subset of Kryo, and then TP's
> shaded Kryo or Spark's unshaded Kryo could be plugged in underneath without
> having to touch the source, recompile any TinkerPop code, or munge bytecode
> at runtime.
> This would not resolve all of the potential problems/complexity around
> TinkerPop serialization, but it would make it possible for users to apply the
> spark.serializer of their choice, switching off GryoSerializer if they so
> choose. Users could also continue to use GyroSerializer as they have until
> now.
> I've already run this through a few iterations locally and have an
> abstraction that allows me to run with spark.serializer=KryoSerializer,
> register GryoMapper/GryoSerializer's standard set of types, reuse TinkerPop's
> StarGraph serializer, and bypass GryoSerializer in HadoopPools to boot. I
> just rebased it on current master. I'll submit a PR for discussion/review.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)