[ https://issues.apache.org/jira/browse/TINKERPOP-1321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15317589#comment-15317589 ]
ASF GitHub Bot commented on TINKERPOP-1321: ------------------------------------------- Github user dalaro commented on the issue: https://github.com/apache/incubator-tinkerpop/pull/325 Just a reminder: we are voting on branch https://github.com/apache/incubator-tinkerpop/tree/TINKERPOP-1321. That subsumes the PR and adds a combination of fixes from Marko and me. > Loosen coupling between TinkerPop serialization logic and shaded Kryo > --------------------------------------------------------------------- > > Key: TINKERPOP-1321 > URL: https://issues.apache.org/jira/browse/TINKERPOP-1321 > Project: TinkerPop > Issue Type: Improvement > Components: io > Affects Versions: 3.2.0-incubating > Reporter: Dan LaRocque > Fix For: 3.2.1 > > > When running jobs on Spark, TinkerPop currently recommends setting > spark.serializer=GryoSerializer. This makes GryoSerializer responsible for > serializing not just TinkerPop types but also scala runtime types and Spark > internals. GryoSerializer doesn't extend either of the two serializers > provided by Spark. It effectively assumes responsibility for reimplementing > them. > This is problematic. It is not totally trivial to replicate the > functionality of Spark's standard serializers. It is also not easy to > empirically test all meaningful cases. For instance, there is a conditional > within Spark that selects between two concrete Map implementations depending > on whether the current RDD partition count exceeds 2k > (https://github.com/apache/spark/blob/branch-1.6/core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala#L47-L53). > The implementation used below this threshold serializes fine on > GryoSerializer. The implementation used above the threshold does not. Above > the partition threshold, I've started getting > {{org.apache.spark.SparkException: Job aborted due to stage failure: > Exception while getting task result: > org.apache.tinkerpop.shaded.kryo.KryoException: java.io.IOException: I failed > to find one of the right cookies.}} Google leads to > https://github.com/RoaringBitmap/RoaringBitmap/issues/64. However, just > switching to Spark's {{KryoSerializer}} without changing anything somehow > fixes the problem in my environment, implying that Spark has done something > to address this problem that may not be fully replicated in TinkerPop. > However, "just switching to Spark's KryoSerializer" is not a great approach. > For one thing, we lose the benefit of TinkerPop's space-efficient StarGraph > serializer, and Spark traversals can produce a lot of little ego-StarGraphs. > These still serialize, but KryoSerializer uses its default behavior > (FieldSerializer), which is not as clever about StarGraphs as TinkerPop's > StarGraphSerializer. TinkerPop's reliance on its own internal shaded Kryo > means that its serializers cannot be registered with Spark's unshaded Kryo. > More concerning, it's impossible to completely switch to KryoSerializer just > by tweaking the configuration. Besides spark.serializer, there is also a > setting spark.closure.serializer for which the only supported value is > JavaSerializer. Key TP classes that make it into the object reference graphs > of Spark closures implement Serializable by resorting to TinkerPop's shaded > Kryo via HadoopPools (looking at Object/VertexWritable). This leads to > surprises with custom property data types. It doesn't matter if those types > implement Serializable, and it doesn't matter if Spark's KryoSerializer is > configured to accept those types. If those types are reachable from > Object/VertexWritable, then they must be registered with TinkerPop's internal > shaded Kryo, or else it will choke on them (unless it was explicitly > configured to allow unregistered classes). > I suggest the following change to give users more flexibility in their choice > of spark.serializer, and to allow them to reuse TinkerPop's serializers if > they choose not to use GryoSerializer: introduce lightweight interfaces that > decouple TinkerPop's serialization logic from the exact Kryo shaded/unshaded > package doing the work. TinkerPop's serialization logic would be written > against interfaces that replicate a minimal subset of Kryo, and then TP's > shaded Kryo or Spark's unshaded Kryo could be plugged in underneath without > having to touch the source, recompile any TinkerPop code, or munge bytecode > at runtime. > This would not resolve all of the potential problems/complexity around > TinkerPop serialization, but it would make it possible for users to apply the > spark.serializer of their choice, switching off GryoSerializer if they so > choose. Users could also continue to use GyroSerializer as they have until > now. > I've already run this through a few iterations locally and have an > abstraction that allows me to run with spark.serializer=KryoSerializer, > register GryoMapper/GryoSerializer's standard set of types, reuse TinkerPop's > StarGraph serializer, and bypass GryoSerializer in HadoopPools to boot. I > just rebased it on current master. I'll submit a PR for discussion/review. -- This message was sent by Atlassian JIRA (v6.3.4#6332)