[jira] [Commented] (TINKERPOP-1321) Loosen coupling between TinkerPop serialization logic and shaded Kryo

Dan LaRocque (JIRA) Thu, 02 Jun 2016 01:38:05 -0700

    [ 
https://issues.apache.org/jira/browse/TINKERPOP-1321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15311952#comment-15311952
 ]


Dan LaRocque commented on TINKERPOP-1321:
-----------------------------------------

I opened a PR: https://github.com/apache/incubator-tinkerpop/pull/325

> Loosen coupling between TinkerPop serialization logic and shaded Kryo
> ---------------------------------------------------------------------
>
>                 Key: TINKERPOP-1321
>                 URL: https://issues.apache.org/jira/browse/TINKERPOP-1321
>             Project: TinkerPop
>          Issue Type: Improvement
>            Reporter: Dan LaRocque
>
> When running jobs on Spark, TinkerPop currently recommends setting 
> spark.serializer=GryoSerializer.  This makes GryoSerializer responsible for 
> serializing not just TinkerPop types but also scala runtime types and Spark 
> internals.  GryoSerializer doesn't extend either of the two serializers 
> provided by Spark.  It effectively assumes responsibility for reimplementing 
> them.
> This is problematic.  It is not totally trivial to replicate the 
> functionality of Spark's standard serializers.  It is also not easy to 
> empirically test all meaningful cases.  For instance, there is a conditional 
> within Spark that selects between two concrete Map implementations depending 
> on whether the current RDD partition count exceeds 2k 
> (https://github.com/apache/spark/blob/branch-1.6/core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala#L47-L53).
>   The implementation used below this threshold serializes fine on 
> GryoSerializer.  The implementation used above the threshold does not.  Above 
> the partition threshold, I've started getting 
> {{org.apache.spark.SparkException: Job aborted due to stage failure: 
> Exception while getting task result: 
> org.apache.tinkerpop.shaded.kryo.KryoException: java.io.IOException: I failed 
> to find one of the right cookies.}}  Google leads to 
> https://github.com/RoaringBitmap/RoaringBitmap/issues/64.  However, just 
> switching to Spark's {{KryoSerializer}} without changing anything somehow 
> fixes the problem in my environment, implying that Spark has done something 
> to address this problem that may not be fully replicated in TinkerPop.
> However, "just switching to Spark's KryoSerializer" is not a great approach.  
> For one thing, we lose the benefit of TinkerPop's space-efficient StarGraph 
> serializer, and Spark traversals can produce a lot of little ego-StarGraphs.  
> These still serialize, but KryoSerializer uses its default behavior 
> (FieldSerializer), which is not as clever about StarGraphs as TinkerPop's 
> StarGraphSerializer.  TinkerPop's reliance on its own internal shaded Kryo 
> means that its serializers cannot be registered with Spark's unshaded Kryo.
> More concerning, it's impossible to completely switch to KryoSerializer just 
> by tweaking the configuration.  Besides spark.serializer, there is also a 
> setting spark.closure.serializer for which the only supported value is 
> JavaSerializer.  Key TP classes that make it into the object reference graphs 
> of Spark closures implement Serializable by resorting to TinkerPop's shaded 
> Kryo via HadoopPools (looking at Object/VertexWritable).  This leads to 
> surprises with custom property data types.  It doesn't matter if those types 
> implement Serializable, and it doesn't matter if Spark's KryoSerializer is 
> configured to accept those types.  If  those types are reachable from 
> Object/VertexWritable, then they must be registered with TinkerPop's internal 
> shaded Kryo, or else it will choke on them (unless it was explicitly 
> configured to allow unregistered classes).
> I suggest the following change to give users more flexibility in their choice 
> of spark.serializer, and to allow them to reuse TinkerPop's serializers if 
> they choose not to use GryoSerializer: introduce lightweight interfaces that 
> decouple TinkerPop's serialization logic from the exact Kryo shaded/unshaded 
> package doing the work.  TinkerPop's serialization logic would be written 
> against interfaces that replicate a minimal subset of Kryo, and then TP's 
> shaded Kryo or Spark's unshaded Kryo could be plugged in underneath without 
> having to touch the source, recompile any TinkerPop code, or munge bytecode 
> at runtime.
> This would not resolve all of the potential problems/complexity around 
> TinkerPop serialization, but it would make it possible for users to apply the 
> spark.serializer of their choice, switching off GryoSerializer if they so 
> choose.  Users could also continue to use GyroSerializer as they have until 
> now.
> I've already run this through a few iterations locally and have an 
> abstraction that allows me to run with spark.serializer=KryoSerializer, 
> register GryoMapper/GryoSerializer's standard set of types, reuse TinkerPop's 
> StarGraph serializer, and bypass GryoSerializer in HadoopPools to boot.  I 
> just rebased it on current master.  I'll submit a PR for discussion/review.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TINKERPOP-1321) Loosen coupling between TinkerPop serialization logic and shaded Kryo

Reply via email to