[jira] [Commented] (TINKERPOP-1321) Loosen coupling between TinkerPop serialization logic and shaded Kryo

ASF GitHub Bot (JIRA) Fri, 03 Jun 2016 12:34:18 -0700

    [ 
https://issues.apache.org/jira/browse/TINKERPOP-1321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15314655#comment-15314655
 ]


ASF GitHub Bot commented on TINKERPOP-1321:
-------------------------------------------

Github user dalaro commented on the issue:

    https://github.com/apache/incubator-tinkerpop/pull/325
  
    After discussion on Slack, I think @okram and I tentatively agreed to 
proceed with this PR after I do additional work to save users who have custom 
serializers the effort of maintaining separate `IoRegistry` and 
`spark.kryo.registrator` implementations with near-identical contents.
    
    I may discover other complications during the process, but I think this 
involves two changes:
    
    1. I will attempt to subclass KryoSerializer so that I can access the 
SparkConf passed to its constructior and check for 
`GryoPool.CONFIG_IO_REGISTRY` (similar to what GryoSerializer does now), then 
apply any registrations found therein so long as each registration either:
    
       * specifies no explicit serializer (using Kryo's internal default) or
       * specifies a shim serializer
    
       In particular, if the registry contains an old-style TP shaded Kryo 
serializer that hasn't been ported to the shim, then the KryoSerializer 
subclass will throw an exception.
    
       This change is necessary to support custom-serialized, 
`IoRegistry`-listed datatypes that appear in most Spark data outside of 
closures (say, in the RDD itself).
    
    2. I will replace current callsites of `HadoopPools.initialize(conf)` with 
some other static method that calls `HadoopPools.initialize(conf)` and then 
calls some roughly equivalent `initialize(conf)` method for the static KryoPool 
backing `KryoPoolShimService`.
    
       This change is necessary to support custom-serialized, 
`IoRegistry`-listed datatypes that appear in closures.


> Loosen coupling between TinkerPop serialization logic and shaded Kryo
> ---------------------------------------------------------------------
>
>                 Key: TINKERPOP-1321
>                 URL: https://issues.apache.org/jira/browse/TINKERPOP-1321
>             Project: TinkerPop
>          Issue Type: Improvement
>          Components: io
>    Affects Versions: 3.2.0-incubating
>            Reporter: Dan LaRocque
>             Fix For: 3.2.1
>
>
> When running jobs on Spark, TinkerPop currently recommends setting 
> spark.serializer=GryoSerializer.  This makes GryoSerializer responsible for 
> serializing not just TinkerPop types but also scala runtime types and Spark 
> internals.  GryoSerializer doesn't extend either of the two serializers 
> provided by Spark.  It effectively assumes responsibility for reimplementing 
> them.
> This is problematic.  It is not totally trivial to replicate the 
> functionality of Spark's standard serializers.  It is also not easy to 
> empirically test all meaningful cases.  For instance, there is a conditional 
> within Spark that selects between two concrete Map implementations depending 
> on whether the current RDD partition count exceeds 2k 
> (https://github.com/apache/spark/blob/branch-1.6/core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala#L47-L53).
>   The implementation used below this threshold serializes fine on 
> GryoSerializer.  The implementation used above the threshold does not.  Above 
> the partition threshold, I've started getting 
> {{org.apache.spark.SparkException: Job aborted due to stage failure: 
> Exception while getting task result: 
> org.apache.tinkerpop.shaded.kryo.KryoException: java.io.IOException: I failed 
> to find one of the right cookies.}}  Google leads to 
> https://github.com/RoaringBitmap/RoaringBitmap/issues/64.  However, just 
> switching to Spark's {{KryoSerializer}} without changing anything somehow 
> fixes the problem in my environment, implying that Spark has done something 
> to address this problem that may not be fully replicated in TinkerPop.
> However, "just switching to Spark's KryoSerializer" is not a great approach.  
> For one thing, we lose the benefit of TinkerPop's space-efficient StarGraph 
> serializer, and Spark traversals can produce a lot of little ego-StarGraphs.  
> These still serialize, but KryoSerializer uses its default behavior 
> (FieldSerializer), which is not as clever about StarGraphs as TinkerPop's 
> StarGraphSerializer.  TinkerPop's reliance on its own internal shaded Kryo 
> means that its serializers cannot be registered with Spark's unshaded Kryo.
> More concerning, it's impossible to completely switch to KryoSerializer just 
> by tweaking the configuration.  Besides spark.serializer, there is also a 
> setting spark.closure.serializer for which the only supported value is 
> JavaSerializer.  Key TP classes that make it into the object reference graphs 
> of Spark closures implement Serializable by resorting to TinkerPop's shaded 
> Kryo via HadoopPools (looking at Object/VertexWritable).  This leads to 
> surprises with custom property data types.  It doesn't matter if those types 
> implement Serializable, and it doesn't matter if Spark's KryoSerializer is 
> configured to accept those types.  If  those types are reachable from 
> Object/VertexWritable, then they must be registered with TinkerPop's internal 
> shaded Kryo, or else it will choke on them (unless it was explicitly 
> configured to allow unregistered classes).
> I suggest the following change to give users more flexibility in their choice 
> of spark.serializer, and to allow them to reuse TinkerPop's serializers if 
> they choose not to use GryoSerializer: introduce lightweight interfaces that 
> decouple TinkerPop's serialization logic from the exact Kryo shaded/unshaded 
> package doing the work.  TinkerPop's serialization logic would be written 
> against interfaces that replicate a minimal subset of Kryo, and then TP's 
> shaded Kryo or Spark's unshaded Kryo could be plugged in underneath without 
> having to touch the source, recompile any TinkerPop code, or munge bytecode 
> at runtime.
> This would not resolve all of the potential problems/complexity around 
> TinkerPop serialization, but it would make it possible for users to apply the 
> spark.serializer of their choice, switching off GryoSerializer if they so 
> choose.  Users could also continue to use GyroSerializer as they have until 
> now.
> I've already run this through a few iterations locally and have an 
> abstraction that allows me to run with spark.serializer=KryoSerializer, 
> register GryoMapper/GryoSerializer's standard set of types, reuse TinkerPop's 
> StarGraph serializer, and bypass GryoSerializer in HadoopPools to boot.  I 
> just rebased it on current master.  I'll submit a PR for discussion/review.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TINKERPOP-1321) Loosen coupling between TinkerPop serialization logic and shaded Kryo

Reply via email to