[ 
https://issues.apache.org/jira/browse/TINKERPOP-1321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15317279#comment-15317279
 ] 

ASF GitHub Bot commented on TINKERPOP-1321:
-------------------------------------------

Github user dalaro commented on the issue:

    https://github.com/apache/incubator-tinkerpop/pull/325
  
    These questions involved a lot of back-and-forth that would have been slow 
to resolve on github comments, so and I discussed the failures and the code on 
a hangout earlier today.
    
    Summary:
    
    * Marko created 
https://github.com/apache/incubator-tinkerpop/tree/TINKERPOP-1321 on the TP 
repo.  This branch is my PR plus additional tweaks Marko and I worked out on 
top.  His branch supersedes this PR branch now.  This PR branch **should not be 
merged because Marko's branch subsumes/extends it**.
    * I have a rough idea of what changed as we discussed code on the hangout, 
but I will go through his branch's new commits and review.
    * The new serialization code is lower-priority than the old code.  It has 
to be turned on through explicit user action, like this is the test provider 
code:
    ```
                System.setProperty(SHIM_CLASS_SYSTEM_PROPERTY, 
UnshadedKryoShimService.class.getCanonicalName());
                config.put("spark.serializer", 
KryoSerializer.class.getCanonicalName());
                config.put("spark.kryo.registrator", 
GryoRegistrator.class.getCanonicalName());
    ```
    * Marko setup SparkHadoopGraphProvider to randomly choose either the old 
serialization code or the new stuff in 
https://github.com/apache/incubator-tinkerpop/commit/e7003635e27c625b3f30492111f20f4fe4e24eb5#diff-afffc6796745845d46e6f60ea3a992f9R91.
  IMHO that should be overrideable to make it deterministic, but it's not a 
huge deal since it's limited to test code.
    * We replaced `GryoClassResolver` with a series of class registrations in 
https://github.com/apache/incubator-tinkerpop/commit/e7003635e27c625b3f30492111f20f4fe4e24eb5#diff-2e881d0a6ab7ed1ca7d1f593bceadfd2R207.
  The idea here is to explicitly configure which types need a detaching 
serializer rather than relying on `GryoClassResolver` to automagically do it 
regardless of explicit class registrations.  Three factors influence this 
choice:
       * Spark's KryoSerializer does not support a user-provided ClassResolver, 
and working around that limitation entails either getting a change through 
Spark upstream or copying and pasting a bunch of KryoSerializer's 
private/hardcoded class registrations, some of which could change in future 
Spark versions.  The latter is ugly and would make the code brittle across 
Spark upgrades.
       * `GryoClassResolver` is convenient but somewhat mysterious to third 
parties, who are likelier to have seen a custom class registration mechanism 
(like TP's `IoRegistry` or Spark's `spark.kryo.registrator`) than a custom 
resolver.  It also removes vendor control in that its detachment behavior is 
not configurable, so modifying it requires subclassing or reimplementation. For 
instance, if a vendor had a lightweight Vertex implementation with a custom 
serializer and wanted to bypass TP's resolver-driven detachment, then I don't 
think it would be possible without modifying TP's resolver.
       * `GryoClassResolver` is written against a bunch of shaded types and 
relies on `readVarInt`/`writeVarInt`, which were added in Kryo 2.22, one 
version too late for compatibility with Spark's Kryo 2.21.  This is the least 
important concern of the three, more an implementation detail.
    * Marko is running the full integration test suite now.  He ran the spark 
ITs just before pushing and reported success on those specifically.


> Loosen coupling between TinkerPop serialization logic and shaded Kryo
> ---------------------------------------------------------------------
>
>                 Key: TINKERPOP-1321
>                 URL: https://issues.apache.org/jira/browse/TINKERPOP-1321
>             Project: TinkerPop
>          Issue Type: Improvement
>          Components: io
>    Affects Versions: 3.2.0-incubating
>            Reporter: Dan LaRocque
>             Fix For: 3.2.1
>
>
> When running jobs on Spark, TinkerPop currently recommends setting 
> spark.serializer=GryoSerializer.  This makes GryoSerializer responsible for 
> serializing not just TinkerPop types but also scala runtime types and Spark 
> internals.  GryoSerializer doesn't extend either of the two serializers 
> provided by Spark.  It effectively assumes responsibility for reimplementing 
> them.
> This is problematic.  It is not totally trivial to replicate the 
> functionality of Spark's standard serializers.  It is also not easy to 
> empirically test all meaningful cases.  For instance, there is a conditional 
> within Spark that selects between two concrete Map implementations depending 
> on whether the current RDD partition count exceeds 2k 
> (https://github.com/apache/spark/blob/branch-1.6/core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala#L47-L53).
>   The implementation used below this threshold serializes fine on 
> GryoSerializer.  The implementation used above the threshold does not.  Above 
> the partition threshold, I've started getting 
> {{org.apache.spark.SparkException: Job aborted due to stage failure: 
> Exception while getting task result: 
> org.apache.tinkerpop.shaded.kryo.KryoException: java.io.IOException: I failed 
> to find one of the right cookies.}}  Google leads to 
> https://github.com/RoaringBitmap/RoaringBitmap/issues/64.  However, just 
> switching to Spark's {{KryoSerializer}} without changing anything somehow 
> fixes the problem in my environment, implying that Spark has done something 
> to address this problem that may not be fully replicated in TinkerPop.
> However, "just switching to Spark's KryoSerializer" is not a great approach.  
> For one thing, we lose the benefit of TinkerPop's space-efficient StarGraph 
> serializer, and Spark traversals can produce a lot of little ego-StarGraphs.  
> These still serialize, but KryoSerializer uses its default behavior 
> (FieldSerializer), which is not as clever about StarGraphs as TinkerPop's 
> StarGraphSerializer.  TinkerPop's reliance on its own internal shaded Kryo 
> means that its serializers cannot be registered with Spark's unshaded Kryo.
> More concerning, it's impossible to completely switch to KryoSerializer just 
> by tweaking the configuration.  Besides spark.serializer, there is also a 
> setting spark.closure.serializer for which the only supported value is 
> JavaSerializer.  Key TP classes that make it into the object reference graphs 
> of Spark closures implement Serializable by resorting to TinkerPop's shaded 
> Kryo via HadoopPools (looking at Object/VertexWritable).  This leads to 
> surprises with custom property data types.  It doesn't matter if those types 
> implement Serializable, and it doesn't matter if Spark's KryoSerializer is 
> configured to accept those types.  If  those types are reachable from 
> Object/VertexWritable, then they must be registered with TinkerPop's internal 
> shaded Kryo, or else it will choke on them (unless it was explicitly 
> configured to allow unregistered classes).
> I suggest the following change to give users more flexibility in their choice 
> of spark.serializer, and to allow them to reuse TinkerPop's serializers if 
> they choose not to use GryoSerializer: introduce lightweight interfaces that 
> decouple TinkerPop's serialization logic from the exact Kryo shaded/unshaded 
> package doing the work.  TinkerPop's serialization logic would be written 
> against interfaces that replicate a minimal subset of Kryo, and then TP's 
> shaded Kryo or Spark's unshaded Kryo could be plugged in underneath without 
> having to touch the source, recompile any TinkerPop code, or munge bytecode 
> at runtime.
> This would not resolve all of the potential problems/complexity around 
> TinkerPop serialization, but it would make it possible for users to apply the 
> spark.serializer of their choice, switching off GryoSerializer if they so 
> choose.  Users could also continue to use GyroSerializer as they have until 
> now.
> I've already run this through a few iterations locally and have an 
> abstraction that allows me to run with spark.serializer=KryoSerializer, 
> register GryoMapper/GryoSerializer's standard set of types, reuse TinkerPop's 
> StarGraph serializer, and bypass GryoSerializer in HadoopPools to boot.  I 
> just rebased it on current master.  I'll submit a PR for discussion/review.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to