[ 
https://issues.apache.org/jira/browse/TINKERPOP-1321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15313532#comment-15313532
 ] 

ASF GitHub Bot commented on TINKERPOP-1321:
-------------------------------------------

Github user dalaro commented on the issue:

    https://github.com/apache/incubator-tinkerpop/pull/325
  
    Thanks for the comments and for running the tests.
    
    @okram:
    
    1. I also dislike HadoopPools and would prefer to remove it, but I don't 
know how to do it correctly, even after considering the problem for a while.  
Here are my open questions and observations.
    
       HadoopPools apparently exists to help Vertex/ObjectWritable implement 
Serializable.  They must implement Serializable because they keep showing up in 
`spark.closure.serializer`'s workload, and only JavaSerializer can be used as 
`spark.closure.serializer`.  A natural question here: is it reasonable and 
normal for graph types to appear in these closures in the first place?
    
       Spark has this nifty `spark.kryo.registrator` mechanism to support 
user-registered types with Kryo, and its docs talk at length about setting 
`spark.serializer`.   The docs read as though that was all the user would need 
to customize to deal with custom data.  `spark.closure.serializer` seems to 
barely be mentioned.  I've only seen it in the manual's exhaustive table of 
config options (http://spark.apache.org/docs/latest/configuration.html).  Even 
then, the docs merely comment that JavaSerializer is its only supported value.  
There's some old list traffic that says KryoSerializer is unsupported because, 
back when it was originally tested, it blew up on pretty much any Java closure, 
though that might have changed since.
    
       I wonder if this indicates some kind of usage antipattern.  Could there 
be an alternate approach to traversal/vertexprogram evaluation in 
SparkGraphComputer/SparkExecutor that avoids putting graph data into Spark 
closures?  That would make the use-case for HadoopPools disappear.  Then again, 
it seems so easy to get into this situation with `rdd.map(x -> whatever(x))`.  
I am full of uncertainty on this point.  I don't have deep enough Spark 
knowledge to judge whether what we're doing now is reasonable.
    
       As long as Object/VertexWritable keep popping up in closures created by 
SparkGraphComputer & friends, we need some way to serialize them and everything 
they refer to, which can apparently involve a lot of graph entities.  We could 
theoretically implement Serializable "manually" (without Kryo/Gryo), but that 
seems like a massive amount of work, potentially inefficient, definitely 
redundant, and adds a huge new bug surface.  I'm against that approach, but 
that's about the only thing I know for certain here.
    
    2. In the long term, I think users should set
    
       ```
    spark.serializer=KryoSerializer
    spark.kryo.registrator=com.tinkerpop.something.TinkerPopRegistrator
    ```
       and TinkerPop should stop registering/caring about Spark and 
scala-runtime types in its serialization layer.  This PR tries to keep 
GryoSerializer untouched though.  That migration doesn't have to be forced.
    
      I'm already using the `spark.kryo.registrator` setting in my app, where I 
have an impl that registers all TinkerPop types plus all of my app's types.  I 
can refactor that by extracting a TinkerPopRegistrator that covers the TP types 
(but not my app's).  I probably should have done that at the start.  As it 
stands, this PR is sort of a "build-your-own-alternative-to-GryoSerializer" 
kit, but the very last assembly steps involving a `spark.kryo.registrator` impl 
are left as an exercise to the user.  I'll try to introduce a 
TinkerPopRegistrator to address that, though I may reconsider if supporting 
custom type registration turns out way more complicated than I'm imagining.
    
      I think Gryo more generally is useful and can stay around indefinitely 
(say, for reading/writing data to the filesystem), but GryoSerializer should 
eventually stop being the recommended spark.serializer.  If that is all 
GryoSerializer per se exists to do, then it could be removed and any 
TinkerPop-specific class registrations that it performs moved to 
TinkerPopRegistrator.
    
    3.  This PR should maintain full format compatibility.  I deliberately 
tried to keep all of GryoMapper and GryoSerializer's class registrations, class 
IDs, and associated serializers (including StarGraphSerializer) functionally 
unchanged.  If you see somewhere that I unintentionally changed the format, 
that's a bug.
    
    @spmallette:
    
    re. package organization: definitely makes sense, I'll move kryoshim under 
io.gryo.  Please let me know if I bungled the package hierarchy anywhere else.
    
    re. path forward: I think 1 and 2 above touch on both HadoopPools and 
GryoSerializer.


> Loosen coupling between TinkerPop serialization logic and shaded Kryo
> ---------------------------------------------------------------------
>
>                 Key: TINKERPOP-1321
>                 URL: https://issues.apache.org/jira/browse/TINKERPOP-1321
>             Project: TinkerPop
>          Issue Type: Improvement
>          Components: io
>    Affects Versions: 3.2.0-incubating
>            Reporter: Dan LaRocque
>             Fix For: 3.2.1
>
>
> When running jobs on Spark, TinkerPop currently recommends setting 
> spark.serializer=GryoSerializer.  This makes GryoSerializer responsible for 
> serializing not just TinkerPop types but also scala runtime types and Spark 
> internals.  GryoSerializer doesn't extend either of the two serializers 
> provided by Spark.  It effectively assumes responsibility for reimplementing 
> them.
> This is problematic.  It is not totally trivial to replicate the 
> functionality of Spark's standard serializers.  It is also not easy to 
> empirically test all meaningful cases.  For instance, there is a conditional 
> within Spark that selects between two concrete Map implementations depending 
> on whether the current RDD partition count exceeds 2k 
> (https://github.com/apache/spark/blob/branch-1.6/core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala#L47-L53).
>   The implementation used below this threshold serializes fine on 
> GryoSerializer.  The implementation used above the threshold does not.  Above 
> the partition threshold, I've started getting 
> {{org.apache.spark.SparkException: Job aborted due to stage failure: 
> Exception while getting task result: 
> org.apache.tinkerpop.shaded.kryo.KryoException: java.io.IOException: I failed 
> to find one of the right cookies.}}  Google leads to 
> https://github.com/RoaringBitmap/RoaringBitmap/issues/64.  However, just 
> switching to Spark's {{KryoSerializer}} without changing anything somehow 
> fixes the problem in my environment, implying that Spark has done something 
> to address this problem that may not be fully replicated in TinkerPop.
> However, "just switching to Spark's KryoSerializer" is not a great approach.  
> For one thing, we lose the benefit of TinkerPop's space-efficient StarGraph 
> serializer, and Spark traversals can produce a lot of little ego-StarGraphs.  
> These still serialize, but KryoSerializer uses its default behavior 
> (FieldSerializer), which is not as clever about StarGraphs as TinkerPop's 
> StarGraphSerializer.  TinkerPop's reliance on its own internal shaded Kryo 
> means that its serializers cannot be registered with Spark's unshaded Kryo.
> More concerning, it's impossible to completely switch to KryoSerializer just 
> by tweaking the configuration.  Besides spark.serializer, there is also a 
> setting spark.closure.serializer for which the only supported value is 
> JavaSerializer.  Key TP classes that make it into the object reference graphs 
> of Spark closures implement Serializable by resorting to TinkerPop's shaded 
> Kryo via HadoopPools (looking at Object/VertexWritable).  This leads to 
> surprises with custom property data types.  It doesn't matter if those types 
> implement Serializable, and it doesn't matter if Spark's KryoSerializer is 
> configured to accept those types.  If  those types are reachable from 
> Object/VertexWritable, then they must be registered with TinkerPop's internal 
> shaded Kryo, or else it will choke on them (unless it was explicitly 
> configured to allow unregistered classes).
> I suggest the following change to give users more flexibility in their choice 
> of spark.serializer, and to allow them to reuse TinkerPop's serializers if 
> they choose not to use GryoSerializer: introduce lightweight interfaces that 
> decouple TinkerPop's serialization logic from the exact Kryo shaded/unshaded 
> package doing the work.  TinkerPop's serialization logic would be written 
> against interfaces that replicate a minimal subset of Kryo, and then TP's 
> shaded Kryo or Spark's unshaded Kryo could be plugged in underneath without 
> having to touch the source, recompile any TinkerPop code, or munge bytecode 
> at runtime.
> This would not resolve all of the potential problems/complexity around 
> TinkerPop serialization, but it would make it possible for users to apply the 
> spark.serializer of their choice, switching off GryoSerializer if they so 
> choose.  Users could also continue to use GyroSerializer as they have until 
> now.
> I've already run this through a few iterations locally and have an 
> abstraction that allows me to run with spark.serializer=KryoSerializer, 
> register GryoMapper/GryoSerializer's standard set of types, reuse TinkerPop's 
> StarGraph serializer, and bypass GryoSerializer in HadoopPools to boot.  I 
> just rebased it on current master.  I'll submit a PR for discussion/review.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to