[GitHub] incubator-tinkerpop issue #325: TINKERPOP-1321 Introduce Kryo shim to suppor...
Github user spmallette commented on the issue: https://github.com/apache/incubator-tinkerpop/pull/325 I closed manually from our end. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-tinkerpop issue #325: TINKERPOP-1321 Introduce Kryo shim to suppor...
Github user spmallette commented on the issue: https://github.com/apache/incubator-tinkerpop/pull/325 All tests pass with `docker/build.sh -t -n -i` VOTE +1 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-tinkerpop issue #325: TINKERPOP-1321 Introduce Kryo shim to suppor...
Github user okram commented on the issue: https://github.com/apache/incubator-tinkerpop/pull/325 This has been merged. @dalaro Can you please close this PR as the merge happened via the Apache TinkerPop branch sourced from your original PR. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-tinkerpop issue #325: TINKERPOP-1321 Introduce Kryo shim to suppor...
Github user dkuppitz commented on the issue: https://github.com/apache/incubator-tinkerpop/pull/325 VOTE: +1 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-tinkerpop issue #325: TINKERPOP-1321 Introduce Kryo shim to suppor...
Github user okram commented on the issue: https://github.com/apache/incubator-tinkerpop/pull/325 ``` [INFO] Reactor Summary: [INFO] [INFO] Apache TinkerPop ... SUCCESS [ 9.619 s] [INFO] Apache TinkerPop :: Gremlin Shaded . SUCCESS [ 1.894 s] [INFO] Apache TinkerPop :: Gremlin Core ... SUCCESS [ 28.393 s] [INFO] Apache TinkerPop :: Gremlin Test ... SUCCESS [ 9.757 s] [INFO] Apache TinkerPop :: Gremlin Groovy . SUCCESS [ 34.087 s] [INFO] Apache TinkerPop :: Gremlin Groovy Test SUCCESS [ 4.713 s] [INFO] Apache TinkerPop :: TinkerGraph Gremlin SUCCESS [02:35 min] [INFO] Apache TinkerPop :: Gremlin Benchmark .. SUCCESS [ 3.353 s] [INFO] Apache TinkerPop :: Hadoop Gremlin . SUCCESS [04:08 min] [INFO] Apache TinkerPop :: Spark Gremlin .. SUCCESS [12:36 min] [INFO] Apache TinkerPop :: Giraph Gremlin . SUCCESS [ 02:58 h] ``` Neo4j failed cause it couldn't download jars (my internet connection must have died over night). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-tinkerpop issue #325: TINKERPOP-1321 Introduce Kryo shim to suppor...
Github user dalaro commented on the issue: https://github.com/apache/incubator-tinkerpop/pull/325 Just a reminder: we are voting on branch https://github.com/apache/incubator-tinkerpop/tree/TINKERPOP-1321. That subsumes the PR and adds a combination of fixes from Marko and me. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-tinkerpop issue #325: TINKERPOP-1321 Introduce Kryo shim to suppor...
Github user okram commented on the issue: https://github.com/apache/incubator-tinkerpop/pull/325 VOTE +1. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-tinkerpop issue #325: TINKERPOP-1321 Introduce Kryo shim to suppor...
Github user spmallette commented on the issue: https://github.com/apache/incubator-tinkerpop/pull/325 Yeah - i know - it's there, but it's proliferating and becoming more and more OK which i think is "bad". i'll bring it up as a separate issue. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-tinkerpop issue #325: TINKERPOP-1321 Introduce Kryo shim to suppor...
Github user okram commented on the issue: https://github.com/apache/incubator-tinkerpop/pull/325 We use `Random` in numerous places to test various "versions" of a particular code block. It is possible to replace all possible choices with new `GraphProviders` but you are talking about the test suite taking on the order of 20+ hours to complete then. If you have a better idea for testing these facets, then please provide a PR. Note that this PR's use of `Random` is not something "novel" as we use `Random` already in `SparkHadoopGraphProvider` to select `InputFormats`, `OutputFormats`, and traversal sources (deprecated vs. non-deprecated representations). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-tinkerpop issue #325: TINKERPOP-1321 Introduce Kryo shim to suppor...
Github user spmallette commented on the issue: https://github.com/apache/incubator-tinkerpop/pull/325 > Marko setup SparkHadoopGraphProvider to randomly choose either the old serialization code or the new stuff in e700363#diff-afffc6796745845d46e6f60ea3a992f9R91. IMHO that should be overrideable to make it deterministic, but it's not a huge deal since it's limited to test code. +1 - i'd prefer to not have "random" in our test cases. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-tinkerpop issue #325: TINKERPOP-1321 Introduce Kryo shim to suppor...
Github user dalaro commented on the issue: https://github.com/apache/incubator-tinkerpop/pull/325 These questions involved a lot of back-and-forth that would have been slow to resolve on github comments, so and I discussed the failures and the code on a hangout earlier today. Summary: * Marko created https://github.com/apache/incubator-tinkerpop/tree/TINKERPOP-1321 on the TP repo. This branch is my PR plus additional tweaks Marko and I worked out on top. His branch supersedes this PR branch now. This PR branch **should not be merged because Marko's branch subsumes/extends it**. * I have a rough idea of what changed as we discussed code on the hangout, but I will go through his branch's new commits and review. * The new serialization code is lower-priority than the old code. It has to be turned on through explicit user action, like this is the test provider code: ``` System.setProperty(SHIM_CLASS_SYSTEM_PROPERTY, UnshadedKryoShimService.class.getCanonicalName()); config.put("spark.serializer", KryoSerializer.class.getCanonicalName()); config.put("spark.kryo.registrator", GryoRegistrator.class.getCanonicalName()); ``` * Marko setup SparkHadoopGraphProvider to randomly choose either the old serialization code or the new stuff in https://github.com/apache/incubator-tinkerpop/commit/e7003635e27c625b3f30492111f20f4fe4e24eb5#diff-afffc6796745845d46e6f60ea3a992f9R91. IMHO that should be overrideable to make it deterministic, but it's not a huge deal since it's limited to test code. * We replaced `GryoClassResolver` with a series of class registrations in https://github.com/apache/incubator-tinkerpop/commit/e7003635e27c625b3f30492111f20f4fe4e24eb5#diff-2e881d0a6ab7ed1ca7d1f593bceadfd2R207. The idea here is to explicitly configure which types need a detaching serializer rather than relying on `GryoClassResolver` to automagically do it regardless of explicit class registrations. Three factors influence this choice: * Spark's KryoSerializer does not support a user-provided ClassResolver, and working around that limitation entails either getting a change through Spark upstream or copying and pasting a bunch of KryoSerializer's private/hardcoded class registrations, some of which could change in future Spark versions. The latter is ugly and would make the code brittle across Spark upgrades. * `GryoClassResolver` is convenient but somewhat mysterious to third parties, who are likelier to have seen a custom class registration mechanism (like TP's `IoRegistry` or Spark's `spark.kryo.registrator`) than a custom resolver. It also removes vendor control in that its detachment behavior is not configurable, so modifying it requires subclassing or reimplementation. For instance, if a vendor had a lightweight Vertex implementation with a custom serializer and wanted to bypass TP's resolver-driven detachment, then I don't think it would be possible without modifying TP's resolver. * `GryoClassResolver` is written against a bunch of shaded types and relies on `readVarInt`/`writeVarInt`, which were added in Kryo 2.22, one version too late for compatibility with Spark's Kryo 2.21. This is the least important concern of the three, more an implementation detail. * Marko is running the full integration test suite now. He ran the spark ITs just before pushing and reported success on those specifically. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-tinkerpop issue #325: TINKERPOP-1321 Introduce Kryo shim to suppor...
Github user okram commented on the issue: https://github.com/apache/incubator-tinkerpop/pull/325 So, `mvn clean install` is happy w/o the `TravesralInterruptionComputerTest` (commented out). Running full integration tests now... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-tinkerpop issue #325: TINKERPOP-1321 Introduce Kryo shim to suppor...
Github user okram commented on the issue: https://github.com/apache/incubator-tinkerpop/pull/325 Okay. So 1 and 2 above are done, however, I had to `OPT_OUT` of `TraversalInterruptionComputerTest` as for some reason, the `GryoPoolKryoShimService` is not respecting interruption exceptions. On the plus side, serialization is completely handled by the shim service which ensures consistency between computer and `ComputerResult` data. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-tinkerpop issue #325: TINKERPOP-1321 Introduce Kryo shim to suppor...
Github user okram commented on the issue: https://github.com/apache/incubator-tinkerpop/pull/325 I think some things need to be fixed up. 1. `TinkerPopKryoRegistrator` should be called `GryoRegistrator`. 2. I think we need a `GryoPoolKryoShimService` in core that non-Spark/Hadoop projects can leverage. ** We need to get `VertexProgramHelper.serializer/deserializer()` to use `KryoShimService`. I'm working on 2 right now and things are moving forward. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-tinkerpop issue #325: TINKERPOP-1321 Introduce Kryo shim to suppor...
Github user okram commented on the issue: https://github.com/apache/incubator-tinkerpop/pull/325 Testing this morning, I get two test failures. ``` Tests in error: GroovyProgramTest$Traversals>ProgramTest.g_V_outXcreatedX_aggregateXxX_byXlangX_groupCount_programXTestProgramX_asXaX_selectXa_xX:117 » IllegalState ProgramTest$Traversals>ProgramTest.g_V_outXcreatedX_aggregateXxX_byXlangX_groupCount_programXTestProgramX_asXaX_selectXa_xX:117 » IllegalState ``` Both these tests make use of a user defined `VertexProgram` that stores vertices as `HALTED_TRAVERSERS`. The problem is that the vertices are not being detached. ``` Caused by: java.io.NotSerializableException: org.apache.tinkerpop.gremlin.process.computer.util.ComputerGraph$ComputerVertex Serialization stack: - object not serializable (class: org.apache.tinkerpop.gremlin.process.computer.util.ComputerGraph$ComputerVertex, value: v[3]) - writeObject data (class: java.util.HashMap) - object (class java.util.HashMap, {v[3]=3, v[5]=1}) - field (class: org.apache.tinkerpop.gremlin.process.traversal.traverser.util.AbstractTraverser, name: t, type: class java.lang.Object) - object (class org.apache.tinkerpop.gremlin.process.traversal.traverser.B_LP_O_S_SE_SL_Traverser, {v[3]=3, v[5]=1}) ``` In the past, this would be automatically detached, but now it doesn't seem to be. I'll fiddle a bit to sew whats going on, but just heads up in case its obvious to use @dalaro ... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-tinkerpop issue #325: TINKERPOP-1321 Introduce Kryo shim to suppor...
Github user dalaro commented on the issue: https://github.com/apache/incubator-tinkerpop/pull/325 I saw the `shouldHandleMonthDay` test failure on CI and pushed a fix. I carelessly broke the read leg of that serializer while doing some tedious, boilerplate-y method signature changes to port a batch a serializers to the shim to support IoRegistry unification. The one-liner commit that I just pushed fixed the problem locally. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-tinkerpop issue #325: TINKERPOP-1321 Introduce Kryo shim to suppor...
Github user dalaro commented on the issue: https://github.com/apache/incubator-tinkerpop/pull/325 I just pushed some changes that I hacked together this weekend. The key additions are: * `TinkerPopKryoRegistrator`, which I extracted from my app, and which acts as a `spark.kryo.registrator` impl that knows about TinkerPop types * `IoRegistryAwareKryoSerializer`, which is a Spark `Serializer` that looks for `GryoPool.CONFIG_IO_REGISTRY` and applies it if present * `KryoShimLoaderService.applyConfiguration(cfg)`, which replaces direct calls to `HadoopPools.initialize(cfg)` and adds equivalent functionality for initializing the unshaded Kryo serializer pool The user would theoretically just set ``` spark.serializer=org.apache.tinkerpop.gremlin.spark.structure.io.gryo.IoRegistryAwareKryoSerializer spark.kryo.registrator=org.apache.tinkerpop.gremlin.spark.structure.io.gryo.TinkerPopKryoRegistrator # Optional, only needed for custom types gremlin.io.registry=whatever.user.IoRegistryImpl ``` In practice, when I have a custom gremlin.io.registry, I have always had to take the additional step (long before this PR) of forcibly initializing `HadoopPools` before touching SparkGraphComputer in my app, or else some part of Spark -- I think the closure serializer -- would attempt to use HadooPools via ObjectWritable/VertexWritable before initialization and produce garbage on my custom classes. **This problem predates my PR**. I'm not trying to solve it here, in part because I still don't know if it's a pathology specific to my app or because TinkerPop is missing a crucial `HadoopPools.initialize` (now, equivalently, `KryoShimLoaderService.applyConfiguration`) call somewhere, and in part because HadoopPools is such a hideous architectural wart that the ultimate solution probably involves destroying it. In the past, I've worked around this by defining a custom spark.serializer that delegates newKryo() to a GryoSerializer/IoRegistryAwareSerializer, but which has a constructor that invokes `HadoopPools.initialize`/`KryoShimLoaderService.applyConfiguration` (relying on that method's idempotence). Again, this initialization step just be specific to my app and unnecessary for the average TinkerPop user. It's possible that the config I pasted above will work for others. FWIW, this passes, so the overrides bug should be fixed along with all this refactoring stuff: ``` mvn clean install -DskipTests=true && mvn verify -pl gremlin-server -DskipIntegrationTests=false -Dtest.single=GremlinResultSetIntegrateTest ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-tinkerpop issue #325: TINKERPOP-1321 Introduce Kryo shim to suppor...
Github user dalaro commented on the issue: https://github.com/apache/incubator-tinkerpop/pull/325 Additional notes from Marko: > okay, so as long as there is never a need for MyRegistrar, then can you please update your PR accordingly? > Finally, call it GryoRegistrar (not TinkerPopRegistrar) as this has to do with Gryo IORegistry classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-tinkerpop issue #325: TINKERPOP-1321 Introduce Kryo shim to suppor...
Github user dalaro commented on the issue: https://github.com/apache/incubator-tinkerpop/pull/325 After discussion on Slack, I think @okram and I tentatively agreed to proceed with this PR after I do additional work to save users who have custom serializers the effort of maintaining separate `IoRegistry` and `spark.kryo.registrator` implementations with near-identical contents. I may discover other complications during the process, but I think this involves two changes: 1. I will attempt to subclass KryoSerializer so that I can access the SparkConf passed to its constructior and check for `GryoPool.CONFIG_IO_REGISTRY` (similar to what GryoSerializer does now), then apply any registrations found therein so long as each registration either: * specifies no explicit serializer (using Kryo's internal default) or * specifies a shim serializer In particular, if the registry contains an old-style TP shaded Kryo serializer that hasn't been ported to the shim, then the KryoSerializer subclass will throw an exception. This change is necessary to support custom-serialized, `IoRegistry`-listed datatypes that appear in most Spark data outside of closures (say, in the RDD itself). 2. I will replace current callsites of `HadoopPools.initialize(conf)` with some other static method that calls `HadoopPools.initialize(conf)` and then calls some roughly equivalent `initialize(conf)` method for the static KryoPool backing `KryoPoolShimService`. This change is necessary to support custom-serialized, `IoRegistry`-listed datatypes that appear in closures. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-tinkerpop issue #325: TINKERPOP-1321 Introduce Kryo shim to suppor...
Github user okram commented on the issue: https://github.com/apache/incubator-tinkerpop/pull/325 @dalaro -- what about it doesn't work? Meaning, is your problem just a "unregistered class" issue? If so, we just register it. If I have a "bad" serializer for one of the registered classes, then we can fix it? See the classes that I've registered in `GryoSerializer` (The last 7 are TinkerPop specific.): ```java builder.addCustom(Tuple2.class, new Tuple2Serializer()) .addCustom(Tuple2[].class) .addCustom(Tuple3.class, new Tuple3Serializer()) .addCustom(Tuple3[].class) .addCustom(CompactBuffer.class, new CompactBufferSerializer()) .addCustom(CompactBuffer[].class) .addCustom(CompressedMapStatus.class) .addCustom(BlockManagerId.class) .addCustom(HighlyCompressedMapStatus.class, new ExternalizableSerializer()) .addCustom(HttpBroadcast.class) .addCustom(PythonBroadcast.class) .addCustom(BoxedUnit.class) .addCustom(Class.forName("scala.reflect.ClassTag$$anon$1"), new JavaSerializer()) .addCustom(Class.forName("scala.reflect.ManifestFactory$$anon$1"), new JavaSerializer()) .addCustom(WrappedArray.ofRef.class, new WrappedArraySerializer()) .addCustom(MessagePayload.class) .addCustom(ViewIncomingPayload.class) .addCustom(ViewOutgoingPayload.class) .addCustom(ViewPayload.class) .addCustom(SerializableConfiguration.class, new JavaSerializer()) .addCustom(VertexWritable.class, new VertexWritableSerializer()) .addCustom(ObjectWritable.class, new ObjectWritableSerializer()) ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-tinkerpop issue #325: TINKERPOP-1321 Introduce Kryo shim to suppor...
Github user dalaro commented on the issue: https://github.com/apache/incubator-tinkerpop/pull/325 The problem is that GryoSerializer doesn't actually work. This is not the first time it has shown behavior on Spark or scala runtime classes which diverges from the behavior of Spark's KryoSerializer/JavaSerializer. I realize that asking users to write custom serializers against the shim (to be compatible with everything) in the long term or to duplicate registrations in the short term (if they want to use Gryo in one place and Spark graph computation in another, using the same custom types) is not ideal, but I think the status quo is even worse. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-tinkerpop issue #325: TINKERPOP-1321 Introduce Kryo shim to suppor...
Github user okram commented on the issue: https://github.com/apache/incubator-tinkerpop/pull/325 So yea... sorta feels like we are where we were before in a way. And yea, all that Input, Output, Kryo stuff is what causes all the problems. So now I'm wondering, why not just register the requisite classes with `GryoSerializer`? Yes, we miss one every now and then, but we add it. Sooner or later they will all be there. Moreover, if a person needs it now, they can just register it with `GryoMapper`. Did you try just registering your "high density map" w/ `GryoSerializer`? I know that last paragraph sucks given all the work you did, but having two ways of doing something and one way being Spark-specific doesn't make me happy and this is why `GryoSerializer` came into being.. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-tinkerpop issue #325: TINKERPOP-1321 Introduce Kryo shim to suppor...
Github user dalaro commented on the issue: https://github.com/apache/incubator-tinkerpop/pull/325 BTW, in case you're wondering why I think it's hard to create proxies, recall that o.a.t.shaded.kryo.{Kryo,io.Input,io.Output} are all classes, not interfaces. There are a lot of approaches that would make this work, but the ones I can think of suck for various reasons. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-tinkerpop issue #325: TINKERPOP-1321 Introduce Kryo shim to suppor...
Github user dalaro commented on the issue: https://github.com/apache/incubator-tinkerpop/pull/325 To do that cleanly, users would have to write serializers against the shim interface. Of course, the shim is only relevant to users writing new serializers or willing to refactor their old ones. I see no clean way to make this work with existing serializers that users have hypothetically already compiled against TP's shaded Kryo. To make a user serializer written for TP's shaded Kryo link against Spark's unshaded Kryo is technically possible with some degree of skulduggery, it's just that all of the ways to do it that I'm aware of are unspeakably ugly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-tinkerpop issue #325: TINKERPOP-1321 Introduce Kryo shim to suppor...
Github user okram commented on the issue: https://github.com/apache/incubator-tinkerpop/pull/325 @dalaro > The problem with custom registrations is that GryoMapper allows the user to provide a serializer written against shaded Kryo. This is not compatible with Spark's KryoSerializer. I don't see a way to make it compatible without putting fake org.apache.tinkerpop.shaded.kryo.* classes on the classpath, which could get pretty ugly. Yea, thats a problem and the reason why `GryoSerializer` exists. We don't want to make `SparkGraphComputer` "special." Users register their serializers with `Graph` and all OLTP and OLAP systems take it from there. If, for `SparkGraphComputer`, they ALSO have to register them with Spark, that is no bueno. Is there a way to create "proxies" that map between shaded and unshaded Kryo so that registered `GryoMapper` serializers can be used as unshaded Kryo serializers by Spark? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-tinkerpop issue #325: TINKERPOP-1321 Introduce Kryo shim to suppor...
Github user dalaro commented on the issue: https://github.com/apache/incubator-tinkerpop/pull/325 @spmallette You're right about overrides being the problem. The last thing I did before opening this PR was to rebase on latest master, when I encountered conflicts solely involving the GryoMapper registration override feature added in d7684757734732f39970265a425b921a48d8f0fb. I attempted to resolve those conflicts too hastily and got it wrong. I went back to Stephen's original commit reworked my code around registration overrides, and I now think it conforms to the old behavior precedent. Specifically, when handling overrides of GryoMapper type registrations, I departed from the precedent set by Stephen's commit in two ways, both of which I've fixed locally: * Override registrations allocated a new registration ID. Your commit used the old ID from the registration being overridden. * As a side effect of the preceding point, I incremented the shared `currentSerializationId` for all registration calls. Your commit incremented this shared counter only for non-override registration calls. The test passes locally for me now. I'm still doing some registrator refactoring and haven't pushed yet though. @okram The problem with custom registrations is that GryoMapper allows the user to provide a serializer written against shaded Kryo. This is not compatible with Spark's KryoSerializer. I don't see a way to make it compatible without putting fake `org.apache.tinkerpop.shaded.kryo.*` classes on the classpath, which could get pretty ugly. Instead, I'm adding a registrator that includes: * all default mappings from GryoMapper (I am changing the shaded-Kryo-serializers to be shim serializers so that they are reusable) * the TinkerPop classes registered in GryoSerializer's constructors (but not the scala and Spark classes registered in there) If the user wants to register custom types while using KryoSerializer, they can extend the registrator in a subclass, then set `spark.kryo.registrator` to the subclass. The interface in question is `KryoRegistrator` and it has only one method, `registerClasses(Kryo)`. It's about as simple -- maybe even less so -- than implementing TP's `IoRegistry` interface, which has two methods. If the user decides to continue using GryoSerializer, they can also continue using `GryoPool.CONFIG_IO_REGISTRY`, which should still affect GryoSerializer/HadoopPools as it always has. WDYT? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-tinkerpop issue #325: TINKERPOP-1321 Introduce Kryo shim to suppor...
Github user okram commented on the issue: https://github.com/apache/incubator-tinkerpop/pull/325 @dalaro -- you mention: ``` spark.serializer=KryoSerializer spark.kryo.registrator=com.tinkerpop.something.TinkerPopRegistrator ``` I think we need: ``` spark.serializer=KryoSerializer spark.kryo.registrator=org.apache.tinkerpop.gremlin.structure.io.gryo.GryoRegistrar ``` `GryoRegistrar` would have all the `GryoMapper` classes registered which, when people do custom registrations (e.g. `Point`, `MyStupidStringImplementation`), it will be in `GryoMapper`. Can you provide something like that in this PR? Hard? @spmallette knows more about how to interact with `GryoMapper` to get our the serializers... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-tinkerpop issue #325: TINKERPOP-1321 Introduce Kryo shim to suppor...
Github user spmallette commented on the issue: https://github.com/apache/incubator-tinkerpop/pull/325 I've confirmed that this test: ```text GremlinResultSetIntegrateTest.shouldHandleVertexResultWithLiteSerialization:174 ? Execution ``` is only failing after these changes and not on master. I sense it has something to do with this: ```java private void addOrOverrideRegistration(TypeRegistration newRegistration) ``` not preserving the registration identifier, but i could be wrong. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-tinkerpop issue #325: TINKERPOP-1321 Introduce Kryo shim to suppor...
Github user dalaro commented on the issue: https://github.com/apache/incubator-tinkerpop/pull/325 Thanks for the comments and for running the tests. @okram: 1. I also dislike HadoopPools and would prefer to remove it, but I don't know how to do it correctly, even after considering the problem for a while. Here are my open questions and observations. HadoopPools apparently exists to help Vertex/ObjectWritable implement Serializable. They must implement Serializable because they keep showing up in `spark.closure.serializer`'s workload, and only JavaSerializer can be used as `spark.closure.serializer`. A natural question here: is it reasonable and normal for graph types to appear in these closures in the first place? Spark has this nifty `spark.kryo.registrator` mechanism to support user-registered types with Kryo, and its docs talk at length about setting `spark.serializer`. The docs read as though that was all the user would need to customize to deal with custom data. `spark.closure.serializer` seems to barely be mentioned. I've only seen it in the manual's exhaustive table of config options (http://spark.apache.org/docs/latest/configuration.html). Even then, the docs merely comment that JavaSerializer is its only supported value. There's some old list traffic that says KryoSerializer is unsupported because, back when it was originally tested, it blew up on pretty much any Java closure, though that might have changed since. I wonder if this indicates some kind of usage antipattern. Could there be an alternate approach to traversal/vertexprogram evaluation in SparkGraphComputer/SparkExecutor that avoids putting graph data into Spark closures? That would make the use-case for HadoopPools disappear. Then again, it seems so easy to get into this situation with `rdd.map(x -> whatever(x))`. I am full of uncertainty on this point. I don't have deep enough Spark knowledge to judge whether what we're doing now is reasonable. As long as Object/VertexWritable keep popping up in closures created by SparkGraphComputer & friends, we need some way to serialize them and everything they refer to, which can apparently involve a lot of graph entities. We could theoretically implement Serializable "manually" (without Kryo/Gryo), but that seems like a massive amount of work, potentially inefficient, definitely redundant, and adds a huge new bug surface. I'm against that approach, but that's about the only thing I know for certain here. 2. In the long term, I think users should set ``` spark.serializer=KryoSerializer spark.kryo.registrator=com.tinkerpop.something.TinkerPopRegistrator ``` and TinkerPop should stop registering/caring about Spark and scala-runtime types in its serialization layer. This PR tries to keep GryoSerializer untouched though. That migration doesn't have to be forced. I'm already using the `spark.kryo.registrator` setting in my app, where I have an impl that registers all TinkerPop types plus all of my app's types. I can refactor that by extracting a TinkerPopRegistrator that covers the TP types (but not my app's). I probably should have done that at the start. As it stands, this PR is sort of a "build-your-own-alternative-to-GryoSerializer" kit, but the very last assembly steps involving a `spark.kryo.registrator` impl are left as an exercise to the user. I'll try to introduce a TinkerPopRegistrator to address that, though I may reconsider if supporting custom type registration turns out way more complicated than I'm imagining. I think Gryo more generally is useful and can stay around indefinitely (say, for reading/writing data to the filesystem), but GryoSerializer should eventually stop being the recommended spark.serializer. If that is all GryoSerializer per se exists to do, then it could be removed and any TinkerPop-specific class registrations that it performs moved to TinkerPopRegistrator. 3. This PR should maintain full format compatibility. I deliberately tried to keep all of GryoMapper and GryoSerializer's class registrations, class IDs, and associated serializers (including StarGraphSerializer) functionally unchanged. If you see somewhere that I unintentionally changed the format, that's a bug. @spmallette: re. package organization: definitely makes sense, I'll move kryoshim under io.gryo. Please let me know if I bungled the package hierarchy anywhere else. re. path forward: I think 1 and 2 above touch on both HadoopPools and GryoSerializer. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with IN
[GitHub] incubator-tinkerpop issue #325: TINKERPOP-1321 Introduce Kryo shim to suppor...
Github user spmallette commented on the issue: https://github.com/apache/incubator-tinkerpop/pull/325 Ran the integration tests and hit a failure after merging this branch onto the latest from master: ```text Results : Tests in error: GremlinResultSetIntegrateTest.shouldHandleVertexResultWithLiteSerialization:174 ? Execution Tests run: 156, Failures: 0, Errors: 1, Skipped: 2 ``` that would come from Gremlin Server i think...you could test with: ```text mvn verify -pl gremlin-server -DskipIntegrationTests=false ``` I don't think that test fails on master, but I suppose that is possible. I will set up my system now to try it out and will see the result in the morning. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-tinkerpop issue #325: TINKERPOP-1321 Introduce Kryo shim to suppor...
Github user spmallette commented on the issue: https://github.com/apache/incubator-tinkerpop/pull/325 @dalaro you had some really nice vision on this one. Just an organizational point - I feel like the "kryoshim" package should live inside of the "gryo" package as a child and not be a sibling. The `org.apache.tinkerpop.gremlin.structure.io` package should contain actual IO implementations (like gryo, graphson, graphml) and it seems to me that "kryoshim" only relates to "gryo". Make sense? You didn't go so far as to deprecate anything and perhaps this work isn't quite at the point where we're ready to do that, but it would be nice to know what you see as the path to deprecating and removing some code (say `HadoopPools` and `GryoSerializers`) so that we could keep that knowledge around while it is fresh in your head. Perhaps we could polish all that up for the pending 3.2.1 release. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-tinkerpop issue #325: TINKERPOP-1321 Introduce Kryo shim to suppor...
Github user okram commented on the issue: https://github.com/apache/incubator-tinkerpop/pull/325 @dalaro -- your explanation of the problem in JIRA is impeccable. I now I remember why "just extending `SparkSerializer`" was not an option. TinkerPop's Kryo is shaded. Your concept of a shim to abstract both shaded Kryo to be used by unshaded Kryo systems is smart. Looking at the code specifically, unfortunately, its hard to review and understand the full ramifications of what you are providing. I can run the integration test suite and all that, but perhaps you could also answer some questions as well. **Note that I haven't looked at this code in over 6 months so my questions may be a little odd sounding.** 1. I never liked `HadoopPools`. I have a sneaking suspicion that they may not work with custom class registration and I we really should have a test case in there to prove yay or nay. Does your work either make `HadoopPools` "better" or no longer needed? (Also think about `GiraphGraphComputer` which uses `HadoopPools`). 2. Moving forward, would people still use `GryoSerializer` or is it recommended that they use `SparkSerializer`? How would your work effect our docs and does `GryoSerializer` become deprecated? 3. I noticed you moved around the serialization code for `StarGraph`. That is fine, but I'm wondering, did you change the serialization of `StarGraph`? That is, did the `byte[]` format change? If so, thats bad for backwards compatibility. Finally, your PR is beautiful. You JavaDoc like a king and you prove your worth as emperor with `package-info.java`. Thank you for your efforts in a part of the codebase that I wrote with a looming lack of confidence. You have brought a level of Java skill/technique to a long standing problem that is inspiring. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---