[GitHub] incubator-tinkerpop issue #325: TINKERPOP-1321 Introduce Kryo shim to suppor...

2016-06-07 Thread spmallette
Github user spmallette commented on the issue:

https://github.com/apache/incubator-tinkerpop/pull/325
  
I closed manually from our end.  


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-tinkerpop issue #325: TINKERPOP-1321 Introduce Kryo shim to suppor...

2016-06-07 Thread spmallette
Github user spmallette commented on the issue:

https://github.com/apache/incubator-tinkerpop/pull/325
  
All tests pass with `docker/build.sh -t -n -i`

VOTE +1


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-tinkerpop issue #325: TINKERPOP-1321 Introduce Kryo shim to suppor...

2016-06-07 Thread okram
Github user okram commented on the issue:

https://github.com/apache/incubator-tinkerpop/pull/325
  
This has been merged. @dalaro Can you please close this PR as the merge 
happened via the Apache TinkerPop branch sourced from your original PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-tinkerpop issue #325: TINKERPOP-1321 Introduce Kryo shim to suppor...

2016-06-07 Thread dkuppitz
Github user dkuppitz commented on the issue:

https://github.com/apache/incubator-tinkerpop/pull/325
  
VOTE: +1


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-tinkerpop issue #325: TINKERPOP-1321 Introduce Kryo shim to suppor...

2016-06-07 Thread okram
Github user okram commented on the issue:

https://github.com/apache/incubator-tinkerpop/pull/325
  
```
[INFO] Reactor Summary:
[INFO]
[INFO] Apache TinkerPop ... SUCCESS [  
9.619 s]
[INFO] Apache TinkerPop :: Gremlin Shaded . SUCCESS [  
1.894 s]
[INFO] Apache TinkerPop :: Gremlin Core ... SUCCESS [ 
28.393 s]
[INFO] Apache TinkerPop :: Gremlin Test ... SUCCESS [  
9.757 s]
[INFO] Apache TinkerPop :: Gremlin Groovy . SUCCESS [ 
34.087 s]
[INFO] Apache TinkerPop :: Gremlin Groovy Test  SUCCESS [  
4.713 s]
[INFO] Apache TinkerPop :: TinkerGraph Gremlin  SUCCESS [02:35 
min]
[INFO] Apache TinkerPop :: Gremlin Benchmark .. SUCCESS [  
3.353 s]
[INFO] Apache TinkerPop :: Hadoop Gremlin . SUCCESS [04:08 
min]
[INFO] Apache TinkerPop :: Spark Gremlin .. SUCCESS [12:36 
min]
[INFO] Apache TinkerPop :: Giraph Gremlin . SUCCESS [  
02:58 h]
```

Neo4j failed cause it couldn't download jars (my internet connection must 
have died over night).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-tinkerpop issue #325: TINKERPOP-1321 Introduce Kryo shim to suppor...

2016-06-06 Thread dalaro
Github user dalaro commented on the issue:

https://github.com/apache/incubator-tinkerpop/pull/325
  
Just a reminder: we are voting on branch 
https://github.com/apache/incubator-tinkerpop/tree/TINKERPOP-1321.  That 
subsumes the PR and adds a combination of fixes from Marko and me.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-tinkerpop issue #325: TINKERPOP-1321 Introduce Kryo shim to suppor...

2016-06-06 Thread okram
Github user okram commented on the issue:

https://github.com/apache/incubator-tinkerpop/pull/325
  
VOTE +1.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-tinkerpop issue #325: TINKERPOP-1321 Introduce Kryo shim to suppor...

2016-06-06 Thread spmallette
Github user spmallette commented on the issue:

https://github.com/apache/incubator-tinkerpop/pull/325
  
Yeah - i know - it's there, but it's proliferating and becoming more and 
more OK which i think is "bad".  i'll bring it up as a separate issue.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-tinkerpop issue #325: TINKERPOP-1321 Introduce Kryo shim to suppor...

2016-06-06 Thread okram
Github user okram commented on the issue:

https://github.com/apache/incubator-tinkerpop/pull/325
  
We use `Random` in numerous places to test various "versions" of a 
particular code block. It is possible to replace all possible choices with new 
`GraphProviders` but you are talking about the test suite taking on the order 
of 20+ hours to complete then. If you have a better idea for testing these 
facets, then please provide a PR. Note that this PR's use of `Random` is not 
something "novel" as we use `Random` already in `SparkHadoopGraphProvider` to 
select `InputFormats`, `OutputFormats`, and traversal sources (deprecated vs. 
non-deprecated representations).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-tinkerpop issue #325: TINKERPOP-1321 Introduce Kryo shim to suppor...

2016-06-06 Thread spmallette
Github user spmallette commented on the issue:

https://github.com/apache/incubator-tinkerpop/pull/325
  
> Marko setup SparkHadoopGraphProvider to randomly choose either the old 
serialization code or the new stuff in 
e700363#diff-afffc6796745845d46e6f60ea3a992f9R91. IMHO that should be 
overrideable to make it deterministic, but it's not a huge deal since it's 
limited to test code.

+1 - i'd prefer to not have "random" in our test cases. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-tinkerpop issue #325: TINKERPOP-1321 Introduce Kryo shim to suppor...

2016-06-06 Thread dalaro
Github user dalaro commented on the issue:

https://github.com/apache/incubator-tinkerpop/pull/325
  
These questions involved a lot of back-and-forth that would have been slow 
to resolve on github comments, so and I discussed the failures and the code on 
a hangout earlier today.

Summary:

* Marko created 
https://github.com/apache/incubator-tinkerpop/tree/TINKERPOP-1321 on the TP 
repo.  This branch is my PR plus additional tweaks Marko and I worked out on 
top.  His branch supersedes this PR branch now.  This PR branch **should not be 
merged because Marko's branch subsumes/extends it**.
* I have a rough idea of what changed as we discussed code on the hangout, 
but I will go through his branch's new commits and review.
* The new serialization code is lower-priority than the old code.  It has 
to be turned on through explicit user action, like this is the test provider 
code:
```
System.setProperty(SHIM_CLASS_SYSTEM_PROPERTY, 
UnshadedKryoShimService.class.getCanonicalName());
config.put("spark.serializer", 
KryoSerializer.class.getCanonicalName());
config.put("spark.kryo.registrator", 
GryoRegistrator.class.getCanonicalName());
```
* Marko setup SparkHadoopGraphProvider to randomly choose either the old 
serialization code or the new stuff in 
https://github.com/apache/incubator-tinkerpop/commit/e7003635e27c625b3f30492111f20f4fe4e24eb5#diff-afffc6796745845d46e6f60ea3a992f9R91.
  IMHO that should be overrideable to make it deterministic, but it's not a 
huge deal since it's limited to test code.
* We replaced `GryoClassResolver` with a series of class registrations in 
https://github.com/apache/incubator-tinkerpop/commit/e7003635e27c625b3f30492111f20f4fe4e24eb5#diff-2e881d0a6ab7ed1ca7d1f593bceadfd2R207.
  The idea here is to explicitly configure which types need a detaching 
serializer rather than relying on `GryoClassResolver` to automagically do it 
regardless of explicit class registrations.  Three factors influence this 
choice:
   * Spark's KryoSerializer does not support a user-provided ClassResolver, 
and working around that limitation entails either getting a change through 
Spark upstream or copying and pasting a bunch of KryoSerializer's 
private/hardcoded class registrations, some of which could change in future 
Spark versions.  The latter is ugly and would make the code brittle across 
Spark upgrades.
   * `GryoClassResolver` is convenient but somewhat mysterious to third 
parties, who are likelier to have seen a custom class registration mechanism 
(like TP's `IoRegistry` or Spark's `spark.kryo.registrator`) than a custom 
resolver.  It also removes vendor control in that its detachment behavior is 
not configurable, so modifying it requires subclassing or reimplementation. For 
instance, if a vendor had a lightweight Vertex implementation with a custom 
serializer and wanted to bypass TP's resolver-driven detachment, then I don't 
think it would be possible without modifying TP's resolver.
   * `GryoClassResolver` is written against a bunch of shaded types and 
relies on `readVarInt`/`writeVarInt`, which were added in Kryo 2.22, one 
version too late for compatibility with Spark's Kryo 2.21.  This is the least 
important concern of the three, more an implementation detail.
* Marko is running the full integration test suite now.  He ran the spark 
ITs just before pushing and reported success on those specifically.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-tinkerpop issue #325: TINKERPOP-1321 Introduce Kryo shim to suppor...

2016-06-06 Thread okram
Github user okram commented on the issue:

https://github.com/apache/incubator-tinkerpop/pull/325
  
So, `mvn clean install` is happy w/o the 
`TravesralInterruptionComputerTest` (commented out). Running full integration 
tests now...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-tinkerpop issue #325: TINKERPOP-1321 Introduce Kryo shim to suppor...

2016-06-06 Thread okram
Github user okram commented on the issue:

https://github.com/apache/incubator-tinkerpop/pull/325
  
Okay. So 1 and 2 above are done, however, I had to `OPT_OUT` of 
`TraversalInterruptionComputerTest` as for some reason, the 
`GryoPoolKryoShimService` is not respecting interruption exceptions. On the 
plus side, serialization is completely handled by the shim service which 
ensures consistency between computer and `ComputerResult` data.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-tinkerpop issue #325: TINKERPOP-1321 Introduce Kryo shim to suppor...

2016-06-06 Thread okram
Github user okram commented on the issue:

https://github.com/apache/incubator-tinkerpop/pull/325
  
I think some things need to be fixed up.

1. `TinkerPopKryoRegistrator` should be called `GryoRegistrator`.
2. I think we need a `GryoPoolKryoShimService` in core that 
non-Spark/Hadoop projects can leverage.
** We need to get `VertexProgramHelper.serializer/deserializer()` to use 
`KryoShimService`.

I'm working on 2 right now and things are moving forward.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-tinkerpop issue #325: TINKERPOP-1321 Introduce Kryo shim to suppor...

2016-06-06 Thread okram
Github user okram commented on the issue:

https://github.com/apache/incubator-tinkerpop/pull/325
  
Testing this morning, I get two test failures.

```
Tests in error:
  
GroovyProgramTest$Traversals>ProgramTest.g_V_outXcreatedX_aggregateXxX_byXlangX_groupCount_programXTestProgramX_asXaX_selectXa_xX:117
 » IllegalState
  
ProgramTest$Traversals>ProgramTest.g_V_outXcreatedX_aggregateXxX_byXlangX_groupCount_programXTestProgramX_asXaX_selectXa_xX:117
 » IllegalState
```

Both these tests make use of a user defined `VertexProgram` that stores 
vertices as `HALTED_TRAVERSERS`. The problem is that the vertices are not being 
detached.

```
Caused by: java.io.NotSerializableException: 
org.apache.tinkerpop.gremlin.process.computer.util.ComputerGraph$ComputerVertex
Serialization stack:
- object not serializable (class: 
org.apache.tinkerpop.gremlin.process.computer.util.ComputerGraph$ComputerVertex,
 value: v[3])
- writeObject data (class: java.util.HashMap)
- object (class java.util.HashMap, {v[3]=3, v[5]=1})
- field (class: 
org.apache.tinkerpop.gremlin.process.traversal.traverser.util.AbstractTraverser,
 name: t, type: class java.lang.Object)
- object (class 
org.apache.tinkerpop.gremlin.process.traversal.traverser.B_LP_O_S_SE_SL_Traverser,
 {v[3]=3, v[5]=1})
```

In the past, this would be automatically detached, but now it doesn't seem 
to be. I'll fiddle a bit to sew whats going on, but just heads up in case its 
obvious to use @dalaro ...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-tinkerpop issue #325: TINKERPOP-1321 Introduce Kryo shim to suppor...

2016-06-06 Thread dalaro
Github user dalaro commented on the issue:

https://github.com/apache/incubator-tinkerpop/pull/325
  
I saw the `shouldHandleMonthDay` test failure on CI and pushed a fix.  I 
carelessly broke the read leg of that serializer while doing some tedious, 
boilerplate-y method signature changes to port a batch a serializers to the 
shim to support IoRegistry unification.  The one-liner commit that I just 
pushed fixed the problem locally.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-tinkerpop issue #325: TINKERPOP-1321 Introduce Kryo shim to suppor...

2016-06-06 Thread dalaro
Github user dalaro commented on the issue:

https://github.com/apache/incubator-tinkerpop/pull/325
  
I just pushed some changes that I hacked together this weekend.  The key 
additions are:

* `TinkerPopKryoRegistrator`, which I extracted from my app, and which acts 
as a `spark.kryo.registrator` impl that knows about TinkerPop types
* `IoRegistryAwareKryoSerializer`, which is a Spark `Serializer` that looks 
for `GryoPool.CONFIG_IO_REGISTRY` and applies it if present
* `KryoShimLoaderService.applyConfiguration(cfg)`, which replaces direct 
calls to `HadoopPools.initialize(cfg)` and adds equivalent functionality for 
initializing the unshaded Kryo serializer pool

The user would theoretically just set

```

spark.serializer=org.apache.tinkerpop.gremlin.spark.structure.io.gryo.IoRegistryAwareKryoSerializer

spark.kryo.registrator=org.apache.tinkerpop.gremlin.spark.structure.io.gryo.TinkerPopKryoRegistrator
# Optional, only needed for custom types
gremlin.io.registry=whatever.user.IoRegistryImpl
```

In practice, when I have a custom gremlin.io.registry, I have always had to 
take the additional step (long before this PR) of forcibly initializing 
`HadoopPools` before touching SparkGraphComputer in my app, or else some part 
of Spark -- I think the closure serializer -- would attempt to use HadooPools 
via ObjectWritable/VertexWritable before initialization and produce garbage on 
my custom classes.  **This problem predates my PR**. I'm not trying to solve it 
here, in part because I still don't know if it's a pathology specific to my app 
or because TinkerPop is missing a crucial `HadoopPools.initialize` (now, 
equivalently, `KryoShimLoaderService.applyConfiguration`) call somewhere, and 
in part because HadoopPools is such a hideous architectural wart that the 
ultimate solution probably involves destroying it.

In the past, I've worked around this by defining a custom spark.serializer 
that delegates newKryo() to a GryoSerializer/IoRegistryAwareSerializer, but 
which has a constructor that invokes 
`HadoopPools.initialize`/`KryoShimLoaderService.applyConfiguration` (relying on 
that method's idempotence).

Again, this initialization step just be specific to my app and unnecessary 
for the average TinkerPop user.  It's possible that the config I pasted above 
will work for others.

FWIW, this passes, so the overrides bug should be fixed along with all this 
refactoring stuff:

```
mvn clean install -DskipTests=true && mvn verify -pl gremlin-server 
-DskipIntegrationTests=false -Dtest.single=GremlinResultSetIntegrateTest
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-tinkerpop issue #325: TINKERPOP-1321 Introduce Kryo shim to suppor...

2016-06-03 Thread dalaro
Github user dalaro commented on the issue:

https://github.com/apache/incubator-tinkerpop/pull/325
  
Additional notes from Marko:

> okay, so as long as there is never a need for MyRegistrar, then can you 
please update your PR accordingly?
> Finally, call it GryoRegistrar (not TinkerPopRegistrar) as this has to do 
with Gryo IORegistry classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-tinkerpop issue #325: TINKERPOP-1321 Introduce Kryo shim to suppor...

2016-06-03 Thread dalaro
Github user dalaro commented on the issue:

https://github.com/apache/incubator-tinkerpop/pull/325
  
After discussion on Slack, I think @okram and I tentatively agreed to 
proceed with this PR after I do additional work to save users who have custom 
serializers the effort of maintaining separate `IoRegistry` and 
`spark.kryo.registrator` implementations with near-identical contents.

I may discover other complications during the process, but I think this 
involves two changes:

1. I will attempt to subclass KryoSerializer so that I can access the 
SparkConf passed to its constructior and check for 
`GryoPool.CONFIG_IO_REGISTRY` (similar to what GryoSerializer does now), then 
apply any registrations found therein so long as each registration either:

   * specifies no explicit serializer (using Kryo's internal default) or
   * specifies a shim serializer

   In particular, if the registry contains an old-style TP shaded Kryo 
serializer that hasn't been ported to the shim, then the KryoSerializer 
subclass will throw an exception.

   This change is necessary to support custom-serialized, 
`IoRegistry`-listed datatypes that appear in most Spark data outside of 
closures (say, in the RDD itself).

2. I will replace current callsites of `HadoopPools.initialize(conf)` with 
some other static method that calls `HadoopPools.initialize(conf)` and then 
calls some roughly equivalent `initialize(conf)` method for the static KryoPool 
backing `KryoPoolShimService`.

   This change is necessary to support custom-serialized, 
`IoRegistry`-listed datatypes that appear in closures.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-tinkerpop issue #325: TINKERPOP-1321 Introduce Kryo shim to suppor...

2016-06-03 Thread okram
Github user okram commented on the issue:

https://github.com/apache/incubator-tinkerpop/pull/325
  
@dalaro -- what about it doesn't work? Meaning, is your problem just a 
"unregistered class" issue? If so, we just register it. If I have a "bad" 
serializer for one of the registered classes, then we can fix it?

See the classes that I've registered in `GryoSerializer` (The last 7 are 
TinkerPop specific.):

```java
builder.addCustom(Tuple2.class, new 
Tuple2Serializer())
.addCustom(Tuple2[].class)
.addCustom(Tuple3.class, new 
Tuple3Serializer())
.addCustom(Tuple3[].class)
.addCustom(CompactBuffer.class, new 
CompactBufferSerializer())
.addCustom(CompactBuffer[].class)
.addCustom(CompressedMapStatus.class)
.addCustom(BlockManagerId.class)
.addCustom(HighlyCompressedMapStatus.class, 
new ExternalizableSerializer()) 
.addCustom(HttpBroadcast.class)
.addCustom(PythonBroadcast.class)
.addCustom(BoxedUnit.class)

.addCustom(Class.forName("scala.reflect.ClassTag$$anon$1"), new 
JavaSerializer())

.addCustom(Class.forName("scala.reflect.ManifestFactory$$anon$1"), new 
JavaSerializer())
.addCustom(WrappedArray.ofRef.class, new 
WrappedArraySerializer())
.addCustom(MessagePayload.class)
.addCustom(ViewIncomingPayload.class)
.addCustom(ViewOutgoingPayload.class)
.addCustom(ViewPayload.class)
.addCustom(SerializableConfiguration.class, 
new JavaSerializer())
.addCustom(VertexWritable.class, new 
VertexWritableSerializer())
.addCustom(ObjectWritable.class, new 
ObjectWritableSerializer())
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-tinkerpop issue #325: TINKERPOP-1321 Introduce Kryo shim to suppor...

2016-06-03 Thread dalaro
Github user dalaro commented on the issue:

https://github.com/apache/incubator-tinkerpop/pull/325
  
The problem is that GryoSerializer doesn't actually work.  This is not the 
first time it has shown behavior on Spark or scala runtime classes which 
diverges from the behavior of Spark's KryoSerializer/JavaSerializer.

I realize that asking users to write custom serializers against the shim 
(to be compatible with everything) in the long term or to duplicate 
registrations in the short term (if they want to use Gryo in one place and 
Spark graph computation in another, using the same custom types) is not ideal, 
but I think the status quo is even worse.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-tinkerpop issue #325: TINKERPOP-1321 Introduce Kryo shim to suppor...

2016-06-03 Thread okram
Github user okram commented on the issue:

https://github.com/apache/incubator-tinkerpop/pull/325
  
So yea... sorta feels like we are where we were before in a way. And yea, 
all that Input, Output, Kryo stuff is what causes all the problems.

So now I'm wondering, why not just register the requisite classes with 
`GryoSerializer`? Yes, we miss one every now and then, but we add it. Sooner or 
later they will all be there. Moreover, if a person needs it now, they can just 
register it with `GryoMapper`. Did you try just registering your "high density 
map" w/ `GryoSerializer`?

I know that last paragraph sucks given all the work you did, but having two 
ways of doing something and one way being Spark-specific doesn't make me happy 
and this is why `GryoSerializer` came into being.. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-tinkerpop issue #325: TINKERPOP-1321 Introduce Kryo shim to suppor...

2016-06-03 Thread dalaro
Github user dalaro commented on the issue:

https://github.com/apache/incubator-tinkerpop/pull/325
  
BTW, in case you're wondering why I think it's hard to create proxies, 
recall that o.a.t.shaded.kryo.{Kryo,io.Input,io.Output} are all classes, not 
interfaces.  There are a lot of approaches that would make this work, but the 
ones I can think of suck for various reasons.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-tinkerpop issue #325: TINKERPOP-1321 Introduce Kryo shim to suppor...

2016-06-03 Thread dalaro
Github user dalaro commented on the issue:

https://github.com/apache/incubator-tinkerpop/pull/325
  
To do that cleanly, users would have to write serializers against the shim 
interface.

Of course, the shim is only relevant to users writing new serializers or 
willing to refactor their old ones.

I see no clean way to make this work with existing serializers that users 
have hypothetically already compiled against TP's shaded Kryo.  To make a user 
serializer written for TP's shaded Kryo link against Spark's unshaded Kryo is 
technically possible with some degree of skulduggery, it's just that all of the 
ways to do it that I'm aware of are unspeakably ugly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-tinkerpop issue #325: TINKERPOP-1321 Introduce Kryo shim to suppor...

2016-06-03 Thread okram
Github user okram commented on the issue:

https://github.com/apache/incubator-tinkerpop/pull/325
  
@dalaro

> The problem with custom registrations is that GryoMapper allows the user 
to provide a serializer written against shaded Kryo. This is not compatible 
with Spark's KryoSerializer. I don't see a way to make it compatible without 
putting fake org.apache.tinkerpop.shaded.kryo.* classes on the classpath, which 
could get pretty ugly.

Yea, thats a problem and the reason why `GryoSerializer` exists. We don't 
want to make `SparkGraphComputer` "special." Users register their serializers 
with `Graph` and all OLTP and OLAP systems take it from there. If, for 
`SparkGraphComputer`, they ALSO have to register them with Spark, that is no 
bueno. Is there a way to create "proxies" that map between shaded and unshaded 
Kryo so that registered `GryoMapper` serializers can be used as unshaded Kryo 
serializers by Spark?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-tinkerpop issue #325: TINKERPOP-1321 Introduce Kryo shim to suppor...

2016-06-03 Thread dalaro
Github user dalaro commented on the issue:

https://github.com/apache/incubator-tinkerpop/pull/325
  
@spmallette You're right about overrides being the problem.

The last thing I did before opening this PR was to rebase on latest master, 
when I encountered conflicts solely involving the GryoMapper registration 
override feature added in d7684757734732f39970265a425b921a48d8f0fb.  I 
attempted to resolve those conflicts too hastily and got it wrong.

I went back to Stephen's original commit reworked my code around 
registration overrides, and I now think it conforms to the old behavior 
precedent.  Specifically, when handling overrides of GryoMapper type 
registrations, I departed from the precedent set by Stephen's commit in two 
ways, both of which I've fixed locally:

* Override registrations allocated a new registration ID.  Your commit used 
the old ID from the registration being overridden.
* As a side effect of the preceding point, I incremented the shared 
`currentSerializationId` for all registration calls.  Your commit incremented 
this shared counter only for non-override registration calls.

The test passes locally for me now.  I'm still doing some registrator 
refactoring and haven't pushed yet though.

@okram The problem with custom registrations is that GryoMapper allows the 
user to provide a serializer written against shaded Kryo.  This is not 
compatible with Spark's KryoSerializer.  I don't see a way to make it 
compatible without putting fake `org.apache.tinkerpop.shaded.kryo.*` classes on 
the classpath, which could get pretty ugly.

Instead, I'm adding a registrator that includes:

* all default mappings from GryoMapper (I am changing the 
shaded-Kryo-serializers to be shim serializers so that they are reusable)
* the TinkerPop classes registered in GryoSerializer's constructors (but 
not the scala and Spark classes registered in there)

If the user wants to register custom types while using KryoSerializer, they 
can extend the registrator in a subclass, then set `spark.kryo.registrator` to 
the subclass.  The interface in question is `KryoRegistrator` and it has only 
one method, `registerClasses(Kryo)`.  It's about as simple -- maybe even less 
so -- than implementing TP's `IoRegistry` interface, which has two methods.

If the user decides to continue using GryoSerializer, they can also 
continue using `GryoPool.CONFIG_IO_REGISTRY`, which should still affect 
GryoSerializer/HadoopPools as it always has.

WDYT?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-tinkerpop issue #325: TINKERPOP-1321 Introduce Kryo shim to suppor...

2016-06-03 Thread okram
Github user okram commented on the issue:

https://github.com/apache/incubator-tinkerpop/pull/325
  
@dalaro -- you mention:

```
spark.serializer=KryoSerializer
spark.kryo.registrator=com.tinkerpop.something.TinkerPopRegistrator
```

I think we need:

```
spark.serializer=KryoSerializer

spark.kryo.registrator=org.apache.tinkerpop.gremlin.structure.io.gryo.GryoRegistrar
```

`GryoRegistrar` would have all the `GryoMapper` classes registered which, 
when people do custom registrations (e.g. `Point`, 
`MyStupidStringImplementation`), it will be in `GryoMapper`. Can you provide 
something like that in this PR? Hard? @spmallette knows more about how to 
interact with `GryoMapper` to get our the serializers...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-tinkerpop issue #325: TINKERPOP-1321 Introduce Kryo shim to suppor...

2016-06-03 Thread spmallette
Github user spmallette commented on the issue:

https://github.com/apache/incubator-tinkerpop/pull/325
  
I've confirmed that this test:

```text

GremlinResultSetIntegrateTest.shouldHandleVertexResultWithLiteSerialization:174 
? Execution
```

is only failing after these changes and not on master. I sense it has 
something to do with this:

```java
private  void addOrOverrideRegistration(TypeRegistration 
newRegistration) 
```

not preserving the registration identifier, but i could be wrong.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-tinkerpop issue #325: TINKERPOP-1321 Introduce Kryo shim to suppor...

2016-06-02 Thread dalaro
Github user dalaro commented on the issue:

https://github.com/apache/incubator-tinkerpop/pull/325
  
Thanks for the comments and for running the tests.

@okram:

1. I also dislike HadoopPools and would prefer to remove it, but I don't 
know how to do it correctly, even after considering the problem for a while.  
Here are my open questions and observations.

   HadoopPools apparently exists to help Vertex/ObjectWritable implement 
Serializable.  They must implement Serializable because they keep showing up in 
`spark.closure.serializer`'s workload, and only JavaSerializer can be used as 
`spark.closure.serializer`.  A natural question here: is it reasonable and 
normal for graph types to appear in these closures in the first place?

   Spark has this nifty `spark.kryo.registrator` mechanism to support 
user-registered types with Kryo, and its docs talk at length about setting 
`spark.serializer`.   The docs read as though that was all the user would need 
to customize to deal with custom data.  `spark.closure.serializer` seems to 
barely be mentioned.  I've only seen it in the manual's exhaustive table of 
config options (http://spark.apache.org/docs/latest/configuration.html).  Even 
then, the docs merely comment that JavaSerializer is its only supported value.  
There's some old list traffic that says KryoSerializer is unsupported because, 
back when it was originally tested, it blew up on pretty much any Java closure, 
though that might have changed since.

   I wonder if this indicates some kind of usage antipattern.  Could there 
be an alternate approach to traversal/vertexprogram evaluation in 
SparkGraphComputer/SparkExecutor that avoids putting graph data into Spark 
closures?  That would make the use-case for HadoopPools disappear.  Then again, 
it seems so easy to get into this situation with `rdd.map(x -> whatever(x))`.  
I am full of uncertainty on this point.  I don't have deep enough Spark 
knowledge to judge whether what we're doing now is reasonable.

   As long as Object/VertexWritable keep popping up in closures created by 
SparkGraphComputer & friends, we need some way to serialize them and everything 
they refer to, which can apparently involve a lot of graph entities.  We could 
theoretically implement Serializable "manually" (without Kryo/Gryo), but that 
seems like a massive amount of work, potentially inefficient, definitely 
redundant, and adds a huge new bug surface.  I'm against that approach, but 
that's about the only thing I know for certain here.

2. In the long term, I think users should set

   ```
spark.serializer=KryoSerializer
spark.kryo.registrator=com.tinkerpop.something.TinkerPopRegistrator
```
   and TinkerPop should stop registering/caring about Spark and 
scala-runtime types in its serialization layer.  This PR tries to keep 
GryoSerializer untouched though.  That migration doesn't have to be forced.

  I'm already using the `spark.kryo.registrator` setting in my app, where I 
have an impl that registers all TinkerPop types plus all of my app's types.  I 
can refactor that by extracting a TinkerPopRegistrator that covers the TP types 
(but not my app's).  I probably should have done that at the start.  As it 
stands, this PR is sort of a "build-your-own-alternative-to-GryoSerializer" 
kit, but the very last assembly steps involving a `spark.kryo.registrator` impl 
are left as an exercise to the user.  I'll try to introduce a 
TinkerPopRegistrator to address that, though I may reconsider if supporting 
custom type registration turns out way more complicated than I'm imagining.

  I think Gryo more generally is useful and can stay around indefinitely 
(say, for reading/writing data to the filesystem), but GryoSerializer should 
eventually stop being the recommended spark.serializer.  If that is all 
GryoSerializer per se exists to do, then it could be removed and any 
TinkerPop-specific class registrations that it performs moved to 
TinkerPopRegistrator.

3.  This PR should maintain full format compatibility.  I deliberately 
tried to keep all of GryoMapper and GryoSerializer's class registrations, class 
IDs, and associated serializers (including StarGraphSerializer) functionally 
unchanged.  If you see somewhere that I unintentionally changed the format, 
that's a bug.

@spmallette:

re. package organization: definitely makes sense, I'll move kryoshim under 
io.gryo.  Please let me know if I bungled the package hierarchy anywhere else.

re. path forward: I think 1 and 2 above touch on both HadoopPools and 
GryoSerializer.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with IN

[GitHub] incubator-tinkerpop issue #325: TINKERPOP-1321 Introduce Kryo shim to suppor...

2016-06-02 Thread spmallette
Github user spmallette commented on the issue:

https://github.com/apache/incubator-tinkerpop/pull/325
  
Ran the integration tests and hit a failure after merging this branch onto 
the latest from master:

```text
Results :

Tests in error: 
  
GremlinResultSetIntegrateTest.shouldHandleVertexResultWithLiteSerialization:174 
? Execution

Tests run: 156, Failures: 0, Errors: 1, Skipped: 2
```

that would come from Gremlin Server i think...you could test with:

```text
mvn verify -pl gremlin-server -DskipIntegrationTests=false
```

I don't think that test fails on master, but I suppose that is possible. I 
will set up my system now to try it out and will see the result in the morning.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-tinkerpop issue #325: TINKERPOP-1321 Introduce Kryo shim to suppor...

2016-06-02 Thread spmallette
Github user spmallette commented on the issue:

https://github.com/apache/incubator-tinkerpop/pull/325
  
@dalaro you had some really nice vision on this one. 

Just an organizational point - I feel like the "kryoshim" package should 
live inside of the "gryo" package as a child and not be a sibling. The 
`org.apache.tinkerpop.gremlin.structure.io` package should contain actual IO 
implementations (like gryo, graphson, graphml) and it seems to me that 
"kryoshim" only relates to "gryo". Make sense?

You didn't go so far as to deprecate anything and perhaps this work isn't 
quite at the point where we're ready to do that, but it would be nice to know 
what you see as the path to deprecating and removing some code (say 
`HadoopPools` and `GryoSerializers`) so that we could keep that knowledge 
around while it is fresh in your head. Perhaps we could polish all that up for 
the pending 3.2.1 release.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-tinkerpop issue #325: TINKERPOP-1321 Introduce Kryo shim to suppor...

2016-06-02 Thread okram
Github user okram commented on the issue:

https://github.com/apache/incubator-tinkerpop/pull/325
  
@dalaro -- your explanation of the problem in JIRA is impeccable. I now I 
remember why "just extending `SparkSerializer`" was not an option. TinkerPop's 
Kryo is shaded. Your concept of a shim to abstract both shaded Kryo to be used 
by unshaded Kryo systems is smart. Looking at the code specifically, 
unfortunately, its hard to review and understand the full ramifications of what 
you are providing. I can run the integration test suite and all that, but 
perhaps you could also answer some questions as well. 

**Note that I haven't looked at this code in over 6 months so my questions 
may be a little odd sounding.**

1. I never liked `HadoopPools`. I have a sneaking suspicion that they may 
not work with custom class registration and I we really should have a test case 
in there to prove yay or nay. Does your work either make `HadoopPools` "better" 
or no longer needed? (Also think about `GiraphGraphComputer` which uses 
`HadoopPools`).

2. Moving forward, would people still use `GryoSerializer` or is it 
recommended that they use `SparkSerializer`? How would your work effect our 
docs and does `GryoSerializer` become deprecated?

3. I noticed you moved around the serialization code for `StarGraph`. That 
is fine, but I'm wondering, did you change the serialization of `StarGraph`? 
That is, did the `byte[]` format change? If so, thats bad for backwards 
compatibility.

Finally, your PR is beautiful. You JavaDoc like a king and you prove your 
worth as emperor with `package-info.java`. Thank you for your efforts in a part 
of the codebase that I wrote with a looming lack of confidence. You have 
brought a level of Java skill/technique to a long standing problem that is 
inspiring.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---