Re: [DISCUSS] New IO format for GLVs/Gremlin Server

2016-07-15 Thread Robert Dale
Responding to Marko and Kevin...

Marko wrote:
> SIDENOTE: This serves as a foundation for when we move to GraphSON 2.0. In 
> terms of numbers, I think, unfortunately, we have to stick with int32, int64, 
> float, double, etc. given graph database providers and their type systems. 
> Its not about the Gremlin traversal API, its more about provider schemas. 
> has(“someNumber”,12L) vs. has(“someNumber”,12).

I call the above behavior a bug or a peculiarity of Titan; it clings
to a java object idiom. On the other hand, DSE graph exhibits expected
behavior (as does IBM Graph, Neo4j.)  I know of no other query
language that behaves like this - e.g. SQL, CassandraQL, JPQL, JOOQ
(the gremlin of sql).  Typically the underlying driver/provider does
the "right" thing (or doesn't).  Again, take UUID in gremlin, I can
pass a string.  The underlying driver seems to convert it to UUID, I
don't have to provide an UUID object.  This seems inconsistent.
Either it's doing strong typing or not.  Which is it??

IMO, the query language should be abstracted from the storage schema.
And I think this is where we have the impedance mismatch in this
thread.  What gremlin is really acting like in addition to query
language is an Object Graph Mapper (like an ORM).  It's playing two
roles. So I'm also arguing that it should have a single
responsibility. Yes, I've said this before. But maybe it changes
things too drastically.  Maybe there are aspects of gremlin that
actually require strong typing. I don't know. I haven't run into them.
On to the next item...

Kevin wrote:
>> Correct, these types weren't relevant... I only wanted to show you the 
>> format...
> However, I don't manage to understand the structure behind the format you 
> suggest, and I don't manage to establish a clear explicit representation in 
> my mind, regarding the example you provided in the TP-1274 PR. Could you 
> please give an example of how you would imagine the serialized JSON of :
> - an example list of typed values, like List
> - an example list of typed and untyped values, like a list with UUIDs and 
> booleans
> - an example map of typed and untyped values
>
> How would you define that format in a general way ? Like what I did when 
> saying
> "- untyped : value
> - typed : {"@type", "typeName", "value" : value}"
>
> Just trying your point better.
> Also what are the downsides you see with the format suggested above ?

The original format was in a list. I must have missed where you
accepted this format. In any case, like I originally stated, if you
want strong-typing, then _everything_ must be an _object_.

Here's an example of non-typed:
https://gist.github.com/robertdale/02931f5633be55a59c13bca3b0e58655
- native json only

Here's strongly typed:
https://gist.github.com/robertdale/6c074b165a72efee701e26f851f8b68a
- set (as an object), list (as an object), mixed-type lists, etc

Let me add that while there's no strict definition of schemaless, it
was not necessarily intended to include having mixed data types for a
single field. This is a really bad idea. Experts warn against this.
Most NoSQL databases don't even support this. You will probably die if
you use it. The default behavior for DSE graph, IBM graph, and even
Titan is to create the schema based on the first type inserted.  It
will complain if any subsequent type is different.

Also, schemaless doesn't mean without any schema. While not having to
define a schema up-front during a quickstart or early development
makes life easier, no one doing any serious work or going to
production goes without a schema.  Again, see DSE graph, IBM graph,
Titan, etc.

Let's take a look at DSE graph types [1]. They are a subset of
cassandra data types. What's really interesting about that is that
they are all represented in some simple form - string or integer
literals (and bool) - except for Geo but in even that can be in some
form of arrays. So blob, inet, uuid, even timestamp are all queried as
strings!

Also look at other APIs and you'll see the use of JSON without
strong-typing for non-domain and/or scalar types in IBM graph,
Elasticsearch, Solr, and just about every other REST API out there.
Types other than the weak-typing in JSON are settled by the backing
schema (southbound) or by the OGM (northbound).  Additionally,
VertexProperty returns only Object. I still have to know what the
underlying type is. What difference does it make if I cast
(strong-typed) or convert (weak-type)? I still have to do something in
order for it to be usable in java.  Maybe I'm just missing
something...

But at the end of the day, I would prefer consistency over whether
strong or weak typing.  :-)

Finally, I still would consider promoting spatial shapes to a
first-class entity in gremlin and include GeoJSON for serialization.
This is may be a separate effort.

1. 
https://docs.datastax.com/en/latest-dse/datastax_enterprise/graph/reference/refDSEGraphDataTypes.html

-- 
Robert Dale


Re: [DISCUSS] New IO format for GLVs/Gremlin Server

2016-07-15 Thread gallardo.kev...@gmail.com


On 2016-07-09 16:48 (+0100), Stephen Mallette  wrote: 
> With all the work on GLVs and the recent work on GraphSON 2.0, I think it's
> important that we have a solid, efficient, programming language neutral,
> lossless serialization format. Right now that format is GraphSON and it
> works for that purpose (ever more  so with 2.0). Given some discussion on
> the GraphSON 2.0 PR driven a bit by Robert Dale:
> 
> https://github.com/apache/tinkerpop/pull/351#issuecomment-231157389
> 
> I wonder if we shouldn't consider another IO format that has Gremlin
> Server/GLVs in mind. At this point I'm not suggesting anything specific -
> I'm just hanging the idea out for further discussion and brain storming.
> Thoughts?
> 

Hey, so I'm trying to gather all infos we have here in order to prepare to move 
forward with the implem of GraphSON 2.0, here's what I come up with : 

Things we have : 
- Type format.
- The structure in Jackson to implement our own type format.
- All non native Graph types are typed (except the domain specific types).

New things we need : 
- Types for domain specific objects.
- Types for all numeric values.
- Don't serialize empty fields (outV and stuff).

Things we consider changing :
- Type IDs convention. Before : Java simple class names. Now : starts with a 
"domain" like "gremlin" followed by the "type name", which is a lowercased type 
name (like "uuid", or "float", or "vertex"). Example : "gremlin:uuid".
- Type format ?

Am I missing something ?


Re: [DISCUSS] New IO format for GLVs/Gremlin Server

2016-07-15 Thread Marko Rodriguez
Hello,

> How would you define that format in a general way ? Like what I did when 
> saying 
> "- untyped : value
> - typed : {"@type", "typeName", "value" : value}"
> 
> Just trying your point better. 
> Also what are the downsides you see with the format suggested above ?

This makes sense to me.

Thus, Vertex becomes {@type=vertex, …}.

If you want to use JSON types, don’t {@type=} them, else, you can do 
{@type=int32}

Marko.

Re: [DISCUSS] New IO format for GLVs/Gremlin Server

2016-07-15 Thread gallardo.kev...@gmail.com


On 2016-07-15 15:52 (+0100), 
"gallardo.kev...@gmail.com" wrote: 
> 
> 
> On 2016-07-15 14:44 (+0100), Robert Dale  wrote: 
> > It looks to me like a self-inflicted problem because the things that
> > are typed are already native to json so it's redundant.  And to go a
> > step further, I wouldn't consider the types to be 'correct' because
> > everything that is a HashMap is really a Vertex, Edge, or Property.
> > 
> > On Thu, Jul 14, 2016 at 10:03 AM, gallardo.kev...@gmail.com
> >  wrote:
> > >
> > >
> > > On 2016-07-13 13:17 (+0100), Robert Dale  wrote:
> > >> Marko, I agree that empty object properties should not be represented.
> > >> I think if you saw that in an example then it was probably for
> > >> demonstration purposes.
> > >>
> > >> Kevin, can you expand on this comment:
> > >>
> > >> > the format you suggest would lead to the same inconsistencies as in 
> > >> > GraphSON 1.0.
> > >> > Since the type is at the same level than the data itself, whether the 
> > >> > container is an Array or an Object
> > >> > https://github.com/apache/tinkerpop/pull/351#issuecomment-231351653
> > >>
> > >> What exactly are the inconsistencies?  What is the problem in
> > >> determining an array or object?
> > >> This is a natural JSON array (or list): []
> > >> This is a natural JSON object: {}
> > >>
> > >> Type at the object level is a common pattern and supported feature of
> > >> Jackson.  Also, GeoJSON would be a natural fit as it also stores
> > >> 'type' at the object level. Titan supports GeoJSON currently.  I
> > >> wonder if it would make sense to promote geometry to gremlin.
> > >>
> > >
> > > I wasn't probably clear enough, in my first email exposing my motivation 
> > > to improve GraphSON 1.0, one of the things I noticed was that according 
> > > to the enclosing element (either an Array or a Map), a type will either 
> > > be described as (respectively) an element of the Array, or a key/value 
> > > pair in a Map, you can see that in the "embedded types" example of the 
> > > Tinkerpop docs : 
> > > http://tinkerpop.apache.org/docs/current/reference/#graphson-reader-writer
> > >  .
> > >
> > > There you can see that the type "java.util.ArrayList" is a simple element 
> > > of the enclosing array, but the "java.util.HashMap" type is a field of 
> > > the enclosing Map as {"@class" : "java.util.HashMap", ...}. This does not 
> > > seem consistent to me and even though I know that Jackson handles it 
> > > well, it seems that we'd better provide a consistent enclosing format 
> > > that we know is fixed whatever the enclosed data is, to make the 
> > > automatic type detection for other parsers in other libraries/languages 
> > > easier. Does that make sense ?
> > >
> > >> We should probably start documenting a table of supported types. (If
> > >> there is one, please provide link)
> > >>
> > >> I wonder if it even makes sense to type numbers according to their
> > >> memory model. As objects, Byte, Short, and Integer occupy the same
> > >> space. Long isn't much more.  So in Java we're not saving much space.
> > >> Jackson will attempt to parse in order: int, long, BigInt, BigDecimal.
> > >> The JSON JSR uses only BigDecimal. Some non-jvm languages don't even
> > >> have this concept.  Does anything in gremlin actually require this?
> > >> I'm thinking that this is only going to be relevant at the domain
> > >> model level. This way json native numbers can be used and not need
> > >> typing.
> > >>
> > >> Additionally, I think that all things that will be typed should always
> > >> be typed. For the use cases of injesting a saved graph from a file, it
> > >> can probably be assumed that the top-level objects are vertices since
> > >> the graph is vertex-centric and everything else follows naturally.
> > >> I'm not entirely sure what is required for submitting traversals to
> > >> gremlin server from GLV.  However, if this is used for the results
> > >> from gremlin server then the results could start with any one of path,
> > >> vertex, edge, property, vertex property, etc. So you'll need that type
> > >> data there.
> > >>
> > >> --
> > >> Robert Dale
> > >>
> > >> On Tue, Jul 12, 2016 at 8:35 AM, Marko Rodriguez  
> > >> wrote:
> > >> > Hi,
> > >> >
> > >> > I\u2019m not following this PR too closely so what I might be saying 
> > >> > is a already known/argued against/etc.
> > >> >
> > >> > 1. I think we should go with Robert Dale\u2019s proposal of 
> > >> > int32, int64, Vertex, uuid, etc. instead of Java class names.
> > >> > 2. In Java we then have a Map for typecasting 
> > >> > accordingly.
> > >> > 3. This would make GraphSON 2.0 perfect for Bytecode 
> > >> > serialization in TINKERPOP-1278.
> > >> > 4. I think that if a Vertex, Edge, etc. doesn\u2019t have 
> > >> > properties, outV, etc. then don\u2019t even have those fields 

Re: [DISCUSS] New IO format for GLVs/Gremlin Server

2016-07-15 Thread gallardo.kev...@gmail.com


On 2016-07-15 16:07 (+0100), 
"gallardo.kev...@gmail.com" wrote: 
> 
> 
> On 2016-07-15 15:52 (+0100), 
> "gallardo.kev...@gmail.com" wrote: 
> > 
> > 
> > On 2016-07-15 14:44 (+0100), Robert Dale  wrote: 
> > > It looks to me like a self-inflicted problem because the things that
> > > are typed are already native to json so it's redundant.  And to go a
> > > step further, I wouldn't consider the types to be 'correct' because
> > > everything that is a HashMap is really a Vertex, Edge, or Property.
> > > 
> > > On Thu, Jul 14, 2016 at 10:03 AM, gallardo.kev...@gmail.com
> > >  wrote:
> > > >
> > > >
> > > > On 2016-07-13 13:17 (+0100), Robert Dale  wrote:
> > > >> Marko, I agree that empty object properties should not be represented.
> > > >> I think if you saw that in an example then it was probably for
> > > >> demonstration purposes.
> > > >>
> > > >> Kevin, can you expand on this comment:
> > > >>
> > > >> > the format you suggest would lead to the same inconsistencies as in 
> > > >> > GraphSON 1.0.
> > > >> > Since the type is at the same level than the data itself, whether 
> > > >> > the container is an Array or an Object
> > > >> > https://github.com/apache/tinkerpop/pull/351#issuecomment-231351653
> > > >>
> > > >> What exactly are the inconsistencies?  What is the problem in
> > > >> determining an array or object?
> > > >> This is a natural JSON array (or list): []
> > > >> This is a natural JSON object: {}
> > > >>
> > > >> Type at the object level is a common pattern and supported feature of
> > > >> Jackson.  Also, GeoJSON would be a natural fit as it also stores
> > > >> 'type' at the object level. Titan supports GeoJSON currently.  I
> > > >> wonder if it would make sense to promote geometry to gremlin.
> > > >>
> > > >
> > > > I wasn't probably clear enough, in my first email exposing my 
> > > > motivation to improve GraphSON 1.0, one of the things I noticed was 
> > > > that according to the enclosing element (either an Array or a Map), a 
> > > > type will either be described as (respectively) an element of the 
> > > > Array, or a key/value pair in a Map, you can see that in the "embedded 
> > > > types" example of the Tinkerpop docs : 
> > > > http://tinkerpop.apache.org/docs/current/reference/#graphson-reader-writer
> > > >  .
> > > >
> > > > There you can see that the type "java.util.ArrayList" is a simple 
> > > > element of the enclosing array, but the "java.util.HashMap" type is a 
> > > > field of the enclosing Map as {"@class" : "java.util.HashMap", ...}. 
> > > > This does not seem consistent to me and even though I know that Jackson 
> > > > handles it well, it seems that we'd better provide a consistent 
> > > > enclosing format that we know is fixed whatever the enclosed data is, 
> > > > to make the automatic type detection for other parsers in other 
> > > > libraries/languages easier. Does that make sense ?
> > > >
> > > >> We should probably start documenting a table of supported types. (If
> > > >> there is one, please provide link)
> > > >>
> > > >> I wonder if it even makes sense to type numbers according to their
> > > >> memory model. As objects, Byte, Short, and Integer occupy the same
> > > >> space. Long isn't much more.  So in Java we're not saving much space.
> > > >> Jackson will attempt to parse in order: int, long, BigInt, BigDecimal.
> > > >> The JSON JSR uses only BigDecimal. Some non-jvm languages don't even
> > > >> have this concept.  Does anything in gremlin actually require this?
> > > >> I'm thinking that this is only going to be relevant at the domain
> > > >> model level. This way json native numbers can be used and not need
> > > >> typing.
> > > >>
> > > >> Additionally, I think that all things that will be typed should always
> > > >> be typed. For the use cases of injesting a saved graph from a file, it
> > > >> can probably be assumed that the top-level objects are vertices since
> > > >> the graph is vertex-centric and everything else follows naturally.
> > > >> I'm not entirely sure what is required for submitting traversals to
> > > >> gremlin server from GLV.  However, if this is used for the results
> > > >> from gremlin server then the results could start with any one of path,
> > > >> vertex, edge, property, vertex property, etc. So you'll need that type
> > > >> data there.
> > > >>
> > > >> --
> > > >> Robert Dale
> > > >>
> > > >> On Tue, Jul 12, 2016 at 8:35 AM, Marko Rodriguez 
> > > >>  wrote:
> > > >> > Hi,
> > > >> >
> > > >> > I\u2019m not following this PR too closely so what I might be saying 
> > > >> > is a already known/argued against/etc.
> > > >> >
> > > >> > 1. I think we should go with Robert Dale\u2019s proposal of 
> > > >> > int32, int64, Vertex, uuid, etc. instead of Java class names.
> > > >> > 2. In Java we then have a Map for typecasting 

Re: [DISCUSS] New IO format for GLVs/Gremlin Server

2016-07-15 Thread Robert Dale
It looks to me like a self-inflicted problem because the things that
are typed are already native to json so it's redundant.  And to go a
step further, I wouldn't consider the types to be 'correct' because
everything that is a HashMap is really a Vertex, Edge, or Property.

On Thu, Jul 14, 2016 at 10:03 AM, gallardo.kev...@gmail.com
 wrote:
>
>
> On 2016-07-13 13:17 (+0100), Robert Dale  wrote:
>> Marko, I agree that empty object properties should not be represented.
>> I think if you saw that in an example then it was probably for
>> demonstration purposes.
>>
>> Kevin, can you expand on this comment:
>>
>> > the format you suggest would lead to the same inconsistencies as in 
>> > GraphSON 1.0.
>> > Since the type is at the same level than the data itself, whether the 
>> > container is an Array or an Object
>> > https://github.com/apache/tinkerpop/pull/351#issuecomment-231351653
>>
>> What exactly are the inconsistencies?  What is the problem in
>> determining an array or object?
>> This is a natural JSON array (or list): []
>> This is a natural JSON object: {}
>>
>> Type at the object level is a common pattern and supported feature of
>> Jackson.  Also, GeoJSON would be a natural fit as it also stores
>> 'type' at the object level. Titan supports GeoJSON currently.  I
>> wonder if it would make sense to promote geometry to gremlin.
>>
>
> I wasn't probably clear enough, in my first email exposing my motivation to 
> improve GraphSON 1.0, one of the things I noticed was that according to the 
> enclosing element (either an Array or a Map), a type will either be described 
> as (respectively) an element of the Array, or a key/value pair in a Map, you 
> can see that in the "embedded types" example of the Tinkerpop docs : 
> http://tinkerpop.apache.org/docs/current/reference/#graphson-reader-writer .
>
> There you can see that the type "java.util.ArrayList" is a simple element of 
> the enclosing array, but the "java.util.HashMap" type is a field of the 
> enclosing Map as {"@class" : "java.util.HashMap", ...}. This does not seem 
> consistent to me and even though I know that Jackson handles it well, it 
> seems that we'd better provide a consistent enclosing format that we know is 
> fixed whatever the enclosed data is, to make the automatic type detection for 
> other parsers in other libraries/languages easier. Does that make sense ?
>
>> We should probably start documenting a table of supported types. (If
>> there is one, please provide link)
>>
>> I wonder if it even makes sense to type numbers according to their
>> memory model. As objects, Byte, Short, and Integer occupy the same
>> space. Long isn't much more.  So in Java we're not saving much space.
>> Jackson will attempt to parse in order: int, long, BigInt, BigDecimal.
>> The JSON JSR uses only BigDecimal. Some non-jvm languages don't even
>> have this concept.  Does anything in gremlin actually require this?
>> I'm thinking that this is only going to be relevant at the domain
>> model level. This way json native numbers can be used and not need
>> typing.
>>
>> Additionally, I think that all things that will be typed should always
>> be typed. For the use cases of injesting a saved graph from a file, it
>> can probably be assumed that the top-level objects are vertices since
>> the graph is vertex-centric and everything else follows naturally.
>> I'm not entirely sure what is required for submitting traversals to
>> gremlin server from GLV.  However, if this is used for the results
>> from gremlin server then the results could start with any one of path,
>> vertex, edge, property, vertex property, etc. So you'll need that type
>> data there.
>>
>> --
>> Robert Dale
>>
>> On Tue, Jul 12, 2016 at 8:35 AM, Marko Rodriguez  
>> wrote:
>> > Hi,
>> >
>> > I’m not following this PR too closely so what I might be saying is a 
>> > already known/argued against/etc.
>> >
>> > 1. I think we should go with Robert Dale’s proposal of int32, 
>> > int64, Vertex, uuid, etc. instead of Java class names.
>> > 2. In Java we then have a Map for typecasting 
>> > accordingly.
>> > 3. This would make GraphSON 2.0 perfect for Bytecode serialization 
>> > in TINKERPOP-1278.
>> > 4. I think that if a Vertex, Edge, etc. doesn’t have properties, 
>> > outV, etc. then don’t even have those fields in the representation.
>> > 5. Most of the serialization back and forth will be ReferenceXXX 
>> > elements and thus, don’t create more Maps/lists for no reason. — less 
>> > chars.
>> >
>> > For me, my interests with this work is all about a language agnostic way 
>> > of sending Gremlin traversal bytecode between different languages. This 
>> > work is exactly what I am looking for.
>> >
>> > Thanks,
>> > Marko.
>> >
>> > http://markorodriguez.com
>> >
>> >
>> >
>> >> On Jul 9, 2016, at 9:48 AM, Stephen Mallette  wrote:
>> >>
>> >>