[jira] [Updated] (CALCITE-6171) Support gremlin adapter

xingyuan cheng (Jira) Wed, 20 Dec 2023 16:29:04 -0800


     [ 
https://issues.apache.org/jira/browse/CALCITE-6171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


xingyuan cheng updated CALCITE-6171:
------------------------------------
    Description: 
Motivation
Hi, community. Currently, more and more users are using some graph databases, 
such as JanusGraph, HugeGraph, etc.
To do some relationship representation of personnel social networks,
It is used to count the activity of each user in the social network. Most graph 
databases are in the graph building and graph traversal stage.
All will be implemented using Gremlin syntax.
However, this is very unfriendly to users who are not familiar with gremlin 
syntax. While calcite exists as a query framework,
It also provides an adapter interface to adapt to different database dialects, 
such as parsing, relational algebra conversion, and query plan binding.
Our company has solved the problem of adapting various graph databases. This is 
my warehouse: [https://github.com/kaori-seasons/calcite-gremlin-sql]

Background

Calcite itself supports the database language expansion of the adapter, which 
enables users to understand the meaning of the grammar.
Complete the simplification of the dialect. For example, expand SqlNode to 
implement syntax analysis, and expand RelNode to implement logical plan mapping.

thinkerpop is an adaptation framework for various graph databases. In this 
framework, gremlin syntax is mentioned for the first time.
It has now become a universal query layer for graph databases. While expanding 
query statements through calcite’s adapter interface,
We will also use thinkerpop's universal graph database API to provide dialect 
compatibility for different graph databases.

Give a simple example:
From 
{code:java}
SELECT "key" FROM {code}
 inttype maps to
{code:java}
g.V().hasLabel("inttype").group().unfold().select(Column.values).order().by(_.unfold().id()).project("inttype").
 
by(.project("key").by(.unfold().choose(.has("key"),.values("key"),_.constant("\$%#NULL#
 %\$")))) {code}
 

 

The design architecture is divided into three layers.

Analytical syntax layer, relational algebra transformation layer, logical plan 
binding layer.

Parsing syntax layer: In the parsing query statement phase, fields and 
equivalent conditions are split and converted into points and edges.

Relational algebra layer: Split it into a collection of points and edges, and 
convert it into a TableScan during the aggregation/sorting/query stage where 
calcite abstracts it.
It is convenient to generate query plans based on conditions and field 
information.
Connection scanning/single table filtering and other methods can be used to 
traverse from any edge/any starting point in the graph

Logical plan binding layer: Bind behaviors such as connection scanning/single 
table filtering/projection to calcite’s planner to generate query plans.

The process of generating Gremlin logical plan using select statement:

1. First of all, all graph databases start from a source point to build the 
graph. We will use the GraphTraversalSource provided by thinkerpop.
As the origin, extract the syntax of the incoming point and side information. 
This step will be implemented in SqlSchemaGrabber
2. Secondly, for select/where/having/order by/group by our plan in the parsing 
phase is as follows:
{code:java}
  - group by: for a point. There are out-degree and in-degree attributes in the 
graph. From the perspective of the data table, it is equivalent to converting 
the table data into IN or OUT.
These two dimensions are aggregated. This behavior also corresponds to the 
table traversal graph operation. During the graph traversal process, 
fold/unfold tags will be generated, representing the direction of graph 
traversal.
  - select: For the select operation, the operation of scanning the entire 
table can be regarded as projecting all columns into point attributes. The 
value changes of each column are mapped to the gremlin operation of adding 
points.
  - where: The filter operation is used in graph computing semantic scenarios. 
It can be regarded as the edges connected by the out-degree and in-degree of 
the filter point, so it does not involve the conversion of relational algebra.
  Instead, it is pushed directly to the logical plan.
  - order by: In the process of graph traversal, we mentioned that fold/unflod 
will be generated on the path to represent the forward/backward direction.
  If a field is encountered and there is no value that can be sorted, we will 
fall back to the origin of GraphTraversalSource and end the sorting operation.
  If there are values that can be sorted, they will be unified in 
SqlTraversalEngine, in-degree and out-degree will be counted separately for 
aggregation, and then used with group by according to label (IN/OUT)
  - having: The meaning is the same as group by, but the label is different (in 
addition to the IN/OUT columns, specific point fields need to be specified) 
{code}
 

samples(samples)
Below I will give a simple example using the unit test in my project
Now suppose there is such a graph data set, there are multiple point sets 🎁 and 
an edge set
Since the company is connected to the graph database, further testing is 
needed. I will complete it in the near future //todo
The point sets respectively represent countries, companies, groups of people, 
and spatial locations.

 
{code:java}
{
   "tables": [
     
{        "name": "company",        "columns": [          \{"name": "name", 
"type": "string"}
       ]
     },
     
{        "name": "country",        "columns": [          \{"name": "name", 
"type": "string"}
       ]
     },
     
{        "name": "planet",        "columns": [          \{"name": "name", 
"type": "string"}
       ]
     },
     
{        "name": "person",        "columns": [          \{"name": "name", 
"type": "string"}
,
         {"name": "age", "type": "integer"}
       ]
     },
     
{        "name": "spaceship",        "columns": [          \{"name": "name", 
"type": "string"}
,
         {"name": "model", "type": "string"}
       ]
     },
     
{        "name": "satellite",        "columns": [          \{"name": "name", 
"type": "string"}
       ]
     },
     
{        "name": "sensor",        "columns": [          \{"name": "name", 
"type": "string"}
,
         {"name": "type", "type": "string"}
       ]
     },
     
{        "name": "sensorReading",        "columns": [          \{"name": 
"tstamp", "type": "long_timestamp", "propertyName": "timestamp"}
,
         {"name": "dt", "type": "long_date", "propertyName": "date"},
         {"name": "value", "type": "double"}
       ]
     },
     
{        "name": "fliesTo",        "columns":[          \{"name": "trips", 
"type": "integer"}
       ]
     },
     
{        "name": "orbits",        "columns":[          \{"name": "launched", 
"type": "integer"}
       ]
     }
     ],
   "relationships": [
     {"outTable": "company", "inTable": "country", "edgeLabel": "baseIn"},
     {"outTable": "person", "inTable": "company", "edgeLabel": "worksFor"},
     {"outTable": "person", "inTable": "planets", "edgeLabel": "travelledTo"},
     {"outTable": "company", "inTable": "spaceship", "edgeLabel": "owns"},
     {"outTable": "person", "inTable": "spaceship", "edgeLabel": "pilots"},
     {"outTable": "sensor", "inTable": "sensorReading", "edgeLabel": 
"hasReading", "fkTable": "sensorReading"},
     {"outTable": "person", "inTable": "planet", "edgeLabel": "fliesTo"},
     {"outTable": "satellite", "inTable": "planet", "edgeLabel": "orbits"},
     {"outTable": "person", "inTable": "person", "edgeLabel": "friendsWith"}
   ]
} {code}
Scope of application

Graph database compatible with thinkerpop's gremlin syntax

  was:
Motivation
Hi, community. Currently, more and more users are using some graph databases, 
such as JanusGraph, HugeGraph, etc.
To do some relationship representation of personnel social networks,
It is used to count the activity of each user in the social network. Most graph 
databases are in the graph building and graph traversal stage.
All will be implemented using Gremlin syntax.
However, this is very unfriendly to users who are not familiar with gremlin 
syntax. While calcite exists as a query framework,
It also provides an adapter interface to adapt to different database dialects, 
such as parsing, relational algebra conversion, and query plan binding.
Our company has solved the problem of adapting various graph databases. This is 
my warehouse: https://github.com/kaori-seasons/calcite-gremlin-sql

Background

Calcite itself supports the database language expansion of the adapter, which 
enables users to understand the meaning of the grammar.
Complete the simplification of the dialect. For example, expand SqlNode to 
implement syntax analysis, and expand RelNode to implement logical plan mapping.

thinkerpop is an adaptation framework for various graph databases. In this 
framework, gremlin syntax is mentioned for the first time.
It has now become a universal query layer for graph databases. While expanding 
query statements through calcite’s adapter interface,
We will also use thinkerpop's universal graph database API to provide dialect 
compatibility for different graph databases.

Give a simple example:
>From SELECT "key" FROM inttype maps to
g.V().hasLabel("inttype").group().unfold().select(Column.values).order().by(__.unfold().id()).project("inttype").
 
by(__.project("key").by(__.unfold().choose(__.has("key"),__.values("key"),__.constant("\$%#NULL#
 %\$"))))

The design architecture is divided into three layers.

Analytical syntax layer, relational algebra transformation layer, logical plan 
binding layer.

Parsing syntax layer: In the parsing query statement phase, fields and 
equivalent conditions are split and converted into points and edges.

Relational algebra layer: Split it into a collection of points and edges, and 
convert it into a TableScan during the aggregation/sorting/query stage where 
calcite abstracts it.
It is convenient to generate query plans based on conditions and field 
information.
Connection scanning/single table filtering and other methods can be used to 
traverse from any edge/any starting point in the graph

Logical plan binding layer: Bind behaviors such as connection scanning/single 
table filtering/projection to calcite’s planner to generate query plans.

The process of generating Gremlin logical plan using select statement:


1. First of all, all graph databases start from a source point to build the 
graph. We will use the GraphTraversalSource provided by thinkerpop.
As the origin, extract the syntax of the incoming point and side information. 
This step will be implemented in SqlSchemaGrabber
2. Secondly, for select/where/having/order by/group by our plan in the parsing 
phase is as follows:

  - group by: for a point. There are out-degree and in-degree attributes in the 
graph. From the perspective of the data table, it is equivalent to converting 
the table data into IN or OUT.
These two dimensions are aggregated. This behavior also corresponds to the 
table traversal graph operation. During the graph traversal process, 
fold/unfold tags will be generated, representing the direction of graph 
traversal.
  - select: For the select operation, the operation of scanning the entire 
table can be regarded as projecting all columns into point attributes. The 
value changes of each column are mapped to the gremlin operation of adding 
points.
  - where: The filter operation is used in graph computing semantic scenarios. 
It can be regarded as the edges connected by the out-degree and in-degree of 
the filter point, so it does not involve the conversion of relational algebra.
  Instead, it is pushed directly to the logical plan.
  - order by: In the process of graph traversal, we mentioned that fold/unflod 
will be generated on the path to represent the forward/backward direction.
  If a field is encountered and there is no value that can be sorted, we will 
fall back to the origin of GraphTraversalSource and end the sorting operation.
  If there are values that can be sorted, they will be unified in 
SqlTraversalEngine, in-degree and out-degree will be counted separately for 
aggregation, and then used with group by according to label (IN/OUT)
  - having: The meaning is the same as group by, but the label is different (in 
addition to the IN/OUT columns, specific point fields need to be specified)


samples(samples)
Below I will give a simple example using the unit test in my project
Now suppose there is such a graph data set, there are multiple point sets 🎁 and 
an edge set
Since the company is connected to the graph database, further testing is 
needed. I will complete it in the near future //todo
The point sets respectively represent countries, companies, groups of people, 
and spatial locations.
{
   "tables": [
     {
       "name": "company",
       "columns": [
         \{"name": "name", "type": "string"}
       ]
     },
     {
       "name": "country",
       "columns": [
         \{"name": "name", "type": "string"}
       ]
     },
     {
       "name": "planet",
       "columns": [
         \{"name": "name", "type": "string"}
       ]
     },
     {
       "name": "person",
       "columns": [
         \{"name": "name", "type": "string"},
         \{"name": "age", "type": "integer"}
       ]
     },
     {
       "name": "spaceship",
       "columns": [
         \{"name": "name", "type": "string"},
         \{"name": "model", "type": "string"}
       ]
     },
     {
       "name": "satellite",
       "columns": [
         \{"name": "name", "type": "string"}
       ]
     },
     {
       "name": "sensor",
       "columns": [
         \{"name": "name", "type": "string"},
         \{"name": "type", "type": "string"}
       ]
     },
     {
       "name": "sensorReading",
       "columns": [
         \{"name": "tstamp", "type": "long_timestamp", "propertyName": 
"timestamp"},
         \{"name": "dt", "type": "long_date", "propertyName": "date"},
         \{"name": "value", "type": "double"}
       ]
     },
     {
       "name": "fliesTo",
       "columns":[
         \{"name": "trips", "type": "integer"}
       ]
     },
     {
       "name": "orbits",
       "columns":[
         \{"name": "launched", "type": "integer"}
       ]
     }
     ],
   "relationships": [
     \{"outTable": "company", "inTable": "country", "edgeLabel": "baseIn"},
     \{"outTable": "person", "inTable": "company", "edgeLabel": "worksFor"},
     \{"outTable": "person", "inTable": "planets", "edgeLabel": "travelledTo"},
     \{"outTable": "company", "inTable": "spaceship", "edgeLabel": "owns"},
     \{"outTable": "person", "inTable": "spaceship", "edgeLabel": "pilots"},
     \{"outTable": "sensor", "inTable": "sensorReading", "edgeLabel": 
"hasReading", "fkTable": "sensorReading"},
     \{"outTable": "person", "inTable": "planet", "edgeLabel": "fliesTo"},
     \{"outTable": "satellite", "inTable": "planet", "edgeLabel": "orbits"},
     \{"outTable": "person", "inTable": "person", "edgeLabel": "friendsWith"}
   ]
}

Scope of application

Graph database compatible with thinkerpop's gremlin syntax


> Support gremlin adapter
> -----------------------
>
>                 Key: CALCITE-6171
>                 URL: https://issues.apache.org/jira/browse/CALCITE-6171
>             Project: Calcite
>          Issue Type: New Feature
>    Affects Versions: 1.36.0
>            Reporter: xingyuan cheng
>            Priority: Major
>
> Motivation
> Hi, community. Currently, more and more users are using some graph databases, 
> such as JanusGraph, HugeGraph, etc.
> To do some relationship representation of personnel social networks,
> It is used to count the activity of each user in the social network. Most 
> graph databases are in the graph building and graph traversal stage.
> All will be implemented using Gremlin syntax.
> However, this is very unfriendly to users who are not familiar with gremlin 
> syntax. While calcite exists as a query framework,
> It also provides an adapter interface to adapt to different database 
> dialects, such as parsing, relational algebra conversion, and query plan 
> binding.
> Our company has solved the problem of adapting various graph databases. This 
> is my warehouse: [https://github.com/kaori-seasons/calcite-gremlin-sql]
> Background
> Calcite itself supports the database language expansion of the adapter, which 
> enables users to understand the meaning of the grammar.
> Complete the simplification of the dialect. For example, expand SqlNode to 
> implement syntax analysis, and expand RelNode to implement logical plan 
> mapping.
> thinkerpop is an adaptation framework for various graph databases. In this 
> framework, gremlin syntax is mentioned for the first time.
> It has now become a universal query layer for graph databases. While 
> expanding query statements through calcite’s adapter interface,
> We will also use thinkerpop's universal graph database API to provide dialect 
> compatibility for different graph databases.
> Give a simple example:
> From 
> {code:java}
> SELECT "key" FROM {code}
>  inttype maps to
> {code:java}
> g.V().hasLabel("inttype").group().unfold().select(Column.values).order().by(_.unfold().id()).project("inttype").
>  
> by(.project("key").by(.unfold().choose(.has("key"),.values("key"),_.constant("\$%#NULL#
>  %\$")))) {code}
>  
>  
> The design architecture is divided into three layers.
> Analytical syntax layer, relational algebra transformation layer, logical 
> plan binding layer.
> Parsing syntax layer: In the parsing query statement phase, fields and 
> equivalent conditions are split and converted into points and edges.
> Relational algebra layer: Split it into a collection of points and edges, and 
> convert it into a TableScan during the aggregation/sorting/query stage where 
> calcite abstracts it.
> It is convenient to generate query plans based on conditions and field 
> information.
> Connection scanning/single table filtering and other methods can be used to 
> traverse from any edge/any starting point in the graph
> Logical plan binding layer: Bind behaviors such as connection scanning/single 
> table filtering/projection to calcite’s planner to generate query plans.
> The process of generating Gremlin logical plan using select statement:
> 1. First of all, all graph databases start from a source point to build the 
> graph. We will use the GraphTraversalSource provided by thinkerpop.
> As the origin, extract the syntax of the incoming point and side information. 
> This step will be implemented in SqlSchemaGrabber
> 2. Secondly, for select/where/having/order by/group by our plan in the 
> parsing phase is as follows:
> {code:java}
>   - group by: for a point. There are out-degree and in-degree attributes in 
> the graph. From the perspective of the data table, it is equivalent to 
> converting the table data into IN or OUT.
> These two dimensions are aggregated. This behavior also corresponds to the 
> table traversal graph operation. During the graph traversal process, 
> fold/unfold tags will be generated, representing the direction of graph 
> traversal.
>   - select: For the select operation, the operation of scanning the entire 
> table can be regarded as projecting all columns into point attributes. The 
> value changes of each column are mapped to the gremlin operation of adding 
> points.
>   - where: The filter operation is used in graph computing semantic 
> scenarios. It can be regarded as the edges connected by the out-degree and 
> in-degree of the filter point, so it does not involve the conversion of 
> relational algebra.
>   Instead, it is pushed directly to the logical plan.
>   - order by: In the process of graph traversal, we mentioned that 
> fold/unflod will be generated on the path to represent the forward/backward 
> direction.
>   If a field is encountered and there is no value that can be sorted, we will 
> fall back to the origin of GraphTraversalSource and end the sorting operation.
>   If there are values that can be sorted, they will be unified in 
> SqlTraversalEngine, in-degree and out-degree will be counted separately for 
> aggregation, and then used with group by according to label (IN/OUT)
>   - having: The meaning is the same as group by, but the label is different 
> (in addition to the IN/OUT columns, specific point fields need to be 
> specified) {code}
>  
> samples(samples)
> Below I will give a simple example using the unit test in my project
> Now suppose there is such a graph data set, there are multiple point sets 🎁 
> and an edge set
> Since the company is connected to the graph database, further testing is 
> needed. I will complete it in the near future //todo
> The point sets respectively represent countries, companies, groups of people, 
> and spatial locations.
>  
> {code:java}
> {
>    "tables": [
>      
> {        "name": "company",        "columns": [          \{"name": "name", 
> "type": "string"}
>        ]
>      },
>      
> {        "name": "country",        "columns": [          \{"name": "name", 
> "type": "string"}
>        ]
>      },
>      
> {        "name": "planet",        "columns": [          \{"name": "name", 
> "type": "string"}
>        ]
>      },
>      
> {        "name": "person",        "columns": [          \{"name": "name", 
> "type": "string"}
> ,
>          {"name": "age", "type": "integer"}
>        ]
>      },
>      
> {        "name": "spaceship",        "columns": [          \{"name": "name", 
> "type": "string"}
> ,
>          {"name": "model", "type": "string"}
>        ]
>      },
>      
> {        "name": "satellite",        "columns": [          \{"name": "name", 
> "type": "string"}
>        ]
>      },
>      
> {        "name": "sensor",        "columns": [          \{"name": "name", 
> "type": "string"}
> ,
>          {"name": "type", "type": "string"}
>        ]
>      },
>      
> {        "name": "sensorReading",        "columns": [          \{"name": 
> "tstamp", "type": "long_timestamp", "propertyName": "timestamp"}
> ,
>          {"name": "dt", "type": "long_date", "propertyName": "date"},
>          {"name": "value", "type": "double"}
>        ]
>      },
>      
> {        "name": "fliesTo",        "columns":[          \{"name": "trips", 
> "type": "integer"}
>        ]
>      },
>      
> {        "name": "orbits",        "columns":[          \{"name": "launched", 
> "type": "integer"}
>        ]
>      }
>      ],
>    "relationships": [
>      {"outTable": "company", "inTable": "country", "edgeLabel": "baseIn"},
>      {"outTable": "person", "inTable": "company", "edgeLabel": "worksFor"},
>      {"outTable": "person", "inTable": "planets", "edgeLabel": "travelledTo"},
>      {"outTable": "company", "inTable": "spaceship", "edgeLabel": "owns"},
>      {"outTable": "person", "inTable": "spaceship", "edgeLabel": "pilots"},
>      {"outTable": "sensor", "inTable": "sensorReading", "edgeLabel": 
> "hasReading", "fkTable": "sensorReading"},
>      {"outTable": "person", "inTable": "planet", "edgeLabel": "fliesTo"},
>      {"outTable": "satellite", "inTable": "planet", "edgeLabel": "orbits"},
>      {"outTable": "person", "inTable": "person", "edgeLabel": "friendsWith"}
>    ]
> } {code}
> Scope of application
> Graph database compatible with thinkerpop's gremlin syntax



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (CALCITE-6171) Support gremlin adapter

Reply via email to