[ https://issues.apache.org/jira/browse/CALCITE-6171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
ASF GitHub Bot updated CALCITE-6171: ------------------------------------ Labels: pull-request-available (was: ) > Support gremlin adapter > ----------------------- > > Key: CALCITE-6171 > URL: https://issues.apache.org/jira/browse/CALCITE-6171 > Project: Calcite > Issue Type: New Feature > Affects Versions: 1.36.0 > Reporter: xingyuan cheng > Priority: Major > Labels: pull-request-available > > h1. Motivation > Hi, community. Currently, more and more users are using some graph databases, > such as JanusGraph, HugeGraph, etc. > To do some relationship representation of personnel social networks, > It is used to count the activity of each user in the social network. Most > graph databases are in the graph building and graph traversal stage. > All will be implemented using Gremlin syntax. > However, this is very unfriendly to users who are not familiar with gremlin > syntax. While calcite exists as a query framework, > It also provides an adapter interface to adapt to different database > dialects, such as parsing, relational algebra conversion, and query plan > binding. > Our company has solved the problem of adapting various graph databases. This > is my warehouse: [https://github.com/kaori-seasons/calcite-gremlin-sql] > > h2. Background > > Calcite itself supports the database language expansion of the adapter, which > enables users to understand the meaning of the grammar. > Complete the simplification of the dialect. For example, expand SqlNode to > implement syntax analysis, and expand RelNode to implement logical plan > mapping. > thinkerpop is an adaptation framework for various graph databases. In this > framework, gremlin syntax is mentioned for the first time. > It has now become a universal query layer for graph databases. While > expanding query statements through calcite’s adapter interface, > We will also use thinkerpop's universal graph database API to provide dialect > compatibility for different graph databases. > Give a simple example: > From > {code:java} > SELECT "key" FROM inttype{code} > maps to > {code:java} > g.V().hasLabel("inttype").group().unfold().select(Column.values).order().by(_.unfold().id()).project("inttype"). > > by(.project("key").by(.unfold().choose(.has("key"),.values("key"),_.constant("\$%#NULL# > %\$")))) {code} > > > The design architecture is divided into three layers. > Analytical syntax layer, relational algebra transformation layer, logical > plan binding layer. > Parsing syntax layer: In the parsing query statement phase, fields and > equivalent conditions are split and converted into points and edges. > Relational algebra layer: Split it into a collection of points and edges, and > convert it into a TableScan during the aggregation/sorting/query stage where > calcite abstracts it. > It is convenient to generate query plans based on conditions and field > information. > Connection scanning/single table filtering and other methods can be used to > traverse from any edge/any starting point in the graph > Logical plan binding layer: Bind behaviors such as connection scanning/single > table filtering/projection to calcite’s planner to generate query plans. > The process of generating Gremlin logical plan using select statement: > 1. First of all, all graph databases start from a source point to build the > graph. We will use the GraphTraversalSource provided by thinkerpop. > As the origin, extract the syntax of the incoming point and side information. > This step will be implemented in SqlSchemaGrabber > 2. Secondly, for select/where/having/order by/group by our plan in the > parsing phase is as follows: > {code:java} > - group by: for a point. There are out-degree and in-degree attributes in > the graph. From the perspective of the data table, it is equivalent to > converting the table data into IN or OUT. > These two dimensions are aggregated. This behavior also corresponds to the > table traversal graph operation. During the graph traversal process, > fold/unfold tags will be generated, representing the direction of graph > traversal. > - select: For the select operation, the operation of scanning the entire > table can be regarded as projecting all columns into point attributes. The > value changes of each column are mapped to the gremlin operation of adding > points. > - where: The filter operation is used in graph computing semantic > scenarios. It can be regarded as the edges connected by the out-degree and > in-degree of the filter point, so it does not involve the conversion of > relational algebra. > Instead, it is pushed directly to the logical plan. > - order by: In the process of graph traversal, we mentioned that > fold/unflod will be generated on the path to represent the forward/backward > direction. > If a field is encountered and there is no value that can be sorted, we will > fall back to the origin of GraphTraversalSource and end the sorting operation. > If there are values that can be sorted, they will be unified in > SqlTraversalEngine, in-degree and out-degree will be counted separately for > aggregation, and then used with group by according to label (IN/OUT) > - having: The meaning is the same as group by, but the label is different > (in addition to the IN/OUT columns, specific point fields need to be > specified) {code} > > h2. samples > Below I will give a simple example using the unit test in my project > Now suppose there is such a graph data set, there are multiple point sets 🎁 > and an edge set > Since the company is connected to the graph database, further testing is > needed. I will complete it in the near future //todo > The point sets respectively represent countries, companies, groups of people, > and spatial locations. > > {code:java} > { > "tables": [ > > { "name": "company", "columns": [ \{"name": "name", > "type": "string"} > ] > }, > > { "name": "country", "columns": [ \{"name": "name", > "type": "string"} > ] > }, > > { "name": "planet", "columns": [ \{"name": "name", > "type": "string"} > ] > }, > > { "name": "person", "columns": [ \{"name": "name", > "type": "string"} > , > {"name": "age", "type": "integer"} > ] > }, > > { "name": "spaceship", "columns": [ \{"name": "name", > "type": "string"} > , > {"name": "model", "type": "string"} > ] > }, > > { "name": "satellite", "columns": [ \{"name": "name", > "type": "string"} > ] > }, > > { "name": "sensor", "columns": [ \{"name": "name", > "type": "string"} > , > {"name": "type", "type": "string"} > ] > }, > > { "name": "sensorReading", "columns": [ \{"name": > "tstamp", "type": "long_timestamp", "propertyName": "timestamp"} > , > {"name": "dt", "type": "long_date", "propertyName": "date"}, > {"name": "value", "type": "double"} > ] > }, > > { "name": "fliesTo", "columns":[ \{"name": "trips", > "type": "integer"} > ] > }, > > { "name": "orbits", "columns":[ \{"name": "launched", > "type": "integer"} > ] > } > ], > "relationships": [ > {"outTable": "company", "inTable": "country", "edgeLabel": "baseIn"}, > {"outTable": "person", "inTable": "company", "edgeLabel": "worksFor"}, > {"outTable": "person", "inTable": "planets", "edgeLabel": "travelledTo"}, > {"outTable": "company", "inTable": "spaceship", "edgeLabel": "owns"}, > {"outTable": "person", "inTable": "spaceship", "edgeLabel": "pilots"}, > {"outTable": "sensor", "inTable": "sensorReading", "edgeLabel": > "hasReading", "fkTable": "sensorReading"}, > {"outTable": "person", "inTable": "planet", "edgeLabel": "fliesTo"}, > {"outTable": "satellite", "inTable": "planet", "edgeLabel": "orbits"}, > {"outTable": "person", "inTable": "person", "edgeLabel": "friendsWith"} > ] > } {code} > Scope of application > Graph database compatible with thinkerpop's gremlin syntax -- This message was sent by Atlassian Jira (v8.20.10#820010)