[jira] [Comment Edited] (SOLR-8176) Model distributed graph traversals with Streaming Expressions
[ https://issues.apache.org/jira/browse/SOLR-8176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15212525#comment-15212525 ] Kevin Watters edited comment on SOLR-8176 at 3/29/16 2:19 PM: -- Here's a patch with a basic implementation of a Kafka based frontier query broker to support distributed graph query traversal in Solr. The unit test is commented out because it requires a Kafka broker to be running. Also, there's some config parameters / properties that are hard coded. Either way, this shows how to use the GraphQuery in a distributed graph traversal mode. Disclaimer: this patch isn't intended to be merged, it's really only an example of how to do it. there's a lot of cleanup that still needs to happen to make it ready for primetime. was (Author: kwatters): Here's a patch with a basic implementation of a Kafka based frontier query broker to support distributed graph query traversal in Solr. The unit test is commented out because it requires a Kafka broker to be running. Also, there's some config parameters / properties that are hard coded. Either way, this shows how to use the GraphQuery in a distributed graph traversal mode. > Model distributed graph traversals with Streaming Expressions > - > > Key: SOLR-8176 > URL: https://issues.apache.org/jira/browse/SOLR-8176 > Project: Solr > Issue Type: New Feature > Components: clients - java, SolrCloud, SolrJ >Affects Versions: master >Reporter: Joel Bernstein > Labels: Graph > Fix For: master > > Attachments: SOLR-8176.patch > > > I think it would be useful to model a few *distributed graph traversal* use > cases with Solr's *Streaming Expression* language. This ticket will explore > different approaches with a goal of implementing two or three common graph > traversal use cases. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (SOLR-8176) Model distributed graph traversals with Streaming Expressions
[ https://issues.apache.org/jira/browse/SOLR-8176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15208699#comment-15208699 ] Joel Bernstein edited comment on SOLR-8176 at 3/23/16 4:48 PM: --- In regards to the tinkerpop discussions on this ticket. I think it makes sense to first add some Streaming Expressions that model some basic graph traversal use cases. This will iron out some of the mechanics involved in doing graph traversals. This will start to build the foundation for supporting the Gremlin query language. This was the same approach taken when the parallel relational algebra expressions came first and became the foundation of the SQL interface. This approach achieves a number of things: 1) It doesn't put the cart before the horse. In order to properly support Gremlin we need a scalable distributed graph traversal capability. Once we have that adding Gremlin or SparkQL support will be much easier then trying to bootstrap graph capabilities while working on Gremlin at the same time. 2) We can then use Gremlin as a driver for building out the full range of graph traversal streaming expressions. The SQL interface plays this role for the parallel relational algebra Streaming Expressions. 3) Building out all Graph traversals as streaming expression means that we can than directly plug in the graph expressions with the existing expression library. 4) Each Graph Streaming Expression can be released on it's own providing useful functionality incrementally. was (Author: joel.bernstein): In regards to the tinkerpop discussions on this ticket. I think it makes sense to first add some Streaming Expressions that model some basic graph traversal use cases. This will iron out some of the mechanics involved in doing graph traversals. This will start to build the foundation for supporting the Gremlin query language. This was the same approach taken when the parallel relational algebra expressions came first and became the foundation of the SQL interface. This approach achieves a number of things: 1) It doesn't put the cart before the horse. In order to properly support Gremlin we need a scalable distributed graph traversal capability. Once we have that adding Gremlin or SparkQL support will be much easier then trying to bootstrap graph capabilities while working on Gremlin at the same time. 2) We can then use Gremlin as a driver for building out the full range of graph traversal streaming expressions. The SQL interface plays this role for the parallel relational algebra Streaming Expressions. 3) Building out all Graph traversals as streaming expression means that we can than directly plug in the graph expressions with the existing expression library. > Model distributed graph traversals with Streaming Expressions > - > > Key: SOLR-8176 > URL: https://issues.apache.org/jira/browse/SOLR-8176 > Project: Solr > Issue Type: New Feature > Components: clients - java, SolrCloud, SolrJ >Affects Versions: master >Reporter: Joel Bernstein > Labels: Graph > Fix For: master > > > I think it would be useful to model a few *distributed graph traversal* use > cases with Solr's *Streaming Expression* language. This ticket will explore > different approaches with a goal of implementing two or three common graph > traversal use cases. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (SOLR-8176) Model distributed graph traversals with Streaming Expressions
[ https://issues.apache.org/jira/browse/SOLR-8176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15208699#comment-15208699 ] Joel Bernstein edited comment on SOLR-8176 at 3/23/16 4:27 PM: --- In regards to the tinkerpop discussions on this ticket. I think it makes sense to first add some Streaming Expressions that model some basic graph traversal use cases. This will iron out some of the mechanics involved in doing graph traversals. This will start to build the foundation for supporting the Gremlin query language. This was the same approach taken when the parallel relational algebra came first and became the foundation of the SQL interface. This approach achieves a number of things: 1) It doesn't put the cart before the horse. In order to properly support Gremlin we need a scalable distributed graph traversal capability. Once we have that adding Gremlin or SparkQL support will be much easier then trying to bootstrap graph capabilities while working on Gremlin at the same time. 2) We can then use Gremlin as a driver for building out the full range of graph traversal streaming expressions. The SQL interface plays this role for the parallel relational algebra Streaming Expressions. 3) Building out all Graph traversals as streaming expression means that we can than directly plug in the graph expressions with the existing expression library. was (Author: joel.bernstein): In regards to the tinkerpop discussions on this ticket. I'm planning on first adding some Streaming Expressions that model some basic graph traversal use cases. This will iron out some of the mechanics involved in doing graph traversals. This will start to build the foundation for supporting the Gremlin query language. This was the same approach taken when the parallel relational algebra came first and became the foundation of the SQL interface. This approach achieves a number of things: 1) It doesn't put the cart before the horse. In order to properly support Gremlin we need a scalable distributed graph traversal capability. Once we have that adding Gremlin or SparkQL support will be much easier then trying to bootstrap graph capabilities while working on Gremlin at the same time. 2) We can then use Gremlin as a driver for building out the full range of graph traversal streaming expressions. The SQL interface plays this role for the parallel relational algebra Streaming Expressions. 3) Building out all Graph traversals as streaming expression means that we can than directly plug in the graph expressions with the existing expression library. > Model distributed graph traversals with Streaming Expressions > - > > Key: SOLR-8176 > URL: https://issues.apache.org/jira/browse/SOLR-8176 > Project: Solr > Issue Type: New Feature > Components: clients - java, SolrCloud, SolrJ >Affects Versions: master >Reporter: Joel Bernstein > Labels: Graph > Fix For: master > > > I think it would be useful to model a few *distributed graph traversal* use > cases with Solr's *Streaming Expression* language. This ticket will explore > different approaches with a goal of implementing two or three common graph > traversal use cases. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (SOLR-8176) Model distributed graph traversals with Streaming Expressions
[ https://issues.apache.org/jira/browse/SOLR-8176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15208699#comment-15208699 ] Joel Bernstein edited comment on SOLR-8176 at 3/23/16 4:28 PM: --- In regards to the tinkerpop discussions on this ticket. I think it makes sense to first add some Streaming Expressions that model some basic graph traversal use cases. This will iron out some of the mechanics involved in doing graph traversals. This will start to build the foundation for supporting the Gremlin query language. This was the same approach taken when the parallel relational algebra expressions came first and became the foundation of the SQL interface. This approach achieves a number of things: 1) It doesn't put the cart before the horse. In order to properly support Gremlin we need a scalable distributed graph traversal capability. Once we have that adding Gremlin or SparkQL support will be much easier then trying to bootstrap graph capabilities while working on Gremlin at the same time. 2) We can then use Gremlin as a driver for building out the full range of graph traversal streaming expressions. The SQL interface plays this role for the parallel relational algebra Streaming Expressions. 3) Building out all Graph traversals as streaming expression means that we can than directly plug in the graph expressions with the existing expression library. was (Author: joel.bernstein): In regards to the tinkerpop discussions on this ticket. I think it makes sense to first add some Streaming Expressions that model some basic graph traversal use cases. This will iron out some of the mechanics involved in doing graph traversals. This will start to build the foundation for supporting the Gremlin query language. This was the same approach taken when the parallel relational algebra came first and became the foundation of the SQL interface. This approach achieves a number of things: 1) It doesn't put the cart before the horse. In order to properly support Gremlin we need a scalable distributed graph traversal capability. Once we have that adding Gremlin or SparkQL support will be much easier then trying to bootstrap graph capabilities while working on Gremlin at the same time. 2) We can then use Gremlin as a driver for building out the full range of graph traversal streaming expressions. The SQL interface plays this role for the parallel relational algebra Streaming Expressions. 3) Building out all Graph traversals as streaming expression means that we can than directly plug in the graph expressions with the existing expression library. > Model distributed graph traversals with Streaming Expressions > - > > Key: SOLR-8176 > URL: https://issues.apache.org/jira/browse/SOLR-8176 > Project: Solr > Issue Type: New Feature > Components: clients - java, SolrCloud, SolrJ >Affects Versions: master >Reporter: Joel Bernstein > Labels: Graph > Fix For: master > > > I think it would be useful to model a few *distributed graph traversal* use > cases with Solr's *Streaming Expression* language. This ticket will explore > different approaches with a goal of implementing two or three common graph > traversal use cases. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (SOLR-8176) Model distributed graph traversals with Streaming Expressions
[ https://issues.apache.org/jira/browse/SOLR-8176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15096707#comment-15096707 ] Joel Bernstein edited comment on SOLR-8176 at 1/13/16 6:15 PM: --- Thinking some about the mechanics of what you're describing. One possible approach to this is to implement each iteration as a parallel join. The initial join would shuffle results from the initial query to worker nodes. Then the worker nodes would persist the current working set locally. Then the next iteration starts from the worker nodes that persisted the working set. Each step in the traversal could be done like this. The effect would be that the graph traversal would *hop* from one set of workers to another set of workers with each iteration. was (Author: joel.bernstein): Thinking some about the machanics of what you're describing. One possible approach to this is to shuffle results from an initial query to worker nodes. Then the worker nodes persist the current working set locally. Then the next iteration starts from the worker nodes that persisted the working set. Each step in the traversal could be done like this. The effect would be that the graph traversal would *hop* from one set of workers to another set of workers with each iteration. > Model distributed graph traversals with Streaming Expressions > - > > Key: SOLR-8176 > URL: https://issues.apache.org/jira/browse/SOLR-8176 > Project: Solr > Issue Type: New Feature > Components: clients - java, SolrCloud, SolrJ >Affects Versions: Trunk >Reporter: Joel Bernstein > Labels: Graph > Fix For: Trunk > > > I think it would be useful to model a few *distributed graph traversal* use > cases with Solr's *Streaming Expression* language. This ticket will explore > different approaches with a goal of implementing two or three common graph > traversal use cases. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (SOLR-8176) Model distributed graph traversals with Streaming Expressions
[ https://issues.apache.org/jira/browse/SOLR-8176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15073843#comment-15073843 ] Dennis Gove edited comment on SOLR-8176 at 12/29/15 12:10 PM: -- I've been thinking about this a little bit and one thing I keep coming back to is that there are different kinds of graph traversals and I think our model should take that into account. There are lots of types but I think the two major categories are node traversing graphs and edge traversing graphs. h3. Node Traversing Graphs These are graphs where you have some set of root nodes and you want to find connected nodes with some set of criteria. For example, given a collection of geographic locations (city, county, state, country) with fields "id", "type", "parentId", "name" find all cities in NY. As a hiccup the data is not completely normalized and some cities have their county listed as their parent while some have their state listed as their parent. Ie, you do not know how many nodes are between any given city and any given state. {code} graph( geography, root(q="type=state AND name:ny", fl="id"), leaf(q="type=city", fl="id,parentId,name"), edge("id=parentId") ) {code} In this example you're starting with a set of nodes in the geography collection, all which have some relationship to each other. You select your starting (root) nodes as all states named "ny" (there could be more than one). You then define what constitutes an ending (leaf) node as all cities. And finally, you say that all edges where nodeA.id == nodeB.parentId should be followed. This traversal can be implemented as a relatively simple iterative search following the form {code} frontier := search for all root nodes leaves := empty list while frontier is not empty frontierIds := list of ids of all nodes in frontier list leaves :append: search for all nodes whose parentId is in frontierIds and matches the leaf filter frontier := search for all nodes whose parentId is in frontierIds and does not match the leaf filter {code} In each iteration the leaves list can grow and the frontier list is replaced with the next set of nodes to consider. In the end you have a list of all leaf nodes which in some way connect to the original root nodes following the defined edge. Note that for simplicity I've left a couple of things out, including checking for already traversed nodes to avoid loops. Also, the leaf nodes are not added to the frontier but they can be. This would be useful in a situation where leaves are connected to leaves. h3. Edge Traversal Graphs These are graphs where you have some set of edges but the nodes themselves are relatively unimportant for traversal. For example, finding the shortest path between two nodes, or finding the minimum spanning tree for some set of nodes, or finding loops. was (Author: dpgove): I've been thinking about this a little bit and one thing I keep coming back to is that there are different kinds of graph traversals and I think our model should take that into account. There are lots of types but I think the two major categories are node traversing graphs and edge traversing graphs. h3. Node Traversing Graphs These are graphs where you have some set of root nodes and you want to find connected nodes with some set of criteria. For example, given a collection of geographic locations (city, county, state, country) with fields "id", "type", "parentId", "name" find all cities in NY. As a hiccup the data is not completely normalized and some cities have their county listed as their parent while some have their state listed as their parent. Ie, you do not know how many nodes are between any given city and any given state. {code} graph( geography, root(q="type=state AND name:ny", fl="id"), leaf(q="type=city", fl="id,parentId,name"), edge("id=parentId") ) {code} In this example you're starting with a set of nodes in the geography collection, all which have some relationship to each other. You select your starting (root) nodes as all states named "ny" (there could be more than one). You then define what constitutes an ending (leaf) node as all cities. And finally, you say that all edges where nodeA.id == nodeB.parentId should be followed. This traversal can be implemented as a relatively simple iterative search following the form {code} frontier := search for all root nodes leaves := empty list while frontier is not empty frontierIds := list of ids of all nodes in frontier list leaves :append: search for all nodes whose parentId is in frontierIds and matches the leaf filter frontier := search for all nodes whose parentId is in frontierIds and does not match the leaf filter {code} In each iteration the leaves list can grow and the frontier list is replaced with the next set of nodes to consider. In the end you have a list of all leaf nodes which in some way
[jira] [Comment Edited] (SOLR-8176) Model distributed graph traversals with Streaming Expressions
[ https://issues.apache.org/jira/browse/SOLR-8176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15004412#comment-15004412 ] Joel Bernstein edited comment on SOLR-8176 at 11/13/15 6:09 PM: I need to dig into the TinkerPop API. I think implementing Gremlin would be the desired end game. I see a distributed Gremlin implementation as another Parallel Computing problem, like the Parallel SQL interface. This is where the Streaming API comes in. If we can model graph traversals with the Streaming API then we can have a Gremlin parser that compiles to Streaming API objects. This was the approach taken with the SQL interface. So this ticket is really about laying the Parallel Computing framework for supporting graph traversals. I do agree that looking at TinkerPop will be very useful in understanding what to model. was (Author: joel.bernstein): I need to dig into the TinkerPop API. I think implementing Gremlin would be the desired end game. I see distributed Gremlin implementation as another Parallel Computing problem, like the Parallel SQL interface. This is where the Streaming API comes in. If we model graph traversals with the Streaming API then we can have a Gremlin parser that compiles to Streaming API objects. This was the approach taken with the SQL interface. So this ticket is really about laying the Parallel Computing framework for supporting graph traversals. Although I do agree that looking at TinkerPop will be very useful in understanding what to model. > Model distributed graph traversals with Streaming Expressions > - > > Key: SOLR-8176 > URL: https://issues.apache.org/jira/browse/SOLR-8176 > Project: Solr > Issue Type: New Feature > Components: clients - java, SolrCloud, SolrJ >Affects Versions: Trunk >Reporter: Joel Bernstein >Assignee: Joel Bernstein > Labels: Graph > Fix For: Trunk > > > I think it would be useful to model a few *distributed graph traversal* use > cases with Solr's *Streaming Expression* language. This ticket will explore > different approaches with a goal of implementing two or three common graph > traversal use cases. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org