[jira] [Comment Edited] (SOLR-8176) Model distributed graph traversals with Streaming Expressions

2016-03-29 Thread Kevin Watters (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-8176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15212525#comment-15212525
 ] 

Kevin Watters edited comment on SOLR-8176 at 3/29/16 2:19 PM:
--

Here's a patch with a basic implementation of a Kafka based frontier query 
broker to support distributed graph query traversal in Solr.  The unit test is 
commented out because it requires a Kafka broker to be running.  Also, there's 
some config parameters / properties that are hard coded.  Either way, this 
shows how to use the GraphQuery in a distributed graph traversal mode. 

Disclaimer:  this patch isn't intended to be merged, it's really only an 
example of how to do it.  there's a lot of cleanup that still needs to happen 
to make it ready for primetime.


was (Author: kwatters):
Here's a patch with a basic implementation of a Kafka based frontier query 
broker to support distributed graph query traversal in Solr.  The unit test is 
commented out because it requires a Kafka broker to be running.  Also, there's 
some config parameters / properties that are hard coded.  Either way, this 
shows how to use the GraphQuery in a distributed graph traversal mode. 

> Model distributed graph traversals with Streaming Expressions
> -
>
> Key: SOLR-8176
> URL: https://issues.apache.org/jira/browse/SOLR-8176
> Project: Solr
>  Issue Type: New Feature
>  Components: clients - java, SolrCloud, SolrJ
>Affects Versions: master
>Reporter: Joel Bernstein
>  Labels: Graph
> Fix For: master
>
> Attachments: SOLR-8176.patch
>
>
> I think it would be useful to model a few *distributed graph traversal* use 
> cases with Solr's *Streaming Expression* language. This ticket will explore 
> different approaches with a goal of implementing two or three common graph 
> traversal use cases.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (SOLR-8176) Model distributed graph traversals with Streaming Expressions

2016-03-23 Thread Joel Bernstein (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-8176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15208699#comment-15208699
 ] 

Joel Bernstein edited comment on SOLR-8176 at 3/23/16 4:48 PM:
---

In regards to the tinkerpop discussions on this ticket. I think it makes sense 
to first add some Streaming Expressions that model some basic graph traversal 
use cases. This will iron out some of the mechanics involved in doing graph 
traversals. This will start to build the foundation for supporting the Gremlin 
query language. This was the same approach taken when the parallel relational 
algebra expressions came first and became the foundation of the SQL interface. 
This approach achieves a number of things:

1) It doesn't put the cart before the horse. In order to properly support 
Gremlin we need a scalable distributed graph traversal capability. Once we have 
that adding Gremlin or SparkQL support will be much easier then trying to 
bootstrap graph capabilities while working on Gremlin at the same time.

2) We can then use Gremlin as a driver for building out the full range of graph 
traversal streaming expressions. The SQL interface plays this role for the 
parallel relational algebra Streaming Expressions. 

3)  Building out all Graph traversals as streaming expression means that we can 
than directly plug in the graph expressions with the existing expression 
library. 

4) Each Graph Streaming Expression can be released on it's own providing useful 
functionality incrementally. 


was (Author: joel.bernstein):
In regards to the tinkerpop discussions on this ticket. I think it makes sense 
to first add some Streaming Expressions that model some basic graph traversal 
use cases. This will iron out some of the mechanics involved in doing graph 
traversals. This will start to build the foundation for supporting the Gremlin 
query language. This was the same approach taken when the parallel relational 
algebra expressions came first and became the foundation of the SQL interface. 
This approach achieves a number of things:

1) It doesn't put the cart before the horse. In order to properly support 
Gremlin we need a scalable distributed graph traversal capability. Once we have 
that adding Gremlin or SparkQL support will be much easier then trying to 
bootstrap graph capabilities while working on Gremlin at the same time.

2) We can then use Gremlin as a driver for building out the full range of graph 
traversal streaming expressions. The SQL interface plays this role for the 
parallel relational algebra Streaming Expressions. 

3)  Building out all Graph traversals as streaming expression means that we can 
than directly plug in the graph expressions with the existing expression 
library. 

> Model distributed graph traversals with Streaming Expressions
> -
>
> Key: SOLR-8176
> URL: https://issues.apache.org/jira/browse/SOLR-8176
> Project: Solr
>  Issue Type: New Feature
>  Components: clients - java, SolrCloud, SolrJ
>Affects Versions: master
>Reporter: Joel Bernstein
>  Labels: Graph
> Fix For: master
>
>
> I think it would be useful to model a few *distributed graph traversal* use 
> cases with Solr's *Streaming Expression* language. This ticket will explore 
> different approaches with a goal of implementing two or three common graph 
> traversal use cases.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (SOLR-8176) Model distributed graph traversals with Streaming Expressions

2016-03-23 Thread Joel Bernstein (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-8176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15208699#comment-15208699
 ] 

Joel Bernstein edited comment on SOLR-8176 at 3/23/16 4:27 PM:
---

In regards to the tinkerpop discussions on this ticket. I think it makes sense 
to first add some Streaming Expressions that model some basic graph traversal 
use cases. This will iron out some of the mechanics involved in doing graph 
traversals. This will start to build the foundation for supporting the Gremlin 
query language. This was the same approach taken when the parallel relational 
algebra came first and became the foundation of the SQL interface. This 
approach achieves a number of things:

1) It doesn't put the cart before the horse. In order to properly support 
Gremlin we need a scalable distributed graph traversal capability. Once we have 
that adding Gremlin or SparkQL support will be much easier then trying to 
bootstrap graph capabilities while working on Gremlin at the same time.

2) We can then use Gremlin as a driver for building out the full range of graph 
traversal streaming expressions. The SQL interface plays this role for the 
parallel relational algebra Streaming Expressions. 

3)  Building out all Graph traversals as streaming expression means that we can 
than directly plug in the graph expressions with the existing expression 
library. 


was (Author: joel.bernstein):
In regards to the tinkerpop discussions on this ticket. I'm planning on first 
adding some Streaming Expressions that model some basic graph traversal use 
cases. This will iron out some of the mechanics involved in doing graph 
traversals. This will start to build the foundation for supporting the Gremlin 
query language. This was the same approach taken when the parallel relational 
algebra came first and became the foundation of the SQL interface. This 
approach achieves a number of things:

1) It doesn't put the cart before the horse. In order to properly support 
Gremlin we need a scalable distributed graph traversal capability. Once we have 
that adding Gremlin or SparkQL support will be much easier then trying to 
bootstrap graph capabilities while working on Gremlin at the same time.

2) We can then use Gremlin as a driver for building out the full range of graph 
traversal streaming expressions. The SQL interface plays this role for the 
parallel relational algebra Streaming Expressions. 

3)  Building out all Graph traversals as streaming expression means that we can 
than directly plug in the graph expressions with the existing expression 
library. 

> Model distributed graph traversals with Streaming Expressions
> -
>
> Key: SOLR-8176
> URL: https://issues.apache.org/jira/browse/SOLR-8176
> Project: Solr
>  Issue Type: New Feature
>  Components: clients - java, SolrCloud, SolrJ
>Affects Versions: master
>Reporter: Joel Bernstein
>  Labels: Graph
> Fix For: master
>
>
> I think it would be useful to model a few *distributed graph traversal* use 
> cases with Solr's *Streaming Expression* language. This ticket will explore 
> different approaches with a goal of implementing two or three common graph 
> traversal use cases.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (SOLR-8176) Model distributed graph traversals with Streaming Expressions

2016-03-23 Thread Joel Bernstein (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-8176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15208699#comment-15208699
 ] 

Joel Bernstein edited comment on SOLR-8176 at 3/23/16 4:28 PM:
---

In regards to the tinkerpop discussions on this ticket. I think it makes sense 
to first add some Streaming Expressions that model some basic graph traversal 
use cases. This will iron out some of the mechanics involved in doing graph 
traversals. This will start to build the foundation for supporting the Gremlin 
query language. This was the same approach taken when the parallel relational 
algebra expressions came first and became the foundation of the SQL interface. 
This approach achieves a number of things:

1) It doesn't put the cart before the horse. In order to properly support 
Gremlin we need a scalable distributed graph traversal capability. Once we have 
that adding Gremlin or SparkQL support will be much easier then trying to 
bootstrap graph capabilities while working on Gremlin at the same time.

2) We can then use Gremlin as a driver for building out the full range of graph 
traversal streaming expressions. The SQL interface plays this role for the 
parallel relational algebra Streaming Expressions. 

3)  Building out all Graph traversals as streaming expression means that we can 
than directly plug in the graph expressions with the existing expression 
library. 


was (Author: joel.bernstein):
In regards to the tinkerpop discussions on this ticket. I think it makes sense 
to first add some Streaming Expressions that model some basic graph traversal 
use cases. This will iron out some of the mechanics involved in doing graph 
traversals. This will start to build the foundation for supporting the Gremlin 
query language. This was the same approach taken when the parallel relational 
algebra came first and became the foundation of the SQL interface. This 
approach achieves a number of things:

1) It doesn't put the cart before the horse. In order to properly support 
Gremlin we need a scalable distributed graph traversal capability. Once we have 
that adding Gremlin or SparkQL support will be much easier then trying to 
bootstrap graph capabilities while working on Gremlin at the same time.

2) We can then use Gremlin as a driver for building out the full range of graph 
traversal streaming expressions. The SQL interface plays this role for the 
parallel relational algebra Streaming Expressions. 

3)  Building out all Graph traversals as streaming expression means that we can 
than directly plug in the graph expressions with the existing expression 
library. 

> Model distributed graph traversals with Streaming Expressions
> -
>
> Key: SOLR-8176
> URL: https://issues.apache.org/jira/browse/SOLR-8176
> Project: Solr
>  Issue Type: New Feature
>  Components: clients - java, SolrCloud, SolrJ
>Affects Versions: master
>Reporter: Joel Bernstein
>  Labels: Graph
> Fix For: master
>
>
> I think it would be useful to model a few *distributed graph traversal* use 
> cases with Solr's *Streaming Expression* language. This ticket will explore 
> different approaches with a goal of implementing two or three common graph 
> traversal use cases.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (SOLR-8176) Model distributed graph traversals with Streaming Expressions

2016-01-13 Thread Joel Bernstein (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-8176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15096707#comment-15096707
 ] 

Joel Bernstein edited comment on SOLR-8176 at 1/13/16 6:15 PM:
---

Thinking some about the mechanics of what you're describing.

One possible approach to this is to implement each iteration as a parallel 
join. The initial join would shuffle results from the initial query to worker 
nodes. Then the worker nodes would persist the current working set locally. 
Then the next iteration starts from the worker nodes that persisted the working 
set. Each step in the traversal could be done like this. The effect would be 
that the graph traversal would *hop* from one set of workers to another set of 
workers with each iteration.



was (Author: joel.bernstein):
Thinking some about the machanics of what you're describing.

One possible approach to this is to shuffle results from an initial query to 
worker nodes. Then the worker nodes persist the current working set locally. 
Then the next iteration starts from the worker nodes that persisted the working 
set. Each step in the traversal could be done like this. The effect would be 
that the graph traversal would *hop* from one set of workers to another set of 
workers with each iteration.

> Model distributed graph traversals with Streaming Expressions
> -
>
> Key: SOLR-8176
> URL: https://issues.apache.org/jira/browse/SOLR-8176
> Project: Solr
>  Issue Type: New Feature
>  Components: clients - java, SolrCloud, SolrJ
>Affects Versions: Trunk
>Reporter: Joel Bernstein
>  Labels: Graph
> Fix For: Trunk
>
>
> I think it would be useful to model a few *distributed graph traversal* use 
> cases with Solr's *Streaming Expression* language. This ticket will explore 
> different approaches with a goal of implementing two or three common graph 
> traversal use cases.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (SOLR-8176) Model distributed graph traversals with Streaming Expressions

2015-12-29 Thread Dennis Gove (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-8176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15073843#comment-15073843
 ] 

Dennis Gove edited comment on SOLR-8176 at 12/29/15 12:10 PM:
--

I've been thinking about this a little bit and one thing I keep coming back to 
is that there are different kinds of graph traversals and I think our model 
should take that into account. There are lots of types but I think the two 
major categories are node traversing graphs and edge traversing graphs. 

h3. Node Traversing Graphs
These are graphs where you have some set of root nodes and you want to find 
connected nodes with some set of criteria. For example, given a collection of 
geographic locations (city, county, state, country) with fields "id", "type", 
"parentId", "name" find all cities in NY. As a hiccup the data is not 
completely normalized and some cities have their county listed as their parent 
while some have their state listed as their parent. Ie, you do not know how 
many nodes are between any given city and any given state.
{code}
graph(
  geography,
  root(q="type=state AND name:ny", fl="id"),
  leaf(q="type=city", fl="id,parentId,name"),
  edge("id=parentId")
)
{code}
In this example you're starting with a set of nodes in the geography 
collection, all which have some relationship to each other. You select your 
starting (root) nodes as all states named "ny" (there could be more than one). 
You then define what constitutes an ending (leaf) node as all cities. And 
finally, you say that all edges where nodeA.id == nodeB.parentId should be 
followed.

This traversal can be implemented as a relatively simple iterative search 
following the form
{code}
frontier := search for all root nodes
leaves := empty list

while frontier is not empty
  frontierIds := list of ids of all nodes in frontier list
  leaves :append: search for all nodes whose parentId is in frontierIds and 
matches the leaf filter
  frontier := search for all nodes whose parentId is in frontierIds and does 
not match the leaf filter

{code}
In each iteration the leaves list can grow and the frontier list is replaced 
with the next set of nodes to consider. In the end you have a list of all leaf 
nodes which in some way connect to the original root nodes following the 
defined edge. Note that for simplicity I've left a couple of things out, 
including checking for already traversed nodes to avoid loops. Also, the leaf 
nodes are not added to the frontier but they can be. This would be useful in a 
situation where leaves are connected to leaves.

h3. Edge Traversal Graphs
These are graphs where you have some set of edges but the nodes themselves are 
relatively unimportant for traversal. For example, finding the shortest path 
between two nodes, or finding the minimum spanning tree for some set of nodes, 
or finding loops.


was (Author: dpgove):
I've been thinking about this a little bit and one thing I keep coming back to 
is that there are different kinds of graph traversals and I think our model 
should take that into account. There are lots of types but I think the two 
major categories are node traversing graphs and edge traversing graphs. 

h3. Node Traversing Graphs
These are graphs where you have some set of root nodes and you want to find 
connected nodes with some set of criteria. For example, given a collection of 
geographic locations (city, county, state, country) with fields "id", "type", 
"parentId", "name" find all cities in NY. As a hiccup the data is not 
completely normalized and some cities have their county listed as their parent 
while some have their state listed as their parent. Ie, you do not know how 
many nodes are between any given city and any given state.
{code}
graph(
  geography,
  root(q="type=state AND name:ny", fl="id"),
  leaf(q="type=city", fl="id,parentId,name"),
  edge("id=parentId")
)
{code}
In this example you're starting with a set of nodes in the geography 
collection, all which have some relationship to each other. You select your 
starting (root) nodes as all states named "ny" (there could be more than one). 
You then define what constitutes an ending (leaf) node as all cities. And 
finally, you say that all edges where nodeA.id == nodeB.parentId should be 
followed.

This traversal can be implemented as a relatively simple iterative search 
following the form
{code}
frontier := search for all root nodes
leaves := empty list

while frontier is not empty
  frontierIds := list of ids of all nodes in frontier list
  leaves :append: search for all nodes whose parentId is in frontierIds and 
matches the leaf filter
  frontier := search for all nodes whose parentId is in frontierIds and does 
not match the leaf filter

{code}
In each iteration the leaves list can grow and the frontier list is replaced 
with the next set of nodes to consider. In the end you have a list of all leaf 
nodes which in some way 

[jira] [Comment Edited] (SOLR-8176) Model distributed graph traversals with Streaming Expressions

2015-11-13 Thread Joel Bernstein (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-8176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15004412#comment-15004412
 ] 

Joel Bernstein edited comment on SOLR-8176 at 11/13/15 6:09 PM:


I need to dig into the TinkerPop API. I think implementing Gremlin would be the 
desired end game. 

I see a distributed Gremlin implementation as another Parallel Computing 
problem, like the Parallel SQL interface. This is where the Streaming API comes 
in. If we can model graph traversals with the Streaming API then we can have a 
Gremlin parser that compiles to Streaming API objects. This was the approach 
taken with the SQL interface.

So this ticket is really about laying the Parallel Computing framework for 
supporting graph traversals. 

I do agree that looking at TinkerPop will be very useful in understanding what 
to model.


was (Author: joel.bernstein):
I need to dig into the TinkerPop API. I think implementing Gremlin would be the 
desired end game. 

I see distributed Gremlin implementation as another Parallel Computing problem, 
like the Parallel SQL interface. This is where the Streaming API comes in. If 
we model graph traversals with the Streaming API then we can have a Gremlin 
parser that compiles to Streaming API objects. This was the approach taken with 
the SQL interface.

So this ticket is really about laying the Parallel Computing framework for 
supporting graph traversals. 

Although I do agree that looking at TinkerPop will be very useful in 
understanding what to model.

> Model distributed graph traversals with Streaming Expressions
> -
>
> Key: SOLR-8176
> URL: https://issues.apache.org/jira/browse/SOLR-8176
> Project: Solr
>  Issue Type: New Feature
>  Components: clients - java, SolrCloud, SolrJ
>Affects Versions: Trunk
>Reporter: Joel Bernstein
>Assignee: Joel Bernstein
>  Labels: Graph
> Fix For: Trunk
>
>
> I think it would be useful to model a few *distributed graph traversal* use 
> cases with Solr's *Streaming Expression* language. This ticket will explore 
> different approaches with a goal of implementing two or three common graph 
> traversal use cases.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org