[jira] [Commented] (SOLR-11384) add support for distributed graph query

2019-08-27 Thread Komal (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-11384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16916955#comment-16916955
 ] 

Komal commented on SOLR-11384:
--

[~kwatters] Thanks for the revert. I am eagerly looking for this functionality 
in my current work. Given, we don't need caching currently, I would love to 
have an opportunity to chat and see how we may take this forward. Is your work 
open-source already or can you provide a patch here. That would help me 
understand, and do my homework before I discuss further any details. 

> add support for distributed graph query
> ---
>
> Key: SOLR-11384
> URL: https://issues.apache.org/jira/browse/SOLR-11384
> Project: Solr
>  Issue Type: Improvement
>Reporter: Kevin Watters
>Priority: Minor
>
> Creating this ticket to track the work that I've done on the distributed 
> graph traversal support in solr.
> Current GraphQuery will only work on a single core, which introduces some 
> limits on where it can be used and also complexities if you want to scale it. 
>  I believe there's a strong desire to support a fully distributed method of 
> doing the Graph Query.  I'm working on a patch, it's not complete yet, but if 
> anyone would like to have a look at the approach and implementation,  I 
> welcome much feedback.  
> The flow for the distributed graph query is almost exactly the same as the 
> normal graph query.  The only difference is how it discovers the "frontier 
> query" at each level of the traversal.  
> When a distribute graph query request comes in, each shard begins by running 
> the root query, to know where to start on it's shard.  Each participating 
> shard then discovers it's edges for the next hop.  Those edges are then 
> broadcast to all other participating shards.  The shard then receives all the 
> parts of the frontier query , assembles it, and executes it.
> This process continues on each shard until there are no new edges left, or 
> the maxDepth of the traversal has finished.
> The approach is to introduce a FrontierBroker that resides as a singleton on 
> each one of the solr nodes in the cluster.  When a graph query is created, it 
> can do a getInstance() on it so it can listen on the frontier parts coming in.
> Initially, I was using an external Kafka broker to handle this, and it did 
> work pretty well.  The new approach is migrating the FrontierBroker to be a 
> request handler in Solr, and potentially to use the SolrJ client to publish 
> the edges to each node in the cluster.
> There are a few outstanding design questions, first being, how do we know 
> what the list of shards are that are participating in the current query 
> request?  Is that easy info to get at?
> Second,  currently, we are serializing a query object between the shards, 
> perhaps we should consider a slightly different abstraction, and serialize 
> lists of "edge" objects between the nodes.   The point of this would be to 
> batch the exploration/traversal of current frontier to help avoid large 
> bursts of memory being required.
> Thrid, what sort of caching strategy should be introduced for the frontier 
> queries, if any?  And if we do some caching there, how/when should the 
> entries be expired and auto-warmed.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-11384) add support for distributed graph query

2019-08-27 Thread Kevin Watters (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-11384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16916717#comment-16916717
 ] 

Kevin Watters commented on SOLR-11384:
--

[~erickerickson]  Streaming expressions are fundamentally different in their 
semantics to the graph query.  If there is renewed interested in this 
functionality, we can revisit it.

At the moment, we're in the process of building a new cross collection join 
operator (XCJF  cross-collection join filter).  The work there is a stepping 
stone for a fully distributed graph traversal.

[~komal_vmware] if you have a use case, let's chat about it.  I do have a 
version of the distributed graph query working locally, but I don't consider it 
prime time due to a few pesky items related to caching.

> add support for distributed graph query
> ---
>
> Key: SOLR-11384
> URL: https://issues.apache.org/jira/browse/SOLR-11384
> Project: Solr
>  Issue Type: Improvement
>Reporter: Kevin Watters
>Priority: Minor
>
> Creating this ticket to track the work that I've done on the distributed 
> graph traversal support in solr.
> Current GraphQuery will only work on a single core, which introduces some 
> limits on where it can be used and also complexities if you want to scale it. 
>  I believe there's a strong desire to support a fully distributed method of 
> doing the Graph Query.  I'm working on a patch, it's not complete yet, but if 
> anyone would like to have a look at the approach and implementation,  I 
> welcome much feedback.  
> The flow for the distributed graph query is almost exactly the same as the 
> normal graph query.  The only difference is how it discovers the "frontier 
> query" at each level of the traversal.  
> When a distribute graph query request comes in, each shard begins by running 
> the root query, to know where to start on it's shard.  Each participating 
> shard then discovers it's edges for the next hop.  Those edges are then 
> broadcast to all other participating shards.  The shard then receives all the 
> parts of the frontier query , assembles it, and executes it.
> This process continues on each shard until there are no new edges left, or 
> the maxDepth of the traversal has finished.
> The approach is to introduce a FrontierBroker that resides as a singleton on 
> each one of the solr nodes in the cluster.  When a graph query is created, it 
> can do a getInstance() on it so it can listen on the frontier parts coming in.
> Initially, I was using an external Kafka broker to handle this, and it did 
> work pretty well.  The new approach is migrating the FrontierBroker to be a 
> request handler in Solr, and potentially to use the SolrJ client to publish 
> the edges to each node in the cluster.
> There are a few outstanding design questions, first being, how do we know 
> what the list of shards are that are participating in the current query 
> request?  Is that easy info to get at?
> Second,  currently, we are serializing a query object between the shards, 
> perhaps we should consider a slightly different abstraction, and serialize 
> lists of "edge" objects between the nodes.   The point of this would be to 
> batch the exploration/traversal of current frontier to help avoid large 
> bursts of memory being required.
> Thrid, what sort of caching strategy should be introduced for the frontier 
> queries, if any?  And if we do some caching there, how/when should the 
> entries be expired and auto-warmed.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-11384) add support for distributed graph query

2019-08-27 Thread Erick Erickson (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-11384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16916626#comment-16916626
 ] 

Erick Erickson commented on SOLR-11384:
---

[~kwatters] Echoing Jeroen's question. Should we close this and let streaming 
handle it?

> add support for distributed graph query
> ---
>
> Key: SOLR-11384
> URL: https://issues.apache.org/jira/browse/SOLR-11384
> Project: Solr
>  Issue Type: Improvement
>Reporter: Kevin Watters
>Priority: Minor
>
> Creating this ticket to track the work that I've done on the distributed 
> graph traversal support in solr.
> Current GraphQuery will only work on a single core, which introduces some 
> limits on where it can be used and also complexities if you want to scale it. 
>  I believe there's a strong desire to support a fully distributed method of 
> doing the Graph Query.  I'm working on a patch, it's not complete yet, but if 
> anyone would like to have a look at the approach and implementation,  I 
> welcome much feedback.  
> The flow for the distributed graph query is almost exactly the same as the 
> normal graph query.  The only difference is how it discovers the "frontier 
> query" at each level of the traversal.  
> When a distribute graph query request comes in, each shard begins by running 
> the root query, to know where to start on it's shard.  Each participating 
> shard then discovers it's edges for the next hop.  Those edges are then 
> broadcast to all other participating shards.  The shard then receives all the 
> parts of the frontier query , assembles it, and executes it.
> This process continues on each shard until there are no new edges left, or 
> the maxDepth of the traversal has finished.
> The approach is to introduce a FrontierBroker that resides as a singleton on 
> each one of the solr nodes in the cluster.  When a graph query is created, it 
> can do a getInstance() on it so it can listen on the frontier parts coming in.
> Initially, I was using an external Kafka broker to handle this, and it did 
> work pretty well.  The new approach is migrating the FrontierBroker to be a 
> request handler in Solr, and potentially to use the SolrJ client to publish 
> the edges to each node in the cluster.
> There are a few outstanding design questions, first being, how do we know 
> what the list of shards are that are participating in the current query 
> request?  Is that easy info to get at?
> Second,  currently, we are serializing a query object between the shards, 
> perhaps we should consider a slightly different abstraction, and serialize 
> lists of "edge" objects between the nodes.   The point of this would be to 
> batch the exploration/traversal of current frontier to help avoid large 
> bursts of memory being required.
> Thrid, what sort of caching strategy should be introduced for the frontier 
> queries, if any?  And if we do some caching there, how/when should the 
> entries be expired and auto-warmed.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-11384) add support for distributed graph query

2019-08-27 Thread Komal (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-11384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16916581#comment-16916581
 ] 

Komal commented on SOLR-11384:
--

Any updates on this ticket? Do we finally have distributed graph traversal 
support in solr? 

> add support for distributed graph query
> ---
>
> Key: SOLR-11384
> URL: https://issues.apache.org/jira/browse/SOLR-11384
> Project: Solr
>  Issue Type: Improvement
>Reporter: Kevin Watters
>Priority: Minor
>
> Creating this ticket to track the work that I've done on the distributed 
> graph traversal support in solr.
> Current GraphQuery will only work on a single core, which introduces some 
> limits on where it can be used and also complexities if you want to scale it. 
>  I believe there's a strong desire to support a fully distributed method of 
> doing the Graph Query.  I'm working on a patch, it's not complete yet, but if 
> anyone would like to have a look at the approach and implementation,  I 
> welcome much feedback.  
> The flow for the distributed graph query is almost exactly the same as the 
> normal graph query.  The only difference is how it discovers the "frontier 
> query" at each level of the traversal.  
> When a distribute graph query request comes in, each shard begins by running 
> the root query, to know where to start on it's shard.  Each participating 
> shard then discovers it's edges for the next hop.  Those edges are then 
> broadcast to all other participating shards.  The shard then receives all the 
> parts of the frontier query , assembles it, and executes it.
> This process continues on each shard until there are no new edges left, or 
> the maxDepth of the traversal has finished.
> The approach is to introduce a FrontierBroker that resides as a singleton on 
> each one of the solr nodes in the cluster.  When a graph query is created, it 
> can do a getInstance() on it so it can listen on the frontier parts coming in.
> Initially, I was using an external Kafka broker to handle this, and it did 
> work pretty well.  The new approach is migrating the FrontierBroker to be a 
> request handler in Solr, and potentially to use the SolrJ client to publish 
> the edges to each node in the cluster.
> There are a few outstanding design questions, first being, how do we know 
> what the list of shards are that are participating in the current query 
> request?  Is that easy info to get at?
> Second,  currently, we are serializing a query object between the shards, 
> perhaps we should consider a slightly different abstraction, and serialize 
> lists of "edge" objects between the nodes.   The point of this would be to 
> batch the exploration/traversal of current frontier to help avoid large 
> bursts of memory being required.
> Thrid, what sort of caching strategy should be introduced for the frontier 
> queries, if any?  And if we do some caching there, how/when should the 
> entries be expired and auto-warmed.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-11384) add support for distributed graph query

2018-05-01 Thread Gus Heck (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-11384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16459678#comment-16459678
 ] 

Gus Heck commented on SOLR-11384:
-

This would make the existing 
[https://lucene.apache.org/solr/guide/7_3/other-parsers.html#graph-query-parser]
 work across multiple cores. That feature is useful for things like complex 
hierarchy based security expressed as (cacheable) filter queries. Last I looked 
streaming expressions can't be used as a filter on regular queries (though it's 
been some time since I looked) and would need to be calculated every time. 

> add support for distributed graph query
> ---
>
> Key: SOLR-11384
> URL: https://issues.apache.org/jira/browse/SOLR-11384
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Kevin Watters
>Priority: Minor
>
> Creating this ticket to track the work that I've done on the distributed 
> graph traversal support in solr.
> Current GraphQuery will only work on a single core, which introduces some 
> limits on where it can be used and also complexities if you want to scale it. 
>  I believe there's a strong desire to support a fully distributed method of 
> doing the Graph Query.  I'm working on a patch, it's not complete yet, but if 
> anyone would like to have a look at the approach and implementation,  I 
> welcome much feedback.  
> The flow for the distributed graph query is almost exactly the same as the 
> normal graph query.  The only difference is how it discovers the "frontier 
> query" at each level of the traversal.  
> When a distribute graph query request comes in, each shard begins by running 
> the root query, to know where to start on it's shard.  Each participating 
> shard then discovers it's edges for the next hop.  Those edges are then 
> broadcast to all other participating shards.  The shard then receives all the 
> parts of the frontier query , assembles it, and executes it.
> This process continues on each shard until there are no new edges left, or 
> the maxDepth of the traversal has finished.
> The approach is to introduce a FrontierBroker that resides as a singleton on 
> each one of the solr nodes in the cluster.  When a graph query is created, it 
> can do a getInstance() on it so it can listen on the frontier parts coming in.
> Initially, I was using an external Kafka broker to handle this, and it did 
> work pretty well.  The new approach is migrating the FrontierBroker to be a 
> request handler in Solr, and potentially to use the SolrJ client to publish 
> the edges to each node in the cluster.
> There are a few outstanding design questions, first being, how do we know 
> what the list of shards are that are participating in the current query 
> request?  Is that easy info to get at?
> Second,  currently, we are serializing a query object between the shards, 
> perhaps we should consider a slightly different abstraction, and serialize 
> lists of "edge" objects between the nodes.   The point of this would be to 
> batch the exploration/traversal of current frontier to help avoid large 
> bursts of memory being required.
> Thrid, what sort of caching strategy should be introduced for the frontier 
> queries, if any?  And if we do some caching there, how/when should the 
> entries be expired and auto-warmed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-11384) add support for distributed graph query

2018-05-01 Thread Jeroen Steggink (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-11384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16459653#comment-16459653
 ] 

Jeroen Steggink commented on SOLR-11384:


Is this still relevant? We have a streaming expression for graph traversal.

> add support for distributed graph query
> ---
>
> Key: SOLR-11384
> URL: https://issues.apache.org/jira/browse/SOLR-11384
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Kevin Watters
>Priority: Minor
>
> Creating this ticket to track the work that I've done on the distributed 
> graph traversal support in solr.
> Current GraphQuery will only work on a single core, which introduces some 
> limits on where it can be used and also complexities if you want to scale it. 
>  I believe there's a strong desire to support a fully distributed method of 
> doing the Graph Query.  I'm working on a patch, it's not complete yet, but if 
> anyone would like to have a look at the approach and implementation,  I 
> welcome much feedback.  
> The flow for the distributed graph query is almost exactly the same as the 
> normal graph query.  The only difference is how it discovers the "frontier 
> query" at each level of the traversal.  
> When a distribute graph query request comes in, each shard begins by running 
> the root query, to know where to start on it's shard.  Each participating 
> shard then discovers it's edges for the next hop.  Those edges are then 
> broadcast to all other participating shards.  The shard then receives all the 
> parts of the frontier query , assembles it, and executes it.
> This process continues on each shard until there are no new edges left, or 
> the maxDepth of the traversal has finished.
> The approach is to introduce a FrontierBroker that resides as a singleton on 
> each one of the solr nodes in the cluster.  When a graph query is created, it 
> can do a getInstance() on it so it can listen on the frontier parts coming in.
> Initially, I was using an external Kafka broker to handle this, and it did 
> work pretty well.  The new approach is migrating the FrontierBroker to be a 
> request handler in Solr, and potentially to use the SolrJ client to publish 
> the edges to each node in the cluster.
> There are a few outstanding design questions, first being, how do we know 
> what the list of shards are that are participating in the current query 
> request?  Is that easy info to get at?
> Second,  currently, we are serializing a query object between the shards, 
> perhaps we should consider a slightly different abstraction, and serialize 
> lists of "edge" objects between the nodes.   The point of this would be to 
> batch the exploration/traversal of current frontier to help avoid large 
> bursts of memory being required.
> Thrid, what sort of caching strategy should be introduced for the frontier 
> queries, if any?  And if we do some caching there, how/when should the 
> entries be expired and auto-warmed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org