Kevin Watters created SOLR-11384:
------------------------------------

             Summary: add support for distributed graph query
                 Key: SOLR-11384
                 URL: https://issues.apache.org/jira/browse/SOLR-11384
             Project: Solr
          Issue Type: Bug
      Security Level: Public (Default Security Level. Issues are Public)
            Reporter: Kevin Watters


Creating this ticket to track the work that I've done on the distributed graph 
traversal support in solr.

Current GraphQuery will only work on a single core, which introduces some 
limits on where it can be used and also complexities if you want to scale it.  
I believe there's a strong desire to support a fully distributed method of 
doing the Graph Query.  I'm working on a patch, it's not complete yet, but if 
anyone would like to have a look at the approach and implementation,  I welcome 
much feedback.  

The flow for the distributed graph query is almost exactly the same as the 
normal graph query.  The only difference is how it discovers the "frontier 
query" at each level of the traversal.  

When a distribute graph query request comes in, each shard begins by running 
the root query, to know where to start on it's shard.  Each participating shard 
then discovers it's edges for the next hop.  Those edges are then broadcast to 
all other participating shards.  The shard then receives all the parts of the 
frontier query , assembles it, and executes it.

This process continues on each shard until there are no new edges left, or the 
maxDepth of the traversal has finished.

The approach is to introduce a FrontierBroker that resides as a singleton on 
each one of the solr nodes in the cluster.  When a graph query is created, it 
can do a getInstance() on it so it can listen on the frontier parts coming in.

Initially, I was using an external Kafka broker to handle this, and it did work 
pretty well.  The new approach is migrating the FrontierBroker to be a request 
handler in Solr, and potentially to use the SolrJ client to publish the edges 
to each node in the cluster.

There are a few outstanding design questions, first being, how do we know what 
the list of shards are that are participating in the current query request?  Is 
that easy info to get at?

Second,  currently, we are serializing a query object between the shards, 
perhaps we should consider a slightly different abstraction, and serialize 
lists of "edge" objects between the nodes.   The point of this would be to 
batch the exploration/traversal of current frontier to help avoid large bursts 
of memory being required.

Thrid, what sort of caching strategy should be introduced for the frontier 
queries, if any?  And if we do some caching there, how/when should the entries 
be expired and auto-warmed.







--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to