[jira] [Updated] (SOLR-8925) Add gatherNodes Streaming Expression to support breadth first traversals

Joel Bernstein (JIRA) Mon, 11 Apr 2016 13:05:10 -0700

     [ 
https://issues.apache.org/jira/browse/SOLR-8925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Joel Bernstein updated SOLR-8925:
---------------------------------
    Description: 
The gatherNodes Streaming Expression is a flexible general purpose breadth 
first graph traversal. It uses the same parallel join under the covers as 
(SOLR-8888) but is much more generalized and can be used for a wide range of 
use cases.

Sample syntax:

{code}

 gatherNodes(friends,
             gatherNodes(friends,
                         search(articles, q=“body:(queryA)”, fl=“author”),
                         walk ="author->user”,
                         gather="friend"),
             walk=“friend->user”,
             gather="friend",
             scatter=“roots, branches, leaves”)
{code}


The expression above is evaluated as follows:

1) The inner search() expression is evaluated on the *articles* collection, 
emitting a Stream of Tuples with the author field populated.
2) The inner gatherNodes() expression reads the Tuples form the search() stream 
and traverses to the *friends* collection by performing a distributed join 
between articles.author and friends.user field.  It gathers the value from the 
*friend* field during the join.
3) The inner gatherNodes() expression then emits the *friend* Tuples. By 
default the gatherNodes function emits only the leaves which in this case are 
the *friend* tuples.
4) The outer gatherNodes() expression reads the *friend* Tuples and Traverses 
again in the "friends" collection, this time performing the join between 
*friend* Tuples  emitted in step 3. This collects the friend of friends.
5) The outer gatherNodes() expression emits the entire graph that was 
collected. This is controlled by the "scatter" parameter. In the example the 
*root* nodes are the authors, the *branches* are the author's friends and the 
*leaves* are the friend of friends.

This traversal is fully distributed and cross collection.

*Aggregations* are also supported during the traversal. This can be useful for 
making recommendations based on co-occurance counts: Sample syntax:
{code}
top(
      gatherNodes(baskets,
                  search(baskets, q=“prodid:X”, fl=“basketid”, rows=“500”, 
sort=“random_7897987 asc”),
                  walk =“basketid->basketid”,
                  gather=“prodid”,
                  fl=“prodid, price”,
                  count(*),
                  avg(price)),
      n=4,
      sort=“count(*) desc, avg(price) asc”)
{code}

In the expression above, the inner search() function searches the basket 
collection for 500 random basketId's that have the prodid X.

gatherNodes then traverses the basket collection and gathers all the prodid's 
for the selected basketIds.
It aggregates the counts and average price for each productid collected. The 
count reflects the co-occurance count for each product and prodid X. The outer 
*top* expression selects the top 4 prodid's emitted from gatherNodes, based the 
count and avg price.

Like all streaming expressions the gatherNodes expression can be combined with 
other streaming expressions. For example the following expression uses a 
hashJoin to intersect the network of friends rooted to authors found with 
different queries:

{code}
hashInnerJoin(
                      gatherNodes(friends,
                                  gatherNodes(friends,
                                              search(articles, 
q=“body:(queryA)”, fl=“author”),
                                              walk ="author->user”,
                                              gather="friend"),
                                  walk=“friend->user”,
                                  gather="friend",
                                  scatter=“branches, leaves”),
                       gatherNodes(friends,
                                  gatherNodes(friends,
                                              search(articles, 
q=“body:(queryB)”, fl=“author”),
                                              walk ="author->user”,
                                              gather="friend"),
                                  walk=“friend->user”,
                                  gather="friend",
                                  scatter=“branches, leaves”),
                      on=“friend”
         )
{code}






  


  was:
The gatherNodes Streaming Expression is a flexible general purpose breadth 
first graph traversal. It uses the same parallel join under the covers as 
(SOLR-8888) but is much more generalized and can be used for a wide range of 
use cases.

Sample syntax:

{code}

 gatherNodes(friends,
             gatherNodes(friends,
                         search(articles, q=“body:(queryA)”, fl=“author”),
                         walk ="author->user”,
                         gather="friend"),
             walk=“friend->user”,
             gather="friend",
             scatter=“roots, branches, leaves”)
{code}


The expression above is evaluated as follows:

1) The inner search() expression is evaluated on the *articles* collection, 
emitting a Stream of Tuples with the author field populated.
2) The inner gatherNodes() expression reads the Tuples form the search() stream 
and traverses to the *friends* collection by performing a distributed join 
between articles.author and friends.user field.  It gathers the value from the 
*friend* field during the join.
3) The inner gatherNodes() expression then emits the *friend* Tuples. By 
default the gatherNodes function emits only the leaves which in this case are 
the *friend* tuples.
4) The outer gatherNodes() expression reads the *friend* Tuples and Traverses 
again in the "friends" collection, this time performing the join between 
*friend* Tuples  emitted in step 3. This collects the friend of friends.
5) The outer gatherNodes() expression emits the entire graph that was 
collected. This is controlled by the "scatter" parameter. In the example the 
*root* nodes are the authors, the *branches* are the author's friends and the 
*leaves* are the friend of friends.

This traversal is fully distributed and cross collection.

*Aggregations* are also supported during the traversal. This can be useful for 
making recommendations based on co-occurance counts: Sample syntax:
{code}
top(
      gatherNodes(baskets,
                   search(baskets, q=“prodid:X”, fl=“basketid”, rows=“500”, 
sort=“random_7897987 asc”),
                   walk =“basketid->basketid”,
                   gather=“prodid”,
                   fl=“prodid, price”,
                   count(*),
                   avg(price)),
      n=4,
      sort=“count(*) desc, avg(price) asc”)
{code}

In the expression above, the inner search() function searches the basket 
collection for 500 random basketId's that have the prodid X.

gatherNodes then traverses the basket collection and gathers all the prodid's 
for the selected basketIds.
It aggregates the counts and average price for each productid collected. The 
count reflects the co-occurance count for each product and prodid X. The outer 
*top* expression selects the top 4 prodid's emitted from gatherNodes, based the 
count and avg price.

Like all streaming expressions the gatherNodes expression can be combined with 
other streaming expressions. For example the following expression uses a 
hashJoin to intersect the network of friends rooted to authors found with 
different queries:

{code}
hashInnerJoin(
                      gatherNodes(friends,
                                  gatherNodes(friends,
                                              search(articles, 
q=“body:(queryA)”, fl=“author”),
                                              walk ="author->user”,
                                              gather="friend"),
                                  walk=“friend->user”,
                                  gather="friend",
                                  scatter=“branches, leaves”),
                       gatherNodes(friends,
                                  gatherNodes(friends,
                                              search(articles, 
q=“body:(queryB)”, fl=“author”),
                                              walk ="author->user”,
                                              gather="friend"),
                                  walk=“friend->user”,
                                  gather="friend",
                                  scatter=“branches, leaves”),
                      on=“friend”
         )
{code}






  



> Add gatherNodes Streaming Expression to support breadth first traversals
> ------------------------------------------------------------------------
>
>                 Key: SOLR-8925
>                 URL: https://issues.apache.org/jira/browse/SOLR-8925
>             Project: Solr
>          Issue Type: New Feature
>            Reporter: Joel Bernstein
>            Assignee: Joel Bernstein
>             Fix For: 6.1
>
>         Attachments: SOLR-8925.patch
>
>
> The gatherNodes Streaming Expression is a flexible general purpose breadth 
> first graph traversal. It uses the same parallel join under the covers as 
> (SOLR-8888) but is much more generalized and can be used for a wide range of 
> use cases.
> Sample syntax:
> {code}
>  gatherNodes(friends,
>              gatherNodes(friends,
>                          search(articles, q=“body:(queryA)”, fl=“author”),
>                          walk ="author->user”,
>                          gather="friend"),
>              walk=“friend->user”,
>              gather="friend",
>              scatter=“roots, branches, leaves”)
> {code}
> The expression above is evaluated as follows:
> 1) The inner search() expression is evaluated on the *articles* collection, 
> emitting a Stream of Tuples with the author field populated.
> 2) The inner gatherNodes() expression reads the Tuples form the search() 
> stream and traverses to the *friends* collection by performing a distributed 
> join between articles.author and friends.user field.  It gathers the value 
> from the *friend* field during the join.
> 3) The inner gatherNodes() expression then emits the *friend* Tuples. By 
> default the gatherNodes function emits only the leaves which in this case are 
> the *friend* tuples.
> 4) The outer gatherNodes() expression reads the *friend* Tuples and Traverses 
> again in the "friends" collection, this time performing the join between 
> *friend* Tuples  emitted in step 3. This collects the friend of friends.
> 5) The outer gatherNodes() expression emits the entire graph that was 
> collected. This is controlled by the "scatter" parameter. In the example the 
> *root* nodes are the authors, the *branches* are the author's friends and the 
> *leaves* are the friend of friends.
> This traversal is fully distributed and cross collection.
> *Aggregations* are also supported during the traversal. This can be useful 
> for making recommendations based on co-occurance counts: Sample syntax:
> {code}
> top(
>       gatherNodes(baskets,
>                   search(baskets, q=“prodid:X”, fl=“basketid”, rows=“500”, 
> sort=“random_7897987 asc”),
>                   walk =“basketid->basketid”,
>                   gather=“prodid”,
>                   fl=“prodid, price”,
>                   count(*),
>                   avg(price)),
>       n=4,
>       sort=“count(*) desc, avg(price) asc”)
> {code}
> In the expression above, the inner search() function searches the basket 
> collection for 500 random basketId's that have the prodid X.
> gatherNodes then traverses the basket collection and gathers all the prodid's 
> for the selected basketIds.
> It aggregates the counts and average price for each productid collected. The 
> count reflects the co-occurance count for each product and prodid X. The 
> outer *top* expression selects the top 4 prodid's emitted from gatherNodes, 
> based the count and avg price.
> Like all streaming expressions the gatherNodes expression can be combined 
> with other streaming expressions. For example the following expression uses a 
> hashJoin to intersect the network of friends rooted to authors found with 
> different queries:
> {code}
> hashInnerJoin(
>                       gatherNodes(friends,
>                                   gatherNodes(friends,
>                                               search(articles, 
> q=“body:(queryA)”, fl=“author”),
>                                               walk ="author->user”,
>                                               gather="friend"),
>                                   walk=“friend->user”,
>                                   gather="friend",
>                                   scatter=“branches, leaves”),
>                        gatherNodes(friends,
>                                   gatherNodes(friends,
>                                               search(articles, 
> q=“body:(queryB)”, fl=“author”),
>                                               walk ="author->user”,
>                                               gather="friend"),
>                                   walk=“friend->user”,
>                                   gather="friend",
>                                   scatter=“branches, leaves”),
>                       on=“friend”
>          )
> {code}
>   



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-8925) Add gatherNodes Streaming Expression to support breadth first traversals

Reply via email to