[jira] [Commented] (SOLR-9480) Graph Traversal for Significantly Related Terms (Semantic Knowledge Graph)

Hoss Man (JIRA) Mon, 07 May 2018 17:24:07 -0700

    [ 
https://issues.apache.org/jira/browse/SOLR-9480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16466664#comment-16466664
 ]


Hoss Man commented on SOLR-9480:
--------------------------------

I've been playing with this "SKG" code off an on for a bit, and talking with 
trey about it offline, and I think I've come up with a nice strategy for 
integrating it directly into JSON Faceting as a handfull small improvements, to 
be able to leverage most of the existing JSON Facet code (including distributed 
refinement) w/o needing to be "tacked on" to the outside as much as the code in 
the older patch/github-repo does.

The attached patch includes the 3 most "important" of these improvements in 
order for this to work, along with lots of new tests...
 # Refactoring the {{SlotAcc}} so that everytime it's asked to {{collect(...)}} 
a slot, it has the ability to ask for a {{Query}} that identifies this slot 
_independent of the current context_
 ** This allows the new SKG code (described below) to have enough information 
about the current bucket that it can compute the full "foreground" that it 
wants, regardless of the bucket type or how the current bucket may be nested 
under other facets ... so SKG graphs can be built by nesting any types of 
facets (including range & query facets) regardless of how the top level {{q + 
fq}} params realted to the {{foreground}} query.
 ** this happens via a {{IntFunction<Query>}} callback function – so there 
shouldn't be any overhead to existing FacetProcessor/SlotAcc usages that don't 
care about this extra info about the bucket – there's no extra {{TermQuery}} 
(or {{RangeQuery}} etc...) overhead when the accumulators only care about the 
final filtered set of documents for the current bucket/slot.

 # Add an {{skg(...)}} AggValueSource function that can be nested under any 
facet
 ** this function takes in the "foreground" and "background" queries to use, 
which just like any existing (aggregate) function can be {{$variables}} 
pointing to existing request params
 ** this means that, unlike the original SKG code linked from this issue, you 
could compute the SKG relatedness info only at certain points in the facet 
herirachy, or use different foreground/background queries in different places
 ** this function actually produces JSON "objects" as the function result, 
containing the foreground/background popularities as well as the "relatedness" 
score – which is what's used if you sort on this function
 ** I originally experimented with implementing "SKG" as a new type of _facet_ 
that could be nested under any (othe) facet, but implementing as a function 
means that we can leverage the existing code for sorting (parent) facet buckets 
by the (child) function's results – which is very powerful for SKG results (and 
it's not currently possible to "sort" on the results of a sub-facet, and doing 
so would be a lot of work given how sub-facet refinement is currently handled 
... i looked into it briefly)
 ** but sorting on the {{skg()}} function is optional, and not strictly 
neccessary when the clients care more about performance then accuracy – as with 
the existing SKG code trey contributed, the (default) sort on facet count could 
still be used, which means the existing JSON faceting code would only compute 
the (semi-expensive) {{skg()}} function on the final buckets to be returned, 
and the client could then post-process to re-sort them by the {{skg()}} values.

 # Add support for a "explicit query domain" via syntax like {{domain : 
\{query:'foo:bar'\}}}  (or any other JSON query syntax supported by the 
{{filters}} option) that let's you arbitrarily pick any set of queries you want 
to use as a "domain" for a facet, regardless of it's parent facets/bucket.
 ** this provides an optional way to improve the "top n" accuracy of sub-facets 
in a deep SKG request, by letting you ignore the "ancestor facet bucket 
filtering" typically done in faceting, and instead request that *all* buckets 
under some arbitrarr query – like the original background query – be considered.
 ** SKG users that care more about speed & aproximations can ignore this 
feature, and just sort the regular facet terms by the {{skg()}} function to get 
a good aproximation of the top terms ... or as I mentioned before: trust the 
(default) sort on facet counts (w/or w/o using the {{$background_q}} as an 
explicit domain) to approximate the top N terms)

An example of what all these features together can look like right now...
{noformat}
rows=0&
q=type:QUESTION&
fore=body:%22harry+potter%22&
back=*:*&
json.facet='{
  tags : {
    type : terms,
    field : tags,
    limit : 5,
    sort : { skg: desc },
    facet : {
      skg : "skg($fore,$back)",
      body : {
        type : terms,
        field : body,
        limit : 5,
        domain : { query:{param:back} },
        sort : { skg: desc },
        facet : {
          skg : "skg($fore,$back)"
        }
      }
    }
  }
}'
{noformat}
There are still lots of things not included in the patch that could be added 
later to make all of this better and/or easier to use – and in most cases would 
be general improvements to JSON Faceting...
 * As noted in some {{TODO}} comments, I would love to enhance the syntax of 
the {{skg()}} function in a couple of ways...
 ** making the queries optional, and inheriting them from "ancestor" function 
instances higher up the tree...
{noformat}
{
  tags : {
    type : terms,
    field : tags,
    facet : {
      skg : "skg($fore,$back)",
      body : {
        type : terms,
        field : body,
        facet : {
          skg : "skg()" // inherits the $fore/$back queries from the 'skg' 
function of the parent facet
        }
      }
    }
  }
}
{noformat}

 ** I'd also like to improve the way JSON Facet functions are parsed – along 
the lines of what's described in SOLR-11709 – in order to support more 
"optional" args that could be used by {{skg()}} to override some of it's 
default behavior...
 *** this would be implemented under the covers by passing the extra map keys 
as the "localParams" for the ValueSourceParser
 *** Example: telling {{skg()}} that it's effective "sort" value should be 
based on the "foreground_pop" instead of the (default) "relatedness"...
{noformat}
tags : {
  type : terms,
  field : tags,
  sort : "skg desc",
  facet : {
    skg : { type : func,
            func : "skg($fore,$back)",
            sort_value : foreground_pop }
  }
}
{noformat}

 *** this could also be used to implement a {{min_pop}} type value, that could 
be used to configure the {{skg()}} function to return a relatedness of 
{{-Infinity}} for any bucket that didn't have foreground/background popularity 
ratios at least as high as some user specified value.
 * Similar to how the {{rerank}} request param allows people to collect & score 
documents using a "cheap" query, and then re-score the top N using a ore 
expensive query, I think it would be handy if JSON Facets supported a 
{{resort}} option that could be used on any {{FacetRequestSorted}} instance 
right along side the {{sort}} param, using the same JSON syntax, so that 
clients could have Solr internaly sort all the facet buckets by something 
simple (like count) and then "Re-Sort" the top {{N=limit}} (or maybe ( 
{{N=limit+overrequest}} ?) using a more expensive function like {{skg()}}

...however, I think most of this would be best left to other (future) Jiras, 
and they are only marked {{TODO}} in the current patch (if mentioned at all)
----
My current focus is on resolving the outstanding {{nocommits}} which tend to 
fall into these main categories (in order of importance) ...
 * resolving randomized test failures
 ** I used {{TestCloudJSONFacetJoinDomain}} a imspiration for a new 
{{TestCloudJSONFacetSKG}} that similarly tries to generate random indexes & 
requests and then "prove" that the results of those requests are accurate via 
verification queries
 ** i initially thought using {{refine:true}} + {{mincount:0}} + 
{{processEmpty:true}} would allow me to "prove" that the SKG results were 
accurate by executing the equivilent foreground/background queries for each 
bucket – but even with those options, i'm seeing some popularity ratios that 
are missing the denominator (size) from some shards when the numerator (count) 
is 0 ... making me think there is either some flaw in my reasoning about the 
provability, or some bug where the existing refinement logic isn't picking up 
the function contributions of some shards when the doc count is 0
 ** even if this test approach proves flawed, the functionality itself can 
still be useful since it's largely about computing statistical aproximations – 
but i want to be 100% sure i understand *why* the test is failing before 
writting it off
 * refactoring some similar code
 ** the SKG distributed merging data structure is currently completely 
independent from the single-shard "SlotVal" objects ... this hsould be 
refactored to share code
 * can the distributed results be more efficient?
 ** right now the redundent fore & back "size" values (which are the same for 
every slot/bucket) are returned for every bucket ... i'd like to try and figure 
out if i can put that data in the facet "context" to reduce the shard response 
size.
 * figuring out what/how/where to put info in the facetDebug output
 ** it seems like it could be handy for people to be able to access the raw 
fore & back / count & size values for each bucket when debugging facets – i 
just have to figure out how to do that
 * javadocs
 * naming
 ** "Semantic Knowledge Graph" seems like a good name for the _concept_ of how 
these features can be used/combined, but the current _function_ {{skg(...)}} 
seems like it should probably have name more specific to the underlying 
relatedness forumla ... but i still don't really understand where exactly that 
formula comes from, so i'm not really clear yet on what a better name might be.

----

Any feedback/comments/concerns about this approachwould be appreciated 

> Graph Traversal for Significantly Related Terms (Semantic Knowledge Graph)
> --------------------------------------------------------------------------
>
>                 Key: SOLR-9480
>                 URL: https://issues.apache.org/jira/browse/SOLR-9480
>             Project: Solr
>          Issue Type: Improvement
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Trey Grainger
>            Priority: Major
>         Attachments: SOLR-9480.patch, SOLR-9480.patch
>
>
> This issue is to track the contribution of the Semantic Knowledge Graph Solr 
> Plugin (request handler), which exposes a graph-like interface for 
> discovering and traversing significant relationships between entities within 
> an inverted index.
> This data model has been described in the following research paper: [The 
> Semantic Knowledge Graph: A compact, auto-generated model for real-time 
> traversal and ranking of any relationship within a 
> domain|https://arxiv.org/abs/1609.00464], as well as in presentations I gave 
> in October 2015 at [Lucene/Solr 
> Revolution|http://www.slideshare.net/treygrainger/leveraging-lucenesolr-as-a-knowledge-graph-and-intent-engine]
>  and November 2015 at the [Bay Area Search 
> Meetup|http://www.treygrainger.com/posts/presentations/searching-on-intent-knowledge-graphs-personalization-and-contextual-disambiguation/].
> The source code for this project is currently available at 
> [https://github.com/careerbuilder/semantic-knowledge-graph], and the folks at 
> CareerBuilder (where this was built) have given me the go-ahead to now 
> contribute this back to the Apache Solr Project, as well.
> Check out the Github repository, research paper, or presentations for a more 
> detailed description of this contribution. Initial patch coming soon.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SOLR-9480) Graph Traversal for Significantly Related Terms (Semantic Knowledge Graph)

Reply via email to