[
https://issues.apache.org/jira/browse/SOLR-9480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16466664#comment-16466664
]
Hoss Man commented on SOLR-9480:
--------------------------------
I've been playing with this "SKG" code off an on for a bit, and talking with
trey about it offline, and I think I've come up with a nice strategy for
integrating it directly into JSON Faceting as a handfull small improvements, to
be able to leverage most of the existing JSON Facet code (including distributed
refinement) w/o needing to be "tacked on" to the outside as much as the code in
the older patch/github-repo does.
The attached patch includes the 3 most "important" of these improvements in
order for this to work, along with lots of new tests...
# Refactoring the {{SlotAcc}} so that everytime it's asked to {{collect(...)}}
a slot, it has the ability to ask for a {{Query}} that identifies this slot
_independent of the current context_
** This allows the new SKG code (described below) to have enough information
about the current bucket that it can compute the full "foreground" that it
wants, regardless of the bucket type or how the current bucket may be nested
under other facets ... so SKG graphs can be built by nesting any types of
facets (including range & query facets) regardless of how the top level {{q +
fq}} params realted to the {{foreground}} query.
** this happens via a {{IntFunction<Query>}} callback function – so there
shouldn't be any overhead to existing FacetProcessor/SlotAcc usages that don't
care about this extra info about the bucket – there's no extra {{TermQuery}}
(or {{RangeQuery}} etc...) overhead when the accumulators only care about the
final filtered set of documents for the current bucket/slot.
# Add an {{skg(...)}} AggValueSource function that can be nested under any
facet
** this function takes in the "foreground" and "background" queries to use,
which just like any existing (aggregate) function can be {{$variables}}
pointing to existing request params
** this means that, unlike the original SKG code linked from this issue, you
could compute the SKG relatedness info only at certain points in the facet
herirachy, or use different foreground/background queries in different places
** this function actually produces JSON "objects" as the function result,
containing the foreground/background popularities as well as the "relatedness"
score – which is what's used if you sort on this function
** I originally experimented with implementing "SKG" as a new type of _facet_
that could be nested under any (othe) facet, but implementing as a function
means that we can leverage the existing code for sorting (parent) facet buckets
by the (child) function's results – which is very powerful for SKG results (and
it's not currently possible to "sort" on the results of a sub-facet, and doing
so would be a lot of work given how sub-facet refinement is currently handled
... i looked into it briefly)
** but sorting on the {{skg()}} function is optional, and not strictly
neccessary when the clients care more about performance then accuracy – as with
the existing SKG code trey contributed, the (default) sort on facet count could
still be used, which means the existing JSON faceting code would only compute
the (semi-expensive) {{skg()}} function on the final buckets to be returned,
and the client could then post-process to re-sort them by the {{skg()}} values.
# Add support for a "explicit query domain" via syntax like {{domain :
\{query:'foo:bar'\}}} (or any other JSON query syntax supported by the
{{filters}} option) that let's you arbitrarily pick any set of queries you want
to use as a "domain" for a facet, regardless of it's parent facets/bucket.
** this provides an optional way to improve the "top n" accuracy of sub-facets
in a deep SKG request, by letting you ignore the "ancestor facet bucket
filtering" typically done in faceting, and instead request that *all* buckets
under some arbitrarr query – like the original background query – be considered.
** SKG users that care more about speed & aproximations can ignore this
feature, and just sort the regular facet terms by the {{skg()}} function to get
a good aproximation of the top terms ... or as I mentioned before: trust the
(default) sort on facet counts (w/or w/o using the {{$background_q}} as an
explicit domain) to approximate the top N terms)
An example of what all these features together can look like right now...
{noformat}
rows=0&
q=type:QUESTION&
fore=body:%22harry+potter%22&
back=*:*&
json.facet='{
tags : {
type : terms,
field : tags,
limit : 5,
sort : { skg: desc },
facet : {
skg : "skg($fore,$back)",
body : {
type : terms,
field : body,
limit : 5,
domain : { query:{param:back} },
sort : { skg: desc },
facet : {
skg : "skg($fore,$back)"
}
}
}
}
}'
{noformat}
There are still lots of things not included in the patch that could be added
later to make all of this better and/or easier to use – and in most cases would
be general improvements to JSON Faceting...
* As noted in some {{TODO}} comments, I would love to enhance the syntax of
the {{skg()}} function in a couple of ways...
** making the queries optional, and inheriting them from "ancestor" function
instances higher up the tree...
{noformat}
{
tags : {
type : terms,
field : tags,
facet : {
skg : "skg($fore,$back)",
body : {
type : terms,
field : body,
facet : {
skg : "skg()" // inherits the $fore/$back queries from the 'skg'
function of the parent facet
}
}
}
}
}
{noformat}
** I'd also like to improve the way JSON Facet functions are parsed – along
the lines of what's described in SOLR-11709 – in order to support more
"optional" args that could be used by {{skg()}} to override some of it's
default behavior...
*** this would be implemented under the covers by passing the extra map keys
as the "localParams" for the ValueSourceParser
*** Example: telling {{skg()}} that it's effective "sort" value should be
based on the "foreground_pop" instead of the (default) "relatedness"...
{noformat}
tags : {
type : terms,
field : tags,
sort : "skg desc",
facet : {
skg : { type : func,
func : "skg($fore,$back)",
sort_value : foreground_pop }
}
}
{noformat}
*** this could also be used to implement a {{min_pop}} type value, that could
be used to configure the {{skg()}} function to return a relatedness of
{{-Infinity}} for any bucket that didn't have foreground/background popularity
ratios at least as high as some user specified value.
* Similar to how the {{rerank}} request param allows people to collect & score
documents using a "cheap" query, and then re-score the top N using a ore
expensive query, I think it would be handy if JSON Facets supported a
{{resort}} option that could be used on any {{FacetRequestSorted}} instance
right along side the {{sort}} param, using the same JSON syntax, so that
clients could have Solr internaly sort all the facet buckets by something
simple (like count) and then "Re-Sort" the top {{N=limit}} (or maybe (
{{N=limit+overrequest}} ?) using a more expensive function like {{skg()}}
...however, I think most of this would be best left to other (future) Jiras,
and they are only marked {{TODO}} in the current patch (if mentioned at all)
----
My current focus is on resolving the outstanding {{nocommits}} which tend to
fall into these main categories (in order of importance) ...
* resolving randomized test failures
** I used {{TestCloudJSONFacetJoinDomain}} a imspiration for a new
{{TestCloudJSONFacetSKG}} that similarly tries to generate random indexes &
requests and then "prove" that the results of those requests are accurate via
verification queries
** i initially thought using {{refine:true}} + {{mincount:0}} +
{{processEmpty:true}} would allow me to "prove" that the SKG results were
accurate by executing the equivilent foreground/background queries for each
bucket – but even with those options, i'm seeing some popularity ratios that
are missing the denominator (size) from some shards when the numerator (count)
is 0 ... making me think there is either some flaw in my reasoning about the
provability, or some bug where the existing refinement logic isn't picking up
the function contributions of some shards when the doc count is 0
** even if this test approach proves flawed, the functionality itself can
still be useful since it's largely about computing statistical aproximations –
but i want to be 100% sure i understand *why* the test is failing before
writting it off
* refactoring some similar code
** the SKG distributed merging data structure is currently completely
independent from the single-shard "SlotVal" objects ... this hsould be
refactored to share code
* can the distributed results be more efficient?
** right now the redundent fore & back "size" values (which are the same for
every slot/bucket) are returned for every bucket ... i'd like to try and figure
out if i can put that data in the facet "context" to reduce the shard response
size.
* figuring out what/how/where to put info in the facetDebug output
** it seems like it could be handy for people to be able to access the raw
fore & back / count & size values for each bucket when debugging facets – i
just have to figure out how to do that
* javadocs
* naming
** "Semantic Knowledge Graph" seems like a good name for the _concept_ of how
these features can be used/combined, but the current _function_ {{skg(...)}}
seems like it should probably have name more specific to the underlying
relatedness forumla ... but i still don't really understand where exactly that
formula comes from, so i'm not really clear yet on what a better name might be.
----
Any feedback/comments/concerns about this approachwould be appreciated
> Graph Traversal for Significantly Related Terms (Semantic Knowledge Graph)
> --------------------------------------------------------------------------
>
> Key: SOLR-9480
> URL: https://issues.apache.org/jira/browse/SOLR-9480
> Project: Solr
> Issue Type: Improvement
> Security Level: Public(Default Security Level. Issues are Public)
> Reporter: Trey Grainger
> Priority: Major
> Attachments: SOLR-9480.patch, SOLR-9480.patch
>
>
> This issue is to track the contribution of the Semantic Knowledge Graph Solr
> Plugin (request handler), which exposes a graph-like interface for
> discovering and traversing significant relationships between entities within
> an inverted index.
> This data model has been described in the following research paper: [The
> Semantic Knowledge Graph: A compact, auto-generated model for real-time
> traversal and ranking of any relationship within a
> domain|https://arxiv.org/abs/1609.00464], as well as in presentations I gave
> in October 2015 at [Lucene/Solr
> Revolution|http://www.slideshare.net/treygrainger/leveraging-lucenesolr-as-a-knowledge-graph-and-intent-engine]
> and November 2015 at the [Bay Area Search
> Meetup|http://www.treygrainger.com/posts/presentations/searching-on-intent-knowledge-graphs-personalization-and-contextual-disambiguation/].
> The source code for this project is currently available at
> [https://github.com/careerbuilder/semantic-knowledge-graph], and the folks at
> CareerBuilder (where this was built) have given me the go-ahead to now
> contribute this back to the Apache Solr Project, as well.
> Check out the Github repository, research paper, or presentations for a more
> detailed description of this contribution. Initial patch coming soon.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]