[ 
https://issues.apache.org/jira/browse/SOLR-9418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hoss Man reassigned SOLR-9418:
------------------------------

      Assignee: Hoss Man
    Attachment: SOLR-9418.patch

Trey showed me the careerbuilder repo a while back, and I've been working on 
reviewing/refactoring/generalizing/adapating the concepts in it into a general 
purpose reusable SearchComponent (but I've held off on discussing publicly 
until he had a chance to finalize the official careerbuild contribution patch 
today).

My updated patch doesn't directly re-use any of the code from the career 
builder patch, but is heavily inspired by the ideas in it. Rather then 
implementing a RequestHandler, it adds this functionality as a (distributed) 
SearchComponent that can be used by itself in a SearchHandler, or in 
conjunction with QueryComponent (similar to the way SuggestComponent can be 
used) to parse the "q" (or alternative "phrases.q" param) and try to identify 
phrases in the input, and return metadata about those phrases.

Similar to the career builder patch, the root data for how this component 
scores candidate phrases is all driven by the use of ShingleFilter in an 
index/query analyzer. But unlike the original careerbuilder code, there are no 
hard coded assumptions about the field names, or shingle sizes, or how to 
tokenize the input string:
 * Multiple weighted fields can be used (as long as they use the same / 
compatible analyzers)
 * The larger the index time shingles used for these fields, the more 
accurately the code can score phrases based on the term stats (using a baysian 
model) of those indexed shingles.
 * The query time analyzer is used for parsing the input, and can dicate larger 
sized shingles to identify the largest possible candidate phrases that will be 
considered – the code will estimate how likely those longer phrases may be 
based on the stats of the overlapping indexed shingles that it is composed of.
 * The "total score" for each (candidate) phrase comes from user supplied 
weights against each field's scores, but the per-field scores are returned as 
well, so that clients can make informed choices based on where a phrase is more 
common – ie: if "Isaac Asimov" is a very common phrase in an "author" field, 
but less common in a "title" field, a client app may want to suggest it as a 
filter/facet against that field in a subsequent query)

Consider the following configuration...
{noformat:title=solrconfig.xml}
  <searchComponent class="solr.PhrasesIdentificationComponent" name="phrases" />
  <requestHandler name="/phrases" class="solr.SearchHandler">
    <arr name="components">
      <str>phrases</str>
    </arr>
    <lst name="defaults">
      <str name="echoParams">explicit</str>
      <str name="indent">true</str>
      <bool name="phrases">true</bool>
      <str name="phrases.fields">multigrams_body multigrams_title^2</str>
    </lst>
  </requestHandler>
{noformat}
{noformat:title=schema.xml}
  <field name="multigrams_title" type="multigrams_3_7" indexed="true" 
stored="false" />
  <field name="multigrams_body"  type="multigrams_3_7" indexed="true" 
stored="false" />
  
  <fieldType name="multigrams_3_7" class="solr.TextField" 
positionIncrementGap="100">
    <analyzer type="index">
      <tokenizer class="solr.StandardTokenizerFactory"/>
      <filter class="solr.ASCIIFoldingFilterFactory"/>
      <filter class="solr.LowerCaseFilterFactory"/>
      <filter class="solr.ShingleFilterFactory" minShingleSize="2" 
maxShingleSize="3" outputUnigrams="true"/>
    </analyzer>
    <analyzer type="query">
      <tokenizer class="solr.StandardTokenizerFactory"/>
      <filter class="solr.ASCIIFoldingFilterFactory"/>
      <filter class="solr.LowerCaseFilterFactory"/>
      <filter class="solr.ShingleFilterFactory" minShingleSize="2" 
maxShingleSize="7" outputUnigramsIfNoShingles="true" outputUnigrams="true"/>
    </analyzer>
  </fieldType>
{noformat}
Then a request like this...
 
[http://localhost:8983/solr/phrase-demo/phrases?q=is+anakin+skywalker+the+kid+in+star+wars+phantom+menace]

...might return results like the example below (assuming we've indexed Q&A 
documents about science fiction)...
{noformat}
{
  "responseHeader":{
    "status":0,
    "QTime":21,
    "params":{
      "q":"is anakin skywalker the kid in star wars phantom menace"}},
  "phrases":{
    "input":"is anakin skywalker the kid in star wars phantom menace",
    "summary":"is {anakin skywalker} the {kid in} {star wars} {phantom menace}",
    "details":[{
        "text":"phantom menace",
        "offset_start":41,
        "offset_end":55,
        "score":0.07991913047151764,
        "field_scores":{
          "multigrams_body":0.16328680317925878,
          "multigrams_title":0.03823529411764706}},
      {
        "text":"anakin skywalker",
        "offset_start":3,
        "offset_end":19,
        "score":0.06451071736995491,
        "field_scores":{
          "multigrams_body":0.06449989404534859,
          "multigrams_title":0.06451612903225806}},
      {
        "text":"star wars",
        "offset_start":31,
        "offset_end":40,
        "score":0.05346181438317064,
        "field_scores":{
          "multigrams_body":0.08329261539254754,
          "multigrams_title":0.03854641387848219}},
      {
        "text":"kid in",
        "offset_start":24,
        "offset_end":30,
        "score":0.016470309145309624,
        "field_scores":{
          "multigrams_body":0.01567598767689273,
          "multigrams_title":0.016867469879518072}}]},
{noformat}
...if this component is configured as a {{last-component}} (in combination with 
the normal default SerachComponents) the same "phrases" results would be 
returned in addition to the normal search/facet/highlight/etc... results.
----
The scoring model in this patch is far from perfect – the "recommended" scoring 
approach in the career builder patch was heavily dependent on some magic 
constants that made several assumptions about index size and term distribution. 
I tried to avoid all of that with a more general baysian model, but I'm certain 
there is still a lot of improvements that could be made (see comments in the 
code).

That said: I think what's here is a really good start that's usable as is, and 
would like to suggest that unless anyone has any major concerns about the 
current API (which i've tried to keep very limited so we don't overcommit to 
specifics on the soring), I'd like to include this as an experimental feature 
in 7.5, and encourace people try it out and give feedback so we can work on 
improving/refining the scoring and add more tuning knobs in future releases.

> Statistical Phrase Identifier
> -----------------------------
>
>                 Key: SOLR-9418
>                 URL: https://issues.apache.org/jira/browse/SOLR-9418
>             Project: Solr
>          Issue Type: New Feature
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Akash Mehta
>            Assignee: Hoss Man
>            Priority: Major
>         Attachments: SOLR-9418.patch, SOLR-9418.patch, SOLR-9418.zip
>
>
> h2. *Summary:*
> The Statistical Phrase Identifier is a Solr contribution that takes in a 
> string of text and then leverages a language model (an Apache Lucene/Solr 
> inverted index) to predict how the inputted text should be divided into 
> phrases. The intended purpose of this tool is to parse short-text queries 
> into phrases prior to executing a keyword search (as opposed parsing out each 
> keyword as a single term).
> It is being generously donated to the Solr project by CareerBuilder, with the 
> original source code and a quickly demo-able version located here:  
> [https://github.com/careerbuilder/statistical-phrase-identifier|https://github.com/careerbuilder/statistical-phrase-identifier,]
> h2. *Purpose:*
> Assume you're building a job search engine, and one of your users searches 
> for the following:
>  _machine learning research and development Portland, OR software engineer 
> AND hadoop, java_
> Most search engines will natively parse this query into the following boolean 
> representation:
>  _(machine AND learning AND research AND development AND Portland) OR 
> (software AND engineer AND hadoop AND java)_
> While this query may still yield relevant results, it is clear that the 
> intent of the user wasn't understood very well at all. By leveraging the 
> Statistical Phrase Identifier on this string prior to query parsing, you can 
> instead expect the following parsing:
> _{machine learning} \{and} \{research and development} \{Portland, OR} 
> \{software engineer} \{AND} \{hadoop,} \{java}_
> It is then possile to modify all the multi-word phrases prior to executing 
> the search:
>  _"machine learning" and "research and development" "Portland, OR" "software 
> engineer" AND hadoop, java_
> Of course, you could do your own query parsing to specifically handle the 
> boolean syntax, but the following would eventually be interpreted correctly 
> by Apache Solr and most other search engines:
>  _"machine learning" AND "research and development" AND "Portland, OR" AND 
> "software engineer" AND hadoop AND java_ 
> h2. *History:*
> This project was originally implemented by the search team at CareerBuilder 
> in the summer of 2015 for use as part of their semantic search system. In the 
> summer of 2016, Akash Mehta, implemented a much simpler version as a proof of 
> concept based upon publicly available information about the CareerBuilder 
> implementation (the first attached patch).  In July of 2018, CareerBuilder 
> open sourced their original version 
> ([https://github.com/careerbuilder/statistical-phrase-identifier),|https://github.com/careerbuilder/statistical-phrase-identifier,]
>  and agreed to also donate the code to the Apache Software foundation as a 
> Solr contribution. An Solr patch with the CareerBuilder version was added to 
> this issue on September 5th, 2018, and community feedback and contributions 
> are encouraged.
> This issue was originally titled the "Probabilistic Query Parser", but the 
> name has now been updated to "Statistical Phrase Identifier" to avoid 
> ambiguity with Solr's query parsers (per some of the feedback on this issue), 
> as the implementation is actually just a mechanism for identifying phrases 
> statistically from a string and is NOT a Solr query parser. 
> h2. *Example usage:*
> h3. (See contrib readme or configuration files in the patch for full 
> configuration details)
> h3. *{{Request:}}*
> {code:java}
> http://localhost:8983/solr/spi/parse?q=darth vader obi wan kenobi anakin 
> skywalker toad x men magneto professor xavier{code}
> h3. *{{Response:}}* 
> {code:java}
> {
>   "responseHeader":{
>     "status":0,
>     "QTime":25},
>     "top_parsed_query":"{darth vader} {obi wan kenobi} {anakin skywalker} 
> {toad} {x men} {magneto} {professor xavier}",
>     "top_parsed_phrases":[
>       "darth vader",
>       "obi wan kenobi",
>       "anakin skywalker",
>       "toad",
>       "x-men",
>       "magneto",
>       "professor xavier"],
>       "potential_parsings":[{
>       "parsed_phrases":["darth vader",
>       "obi wan kenobi",
>       "anakin skywalker",
>       "toad",
>       "x-men",
>       "magneto",
>       "professor xavier"],
>       "parsed_query":"{darth vader} {obi wan kenobi} {anakin skywalker} 
> {toad} {x-men} {magneto} {professor xavier}",
>     "score":0.0}]}{code}
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to