[jira] [Updated] (SOLR-9418) Statistical Phrase Identifier

Trey Grainger (JIRA) Wed, 05 Sep 2018 07:36:10 -0700


     [ 
https://issues.apache.org/jira/browse/SOLR-9418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Trey Grainger updated SOLR-9418:
--------------------------------
    Description: 
h2. *Summary:*

The Statistical Phrase Identifier is a Solr contribution that takes in a string 
of text and then leverages a language model (an Apache Lucene/Solr inverted 
index) to predict how the inputted text should be divided into phrases. The 
intended purpose of this tool is to parse short-text queries into phrases prior 
to executing a keyword search (as opposed parsing out each keyword as a single 
term).

It is being generously donated to the Solr project by CareerBuilder, with the 
original source code and a quickly demo-able version located here:  
[https://github.com/careerbuilder/statistical-phrase-identifier|https://github.com/careerbuilder/statistical-phrase-identifier,]
h2. *Purpose:*

Assume you're building a job search engine, and one of your users searches for 
the following:
 _machine learning research and development Portland, OR software engineer AND 
hadoop, java_

Most search engines will natively parse this query into the following boolean 
representation:
 _(machine AND learning AND research AND development AND Portland) OR (software 
AND engineer AND hadoop AND java)_

While this query may still yield relevant results, it is clear that the intent 
of the user wasn't understood very well at all. By leveraging the Statistical 
Phrase Identifier on this string prior to query parsing, you can instead expect 
the following parsing:



_{machine learning} \{and} \{research and development} \{Portland, OR} 
\{software engineer} \{AND} \{hadoop,} \{java}_

It is then possile to modify all the multi-word phrases prior to executing the 
search:
 _"machine learning" and "research and development" "Portland, OR" "software 
engineer" AND hadoop, java_

Of course, you could do your own query parsing to specifically handle the 
boolean syntax, but the following would eventually be interpreted correctly by 
Apache Solr and most other search engines:
 _"machine learning" AND "research and development" AND "Portland, OR" AND 
"software engineer" AND hadoop AND java_ 
h2. *History:*

This project was originally implemented by the search team at CareerBuilder in 
the summer of 2015 for use as part of their semantic search system. In the 
summer of 2016, Akash Mehta, implemented a much simpler version as a proof of 
concept based upon publicly available information about the CareerBuilder 
implementation (the first attached patch).  In July of 2018, CareerBuilder open 
sourced their original version 
([https://github.com/careerbuilder/statistical-phrase-identifier),|https://github.com/careerbuilder/statistical-phrase-identifier,]
 and agreed to also donate the code to the Apache Software foundation as a Solr 
contribution. An Solr patch with the CareerBuilder version was added to this 
issue on September 5th, 2018, and community feedback and contributions are 
encouraged.

This issue was originally titled the "Probabilistic Query Parser", but the name 
has now been updated to "Statistical Phrase Identifier" to avoid ambiguity with 
Solr's query parsers (per some of the feedback on this issue), as the 
implementation is actually just a mechanism for identifying phrases 
statistically from a string and is NOT a Solr query parser. 
h2. *Example usage:*
h3. (See contrib readme or configuration files in the patch for full 
configuration details)
h3. *{{Request:}}*
{code:java}
http://localhost:8983/solr/spi/parse?q=darth vader obi wan kenobi anakin 
skywalker toad x men magneto professor xavier{code}
h3. *{{Response:}}* 
{code:java}
{
  "responseHeader":{
    "status":0,
    "QTime":25},
    "top_parsed_query":"{darth vader} {obi wan kenobi} {anakin skywalker} 
{toad} {x men} {magneto} {professor xavier}",
    "top_parsed_phrases":[
      "darth vader",
      "obi wan kenobi",
      "anakin skywalker",
      "toad",
      "x-men",
      "magneto",
      "professor xavier"],
      "potential_parsings":[{
      "parsed_phrases":["darth vader",
      "obi wan kenobi",
      "anakin skywalker",
      "toad",
      "x-men",
      "magneto",
      "professor xavier"],
      "parsed_query":"{darth vader} {obi wan kenobi} {anakin skywalker} {toad} 
{x-men} {magneto} {professor xavier}",
    "score":0.0}]}{code}
 

 

  was:
h2. *Summary:*

The Statistical Phrase Identifier is a Solr contribution that takes in a string 
of text and then leverages a language model (an Apache Lucene/Solr inverted 
index) to predict how the inputted text should be divided into phrases. The 
intended purpose of this tool is to parse short-text queries into phrases prior 
to executing a keyword search (as opposed parsing out each keyword as a single 
term).

It is being generously donated to the Solr project by CareerBuilder, with the 
original source code and a quickly demo-able version located here:  
[https://github.com/careerbuilder/statistical-phrase-identifier|https://github.com/careerbuilder/statistical-phrase-identifier,]
h2. *Purpose:*

Assume you're building a job search engine, and one of your users searches for 
the following:
_machine learning research and development Portland, OR software engineer AND 
hadoop, java_

Most search engines will natively parse this query into the following boolean 
representation:
_(machine AND learning AND research AND development AND Portland) OR (software 
AND engineer AND hadoop AND java)_

While this query may still yield relevant results, it is clear that the intent 
of the user wasn't understood very well at all. By leveraging the Statistical 
Phrase Identifier on this string prior to query parsing, you can instead expect 
the following parsing:
_{machine learning} \{and} \{research and development} \{Portland, OR} 
\{software engineer} \{AND} \{hadoop,} \{java}_

It is then possile to modify all the multi-word phrases prior to executing the 
search:
_"machine learning" and "research and development" "Portland, OR" "software 
engineer" AND hadoop, java_

Of course, you could do your own query parsing to specifically handle the 
boolean syntax, but the following would eventually be interpreted correctly by 
Apache Solr and most other search engines:
_"machine learning" AND "research and development" AND "Portland, OR" AND 
"software engineer" AND hadoop AND java_

 
h2. *History:*

This project was originally implemented by the search team at CareerBuilder in 
the summer of 2015 for use as part of their semantic search system. In the 
summer of 2016, Akash Mehta, implemented a much simpler version as a proof of 
concept based upon publicly available information about the CareerBuilder 
implementation (the first attached patch).  In July of 2018, CareerBuilder open 
sourced their original version 
([https://github.com/careerbuilder/statistical-phrase-identifier),|https://github.com/careerbuilder/statistical-phrase-identifier,]
 and agreed to also donate the code to the Apache Software foundation as a Solr 
contribution. An Solr patch with the CareerBuilder version was added to this 
issue on September 5th, 2018, and community feedback and contributions are 
encouraged.

This issue was originally titled the "Probabilistic Query Parser", but the name 
has now been updated to "Statistical Phrase Identifier" to avoid ambiguity with 
Solr's query parsers (per some of the feedback on this issue), as the 
implementation is actually just a mechanism for identifying phrases 
statistically from a string and is NOT a Solr query parser.

 
h2. *Example usage:*
h3. (See contrib readme or configuration files in the patch for full 
configuration details)
h3. *{{Request:}}*
{code:java}
http://localhost:8983/solr/spi/parse?q=darth vader obi wan kenobi anakin 
skywalker toad x men magneto professor xavier{code}
 
h3. *{{Response:}}*

 
{code:java}
{
  "responseHeader":{
    "status":0,
    "QTime":25},
    "top_parsed_query":"{darth vader} {obi wan kenobi} {anakin skywalker} 
{toad} {x men} {magneto} {professor xavier}",
    "top_parsed_phrases":[
      "darth vader",
      "obi wan kenobi",
      "anakin skywalker",
      "toad",
      "x-men",
      "magneto",
      "professor xavier"],
      "potential_parsings":[{
      "parsed_phrases":["darth vader",
      "obi wan kenobi",
      "anakin skywalker",
      "toad",
      "x-men",
      "magneto",
      "professor xavier"],
      "parsed_query":"{darth vader} {obi wan kenobi} {anakin skywalker} {toad} 
{x-men} {magneto} {professor xavier}",
    "score":0.0}]}{code}
 

 


> Statistical Phrase Identifier
> -----------------------------
>
>                 Key: SOLR-9418
>                 URL: https://issues.apache.org/jira/browse/SOLR-9418
>             Project: Solr
>          Issue Type: New Feature
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Akash Mehta
>            Priority: Major
>         Attachments: SOLR-9418.patch, SOLR-9418.zip
>
>
> h2. *Summary:*
> The Statistical Phrase Identifier is a Solr contribution that takes in a 
> string of text and then leverages a language model (an Apache Lucene/Solr 
> inverted index) to predict how the inputted text should be divided into 
> phrases. The intended purpose of this tool is to parse short-text queries 
> into phrases prior to executing a keyword search (as opposed parsing out each 
> keyword as a single term).
> It is being generously donated to the Solr project by CareerBuilder, with the 
> original source code and a quickly demo-able version located here:  
> [https://github.com/careerbuilder/statistical-phrase-identifier|https://github.com/careerbuilder/statistical-phrase-identifier,]
> h2. *Purpose:*
> Assume you're building a job search engine, and one of your users searches 
> for the following:
>  _machine learning research and development Portland, OR software engineer 
> AND hadoop, java_
> Most search engines will natively parse this query into the following boolean 
> representation:
>  _(machine AND learning AND research AND development AND Portland) OR 
> (software AND engineer AND hadoop AND java)_
> While this query may still yield relevant results, it is clear that the 
> intent of the user wasn't understood very well at all. By leveraging the 
> Statistical Phrase Identifier on this string prior to query parsing, you can 
> instead expect the following parsing:
> _{machine learning} \{and} \{research and development} \{Portland, OR} 
> \{software engineer} \{AND} \{hadoop,} \{java}_
> It is then possile to modify all the multi-word phrases prior to executing 
> the search:
>  _"machine learning" and "research and development" "Portland, OR" "software 
> engineer" AND hadoop, java_
> Of course, you could do your own query parsing to specifically handle the 
> boolean syntax, but the following would eventually be interpreted correctly 
> by Apache Solr and most other search engines:
>  _"machine learning" AND "research and development" AND "Portland, OR" AND 
> "software engineer" AND hadoop AND java_ 
> h2. *History:*
> This project was originally implemented by the search team at CareerBuilder 
> in the summer of 2015 for use as part of their semantic search system. In the 
> summer of 2016, Akash Mehta, implemented a much simpler version as a proof of 
> concept based upon publicly available information about the CareerBuilder 
> implementation (the first attached patch).  In July of 2018, CareerBuilder 
> open sourced their original version 
> ([https://github.com/careerbuilder/statistical-phrase-identifier),|https://github.com/careerbuilder/statistical-phrase-identifier,]
>  and agreed to also donate the code to the Apache Software foundation as a 
> Solr contribution. An Solr patch with the CareerBuilder version was added to 
> this issue on September 5th, 2018, and community feedback and contributions 
> are encouraged.
> This issue was originally titled the "Probabilistic Query Parser", but the 
> name has now been updated to "Statistical Phrase Identifier" to avoid 
> ambiguity with Solr's query parsers (per some of the feedback on this issue), 
> as the implementation is actually just a mechanism for identifying phrases 
> statistically from a string and is NOT a Solr query parser. 
> h2. *Example usage:*
> h3. (See contrib readme or configuration files in the patch for full 
> configuration details)
> h3. *{{Request:}}*
> {code:java}
> http://localhost:8983/solr/spi/parse?q=darth vader obi wan kenobi anakin 
> skywalker toad x men magneto professor xavier{code}
> h3. *{{Response:}}* 
> {code:java}
> {
>   "responseHeader":{
>     "status":0,
>     "QTime":25},
>     "top_parsed_query":"{darth vader} {obi wan kenobi} {anakin skywalker} 
> {toad} {x men} {magneto} {professor xavier}",
>     "top_parsed_phrases":[
>       "darth vader",
>       "obi wan kenobi",
>       "anakin skywalker",
>       "toad",
>       "x-men",
>       "magneto",
>       "professor xavier"],
>       "potential_parsings":[{
>       "parsed_phrases":["darth vader",
>       "obi wan kenobi",
>       "anakin skywalker",
>       "toad",
>       "x-men",
>       "magneto",
>       "professor xavier"],
>       "parsed_query":"{darth vader} {obi wan kenobi} {anakin skywalker} 
> {toad} {x-men} {magneto} {professor xavier}",
>     "score":0.0}]}{code}
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-9418) Statistical Phrase Identifier

Reply via email to