[jira] [Updated] (SOLR-9418) Statistical Phrase Identifier

Trey Grainger (JIRA) Wed, 05 Sep 2018 05:47:17 -0700


     [ 
https://issues.apache.org/jira/browse/SOLR-9418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Trey Grainger updated SOLR-9418:
--------------------------------
    Description: 
The Statistical Phrase Identifier is a Solr contribution that takes in a string 
of text and then leverages a language model (an Apache Lucene/Solr inverted 
index) to predict how the inputted text should be divided into phrases. The 
intended purpose of this tool is to parse short-text queries into phrases prior 
to executing a keyword search (as opposed parsing out each keyword as a single 
term).

History

This project was originally implemented at CareerBuilder in the summer of 2015 
for use as part of their semantic search system. In 2018

 

The main aim of this requestHandler is to get the best parsing for a given 
query. This basically means recognizing different phrases within the query. We 
need some kind of training data to generate these phrases. The way this project 
works is:
 1.)Generate all possible parsings for the given query
 2.)For each possible parsing, a naive-bayes like score is calculated.
 3.)The main scoring is done by going through all the documents in the training 
set and finding the probability of bunch of words occurring together as a 
phrase as compared to them occurring randomly in the same document. Then the 
score is normalized. Some higher importance is given to the title field as 
compared to content field which is configurable.
 4.)Finally after scoring each of the possible parsing, the one with the 
highest score is returned.

  was:
The main aim of this requestHandler is to get the best parsing for a given 
query. This basically means recognizing different phrases within the query. We 
need some kind of training data to generate these phrases. The way this project 
works is:
1.)Generate all possible parsings for the given query
2.)For each possible parsing, a naive-bayes like score is calculated.
3.)The main scoring is done by going through all the documents in the training 
set and finding the probability of bunch of words occurring together as a 
phrase as compared to them occurring randomly in the same document. Then the 
score is normalized. Some higher importance is given to the title field as 
compared to content field which is configurable.
4.)Finally after scoring each of the possible parsing, the one with the highest 
score is returned.


> Statistical Phrase Identifier
> -----------------------------
>
>                 Key: SOLR-9418
>                 URL: https://issues.apache.org/jira/browse/SOLR-9418
>             Project: Solr
>          Issue Type: New Feature
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Akash Mehta
>            Priority: Major
>         Attachments: SOLR-9418.zip
>
>
> The Statistical Phrase Identifier is a Solr contribution that takes in a 
> string of text and then leverages a language model (an Apache Lucene/Solr 
> inverted index) to predict how the inputted text should be divided into 
> phrases. The intended purpose of this tool is to parse short-text queries 
> into phrases prior to executing a keyword search (as opposed parsing out each 
> keyword as a single term).
> History
> This project was originally implemented at CareerBuilder in the summer of 
> 2015 for use as part of their semantic search system. In 2018
>  
> The main aim of this requestHandler is to get the best parsing for a given 
> query. This basically means recognizing different phrases within the query. 
> We need some kind of training data to generate these phrases. The way this 
> project works is:
>  1.)Generate all possible parsings for the given query
>  2.)For each possible parsing, a naive-bayes like score is calculated.
>  3.)The main scoring is done by going through all the documents in the 
> training set and finding the probability of bunch of words occurring together 
> as a phrase as compared to them occurring randomly in the same document. Then 
> the score is normalized. Some higher importance is given to the title field 
> as compared to content field which is configurable.
>  4.)Finally after scoring each of the possible parsing, the one with the 
> highest score is returned.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-9418) Statistical Phrase Identifier

Reply via email to