Have anyone used Automatic Phrase Tokenization (AutoPhrasingTokenFilterFactory) ?

shamik Thu, 11 Dec 2014 09:49:28 -0800

Hi, 

  I'm trying to use AutoPhrasingTokenFilterFactory which seems to be a 
great solution to our phrase query issues. But doesn't seem to work as 
mentioned in the blog :


https://lucidworks.com/blog/automatic-phrase-tokenization-improving-lucene-search-precision-by-more-precise-linguistic-analysis/

The tokenizer is working as expected during query time, where it's 
preserving the phrases as a single token based on the text file. Here's my 
field definition : 

<fieldType name="text_autophrase" class="solr.TextField" 
positionIncrementGap="100">
        <analyzer type="index">
                <tokenizer class="solr.StandardTokenizerFactory" />
                <filter class="solr.LowerCaseFilterFactory" />
                <filter 
class="com.lucidworks.analysis.AutoPhrasingTokenFilterFactory" 
phrases="autophrases.txt" includeTokens="true" />
                <filter class="solr.StopFilterFactory" ignoreCase="true" 
words="stopwords.txt" enablePositionIncrements="true" />
                <filter class="solr.SynonymFilterFactory" 
synonyms="synonyms.txt" 
ignoreCase="true" expand="true" />
                <filter class="solr.KStemFilterFactory" />
        </analyzer>
        <analyzer type="query">
                <tokenizer class="solr.StandardTokenizerFactory" />
                <filter class="solr.LowerCaseFilterFactory" />
                <filter class="solr.StopFilterFactory" ignoreCase="true" 
            words="stopwords.txt" enablePositionIncrements="true" />
                <filter class="solr.KStemFilterFactory" />
        </analyzer>
</fieldType>

On analyzing, I can see the phrase "seat cushions" (defined in 
autophrases.txt) is being indexed as "seat", "seat cushions" and "cushion". 

The problem is during the query time. As per the blog, the request handler 
needs to use a custom query parser to achieve the result. Here's my entry 
in solrconfig. 

<requestHandler name="/autophrase" class="solr.SearchHandler">
<lst name="defaults">

<str name="wt">velocity</str>
<str name="v.template">browse</str>
<str name="v.layout">layout</str>
<str name="title">Solritas</str>

<str name="echoParams">explicit</str>
<int name="rows">10</int>
<str name="df">text</str>
<str name="defType">autophrasingParser</str>
</lst>
</requestHandler>

<queryParser name="autophrasingParser" 
class="com.lucidworks.analysis.AutoPhrasingQParserPlugin" >
<str name="phrases">autophrases.txt</str>
</queryParser>

But if I query "seat cushions"  using this request handler, it's seemed to 
be treating the query as two separate terms and returning all results 
matching "seat" and "cushion". Not sure what I'm missing here. I'm using 
Solr 4.10. 

The other question I had is whether 
"com.lucidworks.analysis.AutoPhrasingQParserPlugin" supports the edismax 
features which is my default parser. 

I'll appreciate if anyone provide their feedback. 

-Thanks 
Shamik



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Have-anyone-used-Automatic-Phrase-Tokenization-AutoPhrasingTokenFilterFactory-tp4173808.html
Sent from the Solr - User mailing list archive at Nabble.com.

Have anyone used Automatic Phrase Tokenization (AutoPhrasingTokenFilterFactory) ?

Reply via email to