Hi Alexey,
Lucene's QueryParser, and at least some of Solr's query parsers - I'm not
familiar with all of them - have the problem you mention: analyzers are fed
queries word-by-word, instead of whole strings between operators. There is a
JIRA issue for fixing this, but no work done yet:
<https://issues.apache.org/jira/browse/LUCENE-2605>.
Separately, do you know about the "raw" query parser[2]? I'm not sure if it
would help, but you may be able to use it in alternate solution.
One small simplification I can think of for your current setup:
ShingleFilterFactory[1] takes an option called "tokenSeparator" - if you set
this to the empty string (""), you can eliminate your whitespace-stripping
filter.
Steve
[1]
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.ShingleFilterFactory
[2]
http://wiki.apache.org/solr/SolrQuerySyntax#Other_built-in_useful_query_parsers
> -----Original Message-----
> From: Alexey Verkhovsky [mailto:[email protected]]
> Sent: Monday, February 27, 2012 1:26 PM
> To: [email protected]
> Subject: Combining ShingleFilter and DisMaxParser, with a twist
>
> Say, there is an index of business names (fairly short text snippets),
> containing: Walmart, Walmart Bakery and Mini Mart. And say we need a query
> for 'wal mart' to match all three, with an appropriate ranking order. Also
> need 'walmart', 'walmart bakery' and 'bakery' to find the right things in
> the right order.
>
> Here is the solution we came up with:
>
> 1. edismax query parser (we don't need it for this, but do for a number of
> other requirements)
>
> 2. On the index, apply ShingleFilter, then remove word separators in the
> shingles, so that "walmart bakery" is indexed as "walmart", "bakery",
> "walmartbakery"
> Schema for this index looks like this:
> <analyzer type="index">
> <charFilter class="solr.PatternReplaceCharFilterFactory"
> pattern="'+" replacement=""/>
> <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> <filter class="solr.ASCIIFoldingFilterFactory"/>
> <filter class="solr.ShingleFilterFactory" minShingleSize="2"
> maxShingleSize="3" outputUnigrams="true"/>
> <filter class="solr.PatternReplaceFilterFactory" pattern="\W+"
> replacement=""/>
> <filter class="solr.LowerCaseFilterFactory"/>
> </analyzer>
>
> 3. Before sending the original query to Solr, modify it by adding a
> whitespace-stripped version of it. Thus, 'wal mart' becomes 'wal mart
> walmart' and walmart bakery becomes 'walmart bakery walmartbakery'. Don't
> modify the query if it only has one word in it, or contains any edismax
> syntax (double quotes; pluses and minuses in the beginning of a query or
> after whitespace).
>
> 4. ... profit.
>
> The reason we have to shingle the query before Solr is that edismax parser
> treats 'wal mart' as two queries - 'wal' OR 'mart', so applying the
> ShingleFilter in the query analyzer doesn't do anything.
>
> This works, but feels a little dirty. Is there a more elegant way to solve
> this problem?
>
> --
> Alex Verkhovsky