Re: Handling ampersands in searches.

Erick Erickson Wed, 16 Nov 2016 12:41:30 -0800

Why do you think that the porter stemmer is involved here? That
takes tokens and tries to reduce them to their base form through
a set of rules. My guess is that the & just falls outside all rules so
is passed through unimpeded.


This is where the admin/analysis page is invaluable. If you look at
your type you'll notice that you have different options on
WordDelimiterFilterFactory for query and index time, in particular
"preserveOriginal" is 0 at index time and 1 at query.

So you get the tokens
light
fit

in your index and

light
&
fit

at query time.

Then since the query is looking for all three terms it fails. I'm
also guessing you have mm=100% or the default op set to
AND in your edismax configuration.

Anyway, this all kind of starts with choosing WhitespaceTokenizerFactory
as your tokenizer. StandardTokenizerFactory will (I think) remove the
ampersand in both cases. You can also use a CharFilterFactory to apply
some filtering to characters before anything starts going through the
analysis chain (NOTE: this is a CharFilter, not a Filter! See something
like PatternReplaceCharFilterFacotry at
https://cwiki.apache.org/confluence/display/solr/CharFilterFactories#CharFilterFactories-solr.PatternReplaceCharFilterFactory)

Best,
Erick

On Wed, Nov 16, 2016 at 9:34 AM, Callum Lamb <cl...@mintel.com> wrote:
> I'm having an issue where searches that contain ampersands aren't being
> handled correctly. I need them to be dropped at index time *AND* query
> time. When documents come in and are indexed the ampersands are
> successfully dropped when they go into my stemmed field (When I facet on
> the stemmed field they aren't in the list), but when I actually search with
> a term containing an ampersand, I get no results.
>
> E.g. I search for the string "light fit" or "light and fit" then I get
> results, but when I search for "light & fit" I get none. Even though the
> SnowballPorterFilterFactory should be dropping it at query time like it
> does for the "and" and all 3 queries *should* be equivalent.
>
> I've tried adding a synonym such that shows in
> my _schema_analysis_synonyms_default.json (I only have one default file) in
> both this form and its inverse as well:
>
> "and":[
>
>       "&",
>       "and"],
>
>
> I've also tried adding the StopWord filter to my fieldtype with & in the
> stopwords (though this shouldn't be necessary because the SnowBallPorter
> should be dropping it anyway) and it still doesn't work.
>
> Is there some kind of special handling I need for ampersands? I'm thinking
> that Solr must be interpreting it as some kind of operator and I need to
> tell Solr that it's actually literal text so the SnowBallPorter knows to
> drop it. Using backslashes or url encoding instead doesn't work though.
> Does anyone have any ideas?
>
> I can obviously just remove any ampersands from the q before I submit the
> query to Solr and get the correct results, so this is not a game breaking
> problem, but i'm more curious to *why* this is happening and how to fix it
> correctly.
>
> Cheers,
>
> Callum.
>
> Extra info:
>
> I'm using Solr 5.5.2 in cloud mode.
>
> The q in the queries is specified like this and are parsed the following
> way:
>
> "rawquerystring":"stemmed_description:light & fit", "querystring":"
> stemmed_description:light & fit", "parsedquery":"(+(+stemmed_description:light
> +DisjunctionMaxQuery((stemmed_description:&)) +DisjunctionMaxQuery((
> stemmed_description:fit))))/no_coord", "parsedquery_toString":"+(+
> stemmed_description:light +(stemmed_description:&) +(stemmed_description
> :fit))",
>
> I have a stemmed field defined in my schema (schema version 1.5) defined
> like this:
>
> <field name="stemmed_description" type="stemmed_text" indexed="true"
> stored="false" required="false" multiValued="true"/>
>
> with a field type defined like this:
>
>     <!-- Stemmed text type -->
>     <fieldType name="stemmed_text" class="solr.TextField"
> positionIncrementGap="100" omitNorms="true">
>       <analyzer type="index">
>         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>         <filter class="solr.StandardFilterFactory"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
>         <filter class="solr.WordDelimiterFilterFactory"
>                 catenateWords="1"
>                 preserveOriginal="0"
>                 splitOnNumerics="0"/>
>         <filter class="solr.ASCIIFoldingFilterFactory"/>
>         <filter class="solr.ManagedSynonymFilterFactory" managed="default"
> />
>         <filter class="solr.SnowballPorterFilterFactory"
> language="English"/>
>       </analyzer>
>       <analyzer type="query">
>         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>         <filter class="solr.StandardFilterFactory"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
>         <filter class="solr.WordDelimiterFilterFactory"
>                 catenateWords="1"
>                 preserveOriginal="1"
>                 splitOnNumerics="0"/>
>         <filter class="solr.ASCIIFoldingFilterFactory"/>
>
>         <filter class="solr.SnowballPorterFilterFactory"
> language="English"/>
>       </analyzer>
>     </fieldType>
>
> --
>
> Mintel Group Ltd | 11 Pilgrim Street | London | EC4V 6RN
> Registered in England: Number 1475918. | VAT Number: GB 232 9342 72
>
> Contact details for our other offices can be found at
> http://www.mintel.com/office-locations.
>
> This email and any attachments may include content that is confidential,
> privileged
> or otherwise protected under applicable law. Unauthorised disclosure,
> copying, distribution
> or use of the contents is prohibited and may be unlawful. If you have
> received this email in error,
> including without appropriate authorisation, then please reply to the
> sender about the error
> and delete this email and any attachments.
>

Re: Handling ampersands in searches.

Reply via email to