Re: searching camel cased terms with phrase queries

Jack Krupansky Thu, 08 Nov 2012 07:05:23 -0800

I forgot to mention DictionaryCompoundWordTokenFilterFactory. It doesrequire you to create a dictionary of terms, as opposed to using the termsthat have been encountered in the index.


-- Jack Krupansky

-----Original Message-----From: Jack Krupansky

Sent: Wednesday, November 07, 2012 8:14 AM
To: solr-user@lucene.apache.org
Subject: Re: searching camel cased terms with phrase queries

This is one of those areas of Solr where you can refine and make
improvements, as you have done, but never actually reach 100% satisfaction.
And, in some cases, as here, you have a choice of settings and no single
combination covers all cases.

In this case, you really need compound-term recognition - detecting that two
or more terms have been juxtaposed with no lexical boundary. Google has it,
and I 'm sure some Solr users have implemented it on their own, but it isn't
in Solr proper, yet.

WDF provides a partial approximation, by generating extra, compound terms at
index time. That works well when ALL of the terms are written together, but
not when only a subset are written together without lexical boundaries, as
in your final example.

So, you COULD go the full Google route with a lot of additional effort, or
accept that you offer only a reasonable approximation. Your choice.

So, pick the approximation which seems "best" and accept that it doesn't
handle the other cases.

BTW, the proper name is "PricewaterhouseCoopers".

-- Jack Krupansky

-----Original Message-----From: Dmitry Kan

Sent: Wednesday, November 07, 2012 1:58 AM
To: solr-user@lucene.apache.org
Subject: searching camel cased terms with phrase queries

Hello list,

There was a number of threads about handling camel cased words apparently
in the past (
http://search-lucene.com/?q=camel+case&fc_project=Lucene&fc_project=Solr).
Our case is somewhat different from them.

===================
Configuration & example
===================

To illustrate the issue, let me give you a real example from our data.
Suppose there is a term in the original text: SmartTV.

If a user wants to type "SmartTV" and "smart tv", we want both to hit the
original term SmartTV. In order to achieve this, the following filter is
used in our solr 3.4 schema:

index side:

             <filter class="solr.WordDelimiterFilterFactory"
               generateWordParts="1"
               generateNumberParts="0"
               catenateWords="0"
               catenateNumbers="0"
               catenateAll="0"
               preserveOriginal="1"
               spiltOnCaseChange="1"
             />

query side:

             <filter class="solr.WordDelimiterFilterFactory"
               generateWordParts="1"
               generateNumberParts="0"
               catenateWords="0"
               catenateNumbers="0"
               catenateAll="0"
               preserveOriginal="1"
               spiltOnCaseChange="1"
             />

(no differences)

Copying from the analysis page, the index will contain the following terms
and their positions:

org.apache.solr.analysis.WordDelimiterFilterFactory {preserveOriginal=1,
spiltOnCaseChange=1, generateNumberParts=0, catenateWords=0,
luceneMatchVersion=LUCENE_34, generateWordParts=1, catenateAll=0,
catenateNumbers=0} position 12 term text SmartTVTV Smart startOffset 05 0
endOffset 77 5 type <ALPHANUM><ALPHANUM> <ALPHANUM>

(there are tokenizer StandardTokenizerFactory and StandardFilterFactory
preceeding this filter, but as they didn't affect in this case, their
output is skipped).

On the query side the query="smart tv" gets processed like:

org.apache.solr.analysis.WordDelimiterFilterFactory {preserveOriginal=1,
spiltOnCaseChange=1, generateNumberParts=0, catenateWords=0,
luceneMatchVersion=LUCENE_34, generateWordParts=1, catenateAll=0,
catenateNumbers=0} position 12 term text smarttv startOffset 06 endOffset 58
type <ALPHANUM><ALPHANUM>

so there is a match (of course the LowerCaseFilterFactory is configured to
follow the WordDelimiterFilterFactory to unify the cases for matching) and
user is happily shooting queries: 'smart tv', 'smarttv' and 'SmartTV'.

===================================================
More complex example that doesn't work with the above configuration
===================================================

Problems start to occur, if a user types "smarttv for me" against the text
"SmartTV for me". Here are the index and query analysis excerpts:

index:

org.apache.solr.analysis.WordDelimiterFilterFactory {preserveOriginal=1,
spiltOnCaseChange=1, generateNumberParts=0, catenateWords=0,
luceneMatchVersion=LUCENE_34, generateWordParts=1, catenateAll=0,
catenateNumbers=0} position 1234 term text SmartTVTVforme Smart startOffset
05812 0 endOffset 771114 5 type <ALPHANUM><ALPHANUM><ALPHANUM><ALPHANUM>
<ALPHANUM>

query:

org.apache.solr.analysis.WordDelimiterFilterFactory {preserveOriginal=1,
spiltOnCaseChange=1, generateNumberParts=0, catenateWords=0,
luceneMatchVersion=LUCENE_34, generateWordParts=1, catenateAll=0,
catenateNumbers=0} position 123 term text smarttvforme startOffset 0812
endOffset 71114 type <ALPHANUM><ALPHANUM><ALPHANUM>
since in the user query smarttv was written in small case, no split on case
is triggered and we believe there is no match due to mismatch of the term
positions ('for' is on the 3rd position in the index and on the 2nd
position in the query and 'smarttv' and 'for' are not adjacent to satisfy
the phrase query).

=========================
Config change to fix the problem
=========================

But here catenateWords=1 on indexing side comes at rescue. Which changes
things to:

index:

org.apache.solr.analysis.WordDelimiterFilterFactory {preserveOriginal=1,
spiltOnCaseChange=1, generateNumberParts=0, catenateWords=1,
luceneMatchVersion=LUCENE_34, generateWordParts=1, catenateAll=0,
catenateNumbers=0} position 1234 term text SmartTVTVforme SmartSmartTV
startOffset 05812 00 endOffset 771114 57 type <ALPHANUM><ALPHANUM><ALPHANUM>
<ALPHANUM> <ALPHANUM><ALPHANUM>
query (copying again for comparison purposes):

org.apache.solr.analysis.WordDelimiterFilterFactory {preserveOriginal=1,
spiltOnCaseChange=1, generateNumberParts=0, catenateWords=0,
luceneMatchVersion=LUCENE_34, generateWordParts=1, catenateAll=0,
catenateNumbers=0} position 123 term text smarttvforme startOffset 0812
endOffset 71114 type <ALPHANUM><ALPHANUM><ALPHANUM>

now there should be a match, because terms 'smarttv', 'for' and 'me' are
adjacent in the index (ingoring the case differences as
LowerCaseFilterFactory unifies them for us) and that is what's required by
the phrase query "smarttv for me".

====================
Problem we couldn't solve
====================

As we saw above, catenateWords merges maximum run of compound term parts
into one and aligns the resulting concatenated term with the last term
part. Illustration with an artificial camel casing:

org.apache.solr.analysis.WordDelimiterFilterFactory {preserveOriginal=1,
spiltOnCaseChange=1, generateNumberParts=0, catenateWords=1,
luceneMatchVersion=LUCENE_34, generateWordParts=1, catenateAll=0,
catenateNumbers=0} position 1234 term text PriceWaterHouseCoopersWaterHouse
Coopers PricePriceWaterHouseCoopers startOffset 051015 00 endOffset 22101522
522 type <ALPHANUM><ALPHANUM><ALPHANUM><ALPHANUM> <ALPHANUM><ALPHANUM>
The following text and query will not match each other: text='product for
PriceWaterHouseCoopers company', query="product for PricewaterHouseCoopers
company":

index:

org.apache.solr.analysis.WordDelimiterFilterFactory {preserveOriginal=1,
spiltOnCaseChange=1, generateNumberParts=0, catenateWords=1,
luceneMatchVersion=LUCENE_34, generateWordParts=1, catenateAll=0,
catenateNumbers=0} position 1234567 term text productfor
PriceWaterHouseCoopersWaterHouseCooperscompany PricePriceWaterHouseCoopers
startOffset 081217222735 1212 endOffset 7113422273442 1734 type <ALPHANUM>
<ALPHANUM><ALPHANUM><ALPHANUM><ALPHANUM><ALPHANUM><ALPHANUM> <ALPHANUM>
<ALPHANUM>
query:

org.apache.solr.analysis.WordDelimiterFilterFactory {preserveOriginal=1,
spiltOnCaseChange=1, generateNumberParts=0, catenateWords=0,
luceneMatchVersion=LUCENE_34, generateWordParts=1, catenateAll=0,
catenateNumbers=0} position 123456 term text productfor
PricewaterHouseCoopersHouseCooperscompany Pricewater startOffset 1913232836
13 endOffset 81235283543 23 type <ALPHANUM><ALPHANUM><ALPHANUM><ALPHANUM>
<ALPHANUM><ALPHANUM> <ALPHANUM>

Is there any way to make them match?

Thanks for reading this far.

-dmitry

Re: searching camel cased terms with phrase queries

Reply via email to