Re: Finding single word index data based on multiple word query

Erick Erickson Wed, 11 Jan 2012 05:53:11 -0800

Dave:

That's actually an interesting way to use WordDelimiterFitlerFactory.
I think you're
being bitten by the difference between analysis and query parsing. The analysis
page bypasses query parsing and throws the input against the exact field
you specified, without any, well, parsing.


But when submitting a query, what you put in goes through the query parser which
can lead to surprising results...

Anyway, try escaping the space, as NAME:DIRECT\ BUY. When I did something
similar I got (with &debugQuery=on) q=eoe:hotel\ california
eoe:hotel eoe:california eoe:hotelcalifornia

Had to set autoGeneratePhraseQueries="false" or the phrases made things
interesting.

But be careful, this doesn't really work for more than 2 words. Submitting
q=eoe:hotel\ california\ los\ angeles, this is the parsed form:
eoe:hotel eoe:california eoe:los eoe:angeles eoe:hotelcalifornialosangeles

so you'd have to do something like pre-process the terms to emit pairs when
# terms > 2, something like
eoe:hotel\ california eoe:california\ los eoe:los\ angeles
which gives:
(eoe:hotel eoe:california eoe:hotelcalifornia) (eoe:california eoe:los
eoe:californialos) (eoe:los eoe:angeles eoe:losangeles)

the "catenate all" is probably not something you really want to do,
although it's also
probably redundant with catenatewords="1"....

Synonyms would also work here, but you're back to creating a list of all the
possibilities, something you already said you don't want to do....

By the way, customizing Solr code is easier than you think, you can create a
plugin that becomes the only thing you have to compile rather than maintain
your own build of all of Solr, but that's another story.

Best
Erick

On Mon, Jan 9, 2012 at 8:41 PM, Giannone, David
<david.gianno...@onstar.com> wrote:
> Hi,
>
>
>
> I'm relatively new to Solr and am trying to solve the following problem:
> we have very structured data that includes the business name for 13
> million points of interest.    There are many names that are actually
> one word, but a user will commonly think it is 2 words and enter it that
> way (e.g., DIRECTBUY entered as DIRECT BUY, CALIFORNIAKIDS entered as
> CALIFORNIA KIDS, LASMARGARITAS entered as LAS MARGARITAS, etc.).    With
> our current search engine we have the client code concatenate the words
> and send the individual and concatenated words in the request.   With
> Solr I was hoping to get rid of that custom code for query and replace
> it with index and query analyzer configuration.   I've tried using a
> WordDelimeterFilterFactory catenateWords and catenateAll.  The
> schema.xml text field def for this field is below.   I used the
> KeywordTokenizerFactory instead of WhitespaceTokenizerFactory in the
> query analyzer because this configuration resulted in a match in the
> admin analysis form, while the WhitespaceTokenizerFactory did not.
> However, even though admin analysis showed a match, actual queries
> against the index would not pull back NAME = DIRECTBUY when entering
> NAME:"DIRECT BUY" as the query.
>
>
>
> I also tried using the DictionaryCompoundWordTokenFilterFactory with a
> dictionary list that included "DIRECT" and "BUY" as words.   This
> worked, but there are too many instances of the compound names to go
> through all of them and enter into file.
>
>
>
> Is there a way to do this with configuration or is manipulation of the
> input query the only way?   Not looking to customize any Solr code at
> this point.
>
>
>
>    <fieldType name="text_en_splitting_tight" class="solr.TextField"
> positionIncrementGap="100" autoGeneratePhraseQueries="true">
>
>      <analyzer type="index">
>
>        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>
>       <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
> ignoreCase="true" expand="false"/>
>
>        <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords_en.txt"/>
>
>        <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="0" generateNumberParts="0" catenateWords="1"
> catenateNumbers="1" catenateAll="0"/>
>
>        <filter class="solr.LowerCaseFilterFactory"/>
>
>        <filter class="solr.KeywordMarkerFilterFactory"
> protected="protwords.txt"/>
>
>        <filter class="solr.EnglishMinimalStemFilterFactory"/>
>
>        <!-- this filter can remove any duplicate tokens that appear at
> the same position - sometimes possible with WordDelimiterFilter in
> conjunction with stemming. -->
>
>        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>
>      </analyzer>
>
>      <analyzer type="query">
>
>        <tokenizer class="solr.KeywordTokenizerFactory"/>
>
>        <filter class="solr.SynonymFilterFactory"
> synonyms="synonyms.txt" ignoreCase="true" expand="false"/>
>
>        <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords_en.txt"/>
>
>        <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1" generateNumberParts="0" catenateWords="1"
> catenateNumbers="1" catenateAll="1"/>
>
>        <filter class="solr.LowerCaseFilterFactory"/>
>
>        <filter class="solr.KeywordMarkerFilterFactory"
> protected="protwords.txt"/>
>
>        <filter class="solr.EnglishMinimalStemFilterFactory"/>
>
>        <!-- this filter can remove any duplicate tokens that appear at
> the same position - sometimes
>
>             possible with WordDelimiterFilter in conjuncton with
> stemming. -->
>
>        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>
>      </analyzer>
>
>    </fieldType>
>
>
>
> Thanks,
>
>
>
> Dave Giannone
>
>
>

Re: Finding single word index data based on multiple word query

Reply via email to