Re: Which Tokeniser (and/or filter)

Robert Brown Wed, 08 Feb 2012 06:24:24 -0800

Thanks Erick,

I didn't get confused with multiple tokens vs multiValued  :)


Before I go ahead and re-index 4m docs, and believe me I'm using the
analysis page like a mad-man!

What do I need to configure to have the following both indexed with and
without the dots...

.net
sales manager.
£12.50

Currently...

<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.WordDelimiterFilterFactory"
        generateWordParts="1"
        generateNumberParts="1"
        catenateWords="1"
        catenateNumbers="1"
        catenateAll="1"
        splitOnCaseChange="1"
        splitOnNumerics="1"
        types="wdftypes.txt"
/>

with nothing specific in wdftypes.txt for full-stops.

Should there also be any difference when quoting my searches?

The analysis page seems to just drop the quotes, but surely actual
calls don't do this?



---

IntelCompute
Web Design & Local Online Marketing

http://www.intelcompute.com


On Wed, 8 Feb 2012 07:38:42 -0500, Erick Erickson
<erickerick...@gmail.com> wrote:
> Yes, WDDF creates multiple tokens. But that has
> nothing to do with the multiValued suggestion.
> 
> You can get exactly what you want by
> 1> setting multiValued="true" in your schema file and re-indexing. Say
> positionIncrementGap is set to 100
> 2> When you index, add the field for each sentence, so your doc
>       looks something like:
>      <doc>
>         <field name="sentences">i am a sales-manager in here</field>
>        <field name="sentences">using asp.net and .net daily</field>
>          .....
>       </doc>
> 3> search like "sales manager"~100
> 
> Best
> Erick
> 
> On Wed, Feb 8, 2012 at 3:05 AM, Rob Brown <r...@intelcompute.com> wrote:
>> Apologies if things were a little vague.
>>
>> Given the example snippet to index (numbered to show searches needed to
>> match)...
>>
>> 1: i am a sales-manager in here
>> 2: using asp.net and .net daily
>> 3: working in design.
>> 4: using something called sage 200. and i'm fluent
>> 5: german sausages.
>> 6: busy A&E dept earning £10,000 annually
>>
>>
>> ... all with newlines in place.
>>
>> able to match...
>>
>> 1. sales
>> 1. "sales manager"
>> 1. sales-manager
>> 1. "sales-manager"
>> 2. .net
>> 2. asp.net
>> 3. design
>> 4. sage 200
>> 6. A&E
>> 6. £10,000
>>
>> But do NOT match "fluent german" from 4 + 5 since there's a newline
>> between them when indexed, but not when searched.
>>
>>
>> Do the filters (wdf in this case) not create multiple tokens, so if
>> splitting on period in "asp.net" would create tokens for all of "asp",
>> "asp.", "asp.net", ".net", "net".
>>
>>
>> Cheers,
>> Rob
>>
>> --
>>
>> IntelCompute
>> Web Design and Online Marketing
>>
>> http://www.intelcompute.com
>>
>>
>> -----Original Message-----
>> From: Chris Hostetter <hossman_luc...@fucit.org>
>> Reply-to: solr-user@lucene.apache.org
>> To: solr-user@lucene.apache.org
>> Subject: Re: Which Tokeniser (and/or filter)
>> Date: Tue, 7 Feb 2012 15:02:36 -0800 (PST)
>>
>> : This all seems a bit too much work for such a real-world scenario?
>>
>> You haven't really told us what your scenerio is.
>>
>> You said you want to split tokens on whitespace, full-stop (aka:
>> period) and comma only, but then in response to some suggestions you added
>> comments other things that you never mentioned previously...
>>
>> 1) evidently you don't want the "." in foo.net to cause a split in tokens?
>> 2) evidently you not only want token splits on newlines, but also
>> positition gaps to prevent phrases matching across newlines.
>>
>> ...these are kind of important details that affect suggestions people
>> might give you.
>>
>> can you please provide some concrete examples of hte types of data you
>> have, the types of queries you want them to match, and the types of
>> queries you *don't* want to match?
>>
>>
>> -Hoss
>>

Re: Which Tokeniser (and/or filter)

Reply via email to