Re: Which Tokeniser (and/or filter)

Robert Brown Wed, 08 Feb 2012 08:42:06 -0800

Attempting to re-produce legacy behaviour (i know!) of simple SQL
substring searching, with and without phrases.


I feel simply NGram'ing 4m CV's may be pushing it?


---

IntelCompute
Web Design & Local Online Marketing

http://www.intelcompute.com


On Wed, 8 Feb 2012 11:27:24 -0500, Erick Erickson
<erickerick...@gmail.com> wrote:
> You'll probably have to index them in separate fields to
> get what you want. The question is always whether it's
> worth it, is the use-case really well served by having a
> variant that keeps dots and things? But that's always more
> a question for your product manager....
> 
> Best
> Erick
> 
> On Wed, Feb 8, 2012 at 9:23 AM, Robert Brown <r...@intelcompute.com> wrote:
>> Thanks Erick,
>>
>> I didn't get confused with multiple tokens vs multiValued  :)
>>
>> Before I go ahead and re-index 4m docs, and believe me I'm using the
>> analysis page like a mad-man!
>>
>> What do I need to configure to have the following both indexed with and
>> without the dots...
>>
>> .net
>> sales manager.
>> £12.50
>>
>> Currently...
>>
>> <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>> <filter class="solr.WordDelimiterFilterFactory"
>>        generateWordParts="1"
>>        generateNumberParts="1"
>>        catenateWords="1"
>>        catenateNumbers="1"
>>        catenateAll="1"
>>        splitOnCaseChange="1"
>>        splitOnNumerics="1"
>>        types="wdftypes.txt"
>> />
>>
>> with nothing specific in wdftypes.txt for full-stops.
>>
>> Should there also be any difference when quoting my searches?
>>
>> The analysis page seems to just drop the quotes, but surely actual
>> calls don't do this?
>>
>>
>>
>> ---
>>
>> IntelCompute
>> Web Design & Local Online Marketing
>>
>> http://www.intelcompute.com
>>
>>
>> On Wed, 8 Feb 2012 07:38:42 -0500, Erick Erickson
>> <erickerick...@gmail.com> wrote:
>>> Yes, WDDF creates multiple tokens. But that has
>>> nothing to do with the multiValued suggestion.
>>>
>>> You can get exactly what you want by
>>> 1> setting multiValued="true" in your schema file and re-indexing. Say
>>> positionIncrementGap is set to 100
>>> 2> When you index, add the field for each sentence, so your doc
>>>       looks something like:
>>>      <doc>
>>>         <field name="sentences">i am a sales-manager in here</field>
>>>        <field name="sentences">using asp.net and .net daily</field>
>>>          .....
>>>       </doc>
>>> 3> search like "sales manager"~100
>>>
>>> Best
>>> Erick
>>>
>>> On Wed, Feb 8, 2012 at 3:05 AM, Rob Brown <r...@intelcompute.com> wrote:
>>>> Apologies if things were a little vague.
>>>>
>>>> Given the example snippet to index (numbered to show searches needed to
>>>> match)...
>>>>
>>>> 1: i am a sales-manager in here
>>>> 2: using asp.net and .net daily
>>>> 3: working in design.
>>>> 4: using something called sage 200. and i'm fluent
>>>> 5: german sausages.
>>>> 6: busy A&E dept earning £10,000 annually
>>>>
>>>>
>>>> ... all with newlines in place.
>>>>
>>>> able to match...
>>>>
>>>> 1. sales
>>>> 1. "sales manager"
>>>> 1. sales-manager
>>>> 1. "sales-manager"
>>>> 2. .net
>>>> 2. asp.net
>>>> 3. design
>>>> 4. sage 200
>>>> 6. A&E
>>>> 6. £10,000
>>>>
>>>> But do NOT match "fluent german" from 4 + 5 since there's a newline
>>>> between them when indexed, but not when searched.
>>>>
>>>>
>>>> Do the filters (wdf in this case) not create multiple tokens, so if
>>>> splitting on period in "asp.net" would create tokens for all of "asp",
>>>> "asp.", "asp.net", ".net", "net".
>>>>
>>>>
>>>> Cheers,
>>>> Rob
>>>>
>>>> --
>>>>
>>>> IntelCompute
>>>> Web Design and Online Marketing
>>>>
>>>> http://www.intelcompute.com
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: Chris Hostetter <hossman_luc...@fucit.org>
>>>> Reply-to: solr-user@lucene.apache.org
>>>> To: solr-user@lucene.apache.org
>>>> Subject: Re: Which Tokeniser (and/or filter)
>>>> Date: Tue, 7 Feb 2012 15:02:36 -0800 (PST)
>>>>
>>>> : This all seems a bit too much work for such a real-world scenario?
>>>>
>>>> You haven't really told us what your scenerio is.
>>>>
>>>> You said you want to split tokens on whitespace, full-stop (aka:
>>>> period) and comma only, but then in response to some suggestions you added
>>>> comments other things that you never mentioned previously...
>>>>
>>>> 1) evidently you don't want the "." in foo.net to cause a split in tokens?
>>>> 2) evidently you not only want token splits on newlines, but also
>>>> positition gaps to prevent phrases matching across newlines.
>>>>
>>>> ...these are kind of important details that affect suggestions people
>>>> might give you.
>>>>
>>>> can you please provide some concrete examples of hte types of data you
>>>> have, the types of queries you want them to match, and the types of
>>>> queries you *don't* want to match?
>>>>
>>>>
>>>> -Hoss
>>>>
>>

Re: Which Tokeniser (and/or filter)

Reply via email to