Re: Which Tokeniser (and/or filter)

Erick Erickson Wed, 08 Feb 2012 04:39:13 -0800

Yes, WDDF creates multiple tokens. But that has
nothing to do with the multiValued suggestion.


You can get exactly what you want by
1> setting multiValued="true" in your schema file and re-indexing. Say
positionIncrementGap is set to 100
2> When you index, add the field for each sentence, so your doc
      looks something like:
     <doc>
        <field name="sentences">i am a sales-manager in here</field>
       <field name="sentences">using asp.net and .net daily</field>
         .....
      </doc>
3> search like "sales manager"~100

Best
Erick

On Wed, Feb 8, 2012 at 3:05 AM, Rob Brown <r...@intelcompute.com> wrote:
> Apologies if things were a little vague.
>
> Given the example snippet to index (numbered to show searches needed to
> match)...
>
> 1: i am a sales-manager in here
> 2: using asp.net and .net daily
> 3: working in design.
> 4: using something called sage 200. and i'm fluent
> 5: german sausages.
> 6: busy A&E dept earning £10,000 annually
>
>
> ... all with newlines in place.
>
> able to match...
>
> 1. sales
> 1. "sales manager"
> 1. sales-manager
> 1. "sales-manager"
> 2. .net
> 2. asp.net
> 3. design
> 4. sage 200
> 6. A&E
> 6. £10,000
>
> But do NOT match "fluent german" from 4 + 5 since there's a newline
> between them when indexed, but not when searched.
>
>
> Do the filters (wdf in this case) not create multiple tokens, so if
> splitting on period in "asp.net" would create tokens for all of "asp",
> "asp.", "asp.net", ".net", "net".
>
>
> Cheers,
> Rob
>
> --
>
> IntelCompute
> Web Design and Online Marketing
>
> http://www.intelcompute.com
>
>
> -----Original Message-----
> From: Chris Hostetter <hossman_luc...@fucit.org>
> Reply-to: solr-user@lucene.apache.org
> To: solr-user@lucene.apache.org
> Subject: Re: Which Tokeniser (and/or filter)
> Date: Tue, 7 Feb 2012 15:02:36 -0800 (PST)
>
> : This all seems a bit too much work for such a real-world scenario?
>
> You haven't really told us what your scenerio is.
>
> You said you want to split tokens on whitespace, full-stop (aka:
> period) and comma only, but then in response to some suggestions you added
> comments other things that you never mentioned previously...
>
> 1) evidently you don't want the "." in foo.net to cause a split in tokens?
> 2) evidently you not only want token splits on newlines, but also
> positition gaps to prevent phrases matching across newlines.
>
> ...these are kind of important details that affect suggestions people
> might give you.
>
> can you please provide some concrete examples of hte types of data you
> have, the types of queries you want them to match, and the types of
> queries you *don't* want to match?
>
>
> -Hoss
>

Re: Which Tokeniser (and/or filter)

Reply via email to