Attempting to re-produce legacy behaviour (i know!) of simple SQL substring searching, with and without phrases.
I feel simply NGram'ing 4m CV's may be pushing it? --- IntelCompute Web Design & Local Online Marketing http://www.intelcompute.com On Wed, 8 Feb 2012 11:27:24 -0500, Erick Erickson <erickerick...@gmail.com> wrote: > You'll probably have to index them in separate fields to > get what you want. The question is always whether it's > worth it, is the use-case really well served by having a > variant that keeps dots and things? But that's always more > a question for your product manager.... > > Best > Erick > > On Wed, Feb 8, 2012 at 9:23 AM, Robert Brown <r...@intelcompute.com> wrote: >> Thanks Erick, >> >> I didn't get confused with multiple tokens vs multiValued :) >> >> Before I go ahead and re-index 4m docs, and believe me I'm using the >> analysis page like a mad-man! >> >> What do I need to configure to have the following both indexed with and >> without the dots... >> >> .net >> sales manager. >> £12.50 >> >> Currently... >> >> <tokenizer class="solr.WhitespaceTokenizerFactory"/> >> <filter class="solr.WordDelimiterFilterFactory" >> generateWordParts="1" >> generateNumberParts="1" >> catenateWords="1" >> catenateNumbers="1" >> catenateAll="1" >> splitOnCaseChange="1" >> splitOnNumerics="1" >> types="wdftypes.txt" >> /> >> >> with nothing specific in wdftypes.txt for full-stops. >> >> Should there also be any difference when quoting my searches? >> >> The analysis page seems to just drop the quotes, but surely actual >> calls don't do this? >> >> >> >> --- >> >> IntelCompute >> Web Design & Local Online Marketing >> >> http://www.intelcompute.com >> >> >> On Wed, 8 Feb 2012 07:38:42 -0500, Erick Erickson >> <erickerick...@gmail.com> wrote: >>> Yes, WDDF creates multiple tokens. But that has >>> nothing to do with the multiValued suggestion. >>> >>> You can get exactly what you want by >>> 1> setting multiValued="true" in your schema file and re-indexing. Say >>> positionIncrementGap is set to 100 >>> 2> When you index, add the field for each sentence, so your doc >>> looks something like: >>> <doc> >>> <field name="sentences">i am a sales-manager in here</field> >>> <field name="sentences">using asp.net and .net daily</field> >>> ..... >>> </doc> >>> 3> search like "sales manager"~100 >>> >>> Best >>> Erick >>> >>> On Wed, Feb 8, 2012 at 3:05 AM, Rob Brown <r...@intelcompute.com> wrote: >>>> Apologies if things were a little vague. >>>> >>>> Given the example snippet to index (numbered to show searches needed to >>>> match)... >>>> >>>> 1: i am a sales-manager in here >>>> 2: using asp.net and .net daily >>>> 3: working in design. >>>> 4: using something called sage 200. and i'm fluent >>>> 5: german sausages. >>>> 6: busy A&E dept earning £10,000 annually >>>> >>>> >>>> ... all with newlines in place. >>>> >>>> able to match... >>>> >>>> 1. sales >>>> 1. "sales manager" >>>> 1. sales-manager >>>> 1. "sales-manager" >>>> 2. .net >>>> 2. asp.net >>>> 3. design >>>> 4. sage 200 >>>> 6. A&E >>>> 6. £10,000 >>>> >>>> But do NOT match "fluent german" from 4 + 5 since there's a newline >>>> between them when indexed, but not when searched. >>>> >>>> >>>> Do the filters (wdf in this case) not create multiple tokens, so if >>>> splitting on period in "asp.net" would create tokens for all of "asp", >>>> "asp.", "asp.net", ".net", "net". >>>> >>>> >>>> Cheers, >>>> Rob >>>> >>>> -- >>>> >>>> IntelCompute >>>> Web Design and Online Marketing >>>> >>>> http://www.intelcompute.com >>>> >>>> >>>> -----Original Message----- >>>> From: Chris Hostetter <hossman_luc...@fucit.org> >>>> Reply-to: solr-user@lucene.apache.org >>>> To: solr-user@lucene.apache.org >>>> Subject: Re: Which Tokeniser (and/or filter) >>>> Date: Tue, 7 Feb 2012 15:02:36 -0800 (PST) >>>> >>>> : This all seems a bit too much work for such a real-world scenario? >>>> >>>> You haven't really told us what your scenerio is. >>>> >>>> You said you want to split tokens on whitespace, full-stop (aka: >>>> period) and comma only, but then in response to some suggestions you added >>>> comments other things that you never mentioned previously... >>>> >>>> 1) evidently you don't want the "." in foo.net to cause a split in tokens? >>>> 2) evidently you not only want token splits on newlines, but also >>>> positition gaps to prevent phrases matching across newlines. >>>> >>>> ...these are kind of important details that affect suggestions people >>>> might give you. >>>> >>>> can you please provide some concrete examples of hte types of data you >>>> have, the types of queries you want them to match, and the types of >>>> queries you *don't* want to match? >>>> >>>> >>>> -Hoss >>>> >>