Re: The Riddle of the Underscore and the Dollar Sign

Lance Norskog Wed, 03 Feb 2010 21:46:32 -0800

Please reframe how you give the various fields and tests - i'ts hard
to follow in this email.


On Wed, Feb 3, 2010 at 12:50 PM, Christopher Ball
<christopher.b...@metaheuristica.com> wrote:
> I am perplexed by the behavior I am seeing of the Solr Analyzer and Filters
> with regard to Underscores.
>
>
>
> 1) I am trying to get rid of them when shingling, but seem unable to do so
> with a Stopwords Filter.
>
>
>
> And yet they are being removed when I am not even trying to by the
> WordDelimiter Filter.
>
>
>
> 2) Conversely, I would like to retain '$' symbols when they adjacent to
> numbers, but seem unable to without having to accept all forms of other
> syntax.
>
>
>
> My simple example configuration and test data and results are below.
>
>
>
> Most grateful for any guidance,
>
>
>
> Christopher
>
>
>
>
>
> Test Data:
>
>
>
> <doc>
>
> <field name="id">StopWordTestData</field>
> <field name="conSubSec-text_dc">PreShingled ThisIsNotAStopWord
> ThisIsAStopWord ThisIsAlsoAStopWord beforeaperiod. beforeacomma,
> beforeacollan: under_Score don't Peter's s $1.00 $1 $1,000 $200 $3,000,000
> $3m - # -#- --#-- Yes X No _ __ ___ a and also about</field>
>
> </doc>
>
>
>
>
>
>
>
> Field 1 - Delimited_text:
>
>
> Index Analyzer: org.apache.solr.analysis.TokenizerChain
>
> Tokenizer Class: org.apache.solr.analysis.WhitespaceTokenizerFactory
>
> Filters:
>
> 1.       org.apache.solr.analysis.WordDelimiterFilterFactory
> args:{splitOnCaseChange: 1 generateNumberParts: 0 catenateWords: 1
> generateWordParts: 0 catenateAll: 1 catenateNumbers: 1 }
>
>
> org.apache.solr.analysis.LowerCaseFilterFactory args:{}
>
>
>
>
>
> Field 1 - Resulting Index Terms:
>
>
>
>
>
>
> Term
>
>
> #
>
>
>
> 100
>
>
> 2
>
>
>
> 1000
>
>
> 2
>
>
>
> 200
>
>
> 2
>
>
>
> 3
>
>
> 2
>
>
>
> 3000000
>
>
> 2
>
>
>
> 3m
>
>
> 2
>
>
>
> a
>
>
> 2
>
>
>
> about
>
>
> 2
>
>
>
> also
>
>
> 2
>
>
>
> and
>
>
> 2
>
>
>
> beforeacollan
>
>
> 2
>
>
>
> beforeacomma
>
>
> 2
>
>
>
> beforeaperiod
>
>
> 2
>
>
>
> dont
>
>
> 2
>
>
>
> m
>
>
> 2
>
>
>
> no
>
>
> 2
>
>
>
> peter
>
>
> 2
>
>
>
> preshingled
>
>
> 2
>
>
>
> s
>
>
> 2
>
>
>
> thisisalsoastopword
>
>
> 2
>
>
>
> thisisastopword
>
>
> 2
>
>
>
> thisisnotastopword
>
>
> 2
>
>
>
> underscore
>
>
> 2
>
>
>
> x
>
>
> 2
>
>
>
> yes
>
>
> 2
>
>
>
> 1
>
>
> 2
>
>
> Field2 - Shingled_Text:
>
>
> Index Analyzer: org.apache.solr.analysis.TokenizerChain
>
> Tokenizer Class: org.apache.solr.analysis.WhitespaceTokenizerFactory
>
> Filters:
>
> 2.          1. org.apache.solr.analysis.WordDelimiterFilterFactory
> args:{splitOnCaseChange: 1 generateNumberParts: 0 catenateWords: 1
> stemEnglishPossessive: 0 generateWordParts: 0 catenateAll: 0
> catenateNumbers: 1 }
>
> 3.          2. org.apache.solr.analysis.StopFilterFactory args:{words:
> StopWords-PreShingled.txt ignoreCase: true enablePositionIncrements: true }
>
> 4.          3. org.apache.solr.analysis.LowerCaseFilterFactory args:{}
>
> 5.          4. org.apache.solr.analysis.ShingleFilterFactory
> args:{outputUnigrams: false maxShingleSize: 5 }
>
>
>
>
>
> File: StopWords-PreShingled.txt
>
>
> s
>
>
> _
>
>
> PreShingled
>
>
> __
>
>
> ThisIsAStopWord
>
>
> ThisIsAlsoAStopWord
>
>
>
>
>
> Field2 - Resulting Index Terms (Sample):
>
>
>
>
>
>
> Term
>
>
> #
>
>
>
> _ 100
>
>
> 1
>
>
>
> _ 100 1 1000
>
>
> 1
>
>
>
> _ _
>
>
> 1
>
>
>
> _ _ beforeaperiod beforeacomma
>
>
> 1
>
>
>
> _ beforeaperiod
>
>
> 1
>
>
>
> _ beforeaperiod beforeacomma beforeacollan
>
>
> 1
>
>
>
> _ thisisnotastopword
>
>
> 1
>
>
>
> _ thisisnotastopword _ _
>
>
> 1
>
>
>
>
>
>
>
>
>
>
>



-- 
Lance Norskog
goks...@gmail.com

Re: The Riddle of the Underscore and the Dollar Sign

Reply via email to