Please reframe how you give the various fields and tests - i'ts hard to follow in this email.
On Wed, Feb 3, 2010 at 12:50 PM, Christopher Ball <christopher.b...@metaheuristica.com> wrote: > I am perplexed by the behavior I am seeing of the Solr Analyzer and Filters > with regard to Underscores. > > > > 1) I am trying to get rid of them when shingling, but seem unable to do so > with a Stopwords Filter. > > > > And yet they are being removed when I am not even trying to by the > WordDelimiter Filter. > > > > 2) Conversely, I would like to retain '$' symbols when they adjacent to > numbers, but seem unable to without having to accept all forms of other > syntax. > > > > My simple example configuration and test data and results are below. > > > > Most grateful for any guidance, > > > > Christopher > > > > > > Test Data: > > > > <doc> > > <field name="id">StopWordTestData</field> > <field name="conSubSec-text_dc">PreShingled ThisIsNotAStopWord > ThisIsAStopWord ThisIsAlsoAStopWord beforeaperiod. beforeacomma, > beforeacollan: under_Score don't Peter's s $1.00 $1 $1,000 $200 $3,000,000 > $3m - # -#- --#-- Yes X No _ __ ___ a and also about</field> > > </doc> > > > > > > > > Field 1 - Delimited_text: > > > Index Analyzer: org.apache.solr.analysis.TokenizerChain > > Tokenizer Class: org.apache.solr.analysis.WhitespaceTokenizerFactory > > Filters: > > 1. org.apache.solr.analysis.WordDelimiterFilterFactory > args:{splitOnCaseChange: 1 generateNumberParts: 0 catenateWords: 1 > generateWordParts: 0 catenateAll: 1 catenateNumbers: 1 } > > > org.apache.solr.analysis.LowerCaseFilterFactory args:{} > > > > > > Field 1 - Resulting Index Terms: > > > > > > > Term > > > # > > > > 100 > > > 2 > > > > 1000 > > > 2 > > > > 200 > > > 2 > > > > 3 > > > 2 > > > > 3000000 > > > 2 > > > > 3m > > > 2 > > > > a > > > 2 > > > > about > > > 2 > > > > also > > > 2 > > > > and > > > 2 > > > > beforeacollan > > > 2 > > > > beforeacomma > > > 2 > > > > beforeaperiod > > > 2 > > > > dont > > > 2 > > > > m > > > 2 > > > > no > > > 2 > > > > peter > > > 2 > > > > preshingled > > > 2 > > > > s > > > 2 > > > > thisisalsoastopword > > > 2 > > > > thisisastopword > > > 2 > > > > thisisnotastopword > > > 2 > > > > underscore > > > 2 > > > > x > > > 2 > > > > yes > > > 2 > > > > 1 > > > 2 > > > Field2 - Shingled_Text: > > > Index Analyzer: org.apache.solr.analysis.TokenizerChain > > Tokenizer Class: org.apache.solr.analysis.WhitespaceTokenizerFactory > > Filters: > > 2. 1. org.apache.solr.analysis.WordDelimiterFilterFactory > args:{splitOnCaseChange: 1 generateNumberParts: 0 catenateWords: 1 > stemEnglishPossessive: 0 generateWordParts: 0 catenateAll: 0 > catenateNumbers: 1 } > > 3. 2. org.apache.solr.analysis.StopFilterFactory args:{words: > StopWords-PreShingled.txt ignoreCase: true enablePositionIncrements: true } > > 4. 3. org.apache.solr.analysis.LowerCaseFilterFactory args:{} > > 5. 4. org.apache.solr.analysis.ShingleFilterFactory > args:{outputUnigrams: false maxShingleSize: 5 } > > > > > > File: StopWords-PreShingled.txt > > > s > > > _ > > > PreShingled > > > __ > > > ThisIsAStopWord > > > ThisIsAlsoAStopWord > > > > > > Field2 - Resulting Index Terms (Sample): > > > > > > > Term > > > # > > > > _ 100 > > > 1 > > > > _ 100 1 1000 > > > 1 > > > > _ _ > > > 1 > > > > _ _ beforeaperiod beforeacomma > > > 1 > > > > _ beforeaperiod > > > 1 > > > > _ beforeaperiod beforeacomma beforeacollan > > > 1 > > > > _ thisisnotastopword > > > 1 > > > > _ thisisnotastopword _ _ > > > 1 > > > > > > > > > > > -- Lance Norskog goks...@gmail.com