Hello, > (1)I rewrote StandardAnalyzer as StrictAnalyzer for the project I am > working > on. StandardAnalyzer does not filter enough words for my liking. > Basically all I did was add to the STOP_WORDS array. The stop words > I added > are based on the default values in SQL Server 2000's text indexing. > (Source code below)
The change seems simple and looks fine to me. If nobody complains until tonight I'll commit it. I'd recommend using explicit imports (not import ....*;) in the future. > (2)I would also like to propose a change to StandardTokenizer which > supports > strings with a trailing and/or leading comma(s) such as "therefore," > and > ",ice,". Currently StandardTokenizer is not returning any results > for some > of my most basic searches because of commas adjacent to words. > > Comments, suggestions, questions? Hm, shouldn't that be filtered by one of the analyzers at both indexing and searching time? Are you using Stop analyzer? Please also see http://www.jguru.com/faq/view.jsp?EID=538308 Otis > import org.apache.lucene.analysis.*; > import java.io.Reader; > import java.util.Hashtable; > > /** Filters {@link StandardTokenizer} with {@link StandardFilter}, > {@link > * LowerCaseFilter} and {@link StopFilter}. */ > public final class StrictAnalyzer extends Analyzer { > private Hashtable stopTable; > > /** An array containing some common English words that are not > usually > useful > for searching. */ > public static final String[] STOP_WORDS = { > "0","1","2","3","4","5","6","7","8","9", > "$", > "about", "after", "all", "also", "an", "and", > "another", "any", "are", "as", "at", "be", "because", > "been", "before", "being", "between", "both", "but", > "by","came","can","come","could","did","do","does", > "each","else","for","from","get","got","has","had", > "he","have","her","here","him","himself","his","how", > "if","in","into","is","it","its","just","like","make", > "many","me","might","more","most","much","must","my", > "never","now","of","on","only","or","other","our","out", > "over","re","said","same","see","should","since","so", > "some","still","such","take","than","that","the","their", > "them","then","there","these","they","this","those","through", > "to","too","under","up","use","very","want","was","way","we", > "well","were","what","when","where","which","while","who","will", > "with","would","you","your", > > "a","b","c","d","e","f","g","h","i","j","k","l","m","n","o","p","q","r","s", > "t","u","v","w","x","y","z" > > }; > > /** Builds an analyzer. */ > public StrictAnalyzer() { > this(STOP_WORDS); > } > > /** Builds an analyzer with the given stop words. */ > public StrictAnalyzer(String[] stopWords) { > stopTable = StopFilter.makeStopTable(stopWords); > } > > /** Constructs a {@link StandardTokenizer} filtered by a {@link > * StandardFilter}, a {@link LowerCaseFilter} and a {@link > StopFilter}. */ > public final TokenStream tokenStream(String fieldName, Reader > reader) { > TokenStream result = new StandardTokenizer(reader); > result = new StandardFilter(result); > result = new LowerCaseFilter(result); > result = new StopFilter(result, stopTable); > return result; > } > } __________________________________________________ Do You Yahoo!? Yahoo! Sports - Coverage of the 2002 Olympic Games http://sports.yahoo.com -- To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>
