---- Phil Whelan <phil...@gmail.com> wrote: > On Thu, Jul 30, 2009 at 7:12 PM, <oh...@cox.net> wrote: > > I was wonder if there is a list of special characters for the standard > > analyzer? > > > > What I mean by "special" is characters that the analyzer considers break > > characters. > > For example, if I have something like "foo=something", apparently the > > analyzer > > considers this as two terms, "foo" and "something. > > Hi Jim, > > This is what I could find in the docs... > > StandardAnalyzer uses StandardTokenizer > > http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/analysis/standard/StandardTokenizer.html > * Splits words at punctuation characters, removing punctuation. > However, a dot that's not followed by whitespace is considered part of > a token. > * Splits words at hyphens, unless there's a number in the token, in > which case the whole token is interpreted as a product number and is > not split. > * Recognizes email addresses and internet hostnames as one token. > > Also, these are the tokens that will be removed.. > > public static final String[] ENGLISH_STOP_WORDS = { > "a", "an", "and", "are", "as", "at", "be", "but", "by", > "for", "if", "in", "into", "is", "it", > "no", "not", "of", "on", "or", "such", > "that", "the", "their", "then", "there", "these", > "they", "this", "to", "was", "will", "with" > }; > > Thanks, > Phil >
Hi Phil, I guess that the obvious question is "Which characters are considered 'punctuation characters'?". In particular, does the analyzer consider "=" (equal) and ":" (colon) to be punctuation characters? Thanks, Jim --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org