Re: Is there a list of "special" characters for standard analyzer?

ohaya Thu, 30 Jul 2009 23:03:24 -0700

---- Phil Whelan <phil...@gmail.com> wrote: 
> On Thu, Jul 30, 2009 at 7:12 PM, <oh...@cox.net> wrote:
> > I was wonder if there is a list of special characters for the standard 
> > analyzer?
> >
> > What I mean by "special" is characters that the analyzer considers break 
> > characters.
> > For example, if I have something like "foo=something", apparently the 
> > analyzer
> > considers this as two terms, "foo" and "something.
> 
> Hi Jim,
> 
> This is what I could find in the docs...
> 
> StandardAnalyzer uses StandardTokenizer
> 
> http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/analysis/standard/StandardTokenizer.html
> * Splits words at punctuation characters, removing punctuation.
> However, a dot that's not followed by whitespace is considered part of
> a token.
> * Splits words at hyphens, unless there's a number in the token, in
> which case the whole token is interpreted as a product number and is
> not split.
> * Recognizes email addresses and internet hostnames as one token.
> 
> Also, these are the tokens that will be removed..
> 
>   public static final String[] ENGLISH_STOP_WORDS = {
>     "a", "an", "and", "are", "as", "at", "be", "but", "by",
>     "for", "if", "in", "into", "is", "it",
>     "no", "not", "of", "on", "or", "such",
>     "that", "the", "their", "then", "there", "these",
>     "they", "this", "to", "was", "will", "with"
>   };
> 
> Thanks,
> Phil
>



Hi Phil,

I guess that the obvious question is "Which characters are considered 
'punctuation characters'?".

In particular, does the analyzer consider "=" (equal) and ":" (colon) to be 
punctuation characters?

Thanks,
Jim

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Is there a list of "special" characters for standard analyzer?

Reply via email to