> I guess that the obvious question is "Which characters are
> considered 'punctuation characters'?".
 
Punctuation = ("_"|"-"|"/"|"."|",")

> In particular, does the analyzer consider "=" (equal) and
> ":" (colon) to be punctuation characters?

":" is special character at QueryParser (if you are using it). If you want to 
search it you need to escape it first. At index time this character is ignored. 
Like the punctuations. The string ahmet:arslan will produce two tokens ahmet 
and arslan. It also breaks words at "=" character in both query/index time.

If you want to understand the behavior of StandardTokenizer, you need to look 
at the file StandardTokenizerImpl.jflex. It recognizes the followings as one 
token: {ALPHANUM}, {APOSTROPHE}, {ACRONYM}, {COMPANY}, {EMAIL} {HOST}, {NUM}, 
{CJ}, {ACRONYM_DEP} and ignores the rest. There are some definitions of these 
token types, similar to Regular Expression. You can change behavior of 
StandardTokenizer by editing this file and generating 
StandardTokenizerImpl.java from it. There is also another jflex file named 
WikipediaTokenizerImpl.jflex. By looking it you can understand how new token 
types can be added. 

Ahmet


      

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to