On Fri, Jul 31, 2009 at 5:00 PM, <oh...@cox.net> wrote: > Hi Ahmet, > > Thanks for the clarification and information! That was exactly what I was > looking for. > > Jim > > > ---- AHMET ARSLAN <iori...@yahoo.com> wrote: >> >> > I guess that the obvious question is "Which characters are >> > considered 'punctuation characters'?". >> >> Punctuation = ("_"|"-"|"/"|"."|",") Those punctuation are only for floating point, ip-addresses etc. StandardTokenizer does not have punctuation explicitly set. You can assume that it will drop and split on almost all punctuations coming along in the input string.
Have a look at StandardTokenizerImpl.jflex the gramma is quiet easy to understand and gives you a better idea what this tokenizer does. simon >> >> > In particular, does the analyzer consider "=" (equal) and >> > ":" (colon) to be punctuation characters? >> >> ":" is special character at QueryParser (if you are using it). If you want >> to search it you need to escape it first. At index time this character is >> ignored. Like the punctuations. The string ahmet:arslan will produce two >> tokens ahmet and arslan. It also breaks words at "=" character in both >> query/index time. >> >> If you want to understand the behavior of StandardTokenizer, you need to >> look at the file StandardTokenizerImpl.jflex. It recognizes the followings >> as one token: {ALPHANUM}, {APOSTROPHE}, {ACRONYM}, {COMPANY}, {EMAIL} >> {HOST}, {NUM}, {CJ}, {ACRONYM_DEP} and ignores the rest. There are some >> definitions of these token types, similar to Regular Expression. You can >> change behavior of StandardTokenizer by editing this file and generating >> StandardTokenizerImpl.java from it. There is also another jflex file named >> WikipediaTokenizerImpl.jflex. By looking it you can understand how new token >> types can be added. >> >> Ahmet >> >> >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org