On Fri, Jul 31, 2009 at 5:00 PM, <oh...@cox.net> wrote:
> Hi Ahmet,
>
> Thanks for the clarification and information!  That was exactly what I was 
> looking for.
>
> Jim
>
>
> ---- AHMET ARSLAN <iori...@yahoo.com> wrote:
>>
>> > I guess that the obvious question is "Which characters are
>> > considered 'punctuation characters'?".
>>
>> Punctuation = ("_"|"-"|"/"|"."|",")
Those punctuation are only for floating point, ip-addresses etc.
StandardTokenizer does not have punctuation explicitly set. You can
assume that it will drop and split on almost all punctuations coming
along in the input string.

Have a look at StandardTokenizerImpl.jflex the gramma is quiet easy to
understand and gives you a better idea what this tokenizer does.

simon
>>
>> > In particular, does the analyzer consider "=" (equal) and
>> > ":" (colon) to be punctuation characters?
>>
>> ":" is special character at QueryParser (if you are using it). If you want 
>> to search it you need to escape it first. At index time this character is 
>> ignored. Like the punctuations. The string ahmet:arslan will produce two 
>> tokens ahmet and arslan. It also breaks words at "=" character in both 
>> query/index time.
>>
>> If you want to understand the behavior of StandardTokenizer, you need to 
>> look at the file StandardTokenizerImpl.jflex. It recognizes the followings 
>> as one token: {ALPHANUM}, {APOSTROPHE}, {ACRONYM}, {COMPANY}, {EMAIL} 
>> {HOST}, {NUM}, {CJ}, {ACRONYM_DEP} and ignores the rest. There are some 
>> definitions of these token types, similar to Regular Expression. You can 
>> change behavior of StandardTokenizer by editing this file and generating 
>> StandardTokenizerImpl.java from it. There is also another jflex file named 
>> WikipediaTokenizerImpl.jflex. By looking it you can understand how new token 
>> types can be added.
>>
>> Ahmet
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to