Re: Splitting of words

Endre Stølsvik Thu, 22 Sep 2005 01:37:01 -0700

| The StandardTokenizer is the most sophisticated one built into Lucene.  You
| can see the types of tokens it emits by looking at the javadoc here:
|    
<http://lucene.apache.org/java/docs/api/org/apache/lucene/analysis/standard/StandardTokenizer.html>
| 
| It recognizes e-mail addresses, interior apostrophe words (like o'clock),
| hostnames/IP addresses (like lucene.apache.org), acronyms, and CJK characters.


It would be great if it also separated "UpperCamelCase" and 
"lowerCamelCase" words into both the different words, and one long word. 
Several uppercase, followed by lowercase, would most probably be best done 
like HTTPUnit -> http unit.
  This is of course due to, for my part, java language influence. But I 
believe it is custom in many programming languages to use lowerCamelCase 
for e.g. variables. Filenames too.

Regards,
Endre.

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Splitting of words

Reply via email to