Re: Splitting of words

Endre Stølsvik Tue, 27 Sep 2005 03:30:18 -0700

On Thu, 22 Sep 2005, Erik Hatcher wrote:

| 
| On Sep 22, 2005, at 4:36 AM, Endre Stølsvik wrote:
| 
| > 
| > | The StandardTokenizer is the most sophisticated one built into Lucene.
| > You
| > | can see the types of tokens it emits by looking at the javadoc here:
| > |
| > 
<http://lucene.apache.org/java/docs/api/org/apache/lucene/analysis/standard/StandardTokenizer.html>
| > |
| > | It recognizes e-mail addresses, interior apostrophe words (like o'clock),
| > | hostnames/IP addresses (like lucene.apache.org), acronyms, and CJK
| > characters.
| > 
| > It would be great if it also separated "UpperCamelCase" and
| > "lowerCamelCase" words into both the different words, and one long word.
| > Several uppercase, followed by lowercase, would most probably be best done
| > like HTTPUnit -> http unit.
| >  This is of course due to, for my part, java language influence. But I
| > believe it is custom in many programming languages to use lowerCamelCase
| > for e.g. variables. Filenames too.
| 
| I strongly disagree.  It would not be good at all for StandardTokenizer to do
| this.


...

|
| It is important to design filters and tokenizers in the most single-purpose
| way to allow them to be combined for various scenarios.

Okay. Why? Just wondering what the reasoning behind this is? What is the 
logic behind the StandardTokenizer as it stands? (Note: There are strong 
reasons to believe that I'm just not quite up to speed on how this all 
fits together..!)

| It would be easy to write a CamelCaseSplitFilter that could be used in 
| conjunction with any tokenizer.

Thanks for the tip!

Regards,
Endre

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Splitting of words

Reply via email to