Re: Splitting of words

Erik Hatcher Tue, 13 Sep 2005 06:01:23 -0700


On Sep 13, 2005, at 7:24 AM, Madhu Satyanarayana Panitini wrote:

Hi Paul,

I agree with u "Analyzer is the magic word"
Lets look it in depth and clear, I would consider three parts in the
analyzer

1. Tokenization (splitting of words)
2. Stopwords removal (depends up on the language)
3. stemming of the words (depends up on the language)

First to start analyze we have split the text, for example I likesplit

the text wherever I find the following non alphabets
"\s+|;|:|<|>|\^|~|=|--+|\+|\?|!|&|\$|@|\#|\'|`|"|_|\%|\*|,|\."
That means I would like to split the text wherever I find
space,:,;,",',<,>,?,  etc....

And then we remove the stopwords and then stemming goes on.

Coming my question is clear now how Lucene splits the text? only when
ever it encounter the space between the words or it consider the non
alphabetic characters as well.

What is the whole grammar Standard analyzer has to split the words ?

Madhu - you'd do well to try out the AnalyzerDemo that comes with the"Lucene in Action" code. You can download it from http://www.lucenebook.com - here's an example run:


$ ant AnalyzerDemo

    ...

AnalyzerDemo:
     [echo]
     [echo]       Demonstrates analysis of sample text.
     [echo]
     [echo]       Refer to the "Analysis" chapter for much more on this
     [echo]       extremely crucial topic.
     [echo]
    [input] Press return to continue...

    [input] String to analyze: [This string will be analyzed.]

     [echo] Running lia.analysis.AnalyzerDemo...
     [java] Analyzing "This string will be analyzed."
     [java]   WhitespaceAnalyzer:
     [java]     [This] [string] [will] [be] [analyzed.]

     [java]   SimpleAnalyzer:
     [java]     [this] [string] [will] [be] [analyzed]

     [java]   StopAnalyzer:
     [java]     [string] [analyzed]

     [java]   StandardAnalyzer:
     [java]     [this] [string] [will] [be] [analyzed]

     [java]   SnowballAnalyzer:
     [java]     [this] [string] [will] [be] [analyz]

     [java]   SnowballAnalyzer:
     [java]     [this] [string] [wil] [be] [analyzed]

     [java]   SnowballAnalyzer:
     [java]     [thi] [string] [will] [be] [analyz]


BUILD SUCCESSFUL
Total time: 13 seconds

The StandardTokenizer is the most sophisticated one built intoLucene. You can see the types of tokens it emits by looking at thejavadoc here:<http://lucene.apache.org/java/docs/api/org/apache/lucene/analysis/standard/StandardTokenizer.html>

It recognizes e-mail addresses, interior apostrophe words (likeo'clock), hostnames/IP addresses (like lucene.apache.org), acronyms,and CJK characters.


    Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Splitting of words

Reply via email to