Sorry, I wasn't really concerned with email addresses - I was just
using that as an example. How would I tell the StandardAnalyzer that
I want a certain phrase to be tokenized as a token? Surround by
quotes or ..? Also, how would you recommend manipulating the Reader
object? You said something about manipulating it as a String. Do
you mean like:
try {
String line = "";
int charValue = 0;
BufferedReader in = new BufferedReader(reader);
while((line = in.readLine()) != null){
text += line + " ";
}
} catch (Exception e) {
System.out.println(e.getMessage());
}
//make text changes
//cast back into a Reader
Thanks for the response.
-Ryan
On Oct 16, 2006, at 5:32 PM, Otis Gospodnetic wrote:
Hi Ryan,
StandardAnalyzer should already be smart about keeping email
addresses as a single token:
// email addresses
| <EMAIL: <ALPHANUM> (("."|"-"|"_") <ALPHANUM>)* "@" <ALPHANUM>
(("."|"-") <ALPHANUM>)+ >
(this is from StandardAnalyzer.jj)
As for changing the text you feed to Lucene, that's all up to you.
Changing the String seems like the simplest approach. If you want
to wrap that in StringReader, you can, but you can also just work
with Strings.
Otis
----- Original Message ----
From: Ryan O'Hara <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Monday, October 16, 2006 4:28:35 PM
Subject: Help with Custom Analyzer
I have a few questions regarding writing a custom analyzer.
My situation is that I would like to use the StandardAnalyzer but
with some data-specific rules. I was wondering if there was a way of
telling the StandardAnalyzer to treat a string of text, that would
normally be tokenized into more than one token, as only one token
(maybe by inserting quotes around the text). For example, say the
StandardAnalyzer normally splits the string of text
[EMAIL PROTECTED] into 4 tokens, but I want it to split the
string into only 1 token. Could I accomplish this by surrounding the
string with quotes or by using some other type of flag?
Another question I have is how do I modify the text being analyzed?
From how I interpreted what I have read (which could easily be off),
it looks like in order to accomplish what I have previously
described, I am going to have to add some code to my custom
analyzer's tokenStream method. I see that tokenStream() has a Field
and a Reader as parameters. Would the way I go about adding rules be
to edit the reader text? If so, would manipulation of the text be
easier if I were to convert the reader into a string?
Any help is greatly appreciated. Thanks.
-Ryan
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]