RE: Tokenizing text custom way

2003-11-26 Thread MOYSE Gilles (Cetelem)
Do you want to define expressions, i.e. a set of terms that must be
intpreted as a whole ?
For instance, when the Analyzer catchs "time" followed by "out" it returns
"time_out" ?


-Message d'origine-
De : Dragan Jotanovic [mailto:[EMAIL PROTECTED]
Envoyé : mercredi 26 novembre 2003 12:12
À : Lucene Users List
Objet : Re: Tokenizing text custom way


> You will need to write a custom analyzer.  Don't worry, though it's
> quite straightforward.  You will also need to write a Tokenizer, but
> Lucene helps you a lot here.

Wouldn't I achieve the same result if I index "time out" like "time_out",
using StandardAnalyzer and later if I search for "time out" (inside quotes)
I should get proper result, but if I search for "time" I shouldn't get
result. Is this right?




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Tokenizing text custom way

2003-11-26 Thread Erik Hatcher
On Wednesday, November 26, 2003, at 06:12  AM, Dragan Jotanovic wrote:
You will need to write a custom analyzer.  Don't worry, though 
it's
quite straightforward.  You will also need to write a Tokenizer, but
Lucene helps you a lot here.
Wouldn't I achieve the same result if I index "time out" like 
"time_out",
using StandardAnalyzer and later if I search for "time out" (inside 
quotes)
I should get proper result, but if I search for "time" I shouldn't get
result. Is this right?
I'm confused on what you are planning doing.  Are you going to replace 
all spaces with an underscore before handing it to the analyzer?  
StandardAnalyzer will still split at the underscores though.

If you have special tokenization needs, why try to hack it somehow 
rather than address it cleanly in the way Lucene was designed to work?

	Erik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Tokenizing text custom way

2003-11-26 Thread Dragan Jotanovic
> You will need to write a custom analyzer.  Don't worry, though it's
> quite straightforward.  You will also need to write a Tokenizer, but
> Lucene helps you a lot here.

Wouldn't I achieve the same result if I index "time out" like "time_out",
using StandardAnalyzer and later if I search for "time out" (inside quotes)
I should get proper result, but if I search for "time" I shouldn't get
result. Is this right?




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Tokenizing text custom way

2003-11-26 Thread Erik Hatcher
On Tuesday, November 25, 2003, at 06:41  AM, Dragan Jotanovic wrote:
Hi. I need to tokenize text while indexing but I don't want space to 
be delimiter. Delimiter should be my custom character (for example 
comma). I understand that I would probably need to implement my own 
analyzer, but could someone help me where to start. Is there any other 
way to do this without writing custom analyzer?
You will need to write a custom analyzer.  Don't worry, though it's 
quite straightforward.  You will also need to write a Tokenizer, but 
Lucene helps you a lot here.  Lucene's LetterTokenizer is simply this:

public class LetterTokenizer extends CharTokenizer {
  /** Construct a new LetterTokenizer. */
  public LetterTokenizer(Reader in) {
super(in);
  }
  /** Collects only characters which satisfy
   * [EMAIL PROTECTED] Character#isLetter(char)}.*/
  protected boolean isTokenChar(char c) {
return Character.isLetter(c);
  }
}
You could change the isTokenChar method in your custom CommaTokenizer 
to only return true if the character is not a ','.  And you might want 
to implement the normalize method to lowercase (look at 
LowerCaseTokenizer).

My advice is for you to check out Lucene's source code in the 
TokenStream hierarchy (ctrl-H in IntelliJ is quite nice! :).  
CharTokenizer seems a good starting point for you.  Then have a look at 
SimpleAnalyzer:

public final class SimpleAnalyzer extends Analyzer {
  public TokenStream tokenStream(String fieldName, Reader reader) {
return new LowerCaseTokenizer(reader);
  }
}
Just create your own CommaAnalyzer that uses your CommaTokenizer 
similar to this.  Have a look at my java.net article and try the sample 
code provided there to observe the analysis process in greater detail 
so you can check that you get what you expect.

and if I enter 'time' as a search word, I don't want to get "time out" 
in results. I need exact keyword matching. I would achieve this if I 
tokenize "time out" as one token while idexing.
It will be a little trickier on the query part if you're using 
QueryParser - you will need to double-quote "time out" for it to work, 
I believe - but don't worry about this until you get the analysis phase 
worked out and then we can revisit the QueryParser issue then.

	Erik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Tokenizing text custom way

2003-11-26 Thread Erik Hatcher
woah that seems like an awfully complex answer to the question of 
how to tokenize at a comma rather than a space!  %-)

On Tuesday, November 25, 2003, at 11:48  AM, MOYSE Gilles (Cetelem) 
wrote:

Hi.

You should define expressions.
To define expressions, you first have to define an expression file.
An expression file contains one expressions per line.
For instance :
	time_out
	expert_system
	...
You can use any character to specify the "expression link". Here, I 
use the
underscore (_).

Then, you have to build an expression loader. You can store 
expressions in
recursives HashMap.
Such HashMap must be built so that HashMap.get("word1") = HashMap, and
(HashMap.get("word1")).get("word2") = null, if you want to code the
expression "word1_word2".
In other words 'HashMap.get("a_word")' returns a hashMap containing 
all the
successors of the word 'a_word'.

So, if your expression file looks like that :
time_out
expert_system
expert_in_information
you'll have to build a loader which returns a HashMap H so that :
H.keySet() = {"time", "expert"}
((HashMap)H.get("time")).keySet = {"out"}
((HashMap)H.get("time")).get("out") = null // null indicates the end
of the expression
((HashMap)H.get("expert")).keySet = {"system", "in"}
((HashMap)H.get("expert")).get("system") = null
((HashMap)((HashMap)H.get("expert")).get("in")).keySet() =
{"information"}
((HashMap)((HashMap)H.get("expert")).get("in")).get("information") =
null
These recursives HashMaps code the following tree :
time - out - null
system --- expert - null
  |- in - information- null
Such an expression loader may be designed this way :

public static HashMap getExpressionMap( File wordfile ) {
HashMap result = new HashMap();

try
{
String line = null;
LineNumberReader in = new LineNumberReader(new
FileReader(wordfile));
HashMap hashToAdd = null;

while ((line = in.readLine()) != null)
{
if (line.startsWith(FILE_COMMENT_CHARACTER))
continue;
if (line.trim().length() == 0)
continue;
StringTokenizer stok = new
StringTokenizer(line, " \t_");
String curTok = "";
HashMap currentHash = result;

// Test wether the expression contains 2 at
least words or not
if (stok.countTokens() < 2)
{
	System.err.println("Warning : '" +
line + "' in file '" + wordfile.getAbsolutePath() + "' line " +
in.getLineNumber() +
		" is not an expression.\n\tA
valid expression contains at least 2 words.");
	continue;
}

while (stok.hasMoreTokens())
{
	curTok = stok.nextToken();
	if
(curTok.startsWith(FILE_COMMENT_CHARACTER)) // if comment at the end 
of the
line, break
		break;
	if (stok.hasMoreTokens())
		hashToAdd = new HashMap(6);
	else
		hashToAdd = (HashMap)null;
		
	if
(!(currentHash.containsKey(curTok)))
		currentHash.put(curTok,
hashToAdd);
		
	currentHash =
(HashMap)currentHash.get(curTok);
}
			}
			return result;
		}
		// On error, use an empty table
		catch ( Exception e )
		{
			System.err.println("While processing '" +
wordfile.getAbsolutePath() + "' : " + e.getMessage());
			e.printStackTrace();
			return new HashMap();
		}
	}

Then, you must build a filter with 2 FIFO stacks : one is the 
expression
stack, the other is the default stack.
Then, you define a 'curMap' variable, initially pointing onto the 
HashMap
returned by the ExpressionFileLoader.

When you receive a token, you check wether it is null or not;
	If it is, you check if the standard stack is null or not.
		If it is not, you pop a token from the default stack and you
return it.
		If it is, you return null
	If it is not (the token is not null), you check whether it is
contained in the HashMap or not (curMap.containsKey(token)).
		If it is not contained and you were building an expression,
you pop all the terms in the expression stack to push them in the 
default
stack (so as not to loose information)
		If it is not contained and the default stack is empty, you
return the token.
		If it is not conatined and the default stack is not empty,
you return the poped token from the default stack and you push the 
current
token.
	If the token is contained in the curMap, then the token MAY be the
first element of an expression.
		You push the token in the expression stack, and you dive
into the next level in your expression tree (curMap = 
curMap.get("token"))
		If the next level (now, curMap), is null, then you have
completed your expression. You can pop all the tokens from the 
expresion
stack to concatenate them, separated by underscores, and push 

RE: Tokenizing text custom way

2003-11-25 Thread MOYSE Gilles (Cetelem)
Hi.

You should define expressions.
To define expressions, you first have to define an expression file.
An expression file contains one expressions per line.
For instance :
time_out
expert_system
...
You can use any character to specify the "expression link". Here, I use the
underscore (_).

Then, you have to build an expression loader. You can store expressions in
recursives HashMap.
Such HashMap must be built so that HashMap.get("word1") = HashMap, and
(HashMap.get("word1")).get("word2") = null, if you want to code the
expression "word1_word2".
In other words 'HashMap.get("a_word")' returns a hashMap containing all the
successors of the word 'a_word'.

So, if your expression file looks like that :
time_out
expert_system
expert_in_information

you'll have to build a loader which returns a HashMap H so that :
H.keySet() = {"time", "expert"}
((HashMap)H.get("time")).keySet = {"out"}
((HashMap)H.get("time")).get("out") = null // null indicates the end
of the expression
((HashMap)H.get("expert")).keySet = {"system", "in"}
((HashMap)H.get("expert")).get("system") = null
((HashMap)((HashMap)H.get("expert")).get("in")).keySet() =
{"information"}
((HashMap)((HashMap)H.get("expert")).get("in")).get("information") =
null

These recursives HashMaps code the following tree :
time - out - null
system --- expert - null
  |- in - information- null

Such an expression loader may be designed this way :

public static HashMap getExpressionMap( File wordfile ) {
HashMap result = new HashMap();

try 
{
String line = null;
LineNumberReader in = new LineNumberReader(new
FileReader(wordfile));
HashMap hashToAdd = null;

while ((line = in.readLine()) != null)
{
if (line.startsWith(FILE_COMMENT_CHARACTER))
continue;

if (line.trim().length() == 0)
continue;

StringTokenizer stok = new
StringTokenizer(line, " \t_");
String curTok = "";
HashMap currentHash = result;

// Test wether the expression contains 2 at
least words or not
if (stok.countTokens() < 2)
{
System.err.println("Warning : '" +
line + "' in file '" + wordfile.getAbsolutePath() + "' line " +
in.getLineNumber() +
" is not an expression.\n\tA
valid expression contains at least 2 words.");
continue;
}

while (stok.hasMoreTokens())
{
curTok = stok.nextToken();
if
(curTok.startsWith(FILE_COMMENT_CHARACTER)) // if comment at the end of the
line, break
break;
if (stok.hasMoreTokens())
hashToAdd = new HashMap(6);
else
hashToAdd = (HashMap)null;

if
(!(currentHash.containsKey(curTok)))
currentHash.put(curTok,
hashToAdd);

currentHash =
(HashMap)currentHash.get(curTok);
}
}
return result;
}
// On error, use an empty table
catch ( Exception e ) 
{
System.err.println("While processing '" +
wordfile.getAbsolutePath() + "' : " + e.getMessage());
e.printStackTrace();
return new HashMap();
}
}


Then, you must build a filter with 2 FIFO stacks : one is the expression
stack, the other is the default stack.
Then, you define a 'curMap' variable, initially pointing onto the HashMap
returned by the ExpressionFileLoader.

When you receive a token, you check wether it is null or not;
If it is, you check if the standard stack is null or not.
If it is not, you pop a token from the default stack and you
return it.
If it is, you return null

RE: Tokenizing text custom way

2003-11-25 Thread Pleasant, Tracy
Not exactly and answer to the question but I haven't yet used the Token 
classes/functionality that came with Lucene. Can someone give me an idea of how and 
why one may use this?

 

-Original Message-
From: Dragan Jotanovic [mailto:[EMAIL PROTECTED]
Sent: Tuesday, November 25, 2003 6:42 AM
To: Lucene Users List
Subject: Tokenizing text custom way


Hi. I need to tokenize text while indexing but I don't want space to be delimiter. 
Delimiter should be my custom character (for example comma). I understand that I would 
probably need to implement my own analyzer, but could someone help me where to start. 
Is there any other way to do this without writing custom analyzer?

This is what I want to achieve.
If I have some text that will be indexed like following:

man, people, time out, sun

and if I enter 'time' as a search word, I don't want to get "time out" in results. I 
need exact keyword matching. I would achieve this if I tokenize "time out" as one 
token while idexing.

Maybe someone had similar problem? If someone knows how to handle this, please help me.

Dragan Jotanovic


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Tokenizing text custom way

2003-11-25 Thread Hackl, Rene
> Your solution isn't doing tokenizing, right?

You're absolutely right, I misunderstood.

Now, instead of return true, I'd maybe put something like

return !Character.toString(c).equals(",");

and then cut off surrounding spaces like "man, people, time out,..."
--> "man" " people" " time out"
--> "man" "people" "time out"

I haven't tested this though. Keep us posted when you find 
something that works. :-)

Best regards,

René

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Tokenizing text custom way

2003-11-25 Thread Dragan Jotanovic
Hi Rene,

> I've had the same problem. On some fields, I do
> employ a "NonTokenizer" now,
> which looks similar to the other tokenizers except for:

> protected boolean isTokenChar(char c)
>  {
>return true;
>  }

> So "time out" would be one token.

This is OK solution in case that I have only "time out" in a field, but I
will have dozens of words in one field of a document. Like I said in
previous letter, I would have "man, people, time out, sun" and all those
words would be in one letter and all should be "searchable" (I need to
tokenize them like "man" "people" "time out" "sun").

Your solution isn't doing tokenizing, right?

Dragan Jotanovic





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Tokenizing text custom way

2003-11-25 Thread Hackl, Rene
Hi Dragan,

> and if I enter 'time' as a search word, I don't want to get "time out" in
> results. I need exact keyword matching. I would achieve this if I tokenize
> "time out" as one token while idexing.

> Maybe someone had similar problem? If someone knows how to handle this,
> please help me.

I've had the same problem. On some fields, I do employ a "NonTokenizer" now,
which looks similar to the other tokenizers except for:

protected boolean isTokenChar(char c) 
  {
return true;
  }

So "time out" would be one token.

HTH

René

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]