RE: Tokenizing text custom way
Do you want to define expressions, i.e. a set of terms that must be intpreted as a whole ? For instance, when the Analyzer catchs "time" followed by "out" it returns "time_out" ? -Message d'origine- De : Dragan Jotanovic [mailto:[EMAIL PROTECTED] Envoyé : mercredi 26 novembre 2003 12:12 À : Lucene Users List Objet : Re: Tokenizing text custom way > You will need to write a custom analyzer. Don't worry, though it's > quite straightforward. You will also need to write a Tokenizer, but > Lucene helps you a lot here. Wouldn't I achieve the same result if I index "time out" like "time_out", using StandardAnalyzer and later if I search for "time out" (inside quotes) I should get proper result, but if I search for "time" I shouldn't get result. Is this right? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Tokenizing text custom way
On Wednesday, November 26, 2003, at 06:12 AM, Dragan Jotanovic wrote: You will need to write a custom analyzer. Don't worry, though it's quite straightforward. You will also need to write a Tokenizer, but Lucene helps you a lot here. Wouldn't I achieve the same result if I index "time out" like "time_out", using StandardAnalyzer and later if I search for "time out" (inside quotes) I should get proper result, but if I search for "time" I shouldn't get result. Is this right? I'm confused on what you are planning doing. Are you going to replace all spaces with an underscore before handing it to the analyzer? StandardAnalyzer will still split at the underscores though. If you have special tokenization needs, why try to hack it somehow rather than address it cleanly in the way Lucene was designed to work? Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Tokenizing text custom way
> You will need to write a custom analyzer. Don't worry, though it's > quite straightforward. You will also need to write a Tokenizer, but > Lucene helps you a lot here. Wouldn't I achieve the same result if I index "time out" like "time_out", using StandardAnalyzer and later if I search for "time out" (inside quotes) I should get proper result, but if I search for "time" I shouldn't get result. Is this right? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Tokenizing text custom way
On Tuesday, November 25, 2003, at 06:41 AM, Dragan Jotanovic wrote: Hi. I need to tokenize text while indexing but I don't want space to be delimiter. Delimiter should be my custom character (for example comma). I understand that I would probably need to implement my own analyzer, but could someone help me where to start. Is there any other way to do this without writing custom analyzer? You will need to write a custom analyzer. Don't worry, though it's quite straightforward. You will also need to write a Tokenizer, but Lucene helps you a lot here. Lucene's LetterTokenizer is simply this: public class LetterTokenizer extends CharTokenizer { /** Construct a new LetterTokenizer. */ public LetterTokenizer(Reader in) { super(in); } /** Collects only characters which satisfy * [EMAIL PROTECTED] Character#isLetter(char)}.*/ protected boolean isTokenChar(char c) { return Character.isLetter(c); } } You could change the isTokenChar method in your custom CommaTokenizer to only return true if the character is not a ','. And you might want to implement the normalize method to lowercase (look at LowerCaseTokenizer). My advice is for you to check out Lucene's source code in the TokenStream hierarchy (ctrl-H in IntelliJ is quite nice! :). CharTokenizer seems a good starting point for you. Then have a look at SimpleAnalyzer: public final class SimpleAnalyzer extends Analyzer { public TokenStream tokenStream(String fieldName, Reader reader) { return new LowerCaseTokenizer(reader); } } Just create your own CommaAnalyzer that uses your CommaTokenizer similar to this. Have a look at my java.net article and try the sample code provided there to observe the analysis process in greater detail so you can check that you get what you expect. and if I enter 'time' as a search word, I don't want to get "time out" in results. I need exact keyword matching. I would achieve this if I tokenize "time out" as one token while idexing. It will be a little trickier on the query part if you're using QueryParser - you will need to double-quote "time out" for it to work, I believe - but don't worry about this until you get the analysis phase worked out and then we can revisit the QueryParser issue then. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Tokenizing text custom way
woah that seems like an awfully complex answer to the question of how to tokenize at a comma rather than a space! %-) On Tuesday, November 25, 2003, at 11:48 AM, MOYSE Gilles (Cetelem) wrote: Hi. You should define expressions. To define expressions, you first have to define an expression file. An expression file contains one expressions per line. For instance : time_out expert_system ... You can use any character to specify the "expression link". Here, I use the underscore (_). Then, you have to build an expression loader. You can store expressions in recursives HashMap. Such HashMap must be built so that HashMap.get("word1") = HashMap, and (HashMap.get("word1")).get("word2") = null, if you want to code the expression "word1_word2". In other words 'HashMap.get("a_word")' returns a hashMap containing all the successors of the word 'a_word'. So, if your expression file looks like that : time_out expert_system expert_in_information you'll have to build a loader which returns a HashMap H so that : H.keySet() = {"time", "expert"} ((HashMap)H.get("time")).keySet = {"out"} ((HashMap)H.get("time")).get("out") = null // null indicates the end of the expression ((HashMap)H.get("expert")).keySet = {"system", "in"} ((HashMap)H.get("expert")).get("system") = null ((HashMap)((HashMap)H.get("expert")).get("in")).keySet() = {"information"} ((HashMap)((HashMap)H.get("expert")).get("in")).get("information") = null These recursives HashMaps code the following tree : time - out - null system --- expert - null |- in - information- null Such an expression loader may be designed this way : public static HashMap getExpressionMap( File wordfile ) { HashMap result = new HashMap(); try { String line = null; LineNumberReader in = new LineNumberReader(new FileReader(wordfile)); HashMap hashToAdd = null; while ((line = in.readLine()) != null) { if (line.startsWith(FILE_COMMENT_CHARACTER)) continue; if (line.trim().length() == 0) continue; StringTokenizer stok = new StringTokenizer(line, " \t_"); String curTok = ""; HashMap currentHash = result; // Test wether the expression contains 2 at least words or not if (stok.countTokens() < 2) { System.err.println("Warning : '" + line + "' in file '" + wordfile.getAbsolutePath() + "' line " + in.getLineNumber() + " is not an expression.\n\tA valid expression contains at least 2 words."); continue; } while (stok.hasMoreTokens()) { curTok = stok.nextToken(); if (curTok.startsWith(FILE_COMMENT_CHARACTER)) // if comment at the end of the line, break break; if (stok.hasMoreTokens()) hashToAdd = new HashMap(6); else hashToAdd = (HashMap)null; if (!(currentHash.containsKey(curTok))) currentHash.put(curTok, hashToAdd); currentHash = (HashMap)currentHash.get(curTok); } } return result; } // On error, use an empty table catch ( Exception e ) { System.err.println("While processing '" + wordfile.getAbsolutePath() + "' : " + e.getMessage()); e.printStackTrace(); return new HashMap(); } } Then, you must build a filter with 2 FIFO stacks : one is the expression stack, the other is the default stack. Then, you define a 'curMap' variable, initially pointing onto the HashMap returned by the ExpressionFileLoader. When you receive a token, you check wether it is null or not; If it is, you check if the standard stack is null or not. If it is not, you pop a token from the default stack and you return it. If it is, you return null If it is not (the token is not null), you check whether it is contained in the HashMap or not (curMap.containsKey(token)). If it is not contained and you were building an expression, you pop all the terms in the expression stack to push them in the default stack (so as not to loose information) If it is not contained and the default stack is empty, you return the token. If it is not conatined and the default stack is not empty, you return the poped token from the default stack and you push the current token. If the token is contained in the curMap, then the token MAY be the first element of an expression. You push the token in the expression stack, and you dive into the next level in your expression tree (curMap = curMap.get("token")) If the next level (now, curMap), is null, then you have completed your expression. You can pop all the tokens from the expresion stack to concatenate them, separated by underscores, and push
RE: Tokenizing text custom way
Hi. You should define expressions. To define expressions, you first have to define an expression file. An expression file contains one expressions per line. For instance : time_out expert_system ... You can use any character to specify the "expression link". Here, I use the underscore (_). Then, you have to build an expression loader. You can store expressions in recursives HashMap. Such HashMap must be built so that HashMap.get("word1") = HashMap, and (HashMap.get("word1")).get("word2") = null, if you want to code the expression "word1_word2". In other words 'HashMap.get("a_word")' returns a hashMap containing all the successors of the word 'a_word'. So, if your expression file looks like that : time_out expert_system expert_in_information you'll have to build a loader which returns a HashMap H so that : H.keySet() = {"time", "expert"} ((HashMap)H.get("time")).keySet = {"out"} ((HashMap)H.get("time")).get("out") = null // null indicates the end of the expression ((HashMap)H.get("expert")).keySet = {"system", "in"} ((HashMap)H.get("expert")).get("system") = null ((HashMap)((HashMap)H.get("expert")).get("in")).keySet() = {"information"} ((HashMap)((HashMap)H.get("expert")).get("in")).get("information") = null These recursives HashMaps code the following tree : time - out - null system --- expert - null |- in - information- null Such an expression loader may be designed this way : public static HashMap getExpressionMap( File wordfile ) { HashMap result = new HashMap(); try { String line = null; LineNumberReader in = new LineNumberReader(new FileReader(wordfile)); HashMap hashToAdd = null; while ((line = in.readLine()) != null) { if (line.startsWith(FILE_COMMENT_CHARACTER)) continue; if (line.trim().length() == 0) continue; StringTokenizer stok = new StringTokenizer(line, " \t_"); String curTok = ""; HashMap currentHash = result; // Test wether the expression contains 2 at least words or not if (stok.countTokens() < 2) { System.err.println("Warning : '" + line + "' in file '" + wordfile.getAbsolutePath() + "' line " + in.getLineNumber() + " is not an expression.\n\tA valid expression contains at least 2 words."); continue; } while (stok.hasMoreTokens()) { curTok = stok.nextToken(); if (curTok.startsWith(FILE_COMMENT_CHARACTER)) // if comment at the end of the line, break break; if (stok.hasMoreTokens()) hashToAdd = new HashMap(6); else hashToAdd = (HashMap)null; if (!(currentHash.containsKey(curTok))) currentHash.put(curTok, hashToAdd); currentHash = (HashMap)currentHash.get(curTok); } } return result; } // On error, use an empty table catch ( Exception e ) { System.err.println("While processing '" + wordfile.getAbsolutePath() + "' : " + e.getMessage()); e.printStackTrace(); return new HashMap(); } } Then, you must build a filter with 2 FIFO stacks : one is the expression stack, the other is the default stack. Then, you define a 'curMap' variable, initially pointing onto the HashMap returned by the ExpressionFileLoader. When you receive a token, you check wether it is null or not; If it is, you check if the standard stack is null or not. If it is not, you pop a token from the default stack and you return it. If it is, you return null
RE: Tokenizing text custom way
Not exactly and answer to the question but I haven't yet used the Token classes/functionality that came with Lucene. Can someone give me an idea of how and why one may use this? -Original Message- From: Dragan Jotanovic [mailto:[EMAIL PROTECTED] Sent: Tuesday, November 25, 2003 6:42 AM To: Lucene Users List Subject: Tokenizing text custom way Hi. I need to tokenize text while indexing but I don't want space to be delimiter. Delimiter should be my custom character (for example comma). I understand that I would probably need to implement my own analyzer, but could someone help me where to start. Is there any other way to do this without writing custom analyzer? This is what I want to achieve. If I have some text that will be indexed like following: man, people, time out, sun and if I enter 'time' as a search word, I don't want to get "time out" in results. I need exact keyword matching. I would achieve this if I tokenize "time out" as one token while idexing. Maybe someone had similar problem? If someone knows how to handle this, please help me. Dragan Jotanovic - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Tokenizing text custom way
> Your solution isn't doing tokenizing, right? You're absolutely right, I misunderstood. Now, instead of return true, I'd maybe put something like return !Character.toString(c).equals(","); and then cut off surrounding spaces like "man, people, time out,..." --> "man" " people" " time out" --> "man" "people" "time out" I haven't tested this though. Keep us posted when you find something that works. :-) Best regards, René - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Tokenizing text custom way
Hi Rene, > I've had the same problem. On some fields, I do > employ a "NonTokenizer" now, > which looks similar to the other tokenizers except for: > protected boolean isTokenChar(char c) > { >return true; > } > So "time out" would be one token. This is OK solution in case that I have only "time out" in a field, but I will have dozens of words in one field of a document. Like I said in previous letter, I would have "man, people, time out, sun" and all those words would be in one letter and all should be "searchable" (I need to tokenize them like "man" "people" "time out" "sun"). Your solution isn't doing tokenizing, right? Dragan Jotanovic - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Tokenizing text custom way
Hi Dragan, > and if I enter 'time' as a search word, I don't want to get "time out" in > results. I need exact keyword matching. I would achieve this if I tokenize > "time out" as one token while idexing. > Maybe someone had similar problem? If someone knows how to handle this, > please help me. I've had the same problem. On some fields, I do employ a "NonTokenizer" now, which looks similar to the other tokenizers except for: protected boolean isTokenChar(char c) { return true; } So "time out" would be one token. HTH René - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]