Re: Eliminating duplicate result
When you are doing two searches are you searching for two different terms? No, I am searching for the same term. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: unexpected results from query
On Tuesday, November 25, 2003, at 10:45 PM, marc wrote: Hi, assume a field has the following text Adenylate kinase (mitochondrial GTP:AMP phosphotransferase) the following searches all return this document AMP AMP AMP; can someone explain this to me..i figured that only the first query would be successful This depends on the Analyzer you're using. I'm assuming you're using the QueryParser and an analyzer that rips off special characters - so essentially the TermQuery underneath is always for AMP. Have a look at my first java.net article which shows the analysis process. Run your sample text through the code provided there to see the effect first-hand. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Tokenizing text custom way
woah that seems like an awfully complex answer to the question of how to tokenize at a comma rather than a space! %-) On Tuesday, November 25, 2003, at 11:48 AM, MOYSE Gilles (Cetelem) wrote: Hi. You should define expressions. To define expressions, you first have to define an expression file. An expression file contains one expressions per line. For instance : time_out expert_system ... You can use any character to specify the expression link. Here, I use the underscore (_). Then, you have to build an expression loader. You can store expressions in recursives HashMap. Such HashMap must be built so that HashMap.get(word1) = HashMap, and (HashMap.get(word1)).get(word2) = null, if you want to code the expression word1_word2. In other words 'HashMap.get(a_word)' returns a hashMap containing all the successors of the word 'a_word'. So, if your expression file looks like that : time_out expert_system expert_in_information you'll have to build a loader which returns a HashMap H so that : H.keySet() = {time, expert} ((HashMap)H.get(time)).keySet = {out} ((HashMap)H.get(time)).get(out) = null // null indicates the end of the expression ((HashMap)H.get(expert)).keySet = {system, in} ((HashMap)H.get(expert)).get(system) = null ((HashMap)((HashMap)H.get(expert)).get(in)).keySet() = {information} ((HashMap)((HashMap)H.get(expert)).get(in)).get(information) = null These recursives HashMaps code the following tree : time - out - null system --- expert - null |- in - information- null Such an expression loader may be designed this way : public static HashMap getExpressionMap( File wordfile ) { HashMap result = new HashMap(); try { String line = null; LineNumberReader in = new LineNumberReader(new FileReader(wordfile)); HashMap hashToAdd = null; while ((line = in.readLine()) != null) { if (line.startsWith(FILE_COMMENT_CHARACTER)) continue; if (line.trim().length() == 0) continue; StringTokenizer stok = new StringTokenizer(line, \t_); String curTok = ; HashMap currentHash = result; // Test wether the expression contains 2 at least words or not if (stok.countTokens() 2) { System.err.println(Warning : ' + line + ' in file ' + wordfile.getAbsolutePath() + ' line + in.getLineNumber() + is not an expression.\n\tA valid expression contains at least 2 words.); continue; } while (stok.hasMoreTokens()) { curTok = stok.nextToken(); if (curTok.startsWith(FILE_COMMENT_CHARACTER)) // if comment at the end of the line, break break; if (stok.hasMoreTokens()) hashToAdd = new HashMap(6); else hashToAdd = (HashMap)null; if (!(currentHash.containsKey(curTok))) currentHash.put(curTok, hashToAdd); currentHash = (HashMap)currentHash.get(curTok); } } return result; } // On error, use an empty table catch ( Exception e ) { System.err.println(While processing ' + wordfile.getAbsolutePath() + ' : + e.getMessage()); e.printStackTrace(); return new HashMap(); } } Then, you must build a filter with 2 FIFO stacks : one is the expression stack, the other is the default stack. Then, you define a 'curMap' variable, initially pointing onto the HashMap returned by the ExpressionFileLoader. When you receive a token, you check wether it is null or not; If it is, you check if the standard stack is null or not. If it is not, you pop a token from the default stack and you return it. If it is, you return null If it is not (the token is not null), you check whether it is contained in the HashMap or not (curMap.containsKey(token)). If it is not contained and you were building an expression, you pop all the terms in the expression stack to push them in the default stack (so as not to loose information) If it is not contained and the default stack is empty, you return the token. If it is not conatined and the default stack is not empty, you return the poped token from the default stack and you push the current token. If the token is contained in the curMap, then the token MAY be the first element of an expression. You push the token in the expression stack, and you dive into the next level in your expression tree (curMap = curMap.get(token)) If the next level (now, curMap), is null, then you have completed your expression. You can pop all the tokens from the expresion stack to concatenate them, separated by underscores, and push the resulting String as a token on the default heap (so as to
Re: Tokenizing text custom way
You will need to write a custom analyzer. Don't worry, though it's quite straightforward. You will also need to write a Tokenizer, but Lucene helps you a lot here. Wouldn't I achieve the same result if I index time out like time_out, using StandardAnalyzer and later if I search for time out (inside quotes) I should get proper result, but if I search for time I shouldn't get result. Is this right? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Chinese input.
Maybe this will help? http://nagoya.apache.org/bugzilla/show_bug.cgi?id=23545 Otis --- Tun Lin [EMAIL PROTECTED] wrote: Hi, May I know how do I analyse Chinese input from Chinese text in Lucene? Do I use Analyser function in Lucene? If yes, how to go about using it? __ Do you Yahoo!? Free Pop-Up Blocker - Get it now http://companion.yahoo.com/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Tokenizing text custom way
On Wednesday, November 26, 2003, at 06:12 AM, Dragan Jotanovic wrote: You will need to write a custom analyzer. Don't worry, though it's quite straightforward. You will also need to write a Tokenizer, but Lucene helps you a lot here. Wouldn't I achieve the same result if I index time out like time_out, using StandardAnalyzer and later if I search for time out (inside quotes) I should get proper result, but if I search for time I shouldn't get result. Is this right? I'm confused on what you are planning doing. Are you going to replace all spaces with an underscore before handing it to the analyzer? StandardAnalyzer will still split at the underscores though. If you have special tokenization needs, why try to hack it somehow rather than address it cleanly in the way Lucene was designed to work? Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Tokenizing text custom way
Do you want to define expressions, i.e. a set of terms that must be intpreted as a whole ? For instance, when the Analyzer catchs time followed by out it returns time_out ? -Message d'origine- De : Dragan Jotanovic [mailto:[EMAIL PROTECTED] Envoyé : mercredi 26 novembre 2003 12:12 À : Lucene Users List Objet : Re: Tokenizing text custom way You will need to write a custom analyzer. Don't worry, though it's quite straightforward. You will also need to write a Tokenizer, but Lucene helps you a lot here. Wouldn't I achieve the same result if I index time out like time_out, using StandardAnalyzer and later if I search for time out (inside quotes) I should get proper result, but if I search for time I shouldn't get result. Is this right? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: log4j.properties
As I've said previously, it's a log4j problem and not a lucene probleme, you should post there. sv On Wed, 26 Nov 2003, Tun Lin wrote: I have created the following log4j.properties and put it in your classpath but it still has that error. Anyone can help? log4j.rootCategory=stdout log4j.appender.stdout=org.apache.log4j.ConsoleAppender log4j.appender.stdout.layout=org.apache.log4j.PatternLayout log4j.appender.stdout.layout.ConversionPattern=%d %c - %m%n - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: log4j.properties
I have integrated Lucene and PDFBox and tried the following command to index files java -Dlog4j.configuration=log4j.xml org.pdfbox.searchengine.lucene.IndexFiles -create -index c:\\index .. But I have the following error message: log4j:WARN No appenders could be found for logger (org.pdfbox.pdfparser.PDFParse r). log4j:WARN Please initialize the log4j system properly. Anyone can help? -Original Message- From: Erik Hatcher [mailto:[EMAIL PROTECTED] Sent: Wednesday, November 26, 2003 5:19 PM To: Lucene Users List Subject: Re: log4j.properties What does this have to do with Lucene? On Wednesday, November 26, 2003, at 01:04 AM, Tun Lin wrote: I have created the following log4j.properties and put it in your classpath but it still has that error. Anyone can help? log4j.rootCategory=stdout log4j.appender.stdout=org.apache.log4j.ConsoleAppender log4j.appender.stdout.layout=org.apache.log4j.PatternLayout log4j.appender.stdout.layout.ConversionPattern=%d %c - %m%n - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Search Question - not returning desired results
Thanks this helps a lot :) -Original Message- From: Erik Hatcher [mailto:[EMAIL PROTECTED] Sent: Wednesday, November 26, 2003 4:58 AM To: Lucene Users List Subject: Re: Search Question - not returning desired results On Tuesday, November 25, 2003, at 12:11 PM, Pleasant, Tracy wrote: The documents I have index contain information regarding file names also. For instance 'return_results.pl' or something like that may be in the document fields. I am not understanding Lucene's way of searching: 1. If I search for 'return_results', the search does not return anything 2. If I search for 'results' or 'return', the search does not return anything 3. If I search for 'results.pl', the search does return the document containg 'return_results.pl' 4. If I search for 'results~', the search does return the document containg 'return_results.pl' 5. If I search for 'return_results~', the search does not return anything What is going on? I want it to return the document in all of the situations. I also don't want to have to use '~' all the time. We sure do have a recurring theme lately :) Analysis! Please refer to my article at java.net: http://today.java.net/pub/a/today/2003/07/30/LuceneIntro.html Look at the AnalysisDemo code. Copy it over and try it out on the text you're using and the Analyzer you're using. The bracketed text that comes out are the tokens that you can search on. It is very very important to understand this process and to really know what terms come out of text you hand it - otherwise it is a mystery why some things can be found and some things cannot despite your expectations to the contrary. A follow-up to the Analysis is querying - and QueryParser has it's own set of quirks and caveats related to how things are tokenized/analyzed. And, I've got just the follow-up article for you handy... http://today.java.net/pub/a/today/2003/11/07/QueryParserRules.html If you digest both of these articles (analysis one first please) then I think a lot of questions that get asked on this list will be implicitly answered. Understanding analysis is key. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Search Question - not returning desired results
Erik, I think there may be a typo in the website. When I run the AnalyzerDemo : Analzying xyz corporation - [EMAIL PROTECTED] org.apache.lucene.analysis.standard.StandardAnalyzer: [xyz] [corporation] [EMAIL PROTECTED] Your website says: org.apache.lucene.analysis.standard.StandardAnalyzer: [xyz] [corporation] [EMAIL PROTECTED] [com] When I run it it keeps the entire email '[EMAIL PROTECTED] but according to your website it separates the '[EMAIL PROTECTED]' from the 'com' Is there a difference between the versions of Lucene? I'm using 1.3rc2. Plus I think what I want is a StandardAnalyzer with a little tweaking. The simple one was fine until I realized that it doesn't do numbers, which I need as part of my search since numbers is important for what I'm doing. The Standard does numbers but I need it to be a little different of course. Thanks for the site. -Original Message- From: Erik Hatcher [mailto:[EMAIL PROTECTED] Sent: Wednesday, November 26, 2003 4:58 AM To: Lucene Users List Subject: Re: Search Question - not returning desired results On Tuesday, November 25, 2003, at 12:11 PM, Pleasant, Tracy wrote: The documents I have index contain information regarding file names also. For instance 'return_results.pl' or something like that may be in the document fields. I am not understanding Lucene's way of searching: 1. If I search for 'return_results', the search does not return anything 2. If I search for 'results' or 'return', the search does not return anything 3. If I search for 'results.pl', the search does return the document containg 'return_results.pl' 4. If I search for 'results~', the search does return the document containg 'return_results.pl' 5. If I search for 'return_results~', the search does not return anything What is going on? I want it to return the document in all of the situations. I also don't want to have to use '~' all the time. We sure do have a recurring theme lately :) Analysis! Please refer to my article at java.net: http://today.java.net/pub/a/today/2003/07/30/LuceneIntro.html Look at the AnalysisDemo code. Copy it over and try it out on the text you're using and the Analyzer you're using. The bracketed text that comes out are the tokens that you can search on. It is very very important to understand this process and to really know what terms come out of text you hand it - otherwise it is a mystery why some things can be found and some things cannot despite your expectations to the contrary. A follow-up to the Analysis is querying - and QueryParser has it's own set of quirks and caveats related to how things are tokenized/analyzed. And, I've got just the follow-up article for you handy... http://today.java.net/pub/a/today/2003/11/07/QueryParserRules.html If you digest both of these articles (analysis one first please) then I think a lot of questions that get asked on this list will be implicitly answered. Understanding analysis is key. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Eliminating duplicate result
You are searching for the same term and you are searching the same index twice, it will return the same results... I don't get what you are asking. -Original Message- From: Dragan Jotanovic [mailto:[EMAIL PROTECTED] Sent: Wednesday, November 26, 2003 3:19 AM To: Lucene Users List Subject: Re: Eliminating duplicate result When you are doing two searches are you searching for two different terms? No, I am searching for the same term. What is the easyest way to eliminate duplicate documents if one is doing two searches on the same index? Have anybody done something similar? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Search Question - not returning desired results
On Wednesday, November 26, 2003, at 11:33 AM, Pleasant, Tracy wrote: Your website says: org.apache.lucene.analysis.standard.StandardAnalyzer: [xyz] [corporation] [EMAIL PROTECTED] [com] When I run it it keeps the entire email '[EMAIL PROTECTED] but according to your website it separates the '[EMAIL PROTECTED]' from the 'com' Is there a difference between the versions of Lucene? I'm using 1.3rc2. Yes, I fixed the bug in the StandardTokenizer that caused e-mail addresses to get split, but fixed it after the article was written. Good eye! - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Collaborative Filtering API
On Tue, Nov 25, 2003 at 01:18:19PM -0500, Michael Giles wrote: Yes, he was the lead Ph.D. student on the GroupLens project at Minnesota. I've actually worked on a system that bundled GroupLens. I think it was Vignette StoryServer. The Vignette docs were incredibly dense with MarketingNewSpeak, so I could never quite figure out what they said GroupLens actually *did* (not at a web-capable terminal right now, or I'd just google it). Collaborative filtering in general is a topic I'm interested in, and is why I first got into Lucene. I wanted and still want to build a collaborative filtering search engine for mailing lists and the like. I do remember that FireFly's engine was supposed to graph all of the users' ratings on a topic in an N-dimensional space, and then find users close to the same user in that N-dimensional space, and suggest topics that they'd liked, but that the current user hadn't rated. I'm interested in more of a free market sort of approach than in statistical analysis; I want to build a system that helps usrs express their opinions, then nurture an emerging consensus. My experience has been that systems that systems/technologies that try to facilitate the way users already do things, instead of replacing them with new ways of doing things, tend to work better. -- Steven J. Owens [EMAIL PROTECTED] I'm going to make broad, sweeping generalizations and strong, declarative statements, because otherwise I'll be here all night and this document will be four times longer and much less fun to read. Take it all with a grain of salt. - Me at http://darksleep.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Dates and others
Hi guys - So I am getting happier with search, and just pushed the lucene version live at: http://www.theserverside.com (on the leftbar) and: http://www.theserverside.com/home/search/index.jsp The only real item that I still want to tweak more is getting recent results higher in the list. I was wondering if something like this could work (or if there is a better solution) At index time, I have the date of the content. I could do some math where the higher the date (based on the time_t version or whatever) the more of a setBoost(metric). Or, for every month in the past, create a larger negative number to setBoost()... or something like that. Would something like this make sense? Dion -Original Message- From: Erik Hatcher [mailto:[EMAIL PROTECTED] Sent: Sunday, November 23, 2003 3:52 PM To: Lucene Users List Subject: Re: Dates and others On Saturday, November 22, 2003, at 06:33 PM, Dion Almaer wrote: 3. I have some fields suck as title, owner, etc as well as the content blob which I index and use as the default search field. Is there an easy way to extend the QueryParser to merge it with a MultiTermQuery which can also search this meta data and give them certain weights? Or, if you go down this path do you have to leave the QueryParser behind and build your own queries? Any best practices would be great. And Ype said: You can provide field weights at document indexing time (norms) and use a MultiTermQuery for searching multiple fields. At query time you can again use field weights. I don't know how the scoring of the MultiTermQuery is done, it might use the max. score over the fields of a document, or combine the scores in the fields of a document. end Ype's reply cut and paste I'm a little confused with this question and Ype's reply. MultiTermQuery is an abstract base class under Query, which is the parent for WildcardQuery and FuzzyQuery. What I think you're after is using MultiFieldQueryParser, but you want to weight the fields differently. You can add the boosts at indexing time using Field.setBoost. Unfortunately at the moment MultiFieldQueryParser is not very extensible - there are some open issues with its subclassability but subclassing MFQP and overriding getFieldQuery will do the trick when the subclassing issues are resolved allowing you to boost at query time. Making an educated guess at what you're doing with Lucene, Dion, I'd venture to say that boosting at indexing time is sufficient for your needs. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]