bad link in mailing list archive?
When I load http://nagoya.apache.org/eyebrowse/[EMAIL PROTECTED]&msgNo=2648 there are three replies listed at the bottom of the page, one by Otis Gospodnetic. The subject of his reply is "Concurency in Lucene". When I click on Otis' reply, my browser loads a post with a different subject; "Problems compiling the java source codes of lucene search engine". Just a heads up -- seems like something funky's going on :) cheers, Gerret - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Score
Pleasant, Tracy wrote: I tried using Boost but that did absolutely nothing. The documents I am using: Plain text PDF Documents (I have two indexes) I'm not sure what's causing your scores to be off -- unless, of course, your scores just look wrong to you but they're in fact just what you should be getting :) One bug in my code was that for an unrelated reason, terms in one field would never be matched. But since other fields contained the same term, the document was still being reported as a hit -- with a lower-than-expected score. Maybe you want to double check that the content of each field is getting tokenized properly.. when you have a term t in the title field that is unique to a particular document (i.e. not contained in any of the other fields of that document) do you still get a hit on the document when searching for t?Boost factors don't help of course if there's no hit in the first place. When you say you use different analyzers for different fields in your index, how would you accomplish that? When I create the index it has a parameter for analyzer.. unless you create different indexes , how do you use two different ones? Use PerFieldAnalyzerWrapper: http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/analysis/PerFieldAnalyzerWrapper.html cheers Gerret -Original Message- From: Gerret Apelt [mailto:[EMAIL PROTECTED] Sent: Monday, November 24, 2003 3:25 PM To: Lucene Users List Subject: Re: Score Tracey -- it would help if you could give more detail on the types of documents, fields and analyzers you're using. Also what do you mean by "Multi Field Search"? I presume you're using the MultiFieldQueryParser to have query terms in a user-submitted query be searched for in each field in your index. If I am understanding your problem, then it might be the same one I had a few weeks ago -- highly relevant matches would not receive a high ranking. (This paragraph will apply to you only if you use more than just one Analyzer for the set of your fields). I had six fields in my index, most of which were populated with a standard analyzer. I used self-made Analyzers for two of the fields. This turned out to be my problem when using MultiFieldQueryParser: I told my MultiFieldQueryParser instance to use only the standard analyzer. Instead I discovered that I needed to make use of org.apache.lucene.analysis.PerFieldAnalyzerWrapper and feed that to the MultiFieldQueryParser. Unless you do this, your problem is whats described here: http://lucene.sourceforge.net/cgi-bin/faq/faqmanager.cgi?file=chapter.in dexing&toc=faq#q15. Most likely, if your scoring is off, you're "doing something wrong" in the way you use the Lucene API -- at least, thats what I've discovered to be the case when my ranking is off. If you're interested in the nitty-gritty of how scoring is done, check this FAQ entry: http://lucene.sourceforge.net/cgi-bin/faq/faqmanager.cgi?file=chapter.se arch&toc=faq#q31 cheers, Gerret Pleasant, Tracy wrote: Hi, I'm using the Multi Field Search to search all the fields of my documents during the search. When it returns results the scores are numerically low - .06, .17, etc. I would think if I searched for "Dog" and there was a doc with "Dog" in the title and several times in the contents of a document that it would receive a score more like 1.0 or close to it. Is there a way that I can tweak the score? I tried using Boost but that did absolutely nothing. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Score
Tracey -- it would help if you could give more detail on the types of documents, fields and analyzers you're using. Also what do you mean by "Multi Field Search"? I presume you're using the MultiFieldQueryParser to have query terms in a user-submitted query be searched for in each field in your index. If I am understanding your problem, then it might be the same one I had a few weeks ago -- highly relevant matches would not receive a high ranking. (This paragraph will apply to you only if you use more than just one Analyzer for the set of your fields). I had six fields in my index, most of which were populated with a standard analyzer. I used self-made Analyzers for two of the fields. This turned out to be my problem when using MultiFieldQueryParser: I told my MultiFieldQueryParser instance to use only the standard analyzer. Instead I discovered that I needed to make use of org.apache.lucene.analysis.PerFieldAnalyzerWrapper and feed that to the MultiFieldQueryParser. Unless you do this, your problem is whats described here: http://lucene.sourceforge.net/cgi-bin/faq/faqmanager.cgi?file=chapter.indexing&toc=faq#q15. Most likely, if your scoring is off, you're "doing something wrong" in the way you use the Lucene API -- at least, thats what I've discovered to be the case when my ranking is off. If you're interested in the nitty-gritty of how scoring is done, check this FAQ entry: http://lucene.sourceforge.net/cgi-bin/faq/faqmanager.cgi?file=chapter.search&toc=faq#q31 cheers, Gerret Pleasant, Tracy wrote: Hi, I'm using the Multi Field Search to search all the fields of my documents during the search. When it returns results the scores are numerically low - .06, .17, etc. I would think if I searched for "Dog" and there was a doc with "Dog" in the title and several times in the contents of a document that it would receive a score more like 1.0 or close to it. Is there a way that I can tweak the score? I tried using Boost but that did absolutely nothing. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
understanding IR topics on this list [was: Re: Vector Space Model in Lucene?]
Dror -- I just completed an introductory course in IR. I can recommend the textbook we used: "Managing Gigabytes: Compressing and Indexing Documents and Images". When I don't understand posts on this list I can typically look up the theory in that book, then come back to the list and have a better idea of whats going on. "Managing Gigabytes" appears to be getting good reviews from most readers, but I can't compare it to similar works as I haven't read any. I've spent some time searching for websites that introduce advanced IR topics at a level that is less rigorous than academic papers. But I haven't really found anything I can recommend. Suggestions welcome :) cheers, Gerret ** Dror Matalon wrote: Hi, I might be the only person on the list who's having a hard time following this discussion. Would one of you wise folks care to point me to a good "dummies", also known as an executive summary, resource about the theoretical background of all of this. I understand the basic premise of collecting the "words" and having pointers to documents and weights, but beyond that ... TIA, Dror On Fri, Nov 14, 2003 at 12:52:15PM -0500, Chong, Herb wrote: i don't know of any open source search engine that incorporates interterm correlation. i have been looking into how to do this in Lucene and so far, it's not been promising. the indexing engine and file format needs to be changed. there are very few search engines that incorporate interterm correlation in any mathematically and linguistically rigorous manner. i designed a couple, but they were all research experiments. if you are familiar with the TREC automatic adhoc track? my experiments with the TREC-5 to TREC-7 questions produced about 0.05 to 0.10 improvement in average precision by proper use of interterm correlation. my project at the time was cancelled after TREC-7 and so there haven't been any new developments. Herb -Original Message- From: Andrzej Bialecki [mailto:[EMAIL PROTECTED] Sent: Friday, November 14, 2003 12:39 PM To: Lucene Users List Subject: Re: Vector Space Model in Lucene? Herb Hmm... Are you perhaps familiar with some open system which doesn't? I'm curious because one of my projects (already using Lucene) could benefit from such feature. Right now I'm using a bastardized version of Markov chains, but it's more of a hack... -- Best regards, Andrzej Bialecki - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: fuzzy searches
Thomas Krämer wrote: Is there an overview of the structure of the index of lucene despite of the javadoc or any other fast access to understanding what happens inside lucene? You mean something like this?: http://jakarta.apache.org/lucene/docs/fileformats.html cheers, Gerret - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: term counts during indexing
Peter -- sorry for the delay; I just accidentally saw your reply in the mailing list archive -- mustave overlooked it in my inbox :( Peter Keegan wrote: As I understand it, the field text is being tokenized by the analyzer when IndexWriter.addDocument is called. At this point, the tokens are indexed and/or stored. Would it be possible for 'addDocument' to save and make the _actual_ counts of 'tokens stored' and 'tokens indexed' available in either the Document or IndexWriter object? I guess I may be turning this into a feature request :) Lucene uses an inverted index, so the index is based on a mapping from "term" instances to the documents that contain them, as opposed to "document" instances mapping to a list of terms contained in that document (which is a fancy way of saying, "Lucene doesn't store documents; filesystems do that"). So in terms of the index representation, Lucene could not simply add a "term count" parameter to the entry for a given document, because (unless we're talking about a stored field) there is no table in which such an entry could exist. You would need to add a totally new data structure to the index, which can store document properties for un-stored fields. This which sort of defeats the purpose of un-stored fields. It sounds wrong to have an un-stored field and store its termcount. Here's a proposal for a hack you could do: write an Analyzer wrapper that counts tokens emitted by the Analyzer's TokenStream's next() method, which it is called by IndexWriter.addDocument(Document). When TokenStream.next() returns null, you can store the tokenCount that you have maintained in a file or database. This is fairly ugly but it has the advantage that it will work for for non-stored fields. I doubt there will be much support for extending Lucene to store field properties for unstored fields. Maybe there could be another field type called TERMCOUNTED_FIELD? Maybe some of the core coders could comment. Also, I can't find this method from the code snippit provided by Gerret (I'm using v1.2): String[] fieldTerms = doc.getValues(fieldName); hmm, it must have been added later then: http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/document/Document.html cheers, Gerret Thanks, Peter - Original Message - From: "Gerret Apelt" <[EMAIL PROTECTED]> To: "Lucene Users List" <[EMAIL PROTECTED]> Sent: Wednesday, October 29, 2003 9:44 PM Subject: Re: term counts during indexing Peter Keegan wrote: Is there a simple and efficient way of determining the number of tokens added to a document after adding each field ('Document.add), as a result of the actions of the Analyzer, without having to re-parse the field Peter -- you can ask the Document instance. Document doc = getDocumentInstanceFromSomewhere(); int termCount = 0; Enumertion fields = doc.fields(); while (fields.hasMoreElements()) { Field field = (Field)fields.nextElement(); String fieldName = field.name(); String[] fieldTerms = doc.getValues(fieldName); termCount += fieldTerms.length; } System.out.println("The fields of the document together contain "+termCount+" terms."); Note that 1) I haven't tried to compile this code, so I'm not sure if it works 2) this will only work for those fields where field.isStored() == true. If the field isnt stored in the index, then you don't have a choice but to go back to the document. [not sure on the following, so please correct me if in error:] Remember that unStored fields are indexed, so you can query on them, but the field terms themselves are not stored in the index. Therefore you cannot count them by asking Lucene. A Lucene field instance also has no way to reference the source of the terms that are added to it. The field doesn't care where its terms came from. So if field.isStored() == false, then for that particular field Lucene cannot tell you how many terms are in it. You'll have to write your own code that analyzes the original data source in this case. Alternatively, is there a way to determine the number of tokens added after adding the document to the index ('IndexWriter.addDocument')? Whether you want the termCount for a document before or after you add the document to the index doesn't matter, so the answer is "see above". cheers, Gerret - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: term counts during indexing
Clarification: in the text quoted below I meant to say ". choice but to go back to the _original data source_". cheers, Gerret Gerret Apelt wrote: Note that 1) I haven't tried to compile this code, so I'm not sure if it works 2) this will only work for those fields where field.isStored() == true. If the field isnt stored in the index, then you don't have a choice but to go back to the document. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: term counts during indexing
Peter Keegan wrote: Is there a simple and efficient way of determining the number of tokens added to a document after adding each field ('Document.add), as a result of the actions of the Analyzer, without having to re-parse the field Peter -- you can ask the Document instance. Document doc = getDocumentInstanceFromSomewhere(); int termCount = 0; Enumertion fields = doc.fields(); while (fields.hasMoreElements()) { Field field = (Field)fields.nextElement(); String fieldName = field.name(); String[] fieldTerms = doc.getValues(fieldName); termCount += fieldTerms.length; } System.out.println("The fields of the document together contain "+termCount+" terms."); Note that 1) I haven't tried to compile this code, so I'm not sure if it works 2) this will only work for those fields where field.isStored() == true. If the field isnt stored in the index, then you don't have a choice but to go back to the document. [not sure on the following, so please correct me if in error:] Remember that unStored fields are indexed, so you can query on them, but the field terms themselves are not stored in the index. Therefore you cannot count them by asking Lucene. A Lucene field instance also has no way to reference the source of the terms that are added to it. The field doesn't care where its terms came from. So if field.isStored() == false, then for that particular field Lucene cannot tell you how many terms are in it. You'll have to write your own code that analyzes the original data source in this case. Alternatively, is there a way to determine the number of tokens added after adding the document to the index ('IndexWriter.addDocument')? Whether you want the termCount for a document before or after you add the document to the index doesn't matter, so the answer is "see above". cheers, Gerret - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]