Re: document boost not showing up in Explanation
On Tuesday 28 December 2004 08:37, Erik Hatcher wrote: On Dec 27, 2004, at 9:54 PM, Vikas Gupta wrote: I am using lucene-1.4.1.jar(with nutch). For some reason, the effect of document boost is not showing up in the search results. Also, why is it not a part of the Explanation It actually is part of it Below is the 'explanation' of a sample query solar. I don't see the boost value (1.5514448) being used at all in the calculation of the document score - from the 'explanation' below and also from the quality of the search. How can I see the effect of document boost? Document boost is not stored in the index as-is. A single normalization factor is stored per-field and is computed at indexing type using field and document boosts, as well as the length normalization factor (and perhaps other factors I'm forgetting at the moment?). This also means that the explanation can only show a field normalisation factor as it is available from the index. One reason that boosting does necessarily not show up in the quality of the search is that the byte encoding allows only 256 different values to be stored. The value stored in the index (called the norm) is the product of the document boost factor, the field boost factor and the lengthNorm() of the field. For the search results to actually change because of the boost factors, it is necessary that this stored factor is changed to another one of the 256 possible. The range of possible values stored in the index is roughly from 7x10^9 to 2x10^-9 . See: http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/document/Field.html#setBoost(float) and http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/Similarity.html#encodeNorm(float) The range of stored values (excluding the zero special case) is about 7x10^9 / 2x10^-9 = 3.5x10^18. The 10 log of that is about 18.5 . Per factor 10 there are about 255/18.5 = 13.8 encoded values. So, a minimum boost factor that should change a document score is about log(13.8)/log(10) = 1.14 . Since the default lengthNorm is the square root, a field length should change by at least the square of that (roughly a factor 1.3) to change the document score (assuming no hits in the changed field text.) Finally, a change in document score only influences the document ordering in the search results when another document has a score that is within the range of the change. Regards, Paul Elschot. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
time of indexer
Hi to everybody, and merry christmas for all(and specially people who that me today are working instead of stay with the family). I don't understand because my search in the index give this bad results: I index 112 php files how a txt. with this machine Pentium 4 2,4GHz 512 RAM running during the index Windows XP and Eclipse Tiempo de búsqueda total: 80882 ms the fields that I use are doc.add(Field.Keyword(filename, file.getCanonicalPath())); doc.add(Field.UnStored(body, bodyText)); doc.add(Field.Text(titulo, title)); What I'm doing bad? thks - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: time of indexer
Download Luke, it makes life easy when you inspect the index, so you an actually look at what you've indexed, as opposed to what you may think you indexed. Nader Daniel Cortes wrote: Hi to everybody, and merry christmas for all(and specially people who that me today are working instead of stay with the family). I don't understand because my search in the index give this bad results: I index 112 php files how a txt. with this machine Pentium 4 2,4GHz 512 RAM running during the index Windows XP and Eclipse Tiempo de bsqueda total: 80882 ms the fields that I use are doc.add(Field.Keyword(filename, file.getCanonicalPath())); doc.add(Field.UnStored(body, bodyText)); doc.add(Field.Text(titulo, title)); What I'm doing bad? thks - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: how often to optimize?
Are not optimized indices causing you any problems (e.g. slow searches, high number of open file handles)? If no, then you don't even need to optimize until those issues become... issues. OK I have changed the process to not doing optimize() at all. So far so good. The number of files hover from 10 to 40 during the indexing of 10,000 files. Seems Lucene is doing some kind of self maintenance to keep things in order. Is it right to say optimize() is a totally optional operation? I probably get the impression it is a natural step to end an incremental update from the IndexHTML example. Since it replicates the whole index it might be an overkill for many applications to do daily. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: how often to optimize?
Correct. The self-maintenance you are referring to is Lucene's periodic segment merging. The frequency of that can be controlled through IndexWriter's mergeFactor. Otis --- aurora [EMAIL PROTECTED] wrote: Are not optimized indices causing you any problems (e.g. slow searches, high number of open file handles)? If no, then you don't even need to optimize until those issues become... issues. OK I have changed the process to not doing optimize() at all. So far so good. The number of files hover from 10 to 40 during the indexing of 10,000 files. Seems Lucene is doing some kind of self maintenance to keep things in order. Is it right to say optimize() is a totally optional operation? I probably get the impression it is a natural step to end an incremental update from the IndexHTML example. Since it replicates the whole index it might be an overkill for many applications to do daily. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
QueryParser, default operator
Hi, the following code QueryParser qp = new QueryParser(itemContent, analyzer); qp.setOperator(org.apache.lucene.queryParser.QueryParser.DEFAULT_OPERATOR_AND); Query query = qp.parse(line, itemContent, analyzer); doesn't produce the expected result because a query foo bar results in: itemContent:foo itemContent:bar where as a foo AND bar results in +itemContent:foo +itemContent:bar If I understand the default operator correctly than the first query should have been expanded to the same as the latter one, isn't it? thanks a lot! Paul - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
how to get most frequented terms from hits
Hello, is ist possible to get most frequented terms from hits? thx miro ___ Gesendet von Yahoo! Mail - Jetzt mit 250MB Speicher kostenlos - Hier anmelden: http://mail.yahoo.de - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Translation
Hi, I'm the author of the following italian document posted on this mailing list from Tun Lin the 3th December 2003. Sorry for the huge delay of this reply, but I've just come back here after a very long time. That document is referred to Lucy, a Java application I developed using Lucene and other useful open source libraries. Lucy can index txt, html, pdf, doc, ppt, xls documents written in English and/or in Italian, with automatic language categorization and suitable stemming and filtering procedures. Unfortunately I haven't translated the documentation to English yet, but if someone needs help, like Tun Lin did, please feel free to write to my e-mail address. If the requests will be enough, I will post something like a FAQ document on this mailing list. The last release of Lucy (1.2) can be downloaded from this webpage: http://www.nsw2001.com/nsw2001/php/software.php otherwise directly from this URL: http://www.nsw2001.com/kenshir/lucy/lucy1.2.exe Cheers! :) Gimmy Pegoraro From: Tun Lin [EMAIL PROTECTED] Subject: Translation. Date: Wed, 3 Dec 2003 09:42:02 +0800 Content-Type: multipart/alternative; boundary==_NextPart_000_0007_01C3B981.B4EE1F10 Hi, Can anyone translate this text for me? I cannot understand the instructions. Please help! Thanks. === || | LUCY 1.1 | readme.txtUltimo aggiornamento: 18/03/2003 || STRUTTURA Lucy 1.1 - Lucene 1.2 - HTMLParser 1.2 - PdfBox 0.5.6 - wvWare 0.7.2-3 - xlhtml 0.4.9 - antiword 0.33 - Xpdf 2.01 - Snowball 0.1 - NGramJ 01.12.11 - it.corila.lucy - IndexAll.java - SearchIndex.java - HTMLDocument.java - PDFDocument.java - ExternalParser.java - ItalianStemFilter.java - EnglishStemFilter.java - ApostropheFilter.java - IndexAnalyzer.java - SearchAnalyzer.java - LanguageCategorizer - NgramjCategorizer.java DESCRIZIONE Lucy e' in grado di indicizzare tutti i files con estensione txt, html, pdf, doc, ppt, xls contenuti in una cartella base e nelle sue sottocartelle. Consente ricerche da linea di comando DOS oppure mediante interfaccia web. Gestisce testi in Italiano e Inglese con procedure di elaborazione lessicale specifiche. (...) -- ___ Sign-up for Ads Free at Mail.com http://promo.mail.com/adsfreejump.htm - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Asking Questions in a Search
Hi Is it possible to do something like this with lucene: http://www.verity.com/products/response/index.html Thanks - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Asking Questions in a Search
Verity acquired Native Minds -- Verity Response appears to be that technology. It is not search technology at all -- rather it is a programmed question-answer script knowledge base. IMO, there are much better commercial solutions to this problem; e.g., see www.inquira.com, which integrates automated natural language search (i.e., finding specific answers to natural language questions from within a text corpus) with question/answer scripting capabilities. I believe Lucene would be an excellent foundation for a system like this, but it would need to be extended with a natural language query parser / search-query generator and, if desired, some form of scripting knowledge base. Somebody may have gone down this path, but I'm not aware of it. Chuck -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Sent: Tuesday, December 28, 2004 7:52 PM To: lucene-user@jakarta.apache.org Subject: Asking Questions in a Search Hi Is it possible to do something like this with lucene: http://www.verity.com/products/response/index.html Thanks - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Word co-occurrences counts
Thanks Doug, This appears to works like a charm. Doug Cutting wrote: Doug Cutting wrote: You could use a custom Similarity implementation for this query, where tf() is the identity function, idf() returns 1.0, etc., so that the final score is the occurance count. You'll need to divide by Similarity.decodeNorm(indexReader.norms(field)[doc]) at the end to get rid of the lengthNorm() and field boost (if any). Much simpler would be to build a SpanNearQuery, call getSpans(), then loop, counting how many times Spans.next() returns true. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]