Re: document boost not showing up in Explanation

2004-12-28 Thread Paul Elschot
On Tuesday 28 December 2004 08:37, Erik Hatcher wrote:
 
 On Dec 27, 2004, at 9:54 PM, Vikas Gupta wrote:
  I am using lucene-1.4.1.jar(with nutch). For some reason, the effect of
  document boost is not showing up in the search results. Also, why is it
  not a part of the Explanation
 
 It actually is part of it
 
  Below is the 'explanation' of a sample query solar. I don't see 
  the
  boost value (1.5514448) being used at all in the calculation of the
  document score - from the 'explanation' below and also from the 
  quality of
  the search.
 
  How can I see the effect of document boost?
 
 Document boost is not stored in the index as-is.  A single 
 normalization factor is stored per-field and is computed at indexing 
 type using field and document boosts, as well as the length 
 normalization factor (and perhaps other factors I'm forgetting at the 
 moment?).

This also means that the explanation can only show a field normalisation
factor as it is available from the index.

One reason that boosting does necessarily not show up in the quality of
the search is that the byte encoding allows only 256 different values to
be stored.
The value stored in the index (called the norm) is the product of the
document boost factor, the field boost factor and the lengthNorm() of
the field.
For the search results to actually change because of the boost factors,
it is necessary that this stored factor is changed to another one of
the 256 possible.

The range of possible values stored in the index is roughly from
7x10^9 to 2x10^-9 . See:
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/document/Field.html#setBoost(float)
and
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/Similarity.html#encodeNorm(float)

The range of stored values (excluding the zero special case) is
about 7x10^9 / 2x10^-9 = 3.5x10^18. The 10 log of that is about 18.5 .
Per factor 10 there are about 255/18.5 = 13.8 encoded values.
So, a minimum boost factor that should change a document
score is about  log(13.8)/log(10) = 1.14 .
Since the default lengthNorm is the square root, a field length
should change by at least the square of that (roughly a factor 1.3)
to change the document score (assuming no hits in 
the changed field text.)

Finally, a change in document score only influences the document
ordering in the search results when another document has a score
that is within the range of the change.

Regards,
Paul Elschot.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



time of indexer

2004-12-28 Thread Daniel Cortes
Hi to everybody, and merry christmas for all(and specially people who 
that me today are working  instead of stay with the family).

I don't understand because my search in the index give this bad results:
I index 112 php files how a txt.
with this machine
Pentium 4 2,4GHz 512 RAM running during the index Windows XP and Eclipse
Tiempo de búsqueda total: 80882 ms
the fields that I use are
doc.add(Field.Keyword(filename, file.getCanonicalPath()));
doc.add(Field.UnStored(body, bodyText));
doc.add(Field.Text(titulo, title));
What I'm doing bad?
thks
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: time of indexer

2004-12-28 Thread Nader Henein
Download Luke, it makes life easy when you inspect the index, so you an 
actually look at what you've indexed, as opposed to what you may think 
you indexed.

Nader
Daniel Cortes wrote:
Hi to everybody, and merry christmas for all(and specially people who 
that me today are working  instead of stay with the family).

I don't understand because my search in the index give this bad results:
I index 112 php files how a txt.
with this machine
Pentium 4 2,4GHz 512 RAM running during the index Windows XP and Eclipse
Tiempo de bsqueda total: 80882 ms
the fields that I use are
doc.add(Field.Keyword(filename, file.getCanonicalPath()));
doc.add(Field.UnStored(body, bodyText));
doc.add(Field.Text(titulo, title));
What I'm doing bad?
thks
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: how often to optimize?

2004-12-28 Thread aurora
Are not optimized indices causing you any problems (e.g. slow searches,
high number of open file handles)?  If no, then you don't even need to
optimize until those issues become... issues.
OK I have changed the process to not doing optimize() at all. So far so  
good. The number of files hover from 10 to 40 during the indexing of  
10,000 files. Seems Lucene is doing some kind of self maintenance to keep  
things in order.

Is it right to say optimize() is a totally optional operation? I probably  
get the impression it is a natural step to end an incremental update from  
the IndexHTML example. Since it replicates the whole index it might be an  
overkill for many applications to do daily.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: how often to optimize?

2004-12-28 Thread Otis Gospodnetic
Correct.
The self-maintenance you are referring to is Lucene's periodic segment
merging.  The frequency of that can be controlled through IndexWriter's
mergeFactor.

Otis

--- aurora [EMAIL PROTECTED] wrote:

  Are not optimized indices causing you any problems (e.g. slow
 searches,
  high number of open file handles)?  If no, then you don't even need
 to
  optimize until those issues become... issues.
 
 
 OK I have changed the process to not doing optimize() at all. So far
 so  
 good. The number of files hover from 10 to 40 during the indexing of 
 
 10,000 files. Seems Lucene is doing some kind of self maintenance to
 keep  
 things in order.
 
 Is it right to say optimize() is a totally optional operation? I
 probably  
 get the impression it is a natural step to end an incremental update
 from  
 the IndexHTML example. Since it replicates the whole index it might
 be an  
 overkill for many applications to do daily.
 
 
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



QueryParser, default operator

2004-12-28 Thread Paul
Hi,
the following code
  QueryParser qp = new QueryParser(itemContent, analyzer);
  
qp.setOperator(org.apache.lucene.queryParser.QueryParser.DEFAULT_OPERATOR_AND);
  Query query = qp.parse(line, itemContent, analyzer);
doesn't produce the expected result because a query foo bar results in:
  itemContent:foo itemContent:bar
where as a foo AND bar results in
  +itemContent:foo +itemContent:bar

If I understand the default operator correctly than the first query
should have been expanded to the same as the latter one, isn't it?

thanks a lot!
Paul

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



how to get most frequented terms from hits

2004-12-28 Thread Miro Max
Hello,

is ist possible to get most frequented terms from
hits?

thx

miro




___ 
Gesendet von Yahoo! Mail - Jetzt mit 250MB Speicher kostenlos - Hier anmelden: 
http://mail.yahoo.de

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Translation

2004-12-28 Thread Gimmy Pegoraro
Hi, I'm the author of the following italian document posted on this mailing 
list from Tun Lin the 3th December 2003.

Sorry for the huge delay of this reply, but I've just come back here after a 
very long time.
That document is referred to Lucy, a Java application I developed using Lucene 
and other useful open source libraries.

Lucy can index txt, html, pdf, doc, ppt, xls documents written in English 
and/or in Italian, with automatic language categorization and suitable stemming 
and filtering procedures.
Unfortunately I haven't translated the documentation to English yet, but if 
someone needs help, like Tun Lin did, please feel free to write to my e-mail 
address.
If the requests will be enough, I will post something like a FAQ document on 
this mailing list.

The last release of Lucy (1.2) can be downloaded from this webpage:
http://www.nsw2001.com/nsw2001/php/software.php
otherwise directly from this URL:
http://www.nsw2001.com/kenshir/lucy/lucy1.2.exe

Cheers! :)
Gimmy Pegoraro





From: Tun Lin [EMAIL PROTECTED]
Subject: Translation.
Date: Wed, 3 Dec 2003 09:42:02 +0800
Content-Type: multipart/alternative;
boundary==_NextPart_000_0007_01C3B981.B4EE1F10

 Hi,
 
 Can anyone translate this text for me? I cannot understand the
 instructions.
 Please help!
 
 Thanks.
 
 ===
  
 ||
 | LUCY 1.1   |   readme.txtUltimo aggiornamento: 18/03/2003
 ||
 
 
 
 
 
 STRUTTURA
 
 
 Lucy 1.1  - Lucene 1.2
   - HTMLParser 1.2
   - PdfBox 0.5.6
   - wvWare 0.7.2-3
   - xlhtml 0.4.9
   - antiword 0.33
   - Xpdf 2.01 
   - Snowball 0.1
   - NGramJ 01.12.11
   - it.corila.lucy   - IndexAll.java
   - SearchIndex.java
   - HTMLDocument.java
   - PDFDocument.java
   - ExternalParser.java
   - ItalianStemFilter.java
   - EnglishStemFilter.java
   - ApostropheFilter.java
   - IndexAnalyzer.java
   - SearchAnalyzer.java
   - LanguageCategorizer
   - NgramjCategorizer.java
 
 
 
 
 
 DESCRIZIONE
 
 Lucy e' in grado di indicizzare tutti i files con estensione txt,
 html, pdf,
 doc, ppt, xls contenuti in una cartella base e nelle sue
 sottocartelle. Consente
 ricerche da linea di comando DOS oppure mediante interfaccia web.
 Gestisce testi
 in Italiano e Inglese con procedure di elaborazione lessicale
 specifiche.
 
 (...)
-- 
___
Sign-up for Ads Free at Mail.com
http://promo.mail.com/adsfreejump.htm


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Asking Questions in a Search

2004-12-28 Thread aneesha
Hi

Is it possible to do something like this with lucene:
http://www.verity.com/products/response/index.html

Thanks


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Asking Questions in a Search

2004-12-28 Thread Chuck Williams
Verity acquired Native Minds -- Verity Response appears to be that
technology.  It is not search technology at all -- rather it is a
programmed question-answer script knowledge base.  IMO, there are much
better commercial solutions to this problem; e.g., see www.inquira.com,
which integrates automated natural language search (i.e., finding
specific answers to natural language questions from within a text
corpus) with question/answer scripting capabilities.

I believe Lucene would be an excellent foundation for a system like
this, but it would need to be extended with a natural language query
parser / search-query generator and, if desired, some form of scripting
knowledge base.  Somebody may have gone down this path, but I'm not
aware of it.

Chuck

   -Original Message-
   From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
   Sent: Tuesday, December 28, 2004 7:52 PM
   To: lucene-user@jakarta.apache.org
   Subject: Asking Questions in a Search
   
   Hi
   
   Is it possible to do something like this with lucene:
   http://www.verity.com/products/response/index.html
   
   Thanks
   
   
  
-
   To unsubscribe, e-mail: [EMAIL PROTECTED]
   For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Word co-occurrences counts

2004-12-28 Thread Andrew Cunningham
Thanks Doug,
This appears to works like a charm.
Doug Cutting wrote:
Doug Cutting wrote:
You could use a custom Similarity implementation for this query, 
where tf() is the identity function, idf() returns 1.0, etc., so that 
the final score is the occurance count.  You'll need to divide by 
Similarity.decodeNorm(indexReader.norms(field)[doc]) at the end to 
get rid of the lengthNorm() and field boost (if any).

Much simpler would be to build a SpanNearQuery, call getSpans(), then 
loop, counting how many times Spans.next() returns true.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]