I would not recommend doing this because absolute score values in Lucene
are not meaningful (e.g., scores are not directly comparable across
searches). The ratio of a score to the highest score returned is
meaningful, but there is no absolute calibration for the highest score
returned, at least
Hi Michael,
I'd suggest first using the explain() mechanism to figure out what's
going on. Besides lengthNorm(), another factor that is likely skewing
your results in my experience is idf(), which Lucene typically makes
very large by squaring the intrinsic value. I've found it helpful to
I think that depends on what you want to do. The Lucene demo parser does
simple mapping of HTML files into Lucene Documents; it does not give you a
parse tree for the HTML doc. CyberNeko is an extension of Xerces (uses the
same API; will likely become part of Xerces), and so maps an HTML
Like any other field, A.I. is only elusive until you master it. There
are plenty of companies using A.I. techniques in various IR applications
successfully. LSI in particular has been around a long time and is well
understood.
Chuck
-Original Message-
From: jian chen
.
Can This [ boost the Full WEBSITE ] be achieved in Lucene's search
based on
searchword
If So Please Explain /examples ???.
with regards
karthik
-Original Message-
From: Chuck Williams [mailto:[EMAIL PROTECTED]
Sent: Tuesday, January 11
Karthik,
I don't think the boost in your example does much since you are using an
AND query, i.e. all hits will have to contain both vendor:nike and
contents:shoes. If you used an OR, then the boost would put nike
products above (non-nike) shoes, unless there was some other factor that
causes
If I understand what you are trying to do, you don't have a problem.
You can OR to your heart's content and Lucene will properly create the
union of the results. I.e., there will be no duplicates.
There is built-in support for this kind of thing. See
MultiFieldQueryParser, and for better
I use it and have yet to have a problem with it. It uses the Xerces API
so you parse and access html files just like xml files. Very cool,
Chuck
-Original Message-
From: Otis Gospodnetic [mailto:[EMAIL PROTECTED]
Sent: Tuesday, January 04, 2005 2:05 PM
To: Lucene Users List
Verity acquired Native Minds -- Verity Response appears to be that
technology. It is not search technology at all -- rather it is a
programmed question-answer script knowledge base. IMO, there are much
better commercial solutions to this problem; e.g., see www.inquira.com,
which integrates
I think you are confusing lengthNorm and the overall normalization of the
score. For overall normalization (prior to a final forced normalization in
Hits), Lucene uses the formula you cite, except that it never sums td_d*idf_t,
using instead tf_q*idf_t again, because the former is
All of your Document.add's need to be doc.add's. You are adding the
field to the document, not the class.
Chuck
-Original Message-
From: Jim Lynch [mailto:[EMAIL PROTECTED]
Sent: Friday, December 24, 2004 8:30 AM
To: Lucene Users List
Subject: I though I understood, but
: Wednesday, December 22, 2004 11:59 PM
To: lucene-user@jakarta.apache.org
Subject: Re: Relevance percentage
On Thursday 23 December 2004 08:13, Gururaja H wrote:
Hi Chuck Williams,
Thanks much for the reply.
If your queries are all BooleanQuery's of
TermQuery's
Depending on what you are doing, there are some problems with
MultiSearcher. See
http://issues.apache.org/bugzilla/show_bug.cgi?id=31841 for a
description of the issues and possible patch(es) to fix.
Chuck
-Original Message-
From: Erik Hatcher [mailto:[EMAIL PROTECTED]
Sent:
The coord() value is not saved anywhere so you would need to recompute
it. You could either call explain() and parse the result string, or
better, look at explain() and implement what it does more efficiently
just for coord(). If your queries are all BooleanQuery's of
TermQuery's, then this is
).
Chuck
-Original Message-
From: Gururaja H [mailto:[EMAIL PROTECTED]
Sent: Sunday, December 19, 2004 10:10 PM
To: Lucene Users List
Subject: RE: Relevance and ranking ...
Chuck Williams,
Thanks for the reply. Source code and Output are below.
Please
This is not the official recommendation, but I'd suggest you are least
consider: http://issues.apache.org/bugzilla/show_bug.cgi?id=32674
If you're not using Java 1.5 and you decide you want to use it, you'd
need to take out those dependencies. If you improve it, please share.
Chuck
The coord is the fraction of clauses matched in a BooleanQuery, so with
your example of a 5-word BooleanQuery, the coord factors should be .4,
.8, .8, 1.0 respectively for doc1, doc2, doc3 and doc4.
One big issue you've got here is lengthNorm. Doc2 is 1/10 the size of
doc4, so its lengthNorm is
Another issue will likely be the tf() and idf() computations. I have a
similar desired relevance ranking and was not getting what I wanted due
to the idf() term dominating the score. Lucene squares the contribution
of this term, which is not considered best practice in IR. To address
these
That looks right to me, assuming you have done an optimize. All of your
index segments are merged into the one .cfs file (which is large,
right?). Try searching -- it should work.
Chuck
-Original Message-
From: Hetan Shah [mailto:[EMAIL PROTECTED]
Sent: Thursday, December 16,
Karthik,
RangeQuery expands into a BooleanQuery containing all of the terms in
the index that fall within the range. By default, BooleanQuery's can
have at most 1,024 terms. So, if your index has more than 1,024
different prices that fall within your range then you will hit this
exception.
Errata:
b. [$2 to 4]
Chuck
-Original Message-
From: Chuck Williams [mailto:[EMAIL PROTECTED]
Sent: Thursday, December 16, 2004 9:58 PM
To: Lucene Users List
Subject: RE: NUMERIC RANGE BOOLEAN
Karthik,
RangeQuery expands into a BooleanQuery containing all
I'll try to address all the comments here.
The normalization I proposed a while back on lucene-dev is specified.
Its properties can be analyzed, so there is no reason to guess about
them.
Re. Hoss's example and analysis, yes, I believe it can be demonstrated
that the proposed normalization would
to computed in incremetal
indexing because when one document is added, idf of
each term changed. But drop it is not a good choice.
What is the role of norm_d_t ?
Nhan.
--- Chuck Williams [EMAIL PROTECTED] wrote:
Nhan,
Re. your two differences:
1
Nhan,
Re. your two differences:
1 is not a difference. Norm_d and Norm_q are both independent of t, so summing
over t has no effect on them. I.e., Norm_d * Norm_q is constant wrt the
summation, so it doesn't matter if the sum is over just the numerator or over
the entire fraction, the
You could support only terms with no operators at all, which will work
in most search engines (except those that require combining operators).
Using just terms and phrases embedded in 's is pretty universal.
After that, you might want to add +/- required/prohibited restrictions,
which many engines
There is an easier way. You should use a custom Similarity, which
allows you to define your own coord() method. Look at DefaultSimilarity
(which specializes Similarity).
I'd suggest analyzing your scores first with explain() to decide what
you really want to tweak. Just a guess, but your issue
Lucene contains a complete set of Boolean query operators, and it uses
the vector space model to determine scores for relevance ranking. It's
fast. It works.
Chuck
-Original Message-
From: John Wang [mailto:[EMAIL PROTECTED]
Sent: Wednesday, December 08, 2004 7:13 PM
To:
Since it's untokenized, are you searching with the exact string stored
in the field?
Chuck
-Original Message-
From: Ramon Aseniero [mailto:[EMAIL PROTECTED]
Sent: Tuesday, December 07, 2004 3:29 PM
To: 'Lucene Users List'; 'Chris Fraschetti'
Subject: RE: Sorting in Lucene
-Original Message-
From: Chuck Williams [mailto:[EMAIL PROTECTED]
Sent: Tuesday, December 07, 2004 4:04 PM
To: Lucene Users List
Subject: RE: Sorting in Lucene
Since it's untokenized, are you searching with the exact string
stored
in the field?
Chuck
I'm wondering what values of mergeFactor, minMergeDocs and maxMergeDocs
people have found to yield the best performance for different
configurations. Is there a repository of this information anywhere?
I've got about 30k documents and have 3 indexing scenarios:
1. Full indexing and
If you want this to be efficient in your application, I'd suggest
integrating at a lower level. E.g., take a look at TermScorer.explain()
to see how it determines whether or not a term matches in a field of
document.
Another approach might be to specialize BooleanQuery to keep track of
which
Try the explain() capability to see what factors are influencing the
order of your results. Probably these other factors are overwhelming
your boost. I had similar problems and resolved them by tweaking these
other contributions, especially idf. You can do that in a custom
Similarity.
Chuck
(Field.Keyword(title,title));
doc.add(Field.Keyword(keywords,keywords));
doc.add(Field.Keyword(type,type));
writer.addDocument(doc);
- Original Message -
From: Chuck Williams [EMAIL PROTECTED]
To: Lucene Users List [EMAIL
Does keyIter return the keys in sorted order? This should reduce seeks,
especially if the keys are dense.
Also, you should be able to localReader.delete(term) instead of
iterating over the docs (of which I presume there is only one doc since
keys are unique). This won't improve performance as
Are you sure you have a performance problem with
TermInfosReader.get(Term)? It looks to me like it scans sequentially
only within a small buffer window (of size
SegmentTermEnum.indexInterval) and that it uses binary search otherwise.
See TermInfosReader.getIndexOffset(Term).
Chuck
2004 12:07:05 +0100, Paul Elschot
[EMAIL PROTECTED] wrote:
On Friday 12 November 2004 22:56, Chuck Williams wrote:
I had a similar need and wrote MaxDisjunctionQuery and
MaxDisjunctionScorer. Unfortunately these are not available as
a
patch
but I've included
A good way to do this is to add a keyword field with whatever unique id
you have for the document. Then you can delete the term containing a
unique id to delete the document from the index (look at
IndexReader.delete(Term)). You can look at the demo class IndexHTML to
see how it does incremental
Lucene does support stemming, but that is not what your example requires
(stemming equates roaming, roam, roamed, etc.). For stemming,
look at PorterStemFilter or better, the Snowball stemmers in the
sandbox. For your similar word list, I think you are looking for the
class FuzzyTermEnum. This
If you are going to compare scores across multiple indices, I'd suggest
considering one of the patches here:
http://issues.apache.org/bugzilla/show_bug.cgi?id=31841
Chuck
-Original Message-
From: Erik Hatcher [mailto:[EMAIL PROTECTED]
Sent: Monday, November 22, 2004 6:30 AM
It sounds like you need to pad your numbers with leading zeroes, i.e.
use the same type of encoding as is required by RangeQuery's. If you
query with 05 instead of 5 do you get what you expect? If all your
document id's are fixed length, then string comparison will be
isomorphic to integer
I do most of these same things and made these relevant design decisions:
1. Use a combination of query expansion to search across multiple
fields and field concatenation to create document fields that combine
separate object fields. I use multiple fields only when it is important
to weight them
Take a look at this:
http://issues.apache.org/bugzilla/show_bug.cgi?id=31841
Not my initial patch, but the latest patch from Wolf Siberski. I
haven't used it yet, but it looks like what you are looking for, and
something I want to use too.
Chuck
-Original Message-
From: Ken
My Lucene application includes multi-faceted navigation that does a more
complex version of the below. I've got 5 different taxonomies into
which every indexed item is classified. The largest of the taxonomies
has over 15,000 entries while the other 4 are much smaller. For every
search query, I
I've done some customization of scoring/ranking and plan to do more. A
good place to start is with your own Similarity, extending Lucene's
DefaultSimilarity. Like you, I found the default length normalization
to not work well with my dataset. I separately weight each indexed
field according to
didn't need that functionality (since I'm generating the
multi-field expansions for which max is a much better scoring choice
than sum).
Chuck
Included message:
-Original Message-
From: Chuck Williams [mailto:[EMAIL PROTECTED]
Sent: Monday, October 11, 2004 9:55 PM
To: [EMAIL PROTECTED
PROTECTED]
Sent: Friday, November 05, 2004 10:00 AM
To: Lucene Users List
Subject: Re: Is there an easy way to have indexing ignore a CVS
subdirectory in the index directory?
On Friday 05 November 2004 18:03, Chuck Williams wrote:
The Lucene index is not in CVS -- neither
tested this (I used a
file,
not a directory) for Lucene in Action. What error are you getting?
I know there is -I CVS option for ignoring files; perhaps it works
with
directories, too.
Otis
--- Chuck Williams [EMAIL PROTECTED] wrote:
I have a Tomcat web module
Yes, by one or multiple criteria.
Chuck
-Original Message-
From: Ramon Aseniero [mailto:[EMAIL PROTECTED]
Sent: Thursday, November 04, 2004 6:21 PM
To: 'Lucene Users List'
Subject: Sorting in Lucene.
Hi All,
Does Lucene supports sorting on the search
,
Can you please point me to some articles or FAQ about Sorting in
Lucene?
Thanks a lot for your reply.
Thanks,
Ramon
-Original Message-
From: Chuck Williams [mailto:[EMAIL PROTECTED]
Sent: Thursday, November 04, 2004 9:44 PM
To: Lucene Users List
Subject
Looks like you produced a PhraseQuery rather than a BooleanQuery. You
want
+GAME:(doom3 3 doom)
Chuck
-Original Message-
From: Abhay Saswade [mailto:[EMAIL PROTECTED]
Sent: Tuesday, October 26, 2004 10:22 AM
To: [EMAIL PROTECTED]
Subject: Aliasing problem
Hi,
Karthik,
It is all spelled out in a Lucene HowTo here:
http://wiki.apache.org/jakarta-lucene/SearchNumericalFields
Have fun with it,
Chuck
-Original Message-
From: Karthik N S [mailto:[EMAIL PROTECTED]
Sent: Wednesday, October 20, 2004 12:15 AM
To: Lucene Users List;
Range queries use a lexicographic (dictionary) order. So, assuming all
your values are positive, you need to ensure that the integer part of
each number has a fixed number of digits (pad with leading 0's). The
fractional part should be fine, although 1.0 will follow 1. If you have
negative
You haven't provided enough information for anybody to help. Have you added indexed
Field's to your document? If not, there is nothing to search. I don't think you are
looking for a parameter to the IndexWriter constructor. I expect the advice from
Aviran is best. You should read and
I had this same problem a while back. It should be resolved if you move
the writer = new IndexWriter(...) until after the reader.close(). I.e.,
complete all the deletions and close the reader before creating the
writer.
Chuck
-Original Message-
From: MATL (Mats Lindberg)
, 2004 11:22 AM
To: [EMAIL PROTECTED]
Subject: RE: Filtering Results?
Thanks Chuck.
Meanwhile searching on net and found this link
http://wiki.apache.org/jakarta-lucene/SearchNumericalFields
Thanks again
From: Chuck Williams [EMAIL PROTECTED]
Reply-To: Lucene Users List [EMAIL
55 matches
Mail list logo