RE: Limiting Hits with a score threshold

2005-02-14 Thread Chuck Williams
I would not recommend doing this because absolute score values in Lucene are not meaningful (e.g., scores are not directly comparable across searches). The ratio of a score to the highest score returned is meaningful, but there is no absolute calibration for the highest score returned, at least

RE: Similarity coord,lengthNorm

2005-02-07 Thread Chuck Williams
Hi Michael, I'd suggest first using the explain() mechanism to figure out what's going on. Besides lengthNorm(), another factor that is likely skewing your results in my experience is idf(), which Lucene typically makes very large by squaring the intrinsic value. I've found it helpful to

RE: which HTML parser is better?

2005-02-01 Thread Chuck Williams
I think that depends on what you want to do. The Lucene demo parser does simple mapping of HTML files into Lucene Documents; it does not give you a parse tree for the HTML doc. CyberNeko is an extension of Xerces (uses the same API; will likely become part of Xerces), and so maps an HTML

RE: Newbie: Human Readable Stemming, Lucene Architecture, etc!

2005-01-20 Thread Chuck Williams
Like any other field, A.I. is only elusive until you master it. There are plenty of companies using A.I. techniques in various IR applications successfully. LSI in particular has been around a long time and is well understood. Chuck -Original Message- From: jian chen

RE: QUERYPARSIN BOOSTING

2005-01-12 Thread Chuck Williams
. Can This [ boost the Full WEBSITE ] be achieved in Lucene's search based on searchword If So Please Explain /examples ???. with regards karthik -Original Message- From: Chuck Williams [mailto:[EMAIL PROTECTED] Sent: Tuesday, January 11

RE: QUERYPARSIN BOOSTING

2005-01-11 Thread Chuck Williams
Karthik, I don't think the boost in your example does much since you are using an AND query, i.e. all hits will have to contain both vendor:nike and contents:shoes. If you used an OR, then the boost would put nike products above (non-nike) shoes, unless there was some other factor that causes

RE: SQL Distinct sintax in Lucen

2005-01-11 Thread Chuck Williams
If I understand what you are trying to do, you don't have a problem. You can OR to your heart's content and Lucene will properly create the union of the results. I.e., there will be no duplicates. There is built-in support for this kind of thing. See MultiFieldQueryParser, and for better

RE: Parsing issue

2005-01-04 Thread Chuck Williams
I use it and have yet to have a problem with it. It uses the Xerces API so you parse and access html files just like xml files. Very cool, Chuck -Original Message- From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] Sent: Tuesday, January 04, 2005 2:05 PM To: Lucene Users List

RE: Asking Questions in a Search

2004-12-28 Thread Chuck Williams
Verity acquired Native Minds -- Verity Response appears to be that technology. It is not search technology at all -- rather it is a programmed question-answer script knowledge base. IMO, there are much better commercial solutions to this problem; e.g., see www.inquira.com, which integrates

RE: Poor Lucene Ranking for Short Text

2004-12-24 Thread Chuck Williams
I think you are confusing lengthNorm and the overall normalization of the score. For overall normalization (prior to a final forced normalization in Hits), Lucene uses the formula you cite, except that it never sums td_d*idf_t, using instead tf_q*idf_t again, because the former is

RE: I though I understood, but obviously I missed something.

2004-12-24 Thread Chuck Williams
All of your Document.add's need to be doc.add's. You are adding the field to the document, not the class. Chuck -Original Message- From: Jim Lynch [mailto:[EMAIL PROTECTED] Sent: Friday, December 24, 2004 8:30 AM To: Lucene Users List Subject: I though I understood, but

RE: Relevance percentage

2004-12-23 Thread Chuck Williams
: Wednesday, December 22, 2004 11:59 PM To: lucene-user@jakarta.apache.org Subject: Re: Relevance percentage On Thursday 23 December 2004 08:13, Gururaja H wrote: Hi Chuck Williams, Thanks much for the reply. If your queries are all BooleanQuery's of TermQuery's

RE: Lucene index files from two different applications.

2004-12-21 Thread Chuck Williams
Depending on what you are doing, there are some problems with MultiSearcher. See http://issues.apache.org/bugzilla/show_bug.cgi?id=31841 for a description of the issues and possible patch(es) to fix. Chuck -Original Message- From: Erik Hatcher [mailto:[EMAIL PROTECTED] Sent:

RE: Relevance percentage

2004-12-20 Thread Chuck Williams
The coord() value is not saved anywhere so you would need to recompute it. You could either call explain() and parse the result string, or better, look at explain() and implement what it does more efficiently just for coord(). If your queries are all BooleanQuery's of TermQuery's, then this is

RE: Relevance and ranking ...

2004-12-20 Thread Chuck Williams
). Chuck -Original Message- From: Gururaja H [mailto:[EMAIL PROTECTED] Sent: Sunday, December 19, 2004 10:10 PM To: Lucene Users List Subject: RE: Relevance and ranking ... Chuck Williams, Thanks for the reply. Source code and Output are below. Please

RE: determination of matching hits

2004-12-20 Thread Chuck Williams
This is not the official recommendation, but I'd suggest you are least consider: http://issues.apache.org/bugzilla/show_bug.cgi?id=32674 If you're not using Java 1.5 and you decide you want to use it, you'd need to take out those dependencies. If you improve it, please share. Chuck

RE: Relevance and ranking ...

2004-12-18 Thread Chuck Williams
The coord is the fraction of clauses matched in a BooleanQuery, so with your example of a 5-word BooleanQuery, the coord factors should be .4, .8, .8, 1.0 respectively for doc1, doc2, doc3 and doc4. One big issue you've got here is lengthNorm. Doc2 is 1/10 the size of doc4, so its lengthNorm is

RE: Relevance and ranking ...

2004-12-17 Thread Chuck Williams
Another issue will likely be the tf() and idf() computations. I have a similar desired relevance ranking and was not getting what I wanted due to the idf() term dominating the score. Lucene squares the contribution of this term, which is not considered best practice in IR. To address these

RE: Indexing with Lucene 1.4.3

2004-12-16 Thread Chuck Williams
That looks right to me, assuming you have done an optimize. All of your index segments are merged into the one .cfs file (which is large, right?). Try searching -- it should work. Chuck -Original Message- From: Hetan Shah [mailto:[EMAIL PROTECTED] Sent: Thursday, December 16,

RE: NUMERIC RANGE BOOLEAN

2004-12-16 Thread Chuck Williams
Karthik, RangeQuery expands into a BooleanQuery containing all of the terms in the index that fall within the range. By default, BooleanQuery's can have at most 1,024 terms. So, if your index has more than 1,024 different prices that fall within your range then you will hit this exception.

RE: NUMERIC RANGE BOOLEAN

2004-12-16 Thread Chuck Williams
Errata: b. [$2 to 4] Chuck -Original Message- From: Chuck Williams [mailto:[EMAIL PROTECTED] Sent: Thursday, December 16, 2004 9:58 PM To: Lucene Users List Subject: RE: NUMERIC RANGE BOOLEAN Karthik, RangeQuery expands into a BooleanQuery containing all

RE: A question about scoring function in Lucene

2004-12-15 Thread Chuck Williams
I'll try to address all the comments here. The normalization I proposed a while back on lucene-dev is specified. Its properties can be analyzed, so there is no reason to guess about them. Re. Hoss's example and analysis, yes, I believe it can be demonstrated that the proposed normalization would

RE: A question about scoring function in Lucene

2004-12-15 Thread Chuck Williams
to computed in incremetal indexing because when one document is added, idf of each term changed. But drop it is not a good choice. What is the role of norm_d_t ? Nhan. --- Chuck Williams [EMAIL PROTECTED] wrote: Nhan, Re. your two differences: 1

RE: A question about scoring function in Lucene

2004-12-14 Thread Chuck Williams
Nhan, Re. your two differences: 1 is not a difference. Norm_d and Norm_q are both independent of t, so summing over t has no effect on them. I.e., Norm_d * Norm_q is constant wrt the summation, so it doesn't matter if the sum is over just the numerator or over the entire fraction, the

RE: A simple Query Language

2004-12-10 Thread Chuck Williams
You could support only terms with no operators at all, which will work in most search engines (except those that require combining operators). Using just terms and phrases embedded in 's is pretty universal. After that, you might want to add +/- required/prohibited restrictions, which many engines

RE: Coordination value

2004-12-09 Thread Chuck Williams
There is an easier way. You should use a custom Similarity, which allows you to define your own coord() method. Look at DefaultSimilarity (which specializes Similarity). I'd suggest analyzing your scores first with explain() to decide what you really want to tweak. Just a guess, but your issue

RE: Lucene Vs Ixiasoft

2004-12-08 Thread Chuck Williams
Lucene contains a complete set of Boolean query operators, and it uses the vector space model to determine scores for relevance ranking. It's fast. It works. Chuck -Original Message- From: John Wang [mailto:[EMAIL PROTECTED] Sent: Wednesday, December 08, 2004 7:13 PM To:

RE: Sorting in Lucene

2004-12-07 Thread Chuck Williams
Since it's untokenized, are you searching with the exact string stored in the field? Chuck -Original Message- From: Ramon Aseniero [mailto:[EMAIL PROTECTED] Sent: Tuesday, December 07, 2004 3:29 PM To: 'Lucene Users List'; 'Chris Fraschetti' Subject: RE: Sorting in Lucene

RE: Sorting in Lucene

2004-12-07 Thread Chuck Williams
-Original Message- From: Chuck Williams [mailto:[EMAIL PROTECTED] Sent: Tuesday, December 07, 2004 4:04 PM To: Lucene Users List Subject: RE: Sorting in Lucene Since it's untokenized, are you searching with the exact string stored in the field? Chuck

Recommended values for mergeFactor, minMergeDocs, maxMergeDocs

2004-12-03 Thread Chuck Williams
I'm wondering what values of mergeFactor, minMergeDocs and maxMergeDocs people have found to yield the best performance for different configurations. Is there a repository of this information anywhere? I've got about 30k documents and have 3 indexing scenarios: 1. Full indexing and

RE: Search multiple Fields

2004-12-02 Thread Chuck Williams
If you want this to be efficient in your application, I'd suggest integrating at a lower level. E.g., take a look at TermScorer.explain() to see how it determines whether or not a term matches in a field of document. Another approach might be to specialize BooleanQuery to keep track of which

RE: boosting challenge

2004-11-29 Thread Chuck Williams
Try the explain() capability to see what factors are influencing the order of your results. Probably these other factors are overwhelming your boost. I had similar problems and resolved them by tweaking these other contributions, especially idf. You can do that in a custom Similarity. Chuck

RE: modifying existing index

2004-11-24 Thread Chuck Williams
(Field.Keyword(title,title)); doc.add(Field.Keyword(keywords,keywords)); doc.add(Field.Keyword(type,type)); writer.addDocument(doc); - Original Message - From: Chuck Williams [EMAIL PROTECTED] To: Lucene Users List [EMAIL

RE: URGENT: Help indexing large document set

2004-11-24 Thread Chuck Williams
Does keyIter return the keys in sorted order? This should reduce seeks, especially if the keys are dense. Also, you should be able to localReader.delete(term) instead of iterating over the docs (of which I presume there is only one doc since keys are unique). This won't improve performance as

RE: URGENT: Help indexing large document set

2004-11-23 Thread Chuck Williams
Are you sure you have a performance problem with TermInfosReader.get(Term)? It looks to me like it scans sequentially only within a small buffer window (of size SegmentTermEnum.indexInterval) and that it uses binary search otherwise. See TermInfosReader.getIndexOffset(Term). Chuck

RE: lucene Scorers

2004-11-23 Thread Chuck Williams
2004 12:07:05 +0100, Paul Elschot [EMAIL PROTECTED] wrote: On Friday 12 November 2004 22:56, Chuck Williams wrote: I had a similar need and wrote MaxDisjunctionQuery and MaxDisjunctionScorer. Unfortunately these are not available as a patch but I've included

RE: modifying existing index

2004-11-23 Thread Chuck Williams
A good way to do this is to add a keyword field with whatever unique id you have for the document. Then you can delete the term containing a unique id to delete the document from the index (look at IndexReader.delete(Term)). You can look at the demo class IndexHTML to see how it does incremental

RE: fetching similar wordlist as given word

2004-11-23 Thread Chuck Williams
Lucene does support stemming, but that is not what your example requires (stemming equates roaming, roam, roamed, etc.). For stemming, look at PorterStemFilter or better, the Snowball stemmers in the sandbox. For your similar word list, I think you are looking for the class FuzzyTermEnum. This

RE: Question about multi-searching [re-post]

2004-11-22 Thread Chuck Williams
If you are going to compare scores across multiple indices, I'd suggest considering one of the patches here: http://issues.apache.org/bugzilla/show_bug.cgi?id=31841 Chuck -Original Message- From: Erik Hatcher [mailto:[EMAIL PROTECTED] Sent: Monday, November 22, 2004 6:30 AM

RE: Need help with filtering

2004-11-22 Thread Chuck Williams
It sounds like you need to pad your numbers with leading zeroes, i.e. use the same type of encoding as is required by RangeQuery's. If you query with 05 instead of 5 do you get what you expect? If all your document id's are fixed length, then string comparison will be isomorphic to integer

RE: Lucene - index fields design question

2004-11-16 Thread Chuck Williams
I do most of these same things and made these relevant design decisions: 1. Use a combination of query expansion to search across multiple fields and field concatenation to create document fields that combine separate object fields. I use multiple fields only when it is important to weight them

RE: setting Similarity at search time

2004-11-15 Thread Chuck Williams
Take a look at this: http://issues.apache.org/bugzilla/show_bug.cgi?id=31841 Not my initial patch, but the latest patch from Wolf Siberski. I haven't used it yet, but it looks like what you are looking for, and something I want to use too. Chuck -Original Message- From: Ken

RE: How to efficiently get # of search results, per attribute

2004-11-13 Thread Chuck Williams
My Lucene application includes multi-faceted navigation that does a more complex version of the below. I've got 5 different taxonomies into which every indexed item is classified. The largest of the taxonomies has over 15,000 entries while the other 4 are much smaller. For every search query, I

RE: Anyone implemented custom hit ranking?

2004-11-13 Thread Chuck Williams
I've done some customization of scoring/ranking and plan to do more. A good place to start is with your own Similarity, extending Lucene's DefaultSimilarity. Like you, I found the default length normalization to not work well with my dataset. I separately weight each indexed field according to

RE: lucene Scorers

2004-11-12 Thread Chuck Williams
didn't need that functionality (since I'm generating the multi-field expansions for which max is a much better scoring choice than sum). Chuck Included message: -Original Message- From: Chuck Williams [mailto:[EMAIL PROTECTED] Sent: Monday, October 11, 2004 9:55 PM To: [EMAIL PROTECTED

RE: Is there an easy way to have indexing ignore a CVS subdirectory in the index directory?

2004-11-05 Thread Chuck Williams
PROTECTED] Sent: Friday, November 05, 2004 10:00 AM To: Lucene Users List Subject: Re: Is there an easy way to have indexing ignore a CVS subdirectory in the index directory? On Friday 05 November 2004 18:03, Chuck Williams wrote: The Lucene index is not in CVS -- neither

RE: Is there an easy way to have indexing ignore a CVS subdirectory in the index directory?

2004-11-04 Thread Chuck Williams
tested this (I used a file, not a directory) for Lucene in Action. What error are you getting? I know there is -I CVS option for ignoring files; perhaps it works with directories, too. Otis --- Chuck Williams [EMAIL PROTECTED] wrote: I have a Tomcat web module

RE: Sorting in Lucene.

2004-11-04 Thread Chuck Williams
Yes, by one or multiple criteria. Chuck -Original Message- From: Ramon Aseniero [mailto:[EMAIL PROTECTED] Sent: Thursday, November 04, 2004 6:21 PM To: 'Lucene Users List' Subject: Sorting in Lucene. Hi All, Does Lucene supports sorting on the search

RE: Sorting in Lucene.

2004-11-04 Thread Chuck Williams
, Can you please point me to some articles or FAQ about Sorting in Lucene? Thanks a lot for your reply. Thanks, Ramon -Original Message- From: Chuck Williams [mailto:[EMAIL PROTECTED] Sent: Thursday, November 04, 2004 9:44 PM To: Lucene Users List Subject

RE: Aliasing problem

2004-10-26 Thread Chuck Williams
Looks like you produced a PhraseQuery rather than a BooleanQuery. You want +GAME:(doom3 3 doom) Chuck -Original Message- From: Abhay Saswade [mailto:[EMAIL PROTECTED] Sent: Tuesday, October 26, 2004 10:22 AM To: [EMAIL PROTECTED] Subject: Aliasing problem Hi,

RE: Range Query

2004-10-20 Thread Chuck Williams
Karthik, It is all spelled out in a Lucene HowTo here: http://wiki.apache.org/jakarta-lucene/SearchNumericalFields Have fun with it, Chuck -Original Message- From: Karthik N S [mailto:[EMAIL PROTECTED] Sent: Wednesday, October 20, 2004 12:15 AM To: Lucene Users List;

RE: Range Query

2004-10-19 Thread Chuck Williams
Range queries use a lexicographic (dictionary) order. So, assuming all your values are positive, you need to ensure that the integer part of each number has a fixed number of digits (pad with leading 0's). The fractional part should be fine, although 1.0 will follow 1. If you have negative

RE: Index and Search Phrase Documents

2004-10-18 Thread Chuck Williams
You haven't provided enough information for anybody to help. Have you added indexed Field's to your document? If not, there is nothing to search. I don't think you are looking for a parameter to the IndexWriter constructor. I expect the advice from Aviran is best. You should read and

RE: index, reindexing problem

2004-10-17 Thread Chuck Williams
I had this same problem a while back. It should be resolved if you move the writer = new IndexWriter(...) until after the reader.close(). I.e., complete all the deletions and close the reader before creating the writer. Chuck -Original Message- From: MATL (Mats Lindberg)

RE: Filtering Results?

2004-10-14 Thread Chuck Williams
, 2004 11:22 AM To: [EMAIL PROTECTED] Subject: RE: Filtering Results? Thanks Chuck. Meanwhile searching on net and found this link http://wiki.apache.org/jakarta-lucene/SearchNumericalFields Thanks again From: Chuck Williams [EMAIL PROTECTED] Reply-To: Lucene Users List [EMAIL