Re: svn commit: r332747 - in /lucene/java/trunk: ./ src/java/org/apache/lucene/search/regex/ src/test/org/apache/lucene/search/regex/

2005-11-15 Thread Paul Elschot
On Tuesday 15 November 2005 20:30, Doug Cutting wrote: > Paul Elschot wrote: > > Not using the document term frequencies in PrefixQuery would still > > leave these as a surprise factor between PrefixQuery and TermQuery. > > Should we dynamically decide to switch to FieldNormQuery when > BooleanQu

Re: Lucene Index backboned by DB

2005-11-15 Thread jian chen
Dear All, I have some thoughts on this issue as well. 1) It might be OK to implement retrieving field values separately for a document. However, I think from a simplicity point of view, it might be better to have the application code do this drudgery. Adding this feature could complicate the nice

Re: Lucene Index backboned by DB

2005-11-15 Thread Robert Kirchgessner
Hi, a discussion in http://issues.apache.org/jira/browse/LUCENE-196 might be of interest to you. Did you think about storing the large pieces of documents in a database to reduce the size of Lucene index? I think there are good reasons to adding support for storing fields in separate files: 1

[jira] Closed: (LUCENE-465) surround test code is incompatible with *Test pattern in test target.

2005-11-15 Thread Daniel Naber (JIRA)
[ http://issues.apache.org/jira/browse/LUCENE-465?page=all ] Daniel Naber closed LUCENE-465: --- Fix Version: 1.9 Resolution: Fixed Thanks, commited. > surround test code is incompatible with *Test pattern in test target. >

Re: svn commit: r332747 - in /lucene/java/trunk: ./ src/java/org/apache/lucene/search/regex/ src/test/org/apache/lucene/search/regex/

2005-11-15 Thread Yonik Seeley
Totally untested, but here is a hack at what the scorer might look like when the number of terms is large. -Yonik package org.apache.lucene.search; import org.apache.lucene.index.TermEnum; import org.apache.lucene.index.IndexReader; import org.apache.lucene.index.TermDocs; import java.io.IOExc

Re: svn commit: r332747 - in /lucene/java/trunk: ./ src/java/org/apache/lucene/search/regex/ src/test/org/apache/lucene/search/regex/

2005-11-15 Thread Yonik Seeley
> However, one problem I don't know how to solve is > Weight.sumOfSquares(), which needs to know the idf of every single > term, before the scorer is even created! Darn, even if one leaves out idf(), Weight.sumOfSquares() still depends on the number of terms in the query. I guess it's not possibl

Lucene Index backboned by DB

2005-11-15 Thread Karel Tejnora
Hi all, in our testing application using lucene 1.4.3. Thanks you guys for that great job. We have index file around 12GiB, one file (merged). To retrieve hits it takes nice small amount of the time, but reading fields takes 10-100 times more (the stored ones). I think because all the fields

Re: svn commit: r332747 - in /lucene/java/trunk: ./ src/java/org/apache/lucene/search/regex/ src/test/org/apache/lucene/search/regex/

2005-11-15 Thread markharw00d
I was thinking about the challenges of holding a score per document recently whilst trying to optimize the Lucene-embedded-in-Derby/HSQLDB code. I found myself actually wanting to visualize the problem and to see the distribution of scores for a query in a graphical form eg how sparse the resu

Re: svn commit: r332747 - in /lucene/java/trunk: ./ src/java/org/apache/lucene/search/regex/ src/test/org/apache/lucene/search/regex/

2005-11-15 Thread Yonik Seeley
On 11/15/05, Doug Cutting <[EMAIL PROTECTED]> wrote: > One option is to keep a score per document, That's what I was thinking... float[maxDoc[]] scores scores[doc] += tf(term) * idf(term) * norm(term.field) It would be nice to keep score compatibility with the current BooleanQuery, then that

Re: svn commit: r332747 - in /lucene/java/trunk: ./ src/java/org/apache/lucene/search/regex/ src/test/org/apache/lucene/search/regex/

2005-11-15 Thread Doug Cutting
Yonik Seeley wrote: Scoring recap... I think I've seen 4 different types of scoring mentioned in this thread for a term expanding query on a single field: 1) query_boost 2) query_boost * (field_boost * lengthNorm) 3) query_boost * (field_boost * lengthNorm) * tf(t in q) 4) query_boost * (field_b

Re: svn commit: r332747 - in /lucene/java/trunk: ./ src/java/org/apache/lucene/search/regex/ src/test/org/apache/lucene/search/regex/

2005-11-15 Thread Yonik Seeley
Scoring recap... I think I've seen 4 different types of scoring mentioned in this thread for a term expanding query on a single field: 1) query_boost 2) query_boost * (field_boost * lengthNorm) 3) query_boost * (field_boost * lengthNorm) * tf(t in q) 4) query_boost * (field_boost * lengthNorm) * t

Re: svn commit: r332747 - in /lucene/java/trunk: ./ src/java/org/apache/lucene/search/regex/ src/test/org/apache/lucene/search/regex/

2005-11-15 Thread Yonik Seeley
Here's a diff to ConstantScoreQuery that optionally folds in norms (minus explain() functionality right now). Should it be added, or do the differences warrant a new Query class, or if kept together, should ConstantScoreQuery be renamed since it's not quite so constant? -Yonik Now hiring -- http:/

Re: svn commit: r332747 - in /lucene/java/trunk: ./ src/java/org/apache/lucene/search/regex/ src/test/org/apache/lucene/search/regex/

2005-11-15 Thread Yonik Seeley
I'm not crazy about the idea of scoring changing dramatically. I think people need to be able to specify the scoring style and have it always score that way. Indicies change size and composition over time, making it difficult to predict when one would be hit with wildly different scoring (and mo

Re: svn commit: r332747 - in /lucene/java/trunk: ./ src/java/org/apache/lucene/search/regex/ src/test/org/apache/lucene/search/regex/

2005-11-15 Thread Doug Cutting
Paul Elschot wrote: Not using the document term frequencies in PrefixQuery would still leave these as a surprise factor between PrefixQuery and TermQuery. Should we dynamically decide to switch to FieldNormQuery when BooleanQuery.maxClauseCount is exceeded? That way queries that currently wo

Re: svn commit: r332431 - in /lucene/java/trunk: CHANGES.txt src/java/org/apache/lucene/search/FieldDocSortedHitQueue.java src/test/org/apache/lucene/search/TestCustomSearcherSort.java

2005-11-15 Thread Yonik Seeley
It's flagged as ASF in the bug: http://issues.apache.org/jira/browse/LUCENE-456 I'll add the header. -Yonik On 11/15/05, Bernhard Messer <[EMAIL PROTECTED]> wrote: > Yonik, > > TestCustomSearcherSort.java you added a few days ago shows that the > author is Martin Seitz from T-Systems and doesn't

[jira] Commented: (LUCENE-323) [PATCH] MultiFieldQueryParser and BooleanQuery do not provide adequate support for queries across multiple fields

2005-11-15 Thread paul.elschot (JIRA)
[ http://issues.apache.org/jira/browse/LUCENE-323?page=comments#action_12357721 ] paul.elschot commented on LUCENE-323: - The ScorerDocQueue.java here has a single operation for something very similar to the heap-remove/generate/heap-insert: http://issue

Re: svn commit: r332431 - in /lucene/java/trunk: CHANGES.txt src/java/org/apache/lucene/search/FieldDocSortedHitQueue.java src/test/org/apache/lucene/search/TestCustomSearcherSort.java

2005-11-15 Thread Bernhard Messer
Yonik, TestCustomSearcherSort.java you added a few days ago shows that the author is Martin Seitz from T-Systems and doesn't has the apache license agreement in it's header. Is it ok to have this test in SVN ? Bernhard [EMAIL PROTECTED] wrote: Author: yonik Date: Thu Nov 10 19:13:10 2005

Re: svn commit: r332747 - in /lucene/java/trunk: ./ src/java/org/apache/lucene/search/regex/ src/test/org/apache/lucene/search/regex/

2005-11-15 Thread Paul Elschot
On Tuesday 15 November 2005 19:35, Yonik Seeley wrote: > On 11/15/05, Doug Cutting <[EMAIL PROTECTED]> wrote: > > Paul Elschot wrote: > > > I think loosing the field boosts for PrefixQuery and friends would not be > > > advisable. Field boosts have a very big range and from that a very big > > > i

Re: svn commit: r332747 - in /lucene/java/trunk: ./ src/java/org/apache/lucene/search/regex/ src/test/org/apache/lucene/search/regex/

2005-11-15 Thread Doug Cutting
Yonik Seeley wrote: As far as API goes, I guess there should be a constructor ConstantScoreQuery(Filter filter, String field) If field is non-null, then the field norm can be multiplied into the score. You could implement this with a scorer subclass that multiplys by the norm, removing a condi

Re: svn commit: r332747 - in /lucene/java/trunk: ./ src/java/org/apache/lucene/search/regex/ src/test/org/apache/lucene/search/regex/

2005-11-15 Thread Yonik Seeley
On 11/15/05, Doug Cutting <[EMAIL PROTECTED]> wrote: > Paul Elschot wrote: > > I think loosing the field boosts for PrefixQuery and friends would not be > > advisable. Field boosts have a very big range and from that a very big > > influence on the score and the order of the results in Hits. > > It

Re: svn commit: r332747 - in /lucene/java/trunk: ./ src/java/org/apache/lucene/search/regex/ src/test/org/apache/lucene/search/regex/

2005-11-15 Thread Doug Cutting
Paul Elschot wrote: I think loosing the field boosts for PrefixQuery and friends would not be advisable. Field boosts have a very big range and from that a very big influence on the score and the order of the results in Hits. It should not be hard to add these. If a field name is provided, the

Re: svn commit: r332747 - in /lucene/java/trunk: ./ src/java/org/apache/lucene/search/regex/ src/test/org/apache/lucene/search/regex/

2005-11-15 Thread Yonik Seeley
On 11/15/05, Paul Elschot <[EMAIL PROTECTED]> wrote: > TermQuery relies on field boost and document term frequency, so > having PrefixQuery ignore these would also lead to unexpected > surprises. The surprise from a field boost not working should be found during development. The surprise of queri

Re: svn commit: r332747 - in /lucene/java/trunk: ./ src/java/org/apache/lucene/search/regex/ src/test/org/apache/lucene/search/regex/

2005-11-15 Thread Yonik Seeley
Good point about FuzzyQuery... it has already mostly solved the "too many clauses" thing anyway. I also think the idf should go. There are two different usecases: 1) relevancy: give highest relevance and closest matches, but I don't care if I get 100% of the matches. 2) matching: must give all

Re: svn commit: r332747 - in /lucene/java/trunk: ./ src/java/org/apache/lucene/search/regex/ src/test/org/apache/lucene/search/regex/

2005-11-15 Thread mark harwood
> That would use more memory, but still permit ranked > searches. Worth it? Not sure. I expect FuzzyQuery results would suffer if the edit distance could no longer be factored in. At least there's a quality threshold to limit the more tenuous matches but all matches below the threshold would be