Re: svn commit: r332747 - in /lucene/java/trunk: ./ src/java/org/apache/lucene/search/regex/ src/test/org/apache/lucene/search/regex/

2005-11-15 Thread mark harwood
 That would use more memory, but still permit ranked
 searches.  Worth it? 

Not sure. I expect FuzzyQuery results would suffer if
the edit distance could no longer be factored in. At
least there's a quality threshold to limit the more
tenuous matches but all matches below the threshold
would be scored equally. I've certainly had no use for
the idf in fuzzy queries (it favours rare
mis-spellings) so happy to see that go.  I'm not sure
what the lack of edit-distance would do without seeing
some examples results.





___ 
To help you stay safe and secure online, we've developed the all new Yahoo! 
Security Centre. http://uk.security.yahoo.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: svn commit: r332747 - in /lucene/java/trunk: ./ src/java/org/apache/lucene/search/regex/ src/test/org/apache/lucene/search/regex/

2005-11-15 Thread Yonik Seeley
Good point about FuzzyQuery... it has already mostly solved the too
many clauses thing anyway.  I also think the idf should go.

There are two different usecases:
 1) relevancy: give highest relevance and closest matches, but I don't
care if I get 100% of the matches.
 2) matching: must give all matches, but we don't really care about
relevance (it's more like a filter).

Range queries are normally only used in case 2
Prefix queries are used in both cases, but since stemming handles word
variants, I think more people use them for case 2
Fuzzy queries are normally only used in case 1 I think.

Now it gets a little confusing because queries in the matching
category may still rely on the field boost (but probably don't want
any other relevancy factor).  An example of this is boosting more
recent documents when building an index.  There are alternate ways to
solve this (some of them more flexible, like the FunctionQuery I'm
refactoring now).

I'd still argue for making ConstantScoreRangeQuery the default of the
QueryParser.


On 11/15/05, mark harwood [EMAIL PROTECTED] wrote:
  That would use more memory, but still permit ranked
  searches.  Worth it?

 Not sure. I expect FuzzyQuery results would suffer if
 the edit distance could no longer be factored in. At
 least there's a quality threshold to limit the more
 tenuous matches but all matches below the threshold
 would be scored equally. I've certainly had no use for
 the idf in fuzzy queries (it favours rare
 mis-spellings) so happy to see that go.  I'm not sure
 what the lack of edit-distance would do without seeing
 some examples results.

-Yonik
Now hiring -- http://forms.cnet.com/slink?231706

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: svn commit: r332747 - in /lucene/java/trunk: ./ src/java/org/apache/lucene/search/regex/ src/test/org/apache/lucene/search/regex/

2005-11-15 Thread Yonik Seeley
On 11/15/05, Paul Elschot [EMAIL PROTECTED] wrote:
 TermQuery relies on field boost and document term frequency, so
 having PrefixQuery ignore these would also lead to unexpected
 surprises.

The surprise from a field boost not working should be found during
development.  The surprise of queries that suddenly stop working after
things have been in production for 6 months is worse I think.

However it's done, TooManyClauses should never happen unless the user
explicitly adds that many clauses.

-Yonik
Now hiring -- http://forms.cnet.com/slink?231706

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: svn commit: r332747 - in /lucene/java/trunk: ./ src/java/org/apache/lucene/search/regex/ src/test/org/apache/lucene/search/regex/

2005-11-15 Thread Doug Cutting

Paul Elschot wrote:

I think loosing the field boosts for PrefixQuery and friends would not be
advisable. Field boosts have a very big range and from that a very big
influence on the score and the order of the results in Hits.


It should not be hard to add these.  If a field name is provided, then 
it would be a one or two line change to ConstantScoreQuery.java to have 
it boost scores according to that field's norm values.  Right, Yonik?


Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: svn commit: r332747 - in /lucene/java/trunk: ./ src/java/org/apache/lucene/search/regex/ src/test/org/apache/lucene/search/regex/

2005-11-15 Thread Doug Cutting

Yonik Seeley wrote:

As far as API goes, I guess there should be a constructor
ConstantScoreQuery(Filter filter, String field)
If field is non-null, then the field norm can be multiplied into the score.


You could implement this with a scorer subclass that multiplys by the 
norm, removing a conditional from the inner loop.



Now how would the user go about choosing a truely constant scoring
range query vs a constant scoring range query with the norms folded
in?
An additional flag called includeFieldBoost on ConstantScoreRangeQuery?


Sounds good to me.

Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: svn commit: r332431 - in /lucene/java/trunk: CHANGES.txt src/java/org/apache/lucene/search/FieldDocSortedHitQueue.java src/test/org/apache/lucene/search/TestCustomSearcherSort.java

2005-11-15 Thread Bernhard Messer

Yonik,

TestCustomSearcherSort.java you added a few days ago shows that the 
author is Martin Seitz from T-Systems and doesn't has the apache license 
agreement in it's header. Is it ok to have this test in SVN ?


Bernhard


[EMAIL PROTECTED] wrote:


Author: yonik
Date: Thu Nov 10 19:13:10 2005
New Revision: 332431

URL: http://svn.apache.org/viewcvs?rev=332431view=rev
Log:
break sorting ties by index order: LUCENE-456

Added:
   
lucene/java/trunk/src/test/org/apache/lucene/search/TestCustomSearcherSort.java
Modified:
   lucene/java/trunk/CHANGES.txt
   
lucene/java/trunk/src/java/org/apache/lucene/search/FieldDocSortedHitQueue.java

Modified: lucene/java/trunk/CHANGES.txt
URL: 
http://svn.apache.org/viewcvs/lucene/java/trunk/CHANGES.txt?rev=332431r1=332430r2=332431view=diff
==
--- lucene/java/trunk/CHANGES.txt (original)
+++ lucene/java/trunk/CHANGES.txt Thu Nov 10 19:13:10 2005
@@ -245,6 +245,10 @@
change the sort order when sorting by string for documents without
a value for the sort field.
(Luc Vanlerberghe via Yonik, LUCENE-453)
+
+16. Fixed a sorting problem with MultiSearchers that can lead to
+missing or duplicate docs due to equal docs sorting in an arbitrary order.
+(Yonik Seeley, LUCENE-456)

Optimizations
 


Modified: 
lucene/java/trunk/src/java/org/apache/lucene/search/FieldDocSortedHitQueue.java
URL: 
http://svn.apache.org/viewcvs/lucene/java/trunk/src/java/org/apache/lucene/search/FieldDocSortedHitQueue.java?rev=332431r1=332430r2=332431view=diff
==
--- 
lucene/java/trunk/src/java/org/apache/lucene/search/FieldDocSortedHitQueue.java 
(original)
+++ 
lucene/java/trunk/src/java/org/apache/lucene/search/FieldDocSortedHitQueue.java 
Thu Nov 10 19:13:10 2005
@@ -157,6 +157,11 @@
c = -c;
}
}
-   return c  0;
+
+// avoid random sort order that could lead to duplicates (bug #31241):
+if (c == 0)
+  return docA.doc  docB.doc;
+
+return c  0;
}
}

Added: 
lucene/java/trunk/src/test/org/apache/lucene/search/TestCustomSearcherSort.java
URL: 
http://svn.apache.org/viewcvs/lucene/java/trunk/src/test/org/apache/lucene/search/TestCustomSearcherSort.java?rev=332431view=auto
==
--- 
lucene/java/trunk/src/test/org/apache/lucene/search/TestCustomSearcherSort.java 
(added)
+++ 
lucene/java/trunk/src/test/org/apache/lucene/search/TestCustomSearcherSort.java 
Thu Nov 10 19:13:10 2005
@@ -0,0 +1,268 @@
+package org.apache.lucene.search;
+
+import java.io.IOException;
+import java.io.Serializable;
+import java.util.Calendar;
+import java.util.GregorianCalendar;
+import java.util.Map;
+import java.util.Random;
+import java.util.TreeMap;
+
+import junit.framework.Test;
+import junit.framework.TestCase;
+import junit.framework.TestSuite;
+import junit.textui.TestRunner;
+
+import org.apache.lucene.analysis.standard.StandardAnalyzer;
+import org.apache.lucene.document.DateTools;
+import org.apache.lucene.document.Document;
+import org.apache.lucene.document.Field;
+import org.apache.lucene.index.IndexReader;
+import org.apache.lucene.index.IndexWriter;
+import org.apache.lucene.index.Term;
+import org.apache.lucene.store.Directory;
+import org.apache.lucene.store.RAMDirectory;
+
+/**
+ * Unit test for sorting code.
+ *
+ * @author  Martin Seitz (T-Systems)
+ */
+
+public class TestCustomSearcherSort
+extends TestCase
+implements Serializable {
+
+private Directory index = null;
+private Query query = null;
+// reduced from 2 to 2000 to speed up test...
+private final static int INDEX_SIZE = 2000;
+
+   public TestCustomSearcherSort (String name) {
+   super (name);
+   }
+
+   public static void main (String[] argv) {
+   TestRunner.run (suite());
+   }
+
+   public static Test suite() {
+   return new TestSuite (TestCustomSearcherSort.class);
+   }
+
+
+   // create an index for testing
+   private Directory getIndex()
+   throws IOException {
+   RAMDirectory indexStore = new RAMDirectory ();
+   IndexWriter writer = new IndexWriter (indexStore, new 
StandardAnalyzer(), true);
+   RandomGen random = new RandomGen();
+   for (int i=0; iINDEX_SIZE; ++i) { // don't decrease; if to low 
the problem doesn't show up
+   Document doc = new Document();
+   if((i%5)!=0) { // some documents must not have an entry in 
the first sort field
+   doc.add (new Field(publicationDate_, 
random.getLuceneDate(), Field.Store.YES, Field.Index.UN_TOKENIZED));
+   }
+	if((i%7)==0) { // some documents to match the query (see below) 
+	doc.add (new 

[jira] Commented: (LUCENE-323) [PATCH] MultiFieldQueryParser and BooleanQuery do not provide adequate support for queries across multiple fields

2005-11-15 Thread paul.elschot (JIRA)
[ 
http://issues.apache.org/jira/browse/LUCENE-323?page=comments#action_12357721 ] 

paul.elschot commented on LUCENE-323:
-

The ScorerDocQueue.java here has a single operation for something
very similar to the heap-remove/generate/heap-insert:

http://issues.apache.org/jira/browse/LUCENE-365

There is also a test class for testing performance of disjunction scorers
which could be used to find out which k is big enough to warrant the use
of a heap (priority queue).

Regards,
Paul Elschot


 [PATCH] MultiFieldQueryParser and BooleanQuery do not provide adequate 
 support for queries across multiple fields
 -

  Key: LUCENE-323
  URL: http://issues.apache.org/jira/browse/LUCENE-323
  Project: Lucene - Java
 Type: Bug
   Components: QueryParser
 Versions: 1.4
  Environment: Operating System: Windows XP
 Platform: PC
 Reporter: Chuck Williams
 Assignee: Lucene Developers
  Attachments: DisjunctionMaxQuery.java, DisjunctionMaxScorer.java, 
 TestDisjunctionMaxQuery.java, TestMaxDisjunctionQuery.java, TestRanking.zip, 
 TestRanking.zip, TestRanking.zip, WikipediaSimilarity.java, 
 WikipediaSimilarity.java, WikipediaSimilarity.java

 The attached test case demonstrates this problem and provides a fix:
   1.  Use a custom similarity to eliminate all tf and idf effects, just to 
 isolate what is being tested.
   2.  Create two documents doc1 and doc2, each with two fields title and 
 description.  doc1 has elephant in title and elephant in description.  
 doc2 has elephant in title and albino in description.
   3.  Express query for albino elephant against both fields.
 Problems:
   a.  MultiFieldQueryParser won't recognize either document as containing 
 both terms, due to the way it expands the query across fields.
   b.  Expressing query as title:albino description:albino title:elephant 
 description:elephant will score both documents equivalently, since each 
 matches two query terms.
   4.  Comparison to MaxDisjunctionQuery and my method for expanding queries 
 across fields.  Using notation that () represents a BooleanQuery and ( | ) 
 represents a MaxDisjunctionQuery, albino elephant expands to:
 ( (title:albino | description:albino)
   (title:elephant | description:elephant) )
 This will recognize that doc2 has both terms matched while doc1 only has 1 
 term matched, score doc2 over doc1.
 Refinement note:  the actual expansion for albino query that I use is:
 ( (title:albino | description:albino)~0.1
   (title:elephant | description:elephant)~0.1 )
 This causes the score of each MaxDisjunctionQuery to be the score of highest 
 scoring MDQ subclause plus 0.1 times the sum of the scores of the other MDQ 
 subclauses.  Thus, doc1 gets some credit for also having elephant in the 
 description but only 1/10 as much as doc2 gets for covering another query 
 term 
 in its description.  If doc3 has elephant in title and both albino 
 and elephant in the description, then with the actual refined expansion, it 
 gets the highest score of all (whereas with pure max, without the 0.1, it 
 would get the same score as doc2).
 In real apps, tf's and idf's also come into play of course, but can affect 
 these either way (i.e., mitigate this fundamental problem or exacerbate it).

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: svn commit: r332747 - in /lucene/java/trunk: ./ src/java/org/apache/lucene/search/regex/ src/test/org/apache/lucene/search/regex/

2005-11-15 Thread Doug Cutting

Paul Elschot wrote:

Not using the document term frequencies in PrefixQuery would still
leave these as a surprise factor between PrefixQuery and TermQuery.


Should we dynamically decide to switch to FieldNormQuery when 
BooleanQuery.maxClauseCount is exceeded?  That way queries that 
currently work would continue to work, and queries that formerly failed 
would now work.  How complicated would this be to implement?


Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: svn commit: r332747 - in /lucene/java/trunk: ./ src/java/org/apache/lucene/search/regex/ src/test/org/apache/lucene/search/regex/

2005-11-15 Thread Yonik Seeley
Scoring recap... I think I've seen 4 different types of scoring
mentioned in this thread for a term expanding query on a single field:

1) query_boost
2) query_boost * (field_boost * lengthNorm)
3) query_boost * (field_boost * lengthNorm) * tf(t in q)
4) query_boost * (field_boost * lengthNorm) * tf(t in q) * idf(t in q)

1  2 can be done with ConstantScoreQuery
4 is currently done via rewrite to BooleanQuery and limiting the
number of terms.
3 is unimplemented AFAIK.

-Yonik
Now hiring -- http://forms.cnet.com/slink?231706

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: svn commit: r332747 - in /lucene/java/trunk: ./ src/java/org/apache/lucene/search/regex/ src/test/org/apache/lucene/search/regex/

2005-11-15 Thread markharw00d
I was thinking about the challenges of holding a score per document 
recently whilst trying to optimize the  Lucene-embedded-in-Derby/HSQLDB 
code.
I found myself actually wanting to visualize the problem and to see the 
distribution of scores for a query in a graphical form eg how sparse the 
result sets were and the distribution of scores.


I ended up adding a panel to Luke which does exactly this. I didn't get 
any blinding insights but it may be of interest anyway.
I've already supplied Andrzej this visualisation code and he is waiting 
for Lucene 1.9 before releasing it as a part of an updated Luke.

Let me know if you want the code before then and I can mail it to you.


Cheers,
Mark


Doug Cutting wrote:


Yonik Seeley wrote:


Scoring recap... I think I've seen 4 different types of scoring
mentioned in this thread for a term expanding query on a single field:

1) query_boost
2) query_boost * (field_boost * lengthNorm)
3) query_boost * (field_boost * lengthNorm) * tf(t in q)
4) query_boost * (field_boost * lengthNorm) * tf(t in q) * idf(t in q)

1  2 can be done with ConstantScoreQuery
4 is currently done via rewrite to BooleanQuery and limiting the
number of terms.
3 is unimplemented AFAIK.



3 is easy to implement as a subcase of 4, no?

The challenge is to implement 3 or 4 efficiently for very large 
queries w/o using gobs of RAM.  One option is to keep a score per 
document, making the RAM use proportional to the size of the 
collection (or at least the number of non-zero matches, if a sparse 
representation is used) or, as in 4, proportional to the number of 
terms in the query (with a large constant--an i/o buffer).


Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]








___ 
Yahoo! Model Search 2005 - Find the next catwalk superstars - http://uk.news.yahoo.com/hot/model-search/


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Lucene Index backboned by DB

2005-11-15 Thread Karel Tejnora

Hi all,
   in our testing application using lucene 1.4.3. Thanks you guys for 
that great job.
We have index file around 12GiB, one file (merged). To retrieve hits it 
takes nice small amount of the time, but reading fields takes 10-100 
times more (the stored ones). I think because all the fields are read.
I would like to try implement lucene index files as tables in db with 
some lazy fields loading. As I have searched web I have found only impl. 
of the store.Directory (bdb), but it only holds data as binary streams. 
This technique will be not so helpful because BLOB operations are not 
fast performing. On another side I will have a lack of the freedom from 
documents fields variability but I can omit a lot of the skipping and 
many opened files. Also IndexWriter can have document/term locking 
granuality.
So I think that way leads to extends IndexWriter / IndexReader and have 
own implementation of index.Segment* classes. It is the best way or I 
missing smthg how achieve this?

If it is bad idea, I will be happy to heard another possibilities.

I would like also join development of the lucene.  Is there some points 
how to start?


Thx for reading this,
sorry if I did some mistakes

Karel


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: svn commit: r332747 - in /lucene/java/trunk: ./ src/java/org/apache/lucene/search/regex/ src/test/org/apache/lucene/search/regex/

2005-11-15 Thread Yonik Seeley
 However, one problem I don't know how to solve is
 Weight.sumOfSquares(), which needs to know the idf of every single
 term, before the scorer is even created!

Darn, even if one leaves out idf(), Weight.sumOfSquares() still
depends on the number of terms in the query.  I guess it's not
possible to match the score of BooleanQuery without first iterating
over all terms in the Weight?

-Yonik
Now hiring -- http://forms.cnet.com/slink?231706

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: svn commit: r332747 - in /lucene/java/trunk: ./ src/java/org/apache/lucene/search/regex/ src/test/org/apache/lucene/search/regex/

2005-11-15 Thread Yonik Seeley
Totally untested, but here is a hack at what the scorer might look
like when the number of terms is large.

-Yonik


package org.apache.lucene.search;

import org.apache.lucene.index.TermEnum;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.TermDocs;

import java.io.IOException;

/**
 * @author yonik
 * @version $Id$
 */
public class MultiTermScorer extends Scorer{
  protected final float[] scores;
  protected int pos;
  protected float docScore;

  public MultiTermScorer(Similarity similarity, IndexReader reader,
Weight w, TermEnum terms, byte[] norms, boolean include_idf, boolean
include_tf) throws IOException {
super(similarity);
float weightVal = w.getValue();
int maxDoc = reader.maxDoc();
this.scores = new float[maxDoc];
float[] normDecoder = Similarity.getNormDecoder();

TermDocs tdocs = reader.termDocs();
while (terms.next()) {
  tdocs.seek(terms);
  float termScore = weightVal;
  if (include_idf) {
termScore *= similarity.idf(terms.docFreq(),maxDoc);
  }
  while (tdocs.next()) {
int doc = tdocs.doc();
float subscore = termScore;
if (include_tf) subscore *= tdocs.freq();
if (norms!=null) subscore *= normDecoder[norms[doc0xff]];
scores[doc] += subscore;
  }
}

pos=-1;
  }

  // could also use a bitset to keep track of docs in the set...
  public boolean next() throws IOException {
while (++pos  scores.length) {
  if (scores[pos] != 0) return true;
}
return false;
  }

  public int doc() {
return pos;
  }

  public float score() throws IOException {
return scores[pos];
  }

  public boolean skipTo(int target) throws IOException {
pos=target-1;
return next();
  }

  public Explanation explain(int doc) throws IOException {
return null;
  }
}



-Yonik
Now hiring -- http://forms.cnet.com/slink?231706

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Closed: (LUCENE-465) surround test code is incompatible with *Test pattern in test target.

2005-11-15 Thread Daniel Naber (JIRA)
 [ http://issues.apache.org/jira/browse/LUCENE-465?page=all ]
 
Daniel Naber closed LUCENE-465:
---

Fix Version: 1.9
 Resolution: Fixed

Thanks, commited.

 surround test code is incompatible with *Test pattern in test target.
 -

  Key: LUCENE-465
  URL: http://issues.apache.org/jira/browse/LUCENE-465
  Project: Lucene - Java
 Type: Bug
   Components: Other
 Versions: CVS Nightly - Specify date in submission
 Reporter: paul.elschot
 Priority: Minor
  Fix For: 1.9
  Attachments: BooleanQueryTst.java, ExceptionQueryTst.java, 
 SrndRenameTestPatch1.txt

 Attachments to follow:
 renamed BooleanQueryTest to BooleanQueryTst,
 renamed ExceptionQueryTest to ExceptionQueryTst,
 patch for the remainder of the test code to use the renamed classes.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene Index backboned by DB

2005-11-15 Thread jian chen
Dear All,

I have some thoughts on this issue as well.

1) It might be OK to implement retrieving field values separately for a
document. However, I think from a simplicity point of view, it might be
better to have the application code do this drudgery. Adding this feature
could complicate the nice and simple design of Lucene without much benefit.

2) The application could separately a document into several documents, for
example, one document for indexing mainly, the other documents for storing
binary values for different fields. Thus, giving the relevant doc id, its
associated binary value for a particular field could be loaded very fast
with just a disk lookup (looking up the fdx file).

This way, only the relevant field is loaded into memory rather than all of
the fields for a doc. There is no change on Lucene side, only some more work
for the application code.

My view for a search library (or in general, a library), should be small and
efficient, since it is used by lot of applications, any additional feature
could potentially impact its robustness and liability to performance
drawback.

Welcome for any critics or comments?

Jian

On 11/15/05, Robert Kirchgessner [EMAIL PROTECTED] wrote:

 Hi,

 a discussion in

 http://issues.apache.org/jira/browse/LUCENE-196

 might be of interest to you.

 Did you think about storing the large pieces of documents
 in a database to reduce the size of Lucene index?

 I think there are good reasons to adding support for
 storing fields in separate files:

 1. One could define a binary field of fixed length and store it
 in a separate file. Then load it into memory and have fast
 access for field contents.

 A use case might be: store calendar date (-MM-DD)
 in three bytes, 4 bits for months, 5 bits for days and up to
 15 bits for years. If you want to retrieve hits sorted by date
 you can load the fields file of size (3 * documents in index) bytes
 and support sorting by date without accessing hard drive
 for reading dates.

 2. One could store document contents in a separate
 file and fields of small size like title and some metadata
 in the way it is stored now. It could speed up access to
 fields. It would be interesting to know whether you gain
 significant perfomance leaving the big chunks out, i.e.
 not storing them in index.

 In my opinion 1. is the most interesting case: storing some
 binary fields (dates, prices, length, any numeric metrics of
 documents) would enable *really* fast sorting of hits.

 Any thoughts about this?

 Regards,

 Robert



 We have a similiar problem

 Am Dienstag, 15. November 2005 23:23 schrieb Karel Tejnora:
  Hi all,
  in our testing application using lucene 1.4.3. Thanks you guys for
  that great job.
  We have index file around 12GiB, one file (merged). To retrieve hits it
  takes nice small amount of the time, but reading fields takes 10-100
  times more (the stored ones). I think because all the fields are read.
  I would like to try implement lucene index files as tables in db with
  some lazy fields loading. As I have searched web I have found only impl.
  of the store.Directory (bdb), but it only holds data as binary streams.
  This technique will be not so helpful because BLOB operations are not
  fast performing. On another side I will have a lack of the freedom from
  documents fields variability but I can omit a lot of the skipping and
  many opened files. Also IndexWriter can have document/term locking
  granuality.
  So I think that way leads to extends IndexWriter / IndexReader and have
  own implementation of index.Segment* classes. It is the best way or I
  missing smthg how achieve this?
  If it is bad idea, I will be happy to heard another possibilities.
 
  I would like also join development of the lucene. Is there some points
  how to start?
 
  Thx for reading this,
  sorry if I did some mistakes
 
  Karel
 
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




Re: svn commit: r332747 - in /lucene/java/trunk: ./ src/java/org/apache/lucene/search/regex/ src/test/org/apache/lucene/search/regex/

2005-11-15 Thread Paul Elschot
On Tuesday 15 November 2005 20:30, Doug Cutting wrote:
 Paul Elschot wrote:
  Not using the document term frequencies in PrefixQuery would still
  leave these as a surprise factor between PrefixQuery and TermQuery.
 
 Should we dynamically decide to switch to FieldNormQuery when 
 BooleanQuery.maxClauseCount is exceeded?  That way queries that 
 currently work would continue to work, and queries that formerly failed 
 would now work.  How complicated would this be to implement?

Why not leave that decision to the program using the query?
Something like this:
- catch the TooManyClauses exception,
-  adapt (the offending parts of) the query to make these use
   a FieldNormQuery, 
- retry with a warning to the provider of the query that 
  the term frequencies have been ignored.

The only disadvantage of this is that the term expansion
during rewrite has to be redone.
Also, when enough terms are involved, they might still cause
a memory problem because they are all needed at the same
time.

Regards,
Paul Elschot


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]