lucene webcrawler/dbms indexing framework

2004-04-01 Thread Woolly Mammoth
Hi All, I have seen some discussion in the past around LARM & other web crawler indexing code, but not much output. I have started a project on SF http://sourceforge.net/projects/knine, and have commited some initial framework code to CVS (despite the front page saying there are not commits

Ordered span query with more than 2 subqueries: avoid?

2004-04-01 Thread Paul Elschot
Dear readers, (Not sure whether this would be better posted to lucene-user.) A test of the ordered span query with three terms: w1 w2 w3 and slop 1 against document: w1 w3 w2 w3 fails. The javadoc (1.4 rc3) of SpanNearQuery gives: Matches spans which are near one another. One can spec

Term Occurence Weight

2004-04-01 Thread Roy
Hello, Mr. Cutting: I am doing some experiments in IR. I am trying to add additional attributes about term occurrences in documents to improve precision. The attributes may include capitalization, font size, font color, etc. How to implement that with Lucene? Can you give me some hints? Thank yo

DO NOT REPLY [Bug 28108] - 1.3 Final Release - TooManyClauses error not allowing search to work

2004-04-01 Thread bugzilla
DO NOT REPLY TO THIS EMAIL, BUT PLEASE POST YOUR BUG RELATED COMMENTS THROUGH THE WEB INTERFACE AVAILABLE AT . ANY REPLY MADE TO THIS MESSAGE WILL NOT BE COLLECTED AND INSERTED IN THE BUG DATABASE. http://issues.apache.org/bugzilla/show_bu

Re: MultiReader

2004-04-01 Thread Doug Cutting
Christoph Goller wrote: As far as I understand, MultiReader is a kind of hybrid, trying to achieve two different and partly conflicting goals: 1) It replaces SegmentsReader, which was and IndexReader accessing several SegmentReader in one directory and doing the synchronization (for delete) for the

MultiReader

2004-04-01 Thread Christoph Goller
Hi folks, you did a great job during the last 4 months. Due to other obligations I was not able to keep track of all these contributions and new features such as: MultiReader TermDocs.skipTo & Scorer.skipTo TermVectors result sorting SpanQueries Fortunately, I have now been able to catch up.

Re: Performance of hit highlighting and finding term positions for

2004-04-01 Thread markharw00d
730 msecs is the correct number for 10 * 16k docs with StandardTokenizer! The 11ms per doc figure in my post was for highlighlighting using a lower-case-filter-only analyzer. 5ms of this figure was the cost of the lower-case-filter-only analyzer. 73 msecs is the cost of JUST StandardTokenizer (n