Re: Caching of TermDocs

2004-07-27 Thread John Patterson
Cool.  I'll give it a try.  Looks like extending FilterIndexReader is the
way to go.  Or possibly I could cache the compressed form at a lower level
getting the best of both worlds.  I'll look into both ways, profile the app,
and post my results.

- Original Message - 
From: "Doug Cutting" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Tuesday, July 27, 2004 8:33 PM
Subject: Re: Caching of TermDocs


> John Patterson wrote:
> > I would like to hold a significant amount of the index in memory but use
the
> > disk index as a spill over.  Obviously the best situation is to hold in
> > memory only the information that is likely to be used again soon.  It
seems
> > that caching TermDocs would allow popular search terms to be searched
more
> > efficiently while the less common terms would need to be read from disk.
>
> The operating system already caches recent disk i/o.  So what you'd save
> primarily would be the overhead of parsing the data.  However the parsed
> form, a sequence of docNo and freq ints, is nearly eight times as large
> as its compressed size in the index.  So your cache would consume a lot
> of memory.
>
> Whether it this provide much overall speedup depends on the distribution
> of common terms in your query traffic.  If you have a few terms that are
> searched very frequently then it might pay off.  In my experience with
> general-purpose search engines this is not usually the case: folks seem
> to use rarer words in queries than they do in ordinary text.  But in
> some search applications perhaps the traffic is more skewed.  Only some
> experiments would tell for sure.
>
> Doug
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Caching of TermDocs

2004-07-27 Thread John Patterson
The caching by TermScorer of the next 32 Docs is a way to speed up the
serial (in order) reading of docs from the TermDocs object (probably coming
direct from disk).

I would like to hold a significant amount of the index in memory but use the
disk index as a spill over.  Obviously the best situation is to hold in
memory only the information that is likely to be used again soon.  It seems
that caching TermDocs would allow popular search terms to be searched more
efficiently while the less common terms would need to be read from disk.

Has anyone else done this?  Know of a better approach?

- Original Message - 
From: "Paul Elschot" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Tuesday, July 27, 2004 3:07 AM
Subject: Re: Caching of TermDocs


> On Monday 26 July 2004 21:41, John Patterson wrote:
>
> > Is there any way to cache TermDocs?  Is this a good idea?
>
> Lucene does this internally by buffering
> up to 32 document numbers in advance for a query Term.
> You can view the details here in case you're interested:
>
http://cvs.apache.org/viewcvs.cgi/jakarta-lucene/src/java/org/apache/lucene/search/TermScorer.java
> It uses the TermDocs.read() method to fill a buffer of document numbers.
>
> Is this what you had in mind?
>
> Regards,
> Paul
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Caching of TermDocs

2004-07-26 Thread John Patterson
Is there any way to cache TermDocs?  Is this a good idea?

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene vs. MySQL Full-Text

2004-07-22 Thread John Patterson
I used the MySQL full text search to index about 70K business directory
records.  It became impossibly slow and I ended up creating my own text
search engine similar in concept to Lucene but database driven.  It worked
much faster than the native MySQL full text search.

Other limitations of MySQL MATCH syntax:
- only 4 letter words and over are indexed (if you change this it searches
VERY slowly)
- the MATCH value figure returned is next to useless (it ranges wildly and
is not normalized like Lucene values are)
- cannot weight certain fields as more important than others.

Really it is very limited.

John.

- Original Message - 
From: <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Friday, July 23, 2004 1:23 AM
Subject: RE: Lucene vs. MySQL Full-Text


I also question whether it could handle extreme volume with such good query
speed.

Has anyone done numbers with  1+ million documents?

-Original Message-
From: Daniel Naber [mailto:[EMAIL PROTECTED]
Sent: Tuesday, July 20, 2004 5:44 PM
To: Lucene Users List
Subject: Re: Lucene vs. MySQL Full-Text


On Tuesday 20 July 2004 21:29, Tim Brennan wrote:

> Does anyone out there have
> anything more concrete they can add?

Stemming is still on the MySQL TODO list:
http://dev.mysql.com/doc/mysql/en/Fulltext_TODO.html

Also, for most people it's easier to extend Lucene than MySQL (as MySQL is
written in C(++?)) and there are more powerful queries in Lucene, e.g.
fuzzy phrase search.

Regards
 Daniel

-- 
http://www.danielnaber.de

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Unnecesary scan with required terms

2004-07-22 Thread John Patterson
Hi,

I have been looking at how Lucene operates with queries where all terms are
required.  I expected that the algorithm would step through each term to
confirm that it did exist in the index and as soon as a clause is found that
does not occur, the search would be aborted.  As far as I can tell this does
not happen and the search continues on to find the frequencey of the other
terms even though no hits will be returned.

This occurs during the call to Query.weight() when the weightings are
calulated before terms are scored.

Is this correct?

Thanks,

John.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Weighting database fields

2004-07-21 Thread John Patterson
Thanks, that was what I was after!

- Original Message - 
From: "Erik Hatcher" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Wednesday, July 21, 2004 9:52 PM
Subject: Re: Weighting database fields


> On Jul 21, 2004, at 10:09 AM, Anson Lau wrote:
> > Apply boost factor to fields when you do a lucene search.
> 
> Or... set the boost on the Field during indexing.
> 
> Erik
> 
> 
> >
> > Anson
> >
> > -Original Message-
> > From: John Patterson [mailto:[EMAIL PROTECTED]
> > Sent: Thursday, July 22, 2004 12:07 AM
> > To: [EMAIL PROTECTED]
> > Subject: Weighting database fields
> >
> > Hi,
> >
> > What is the best way to get Lucene to assign weightings to certain 
> > fields
> > from a database?  For example, the 'name' field should be weighted 
> > higher
> > than the 'description' field.
> >
> > Thanks,
> >
> > John.
> >
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Weighting database fields

2004-07-21 Thread John Patterson
Hi,

What is the best way to get Lucene to assign weightings to certain fields
from a database?  For example, the 'name' field should be weighted higher
than the 'description' field.

Thanks,

John.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]