SV: Sort problematics

Marcus Falck Thu, 18 May 2006 13:53:52 -0700

 

________________________________

Från: Yonik Seeley [mailto:[EMAIL PROTECTED]
Skickat: to 2006-05-18 20:39
Till: java-user@lucene.apache.org
Ämne: Re: Sort problematics

On 5/18/06, Marcus Falck <[EMAIL PROTECTED]> wrote:
>> I'm well aware of the trade offs. But if you were aware of the large amounts 
>> of data that this system should be able to search you woldn't propose the 
>> usage of a database.

>If you have a hard requirement of instantly seeing any update, you
can't use Lucene.  That's more database-like functionallity. That's
why I asked.

Instant in this case isn't really instant. Lets say that the MAXIMUM time that 
will be accepted is 5 minutes. Since i have the altert service up and running 
all users that pays for immediate alerts will get their hits from it.

>> Since I have an separate alert service for immediatly alerts up and running 
>> i may be able to do trade offs with the data availability timings, and hold 
>> the indexsearcher open for a longer period.

>That's pretty much a requirement for using Lucene to support a decent
query rate.

 I thought i had good query rates when i instantiated the IndexSearcher for 
every search. I mean in my example index on 10GB i had response times under a 
second when queried using a boolean query containing approximently 200 terms. 
But I will redesign using a more static behavior for the IndexSearcher and 
recreate it on a reqular basis. After this redesign i suppose the search will 
benefit from the fieldcache (until i recreate). And as you say 800MB is 
nothing. I will without problems have atleast 4GB RAM in each search machine. 

>> But still. The memory is the problem.
>> I mean how much memory would the fieldcache take for 500 Millon newsletter 
>> articles? Probably a lot,
>> ok the system is scaled out over different machines so in reality each 
>> machine won't have 500 Million docs but maybe around 100Million.

>Depends on what you are sorting by... for an int/float 100M*4 or
800MB.  Big, but possible.

>> So i'm still interesting in changing the relevance.
>> Any ideas?

>Depends on what you are sorting by, and how many different ways you
>want to sort.  If it's a single sort criteria, you can use index-time
>boosts.  If you can sort multiple ways, avoiding the fieldcache
>probably won't help you because the time to retrieve the per-doc sort
>info via termvectors or stored fields will take too long.

There is however a problem with boosting the docs since the boost factor is of 
type float. A float doesn't have the resolution needed to differ on a second 
basis. 

I will illustrate my sort/ranking need with an example: 

If i use lucene default implementation of the TermScorer and search for

"you" OR "her"

The term scorer will give higher score on documents containing both terms. This 
is a problem (in our application) since in this case want the same score on 
documents as long as they contain 1 of the terms (since we are dealing with 
newsletter observation for companies they want to get the hits ordered by date 
to get the complete overview).  I tested to rewrite the TermScorer to give me 
the same score with success. So my question is. 
If i can modify the score at search time using the Score class why does 
everybody talk about the Similarity class?

/ 

Marcus

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

SV: Sort problematics

Reply via email to