________________________________
Från: Yonik Seeley [mailto:[EMAIL PROTECTED] Skickat: to 2006-05-18 20:39 Till: [email protected] Ämne: Re: Sort problematics On 5/18/06, Marcus Falck <[EMAIL PROTECTED]> wrote: >> I'm well aware of the trade offs. But if you were aware of the large amounts >> of data that this system should be able to search you woldn't propose the >> usage of a database. >If you have a hard requirement of instantly seeing any update, you can't use Lucene. That's more database-like functionallity. That's why I asked. Instant in this case isn't really instant. Lets say that the MAXIMUM time that will be accepted is 5 minutes. Since i have the altert service up and running all users that pays for immediate alerts will get their hits from it. >> Since I have an separate alert service for immediatly alerts up and running >> i may be able to do trade offs with the data availability timings, and hold >> the indexsearcher open for a longer period. >That's pretty much a requirement for using Lucene to support a decent query rate. I thought i had good query rates when i instantiated the IndexSearcher for every search. I mean in my example index on 10GB i had response times under a second when queried using a boolean query containing approximently 200 terms. But I will redesign using a more static behavior for the IndexSearcher and recreate it on a reqular basis. After this redesign i suppose the search will benefit from the fieldcache (until i recreate). And as you say 800MB is nothing. I will without problems have atleast 4GB RAM in each search machine. >> But still. The memory is the problem. >> I mean how much memory would the fieldcache take for 500 Millon newsletter >> articles? Probably a lot, >> ok the system is scaled out over different machines so in reality each >> machine won't have 500 Million docs but maybe around 100Million. >Depends on what you are sorting by... for an int/float 100M*4 or 800MB. Big, but possible. >> So i'm still interesting in changing the relevance. >> Any ideas? >Depends on what you are sorting by, and how many different ways you >want to sort. If it's a single sort criteria, you can use index-time >boosts. If you can sort multiple ways, avoiding the fieldcache >probably won't help you because the time to retrieve the per-doc sort >info via termvectors or stored fields will take too long. There is however a problem with boosting the docs since the boost factor is of type float. A float doesn't have the resolution needed to differ on a second basis. I will illustrate my sort/ranking need with an example: If i use lucene default implementation of the TermScorer and search for "you" OR "her" The term scorer will give higher score on documents containing both terms. This is a problem (in our application) since in this case want the same score on documents as long as they contain 1 of the terms (since we are dealing with newsletter observation for companies they want to get the hits ordered by date to get the complete overview). I tested to rewrite the TermScorer to give me the same score with success. So my question is. If i can modify the score at search time using the Score class why does everybody talk about the Similarity class? / Marcus
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
