Chris,

Thanks for all your invaluable comments. The killer was the fact that the 
timestamp for each document was unique. For a search with millions of results, 
this resulted in allocation of millions of strings during the sorting step 
(FieldCacheImpl.getStrings). With some loss of precision, I rounded down the 
timestamp to DateTools.MINUTE resolution in the indexing phase (but I still 
store it as a string). The cool thing about this is that it limits the maximum 
number of strings for this field (525000 in a year). So getStrings never 
allocates more than 0.5 million strings. With this, the sort is suitably fast 
(amazingly fast in fact, 13,000,000 documents sorted in < 10 seconds on a 
middling hardware configuration). I don't need a 64-bit JVM either :). 
Beautiful.

Of course, I can improve this by storing the modified timestamp as an 
integer...that's the next step.  


----- Original Message ----
From: Chris Hostetter <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Friday, January 26, 2007 11:15:05 AM
Subject: Re: Extending scoring to eliminate sorting on timestamp


: I used String because the timestamp is a Long and there wasn't any
: SortField.LONG (I guess I should have used SortField.CUSTOM). In this
: case, what should the indexing call look like? Currently, I have:

:     doc.add(new 
Field("timestamp",Long.toString(timestamp),Field.Store.NO,Field.Index.UN_TOKENIZED));

for the record, what you've got there won't sort correctly as a string
anyway, the number 923456 will sort after the number 12345678 because as a
string it's lexigraphicaly larger ... you'd need to 0 pad the value, or
use something like NumberTools.

i never really noticed there was an int field cache, but no long field
cache ... either way, if you are tryingto avoid hte FieldCachebecause of
hte time ittakes to initialize and not hte memory it takes up, then long
support in FieldCachewouldnt' help you.

: The other thing I was considering is to automatically limit the number
: of results (there is no way a user can grok 3 million results anyway) by
: breaking down the range filter into a series of range filters and
: executing multiple searches in series until the max number of results

i'm not sure how that would help you unlessy ou did all of the sorting
external to Lucene (or used something like the patch i mentioned)

: case (when the number of results is reasonable). One way around this is
: to execute an initial search just to figure out the number of hits
: (without sorting, without scoring) and then apply different strategies,

if you try this use Sort.INDEXORDER .. short of writing your own
HitCollector i believe it's the best way to eliminate any extra work
Lucene might do for you to sort results.

: The patch you pointed out looks very very promising.

There's been some disagreement as to how usefull it is -- i don't know if
anyone has had time to do a thorough performnce analysys of it in
differnet sitautions.  if you try it out please comment in the issue with
your experiences 9good or bad)


----- Original Message ----
From: Chris Hostetter <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Thursday, January 25, 2007 9:09:12 PM
Subject: Re: Extending scoring to eliminate sorting on timestamp


: For various reasons, we'd like to eliminate the sort step.

can you elaborate on what those reasons are?

FunctionQuery (in the solr code base, you'll find lots of discussing in
the archives of this list) can let you use a numeric field value in the
score calculation, but it still uses the FieldCache so if you are trying
to avoid that for space/time reasons it won't help.

you may also be interested in this patch...

  https://issues.apache.org/jira/browse/LUCENE-769

in the general case it should be slower then standard sorting, but if
you are dealing with an extremely large index and your result sets all
tend to be small, it may be faster (and it won't pay the initial
FieldCache setup time on frequently modified indexes)

:                    new SortField("timestamp",SortField.STRING,true)}));

why are you sorting timestamps as strings? ... if you sort them as ints,
your FieldCache will be a whole hell of a lot smaller (i'm guessing very
few documents have identicle timestamps, so your FieldCache should be at
least half as big if you sort on ints (and probably a lot more).


-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




:
: ---------------------------------------------------------------------
: To unsubscribe, e-mail: [EMAIL PROTECTED]
: For additional commands, e-mail: [EMAIL PROTECTED]
:



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to