I have had a similar problem. What I do is load all the date field values at index startup, convert dates (timestamps) to a Julian date (# of seconds since 1970/1/1). Then I pre-sort that array using a very fast O(n) distribution sort, and then keep an array of integers which is the pre-sorted permutation of all documents in the index. So that, for docid=N, perm[N]=sorted order. Then it just takes enumerating docids in results (from a bitarray) to get the sorted order of results. Our index is approx. 38million docs. Sorting by date is around 20ms.
-----Original Message----- From: Aleksander M. Stensby [mailto:[EMAIL PROTECTED] Sent: Friday, October 10, 2008 11:45 AM To: java-user@lucene.apache.org Subject: Re: Question regarding sorting and memory consumption in lucene That's a really good idea Mark! :) Thanks! Will try to see if can make a quick change with your suggestion. (Too bad quick isn't really a word in my vocabulary when it's 6 o'clock on a Friday :( Guess it'll be a looong night.. :( Cheers, Aleks On Fri, 10 Oct 2008 17:07:31 +0200, mark harwood <[EMAIL PROTECTED]> wrote: > Update: The statement "...cost is field size (10 bytes ?) times number > of documents" is wrong. > What you actually have is the cost of the unique strings (estimated at > 10 * 1460 -effectively nothing) BUT you have to add the cost of the > array of object references to those strings so > > 30m x 8 bytes on 64bit java = 240mb > or > 30m x 4 bytes on 32bit = 120mb > > ....which is where the bulk of the cost comes in. > > How about using a field cache of "short" which is effectively: > > new short[reader.maxDoc] > or > 2bytes * 30 million = 60 meg. > > Each short could represent up to 65536 values - capable of representing > a date range of 179 years. > > > > > > ----- Original Message ---- > From: mark harwood <[EMAIL PROTECTED]> > To: java-user@lucene.apache.org > Sent: Friday, 10 October, 2008 15:43:35 > Subject: Re: Question regarding sorting and memory consumption in lucene > > I think you have your memory cost calculation wrong. > The cost is field size (10 bytes ?) times number of documents NOT number > of unique terms. > The cache is essentially an array of size reader.maxDoc() which is > indexed directly into on docId to retrieve field values. > > You are right in needing to factor in the cost of keeping one active > cache while busy warming-up a new one so that effectively doubles the > RAM requirements. > > > > > > > ----- Original Message ---- > From: Aleksander M. Stensby <[EMAIL PROTECTED]> > To: java-user@lucene.apache.org > Sent: Friday, 10 October, 2008 15:25:29 > Subject: Re: Question regarding sorting and memory consumption in lucene > > Unfortunately no, since the documents that are added may come form a new > "source" containing old documents aswell..:/ > I tried deploying our webapplication without any searcher objects and it > consumes basically ~200mb of memory in tomcat. > With 6 searchers the same applications manages to consume over 2.5 GB of > memory when warming... :( > I might have done some super-idiotic logic in the way I handle searching, > but I can seriously not see what that might be... > > But I assume that people deal with much larger indexes than this, right? > > cheers, > Aleksander > > > On Fri, 10 Oct 2008 15:18:46 +0200, mark harwood > <[EMAIL PROTECTED]> > wrote: > >> Assuming content is added in chronological order and with no updates to >> existing docs couldn't you rely on internal Lucene document id to give a >> chronological sort order? >> That would require no memory cache at all when sorting. >> >> Querying across multiple indexes simultaneously however may present an >> added complication... >> >> >> >> ----- Original Message ---- >> From: Aleksander M. Stensby <[EMAIL PROTECTED]> >> To: java-user@lucene.apache.org >> Sent: Friday, 10 October, 2008 13:51:50 >> Subject: Re: Question regarding sorting and memory consumption in lucene >> >> I'll follow up on my own question... >> Let's say that we have 4 years of data, meaning that there will be >> roughly >> 4 * 365 = 1460 unique terms for our sort field. >> For one index, lets say with 30 million docs, the cache should use >> approx >> 100mb, or am I wrong? and thus for 6 indexes we would need approx 600 mb >> for the caches? (and an additional 100mb every time we warm a new >> searcher >> and swap it out...) As far as the string versus int or long goes, I >> don't >> really see any big gain in changig it since 1460 * 10 bytes extra >> memory >> doesnt really make much difference. Or? >> >> I guess we should consider reducing the index size or at least only >> allow >> sorted search on a subset of the index (or a pruned version of the >> index...) ? Would that be better for us? >> But then again, I assume that there are much larger lucene-based indexes >> out there than ours, and you guys must have some solution to this issue, >> right? :) >> >> best regards, >> Aleksander >> >> >> On Fri, 10 Oct 2008 14:09:36 +0200, Aleksander M. Stensby >> <[EMAIL PROTECTED]> wrote: >> >>> Hello, I've read a lot of threads now on memory consumption and >>> sorting, >>> and I think I have a pretty good understanding of how things work, but >>> I >>> could still need some input here.. >>> >>> We currently have a system consisting of 6 different lucene indexes >>> (all >>> have the same structure, so you could say it is a form of sharding). We >>> currently use this approach because we want to be able to give users >>> access to different index (but not necessarily all indexes) etc. >>> >>> (We are planning to move to a solr-based system, but for now we would >>> like to solve this issue with our current lucene-based system.) >>> >>> The thing is, the indexes are rather big (ranging from 5G to 20G per >>> index and 10 - 30 million entries per index.) >>> We keep one searcher object open per index, and when the index is >>> changed (new documents added in batches several times a day), we update >>> the searcher objects. >>> In the warmup procedure we did a couple of searches and things work >>> fine, BUT i realized that in our application we return hits sorted by >>> date by default, and our warmup procedure did non-sorted queries... so >>> still the first searches done by the user after an update was slow >>> (obviously). >>> >>> To cope, I changed the warmup procedure to include a sorted search, and >>> now the user will not notice slow queries. Good! >>> But, the problem at hand is that we are running into memory problems >>> (and I understand that sorting does consume a lot of memory...) But is >>> there any way that is "best practice" to deal with this? The field we >>> sort on is an un_indexed text field representing the date. typically >>> "2008-10-10". I am aware that string field sorting consumes a lot of >>> memory, so should we change this field to something different? Would >>> this help us with the memory problems? >>> >>> As a sidenote / couriosity question: Does it matter if we use the >>> search >>> method returning Hits versus the search method returning TopFieldDocs? >>> (we are not iterating them in any way when this memory issue occurs) >>> >>> Thanks in advance for any guidance we may get. >>> >>> Best regards, >>> Aleksander M. Stensby >>> >>> >>> >> >> >> > > > -- Aleksander M. Stensby Senior Software Developer Integrasco A/S +47 41 22 82 72 [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]