I tried the term divisor index prior to the posting and didn't see much difference in the memory usage.
I don't think we can turn off field norms because we use boosting to influence some content to the front of the results. Will definitely spend some time with Solr, Terracotta, and possibly hadoop to see if any of that helps. Also planning on running the example code Mark posted to see if we can see any optimizations. We really appreciate the help. Todd On Thu, Oct 30, 2008 at 5:21 AM, mark harwood <[EMAIL PROTECTED]> wrote: > One issue with the existing field cache implementation is that it uses int > arrays to reference into the list of unique terms where short or even byte > arrays may suffice for fields with smaller numbers of unique terms. > How many unique terms do you have? I don't know the actual number for unique terms but they're vast since they're using entered data on 300M media objects. > > I posted some code that measures the potential RAM savings to be had, given > your particular index. > See here: > http://www.nabble.com/Re%3A-Question-regarding-sorting-and-memory-consumption-in-lucene-p19997753.html > > This code shows what theoretically could be saved given a different approach > to field caching. Unfortunately the use of int arrays is not abstracted in > Lucene's StringIndex so this optimisation cannot be dropped in without some > disruption. > Shame, as there is a big difference between byte[reader.maxDoc] and > int[reader.maxDoc] when you are working with large indexes and are tight on > RAM. > > If you're prepared to write a more optimized custom field cache then the > above code may be a useful start point. > > Cheers, > Mark > > > > > > ----- Original Message ---- > From: Mark Miller <[EMAIL PROTECTED]> > To: "java-user@lucene.apache.org" <java-user@lucene.apache.org> > Sent: Thursday, 30 October, 2008 10:37:48 > Subject: Re: OutOfMemory Problems Lucene 2.4 / Tomcat > > Michaels got some great points (he the lucene master), especially possibly > turning off norms if you can, but for an index like that i'd reccomwnd solr. > Solr sharding can be scaled to billions (min a billion or two anyway) with > few limitations (of course there are a few). Plus it has further caching > options, indexreader refresh managment, etc etc etc > > > - Mark > > > On Oct 29, 2008, at 10:30 PM, "Todd Benge" <[EMAIL PROTECTED]> wrote: > >> Thanks Mark. I appreciate the help. >> >> I thought our memory may be low but wanted to verify there if there is >> any way to control memory usage. I think we'll likely upgrade the >> memory on the machines but that may just delay the inevitable. >> >> Wondering if anyone else has encountered similar issues with indices >> if a similar size. I've been thinking we will need to move to a >> clustered solution and have been reading on hadoop, nutch, solr & >> terracotta for possibilities such as index sharding. >> >> Has anyone implemented a solution using hadoop or terracotta for a >> large scale system? Just wondering the pro's / con's of the various >> approaches. >> >> Thanks, >> >> Todd >> >> On Wed, Oct 29, 2008 at 6:07 PM, Mark Miller <[EMAIL PROTECTED]> wrote: >>> The term, terminfo, indexreader internals stuff is prob on the low end >>> compared to the size of your field caches (needed for sorting). If you are >>> sorting by String I think the space needed is 32 bits x number of docs + an >>> array to hold all of the unique terms. So checking 300 million docs (I know >>> you are actually breaking it up smaller than that, but for example) and >>> ignoring things like String chars being variable byte lengths and storing >>> the length, etc and randomly picking 50000 unique terms at 6 chars per: >>> >>> 32 bits x 300000000 + 50000 x 6 x 16 bits to MB = 1 144.98138 megabytes >>> >>> Thats per field your sorting on. If you are sorting on an int field it >>> should be closer to 32 bits x num docs - shorts, 32 bits x num docs, etc. >>> >>> So you have those field caches, plus the IndexReader terminfo, term stuff, >>> plus whatever RAM your app needs beyond Lucene. 4 gig might just not *quite* >>> cut it is my guess. >>> >>> Todd Benge wrote: >>>> >>>> There's usually only a couple sort fields and a bunch of terms in the >>>> various indices. The terms are user entered on various media so the >>>> number of terms is very large. >>>> >>>> Thanks for the help. >>>> >>>> Todd >>>> >>>> >>>> >>>> On 10/29/08, Todd Benge <[EMAIL PROTECTED]> wrote: >>>> >>>>> >>>>> Hi, >>>>> >>>>> I'm the lead engineer for search on a large website using lucene for >>>>> search. >>>>> >>>>> We're indexing about 300M documents in ~ 100 indices. The indices add >>>>> up to ~ 60G. >>>>> >>>>> The indices are sorted into 4 different Multisearcher with the largest >>>>> handling ~50G. >>>>> >>>>> The code is basically like the following: >>>>> >>>>> private static MultiSearcher searcher; >>>>> >>>>> public void init(File files) { >>>>> >>>>> IndexSearcer [] searchers = new IndexSearcher[files.length] (); >>>>> int i = 0; >>>>> for ( File file: files ) { >>>>> searchers[i++] = new >>>>> IndexSearcher(FSDirectory.getDirectory(file); >>>>> } >>>>> >>>>> searcher = new MultiSearcher(searchers); >>>>> } >>>>> >>>>> public Searcher getSearcher() { >>>>> return searcher; >>>>> } >>>>> >>>>> We're seeing a high cache rate with Term & TermInfo in Lucene 2.4. >>>>> Performance is good but servers are consistently hanging with >>>>> OutOfMemory errors. >>>>> >>>>> We're allocating 4G in the heap to each server. >>>>> >>>>> Is there any way to control the amount of memory Lucene consume for >>>>> caching? Any other suggestions on fixing the memory errors? >>>>> >>>>> Thanks, >>>>> >>>>> Todd >>>>> >>>>> >>>> >>>> >>> >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: [EMAIL PROTECTED] >>> For additional commands, e-mail: [EMAIL PROTECTED] >>> >>> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [EMAIL PROTECTED] >> For additional commands, e-mail: [EMAIL PROTECTED] >> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]