Memory usage
Hi, I am using lucene -1.3 final version. Let us say there are 10,000 files with size of 20 MB. So, total file system size = 10,000 * 20 MB = 200 GB. I want to index these files. Let us say, the merge factor = 10 Min heap size required by JVM = 10 * 20 = 200 MB From the http://www.onjava.com/pub/a/onjava/2003/03/05/lucene.html article, --- For instance, if we set mergeFactor to 10, a new segment will be created on the disk for every 10 documents added to the index. When the 10th segment of size 10 is added, all 10 will be merged into a single segment of size 100. When 10 such segments of size 100 have been added, they will be merged into a single segment containing 1000 documents, and so on. Therefore, at any time, there will be no more than 9 segments in each power of 10 index size. - If the lucene is indexing the 1000th document, then the current time segment size would be 100. At that time, how many documents would the lucene hold in memory (10 documents or 100 documents)? If the lucene holds 100 documents, then min heap memory required will be 100 * 20 = 2 GB, which is unlikely. Is the optimize process memory intensive? How much memory lucene would take while doing the optimize? Is it safe to assume that the maximum heap size required by lucene is mergefactor * maximum_file_size? I am planning to use the default maxMergeDocs as default, as stated in the article. Any help on the above questions is highly appreciated. Thanks, -Sreedhar - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: IndexWriter.optimize and memory usage
On Friday 03 December 2004 08:43, Paul Elschot wrote: On Friday 03 December 2004 07:50, Chris Hostetter wrote: ... So, If I'm understanding you (and the javadocs) correctly, the real key here is maxMergeDocs. It seems like addDocument will never merge a segment untill maxMergeDocs have been added? ... meaning that I need a value less then the default (Integer.MAX_VALUE) if I want IndexWriter to do incrimental merges as I go ... ...except... ...if that were the case, then exactly is the meaning of mergeFactor? oops correction=minMergeDocs should be replaced by mergeFactor: maxMergeDocs controls the sizes of the intermediate segments when adding documents. With maxMergeDocs at default, adding a document can take as much time as : (and have the same effect as) optimize. Eg. with mergeFactor at 10, the 1000'th added document will create a segment of size 1000. With maxMergeDocs at a lower value than 1000, the last merge (of the 10 segments with 100 docs each) will not be done. : optimize() uses mergeFactor for its final merges, but it ignores maxMergeDocs. /oops Meanwhile these fields have been deprecated in the development version for set... methods. Setting minMergeDocs is is deprecated and to be replaced by setMaxBufferedDocs(). The javadoc for this reads: Determines the minimal number of documents required before the buffered in-memory documents are merging and a new Segment is created. Since Documents are merged in a RAMDirectory, large value gives faster indexing. At the same time, mergeFactor limits the number of files open in a FSDirectory. Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
IndexWriter.optimize and memory usage
I've been running into an interesting situation that I wanted to ask about. I've been doing some testing by building up indexes with code that looks like this... IndexWriter writer = null; try { writer = new IndexWriter(index, new StandardAnalyzer(), true); writer.mergeFactor = MERGE_FACTOR; PooledExecutor queue = new PooledExecutor(NUM_UPDATE_THREADS); queue.waitWhenBlocked(); for (int min=low; min high; min += BATCH_SIZE) { int max = min + BATCH_SIZE; if (high max) { max = high; } queue.execute(new BatchIndexer(writer, min, max)); } end = new Date(); System.out.println(Build Time: + (end.getTime() - start.getTime()) + ms); start = end; writer.optimize(); } finally { if (null != writer) { try { writer.close(); } catch (Exception ignore) {/*NOOP*/; } } } end = new Date(); System.out.println(Optimize Time: + (end.getTime() - start.getTime()) + ms); (where BatchIndexer is a class i have that gets a DB connection, and slurps all records from my DB between min and max and builds some simple Documents out of them and calls writer.addDocument(doc) on each) This was working fine with small ranges, but then i tried building up a nice big index for doing some performance testing. i left it running overnight and when i came back in the morning i discovered that after successfully building up the whole index (~112K docs, ~1.5GB disk) it crashed with an OutOfMemory exception while trying to optimize. I then realized i was only running my JVM with a 256m upper limit on RAM, and i figured that PooledExecutor was still in scope, and maybe it was maintaining some state that was using up a lot of space, so i whiped up a quick little app to solve my problem... public static void main(String[] args) throws Exception { IndexWriter writer = null; try { writer = new IndexWriter(index, new StandardAnalyzer(), false); writer.optimize(); } finally { if (null != writer) { try { writer.close(); } catch (Exception ignore) { /*NOOP*/; } } } } ...but I was dissapointed to discover that even this couldn't run with only 256m of ram. I bumped it up to 512m and then it manged to complete successfully (the final index was only 1.1GB of disk). This raises a few questions in my mind: 1) Is there a rule of thumb for knowing how much memory it takes to optimize an index? 2) Is there a Best Practice to follow when building up a large index from scratch in order to reduce the amount of memory needed to optimize once the whole index is build? (ie: would spining up a thread that called writer.optimize() every N minutes be a good idea?) 3) Given an unoptimized index that's allready been built (ie: in the case where my builder crashed and i wanted to try and optimize it without having to rebuild from scratch) is there anyway to get IndexWriter to use less RAM and more disk (trading spead for a smaller form factor -- and aparently: greater stability so that the app doesn't crash) I imagine that the answers to #1 and #2 are largely dependent on the nature of the data in the index (ie: the frequency of terms) but i'm wondering if there is a high level formula that could be used to say based on the nature of your data, you want to take this approach to optimizing when you build -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: IndexWriter.optimize and memory usage
Hello and quick answers: See IndexWriter javadoc and in particular mergeFactor, minMergeDocs, and maxMergeDocs. This will let you control the size of your segments, the frequency of segment merges, the amount of buffered Documents in RAM between segment merges and such. Also, you ask about calling optimize periodically - no need, Lucene should already merge segments once in a while for you. Optimize at the end. You can also experiment with different JVM args for various GC algorithms. Otis --- Chris Hostetter [EMAIL PROTECTED] wrote: I've been running into an interesting situation that I wanted to ask about. I've been doing some testing by building up indexes with code that looks like this... IndexWriter writer = null; try { writer = new IndexWriter(index, new StandardAnalyzer(), true); writer.mergeFactor = MERGE_FACTOR; PooledExecutor queue = new PooledExecutor(NUM_UPDATE_THREADS); queue.waitWhenBlocked(); for (int min=low; min high; min += BATCH_SIZE) { int max = min + BATCH_SIZE; if (high max) { max = high; } queue.execute(new BatchIndexer(writer, min, max)); } end = new Date(); System.out.println(Build Time: + (end.getTime() - start.getTime()) + ms); start = end; writer.optimize(); } finally { if (null != writer) { try { writer.close(); } catch (Exception ignore) {/*NOOP*/; } } } end = new Date(); System.out.println(Optimize Time: + (end.getTime() - start.getTime()) + ms); (where BatchIndexer is a class i have that gets a DB connection, and slurps all records from my DB between min and max and builds some simple Documents out of them and calls writer.addDocument(doc) on each) This was working fine with small ranges, but then i tried building up a nice big index for doing some performance testing. i left it running overnight and when i came back in the morning i discovered that after successfully building up the whole index (~112K docs, ~1.5GB disk) it crashed with an OutOfMemory exception while trying to optimize. I then realized i was only running my JVM with a 256m upper limit on RAM, and i figured that PooledExecutor was still in scope, and maybe it was maintaining some state that was using up a lot of space, so i whiped up a quick little app to solve my problem... public static void main(String[] args) throws Exception { IndexWriter writer = null; try { writer = new IndexWriter(index, new StandardAnalyzer(), false); writer.optimize(); } finally { if (null != writer) { try { writer.close(); } catch (Exception ignore) { /*NOOP*/; } } } } ...but I was dissapointed to discover that even this couldn't run with only 256m of ram. I bumped it up to 512m and then it manged to complete successfully (the final index was only 1.1GB of disk). This raises a few questions in my mind: 1) Is there a rule of thumb for knowing how much memory it takes to optimize an index? 2) Is there a Best Practice to follow when building up a large index from scratch in order to reduce the amount of memory needed to optimize once the whole index is build? (ie: would spining up a thread that called writer.optimize() every N minutes be a good idea?) 3) Given an unoptimized index that's allready been built (ie: in the case where my builder crashed and i wanted to try and optimize it without having to rebuild from scratch) is there anyway to get IndexWriter to use less RAM and more disk (trading spead for a smaller form factor -- and aparently: greater stability so that the app doesn't crash) I imagine that the answers to #1 and #2 are largely dependent on the nature of the data in the index (ie: the frequency of terms) but i'm wondering if there is a high level formula that could be used to say based on the nature of your data, you want to take this approach to optimizing when you build - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Memory usage: IndexSearcher Sort
On Sep 29, 2004, at 3:11 PM, Bryan Dotzour wrote: 3. Certainly some of you on this list are using Lucene in a web-app environment. Can anyone list some best practices on managing reading/writing/searching a Lucene index in that context? Beyond the advice already given on this thread, since you said you were using Tapestry, I keep an IndexSearcher as a transient lazy-init'd property of my Global object. It needs to be transient in case you are scaling with distributed servers in a farm and lazy init'd so to instantiate the first time. Global, in Tapestry, makes a good place to put index operations. As for searching - a good first try is to re-query for each page of search results (if you're implementing paging of results, that is). It is often fast enough. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Memory usage: IndexSearcher Sort
-Message d'origine- De : Otis Gospodnetic [mailto:[EMAIL PROTECTED] Envoyé : mercredi 29 septembre 2004 18:28 À : Lucene Users List Objet : RE: Memory usage: IndexSearcher Sort 2. How does this approach work with multiple, simultaneous users? IndexSearcher is thread-safe. You mean one can invoque at the same time the search method of a unique Searcheable in two different threads, Don't you ? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Memory usage: IndexSearcher Sort
Correct. I think there is a FAQ entry at jguru.com that answers this. Otis --- Cocula Remi [EMAIL PROTECTED] wrote: 2. How does this approach work with multiple, simultaneous users? IndexSearcher is thread-safe. You mean one can invoque at the same time the search method of a unique Searcheable in two different threads, Don't you ? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Memory usage: IndexSearcher Sort
I have been investigating a serious memory problem in our web app (using Tapestry, Hibernate, Lucene) and have reduced it to being the way in which we are using Lucene to search on things. Being a webapp, we have focused on doing our work within a user's request. So we basically end up opening at least one new IndexSearcher on each individual page view. In one particular case, we were doing this in a loop, eventually opening ~20-~40 IndexSearchers which caused our memory usage to skyrocket. After viewing that one page 3 or 4 times we would exhaust the server's memory allocation. Most helpful in this search was the following thread from Bugzilla: http://issues.apache.org/bugzilla/show_bug.cgi?id=30628 http://issues.apache.org/bugzilla/show_bug.cgi?id=30628 From this thread, it sounds like constantly opening and closing IndexSearcher objects is a BAD THING, but it is exactly what we are doing in our app. There are a few things that puzzle me and I'd love it if anyone has some input that might clear up some of these questions. 1. According to the Bugzilla thread, and from my own testing, you can open lots of IndexSearchers in a loop and do a search WITHOUT SORTING and not have this memory problem. Is there an issue with the Sort code? 2. Can anyone give a brief, technical explanation as to why opening multiple IndexSearcher objects is bad? 3. Certainly some of you on this list are using Lucene in a web-app environment. Can anyone list some best practices on managing reading/writing/searching a Lucene index in that context? Thank you all Bryan --- Some extra information about my Lucene setup: Lucene 1.4.1 We maintain 5 different indexes, all in RAMDirectories. The indexes aren't especially big ( 100,000 total objects combined).
Re: Memory usage: IndexSearcher Sort
Most helpful in this search was the following thread from Bugzilla: http://issues.apache.org/bugzilla/show_bug.cgi?id=30628 http://issues.apache.org/bugzilla/show_bug.cgi?id=30628 We had a similar problem in our webapp. Please look at the bug http://issues.apache.org/bugzilla/show_bug.cgi?id=31240 My co-worker Rafa has fixed this bug and submitted a patch today. Have fun ;) -- Damian Gajda Caltha Sp. j. Warszawa 02-807 ul. Kukuki 2 tel. +48 22 643 20 20 mobile: +48 501 032 506 http://www.caltha.pl/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Memory usage: IndexSearcher Sort
My solution is : I have bound in an RMI registry one RemoteSearchable object for each index. Thus I do not have to create any IndexSearcher and I can execute query from any application. This has been implemented in the Lucene Server that I have just began to create. http://sourceforge.net/projects/luceneserver/ I use it in a web app. It would be nice if some people could test it (don't you want ?) -Message d'origine- De : Bryan Dotzour [mailto:[EMAIL PROTECTED] Envoyé : mercredi 29 septembre 2004 15:11 À : '[EMAIL PROTECTED]' Objet : Memory usage: IndexSearcher Sort I have been investigating a serious memory problem in our web app (using Tapestry, Hibernate, Lucene) and have reduced it to being the way in which we are using Lucene to search on things. Being a webapp, we have focused on doing our work within a user's request. So we basically end up opening at least one new IndexSearcher on each individual page view. In one particular case, we were doing this in a loop, eventually opening ~20-~40 IndexSearchers which caused our memory usage to skyrocket. After viewing that one page 3 or 4 times we would exhaust the server's memory allocation. Most helpful in this search was the following thread from Bugzilla: http://issues.apache.org/bugzilla/show_bug.cgi?id=30628 http://issues.apache.org/bugzilla/show_bug.cgi?id=30628 From this thread, it sounds like constantly opening and closing IndexSearcher objects is a BAD THING, but it is exactly what we are doing in our app. There are a few things that puzzle me and I'd love it if anyone has some input that might clear up some of these questions. 1. According to the Bugzilla thread, and from my own testing, you can open lots of IndexSearchers in a loop and do a search WITHOUT SORTING and not have this memory problem. Is there an issue with the Sort code? 2. Can anyone give a brief, technical explanation as to why opening multiple IndexSearcher objects is bad? 3. Certainly some of you on this list are using Lucene in a web-app environment. Can anyone list some best practices on managing reading/writing/searching a Lucene index in that context? Thank you all Bryan --- Some extra information about my Lucene setup: Lucene 1.4.1 We maintain 5 different indexes, all in RAMDirectories. The indexes aren't especially big ( 100,000 total objects combined). - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Memory usage: IndexSearcher Sort
Hello, --- Bryan Dotzour [EMAIL PROTECTED] wrote: I have been investigating a serious memory problem in our web app (using Tapestry, Hibernate, Lucene) and have reduced it to being the way in which we are using Lucene to search on things. Being a webapp, we have focused on doing our work within a user's request. So we basically end up opening at least one new IndexSearcher on each individual page view. In one particular case, we were doing this in a loop, eventually opening ~20-~40 IndexSearchers which caused our memory usage to skyrocket. After viewing that one page 3 or 4 times we would exhaust the server's memory allocation. Most helpful in this search was the following thread from Bugzilla: http://issues.apache.org/bugzilla/show_bug.cgi?id=30628 http://issues.apache.org/bugzilla/show_bug.cgi?id=30628 From this thread, it sounds like constantly opening and closing IndexSearcher objects is a BAD THING, but it is exactly what we are doing in our app. There are a few things that puzzle me and I'd love it if anyone has some input that might clear up some of these questions. 1. According to the Bugzilla thread, and from my own testing, you can open lots of IndexSearchers in a loop and do a search WITHOUT SORTING and not have this memory problem. Is there an issue with the Sort code? Yes, there is a memory leak in Sort code. A kind person from Poland contributed a patch earlier today. It's not in CVS yet. 2. Can anyone give a brief, technical explanation as to why opening multiple IndexSearcher objects is bad? Very simple. A Lucene index consists of X number of files that reside on a disk. Every time you open a new IndexSearcher, some of these files need to be read. If files do not change (no documents added/removed), why do this repetitive work? Just do it once. When these files are read, some data is stored in memory. If you read them multiple times, you will store the same data in memory multiple times. 3. Certainly some of you on this list are using Lucene in a web-app environment. Can anyone list some best practices on managing reading/writing/searching a Lucene index in that context? I use something like this for http://www.simpy.com/ and it works well for me: private IndexDescriptor getIndexDescriptor(String indexID) throws SearcherException { File indexDir = validateIndex(indexID); IndexDescriptor indexDescriptor = getIndexDescriptorFromCache(indexDir); try { // if this is a known index if (indexDescriptor != null) { // if the index has changed since this Searcher was created, make a new Searcher long currentVersion = IndexReader.getCurrentVersion(indexDir); if (currentVersion indexDescriptor.lastKnownVersion) { indexDescriptor.lastKnownVersion = currentVersion; indexDescriptor.searcher = new LuceneUserSearcher(indexDir); } } // if this is a new index else { indexDescriptor = new IndexDescriptor(); indexDescriptor.indexDir = indexDir; indexDescriptor.lastKnownVersion = IndexReader.getCurrentVersion(indexDir); indexDescriptor.searcher = new LuceneUserSearcher(indexDir); } return cacheIndexDescriptor(indexDescriptor); } catch (IOException e) { throw new SearcherException(Cannot open index: + indexDir, e); } } IndexDescriptor is a simple 'struct' with everything public (not good practise, you should change that): final class IndexDescriptor { public File indexDir; public long lastKnownVersion; public Searcher searcher; public String toString() { return IndexDescriptor.class.getName() + : index directory: + indexDir.getAbsolutePath() + , last known version: + lastKnownVersion + , searcher: + searcher; } } These two things combined allow me to re-open an IndexSearcher when the index changes, and re-use the same IndexSearcher while the index remains unmodified. Of course, that LuceneUserSearcher could be Lucene's IndexSearcher, probably. Otis http://www.simpy.com/ -- Index, Search and Share your bookmarks - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Memory usage: IndexSearcher Sort
Thanks very much for the reply Otis. Your code snippet is pretty interesting and made me think about a few questions. 1. Do you just have one IndexReader for a given index? It looks like you are handing out a new IndexSearcher when the IndexReader has been modified. 2. How does this approach work with multiple, simultaneous users? 3. When does the reader need to get closed? Thanks again. Bryan -Original Message- From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] Sent: Wednesday, September 29, 2004 8:47 AM To: Lucene Users List Subject: Re: Memory usage: IndexSearcher Sort Hello, --- Bryan Dotzour [EMAIL PROTECTED] wrote: I have been investigating a serious memory problem in our web app (using Tapestry, Hibernate, Lucene) and have reduced it to being the way in which we are using Lucene to search on things. Being a webapp, we have focused on doing our work within a user's request. So we basically end up opening at least one new IndexSearcher on each individual page view. In one particular case, we were doing this in a loop, eventually opening ~20-~40 IndexSearchers which caused our memory usage to skyrocket. After viewing that one page 3 or 4 times we would exhaust the server's memory allocation. Most helpful in this search was the following thread from Bugzilla: http://issues.apache.org/bugzilla/show_bug.cgi?id=30628 http://issues.apache.org/bugzilla/show_bug.cgi?id=30628 From this thread, it sounds like constantly opening and closing IndexSearcher objects is a BAD THING, but it is exactly what we are doing in our app. There are a few things that puzzle me and I'd love it if anyone has some input that might clear up some of these questions. 1. According to the Bugzilla thread, and from my own testing, you can open lots of IndexSearchers in a loop and do a search WITHOUT SORTING and not have this memory problem. Is there an issue with the Sort code? Yes, there is a memory leak in Sort code. A kind person from Poland contributed a patch earlier today. It's not in CVS yet. 2. Can anyone give a brief, technical explanation as to why opening multiple IndexSearcher objects is bad? Very simple. A Lucene index consists of X number of files that reside on a disk. Every time you open a new IndexSearcher, some of these files need to be read. If files do not change (no documents added/removed), why do this repetitive work? Just do it once. When these files are read, some data is stored in memory. If you read them multiple times, you will store the same data in memory multiple times. 3. Certainly some of you on this list are using Lucene in a web-app environment. Can anyone list some best practices on managing reading/writing/searching a Lucene index in that context? I use something like this for http://www.simpy.com/ and it works well for me: private IndexDescriptor getIndexDescriptor(String indexID) throws SearcherException { File indexDir = validateIndex(indexID); IndexDescriptor indexDescriptor = getIndexDescriptorFromCache(indexDir); try { // if this is a known index if (indexDescriptor != null) { // if the index has changed since this Searcher was created, make a new Searcher long currentVersion = IndexReader.getCurrentVersion(indexDir); if (currentVersion indexDescriptor.lastKnownVersion) { indexDescriptor.lastKnownVersion = currentVersion; indexDescriptor.searcher = new LuceneUserSearcher(indexDir); } } // if this is a new index else { indexDescriptor = new IndexDescriptor(); indexDescriptor.indexDir = indexDir; indexDescriptor.lastKnownVersion = IndexReader.getCurrentVersion(indexDir); indexDescriptor.searcher = new LuceneUserSearcher(indexDir); } return cacheIndexDescriptor(indexDescriptor); } catch (IOException e) { throw new SearcherException(Cannot open index: + indexDir, e); } } IndexDescriptor is a simple 'struct' with everything public (not good practise, you should change that): final class IndexDescriptor { public File indexDir; public long lastKnownVersion; public Searcher searcher; public String toString() { return IndexDescriptor.class.getName() + : index directory: + indexDir.getAbsolutePath() + , last known version: + lastKnownVersion + , searcher: + searcher; } } These two things combined allow me to re-open an IndexSearcher when the index changes, and re-use the same IndexSearcher while the index remains unmodified. Of course, that LuceneUserSearcher could be Lucene's IndexSearcher, probably. Otis http://www.simpy.com/ -- Index, Search and Share your
RE: Memory usage: IndexSearcher Sort
Hello Bryan, --- Bryan Dotzour [EMAIL PROTECTED] wrote: Thanks very much for the reply Otis. Your code snippet is pretty interesting and made me think about a few questions. 1. Do you just have one IndexReader for a given index? It looks like you are handing out a new IndexSearcher when the IndexReader has been modified. 1 index for each index. Simpy has a LOT of Lucene indices. 2. How does this approach work with multiple, simultaneous users? IndexSearcher is thread-safe. 3. When does the reader need to get closed? Just leave it open. You can close it when you are sure you no longer need it, if you can determine that in your application. Otis -Original Message- From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] Sent: Wednesday, September 29, 2004 8:47 AM To: Lucene Users List Subject: Re: Memory usage: IndexSearcher Sort Hello, --- Bryan Dotzour [EMAIL PROTECTED] wrote: I have been investigating a serious memory problem in our web app (using Tapestry, Hibernate, Lucene) and have reduced it to being the way in which we are using Lucene to search on things. Being a webapp, we have focused on doing our work within a user's request. So we basically end up opening at least one new IndexSearcher on each individual page view. In one particular case, we were doing this in a loop, eventually opening ~20-~40 IndexSearchers which caused our memory usage to skyrocket. After viewing that one page 3 or 4 times we would exhaust the server's memory allocation. Most helpful in this search was the following thread from Bugzilla: http://issues.apache.org/bugzilla/show_bug.cgi?id=30628 http://issues.apache.org/bugzilla/show_bug.cgi?id=30628 From this thread, it sounds like constantly opening and closing IndexSearcher objects is a BAD THING, but it is exactly what we are doing in our app. There are a few things that puzzle me and I'd love it if anyone has some input that might clear up some of these questions. 1. According to the Bugzilla thread, and from my own testing, you can open lots of IndexSearchers in a loop and do a search WITHOUT SORTING and not have this memory problem. Is there an issue with the Sort code? Yes, there is a memory leak in Sort code. A kind person from Poland contributed a patch earlier today. It's not in CVS yet. 2. Can anyone give a brief, technical explanation as to why opening multiple IndexSearcher objects is bad? Very simple. A Lucene index consists of X number of files that reside on a disk. Every time you open a new IndexSearcher, some of these files need to be read. If files do not change (no documents added/removed), why do this repetitive work? Just do it once. When these files are read, some data is stored in memory. If you read them multiple times, you will store the same data in memory multiple times. 3. Certainly some of you on this list are using Lucene in a web-app environment. Can anyone list some best practices on managing reading/writing/searching a Lucene index in that context? I use something like this for http://www.simpy.com/ and it works well for me: private IndexDescriptor getIndexDescriptor(String indexID) throws SearcherException { File indexDir = validateIndex(indexID); IndexDescriptor indexDescriptor = getIndexDescriptorFromCache(indexDir); try { // if this is a known index if (indexDescriptor != null) { // if the index has changed since this Searcher was created, make a new Searcher long currentVersion = IndexReader.getCurrentVersion(indexDir); if (currentVersion indexDescriptor.lastKnownVersion) { indexDescriptor.lastKnownVersion = currentVersion; indexDescriptor.searcher = new LuceneUserSearcher(indexDir); } } // if this is a new index else { indexDescriptor = new IndexDescriptor(); indexDescriptor.indexDir = indexDir; indexDescriptor.lastKnownVersion = IndexReader.getCurrentVersion(indexDir); indexDescriptor.searcher = new LuceneUserSearcher(indexDir); } return cacheIndexDescriptor(indexDescriptor); } catch (IOException e) { throw new SearcherException(Cannot open index: + indexDir, e); } } IndexDescriptor is a simple 'struct' with everything public (not good practise, you should change that): final class IndexDescriptor { public File indexDir; public long lastKnownVersion; public Searcher searcher; public String toString() { return IndexDescriptor.class.getName() + : index directory
Re: Memory usage
Sorry if I'm stating the obvious. Is this happening in some stand-alone unit tests, or are you running things from some application and in some environment, like Tomcat, Jetty or in some non-web app? Your queries are pretty big (although I recall some people using even bigger ones... but it all depends on the hardware they had), but are you sure running out of memory is due to Lucene, or could it be a leak in the app from which you are running queries? Otis --- James Dunn [EMAIL PROTECTED] wrote: Doug, We only search on analyzed text fields. There are a couple of additional fields in the index like OBJECT_ID that are keywords but we don't search against those, we only use them once we get a result back to find the thing that document represents. Thanks, Jim --- Doug Cutting [EMAIL PROTECTED] wrote: It is cached by the IndexReader and lives until the index reader is garbage collected. 50-70 searchable fields is a *lot*. How many are analyzed text, and how many are simply keywords? Doug James Dunn wrote: Doug, Thanks! I just asked a question regarding how to calculate the memory requirements for a search. Does this memory only get used only during the search operation itself, or is it referenced by the Hits object or anything else after the actual search completes? Thanks again, Jim --- Doug Cutting [EMAIL PROTECTED] wrote: James Dunn wrote: Also I search across about 50 fields but I don't use wildcard or range queries. Lucene uses one byte of RAM per document per searched field, to hold the normalization values. So if you search a 10M document collection with 50 fields, then you'll end up using 500MB of RAM. If you're using unanalyzed fields, then an easy workaround to reduce the number of fields is to combine many in a single field. So, instead of, e.g., using an f1 field with value abc, and an f2 field with value efg, use a single field named f with values 1_abc and 2_efg. We could optimize this in Lucene. If no values of an indexed field are analyzed, then we could store no norms for the field and hence read none into memory. This wouldn't be too hard to implement... Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] __ Do you Yahoo!? Friends. Fun. Try the all-new Yahoo! Messenger. http://messenger.yahoo.com/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] __ Do you Yahoo!? Friends. Fun. Try the all-new Yahoo! Messenger. http://messenger.yahoo.com/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Memory usage
Otis, My app does run within Tomcat. But when I started getting these OutOfMemoryErrors I wrote a little unit test to watch the memory usage without Tomcat in the middle and I still see the memory usage. Thanks, Jim --- Otis Gospodnetic [EMAIL PROTECTED] wrote: Sorry if I'm stating the obvious. Is this happening in some stand-alone unit tests, or are you running things from some application and in some environment, like Tomcat, Jetty or in some non-web app? Your queries are pretty big (although I recall some people using even bigger ones... but it all depends on the hardware they had), but are you sure running out of memory is due to Lucene, or could it be a leak in the app from which you are running queries? Otis --- James Dunn [EMAIL PROTECTED] wrote: Doug, We only search on analyzed text fields. There are a couple of additional fields in the index like OBJECT_ID that are keywords but we don't search against those, we only use them once we get a result back to find the thing that document represents. Thanks, Jim --- Doug Cutting [EMAIL PROTECTED] wrote: It is cached by the IndexReader and lives until the index reader is garbage collected. 50-70 searchable fields is a *lot*. How many are analyzed text, and how many are simply keywords? Doug James Dunn wrote: Doug, Thanks! I just asked a question regarding how to calculate the memory requirements for a search. Does this memory only get used only during the search operation itself, or is it referenced by the Hits object or anything else after the actual search completes? Thanks again, Jim --- Doug Cutting [EMAIL PROTECTED] wrote: James Dunn wrote: Also I search across about 50 fields but I don't use wildcard or range queries. Lucene uses one byte of RAM per document per searched field, to hold the normalization values. So if you search a 10M document collection with 50 fields, then you'll end up using 500MB of RAM. If you're using unanalyzed fields, then an easy workaround to reduce the number of fields is to combine many in a single field. So, instead of, e.g., using an f1 field with value abc, and an f2 field with value efg, use a single field named f with values 1_abc and 2_efg. We could optimize this in Lucene. If no values of an indexed field are analyzed, then we could store no norms for the field and hence read none into memory. This wouldn't be too hard to implement... Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] __ Do you Yahoo!? Friends. Fun. Try the all-new Yahoo! Messenger. http://messenger.yahoo.com/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] __ Do you Yahoo!? Friends. Fun. Try the all-new Yahoo! Messenger. http://messenger.yahoo.com/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] __ Do you Yahoo!? Friends. Fun. Try the all-new Yahoo! Messenger. http://messenger.yahoo.com/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Memory usage
Hello, I was wondering if anyone has had problems with memory usage and MultiSearcher. My index is composed of two sub-indexes that I search with a MultiSearcher. The total size of the index is about 3.7GB with the larger sub-index being 3.6GB and the smaller being 117MB. I am using Lucene 1.3 Final with the compound file format. Also I search across about 50 fields but I don't use wildcard or range queries. Doing repeated searches in this way seems to eventually chew up about 500MB of memory which seems excessive to me. Does anyone have any ideas where I could look to reduce the memory my queries consume? Thanks, Jim __ Do you Yahoo!? Friends. Fun. Try the all-new Yahoo! Messenger. http://messenger.yahoo.com/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Memory usage
This sounds like a memory leakage situation. If you are using tomcat I would suggest you make sure you are on a recent version, as it is known to have some memory leaks in version 4. It doesn't make sense that repeated queries would use more memory that the most demanding query unless objects are not getting freed from memory. -Will -Original Message- From: James Dunn [mailto:[EMAIL PROTECTED] Sent: Wednesday, May 26, 2004 3:02 PM To: [EMAIL PROTECTED] Subject: Memory usage Hello, I was wondering if anyone has had problems with memory usage and MultiSearcher. My index is composed of two sub-indexes that I search with a MultiSearcher. The total size of the index is about 3.7GB with the larger sub-index being 3.6GB and the smaller being 117MB. I am using Lucene 1.3 Final with the compound file format. Also I search across about 50 fields but I don't use wildcard or range queries. Doing repeated searches in this way seems to eventually chew up about 500MB of memory which seems excessive to me. Does anyone have any ideas where I could look to reduce the memory my queries consume? Thanks, Jim __ Do you Yahoo!? Friends. Fun. Try the all-new Yahoo! Messenger. http://messenger.yahoo.com/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Memory usage
Will, Thanks for your response. It may be an object leak. I will look into that. I just ran some more tests and this time I create a 20GB index by repeatedly merging my large index into itself. When I ran my test query against that index I got an OutOfMemoryError on the very first query. I have my heap set to 512MB. Should a query against a 20GB index require that much memory? I page through the results 100 at a time, so I should never have more than 100 Document objects in memory. Any help would be appreciated, thanks! Jim --- [EMAIL PROTECTED] wrote: This sounds like a memory leakage situation. If you are using tomcat I would suggest you make sure you are on a recent version, as it is known to have some memory leaks in version 4. It doesn't make sense that repeated queries would use more memory that the most demanding query unless objects are not getting freed from memory. -Will -Original Message- From: James Dunn [mailto:[EMAIL PROTECTED] Sent: Wednesday, May 26, 2004 3:02 PM To: [EMAIL PROTECTED] Subject: Memory usage Hello, I was wondering if anyone has had problems with memory usage and MultiSearcher. My index is composed of two sub-indexes that I search with a MultiSearcher. The total size of the index is about 3.7GB with the larger sub-index being 3.6GB and the smaller being 117MB. I am using Lucene 1.3 Final with the compound file format. Also I search across about 50 fields but I don't use wildcard or range queries. Doing repeated searches in this way seems to eventually chew up about 500MB of memory which seems excessive to me. Does anyone have any ideas where I could look to reduce the memory my queries consume? Thanks, Jim __ Do you Yahoo!? Friends. Fun. Try the all-new Yahoo! Messenger. http://messenger.yahoo.com/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Memory usage
How big are your actual Documents? Are you caching Hits? It stores, internally, up to 200 documents. Erik On May 26, 2004, at 4:08 PM, James Dunn wrote: Will, Thanks for your response. It may be an object leak. I will look into that. I just ran some more tests and this time I create a 20GB index by repeatedly merging my large index into itself. When I ran my test query against that index I got an OutOfMemoryError on the very first query. I have my heap set to 512MB. Should a query against a 20GB index require that much memory? I page through the results 100 at a time, so I should never have more than 100 Document objects in memory. Any help would be appreciated, thanks! Jim --- [EMAIL PROTECTED] wrote: This sounds like a memory leakage situation. If you are using tomcat I would suggest you make sure you are on a recent version, as it is known to have some memory leaks in version 4. It doesn't make sense that repeated queries would use more memory that the most demanding query unless objects are not getting freed from memory. -Will -Original Message- From: James Dunn [mailto:[EMAIL PROTECTED] Sent: Wednesday, May 26, 2004 3:02 PM To: [EMAIL PROTECTED] Subject: Memory usage Hello, I was wondering if anyone has had problems with memory usage and MultiSearcher. My index is composed of two sub-indexes that I search with a MultiSearcher. The total size of the index is about 3.7GB with the larger sub-index being 3.6GB and the smaller being 117MB. I am using Lucene 1.3 Final with the compound file format. Also I search across about 50 fields but I don't use wildcard or range queries. Doing repeated searches in this way seems to eventually chew up about 500MB of memory which seems excessive to me. Does anyone have any ideas where I could look to reduce the memory my queries consume? Thanks, Jim __ Do you Yahoo!? Friends. Fun. Try the all-new Yahoo! Messenger. http://messenger.yahoo.com/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Memory usage
James Dunn wrote: Also I search across about 50 fields but I don't use wildcard or range queries. Lucene uses one byte of RAM per document per searched field, to hold the normalization values. So if you search a 10M document collection with 50 fields, then you'll end up using 500MB of RAM. If you're using unanalyzed fields, then an easy workaround to reduce the number of fields is to combine many in a single field. So, instead of, e.g., using an f1 field with value abc, and an f2 field with value efg, use a single field named f with values 1_abc and 2_efg. We could optimize this in Lucene. If no values of an indexed field are analyzed, then we could store no norms for the field and hence read none into memory. This wouldn't be too hard to implement... Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Memory usage
Erik, Thanks for the response. My actual documents are fairly small. Most docs only have about 10 fields. Some of those fields are stored, however, like the OBJECT_ID, NAME and DESC fields. The stored fields are pretty small as well. None should be more than 4KB and very few will approach that limit. I'm also using the default maxFieldSize value of 1. I'm not caching hits, either. Could it be my query? I have about 80 total unique fields in the index although no document has all 80. My query ends up looking like this: +(F1:test F2:test .. F80:test) From previous mails that doesn't look like an enormous amount of fields to be searching against. Is there some formula for the amount of memory required for a query based on the number of clauses and terms? Jim --- Erik Hatcher [EMAIL PROTECTED] wrote: How big are your actual Documents? Are you caching Hits? It stores, internally, up to 200 documents. Erik On May 26, 2004, at 4:08 PM, James Dunn wrote: Will, Thanks for your response. It may be an object leak. I will look into that. I just ran some more tests and this time I create a 20GB index by repeatedly merging my large index into itself. When I ran my test query against that index I got an OutOfMemoryError on the very first query. I have my heap set to 512MB. Should a query against a 20GB index require that much memory? I page through the results 100 at a time, so I should never have more than 100 Document objects in memory. Any help would be appreciated, thanks! Jim --- [EMAIL PROTECTED] wrote: This sounds like a memory leakage situation. If you are using tomcat I would suggest you make sure you are on a recent version, as it is known to have some memory leaks in version 4. It doesn't make sense that repeated queries would use more memory that the most demanding query unless objects are not getting freed from memory. -Will -Original Message- From: James Dunn [mailto:[EMAIL PROTECTED] Sent: Wednesday, May 26, 2004 3:02 PM To: [EMAIL PROTECTED] Subject: Memory usage Hello, I was wondering if anyone has had problems with memory usage and MultiSearcher. My index is composed of two sub-indexes that I search with a MultiSearcher. The total size of the index is about 3.7GB with the larger sub-index being 3.6GB and the smaller being 117MB. I am using Lucene 1.3 Final with the compound file format. Also I search across about 50 fields but I don't use wildcard or range queries. Doing repeated searches in this way seems to eventually chew up about 500MB of memory which seems excessive to me. Does anyone have any ideas where I could look to reduce the memory my queries consume? Thanks, Jim __ Do you Yahoo!? Friends. Fun. Try the all-new Yahoo! Messenger. http://messenger.yahoo.com/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] __ Do you Yahoo!? Friends. Fun. Try the all-new Yahoo! Messenger. http://messenger.yahoo.com/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Memory usage
Doug, Thanks! I just asked a question regarding how to calculate the memory requirements for a search. Does this memory only get used only during the search operation itself, or is it referenced by the Hits object or anything else after the actual search completes? Thanks again, Jim --- Doug Cutting [EMAIL PROTECTED] wrote: James Dunn wrote: Also I search across about 50 fields but I don't use wildcard or range queries. Lucene uses one byte of RAM per document per searched field, to hold the normalization values. So if you search a 10M document collection with 50 fields, then you'll end up using 500MB of RAM. If you're using unanalyzed fields, then an easy workaround to reduce the number of fields is to combine many in a single field. So, instead of, e.g., using an f1 field with value abc, and an f2 field with value efg, use a single field named f with values 1_abc and 2_efg. We could optimize this in Lucene. If no values of an indexed field are analyzed, then we could store no norms for the field and hence read none into memory. This wouldn't be too hard to implement... Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] __ Do you Yahoo!? Friends. Fun. Try the all-new Yahoo! Messenger. http://messenger.yahoo.com/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Memory usage
It is cached by the IndexReader and lives until the index reader is garbage collected. 50-70 searchable fields is a *lot*. How many are analyzed text, and how many are simply keywords? Doug James Dunn wrote: Doug, Thanks! I just asked a question regarding how to calculate the memory requirements for a search. Does this memory only get used only during the search operation itself, or is it referenced by the Hits object or anything else after the actual search completes? Thanks again, Jim --- Doug Cutting [EMAIL PROTECTED] wrote: James Dunn wrote: Also I search across about 50 fields but I don't use wildcard or range queries. Lucene uses one byte of RAM per document per searched field, to hold the normalization values. So if you search a 10M document collection with 50 fields, then you'll end up using 500MB of RAM. If you're using unanalyzed fields, then an easy workaround to reduce the number of fields is to combine many in a single field. So, instead of, e.g., using an f1 field with value abc, and an f2 field with value efg, use a single field named f with values 1_abc and 2_efg. We could optimize this in Lucene. If no values of an indexed field are analyzed, then we could store no norms for the field and hence read none into memory. This wouldn't be too hard to implement... Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] __ Do you Yahoo!? Friends. Fun. Try the all-new Yahoo! Messenger. http://messenger.yahoo.com/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Memory usage
Doug, We only search on analyzed text fields. There are a couple of additional fields in the index like OBJECT_ID that are keywords but we don't search against those, we only use them once we get a result back to find the thing that document represents. Thanks, Jim --- Doug Cutting [EMAIL PROTECTED] wrote: It is cached by the IndexReader and lives until the index reader is garbage collected. 50-70 searchable fields is a *lot*. How many are analyzed text, and how many are simply keywords? Doug James Dunn wrote: Doug, Thanks! I just asked a question regarding how to calculate the memory requirements for a search. Does this memory only get used only during the search operation itself, or is it referenced by the Hits object or anything else after the actual search completes? Thanks again, Jim --- Doug Cutting [EMAIL PROTECTED] wrote: James Dunn wrote: Also I search across about 50 fields but I don't use wildcard or range queries. Lucene uses one byte of RAM per document per searched field, to hold the normalization values. So if you search a 10M document collection with 50 fields, then you'll end up using 500MB of RAM. If you're using unanalyzed fields, then an easy workaround to reduce the number of fields is to combine many in a single field. So, instead of, e.g., using an f1 field with value abc, and an f2 field with value efg, use a single field named f with values 1_abc and 2_efg. We could optimize this in Lucene. If no values of an indexed field are analyzed, then we could store no norms for the field and hence read none into memory. This wouldn't be too hard to implement... Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] __ Do you Yahoo!? Friends. Fun. Try the all-new Yahoo! Messenger. http://messenger.yahoo.com/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] __ Do you Yahoo!? Friends. Fun. Try the all-new Yahoo! Messenger. http://messenger.yahoo.com/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Memory Usage?
This was a single query? How many terms, and of what type are in the query? From the trace it looks like there could be over 40,000 terms in the query! Is this a prefix or wildcard query? These can generate *very* large queries... Doug -Original Message- From: Anders Nielsen [mailto:[EMAIL PROTECTED]] Sent: Sunday, November 11, 2001 6:59 AM To: Lucene Users List Subject: RE: Memory Usage? I am not very familiar with the output of -Xrunhprof, but I've attached the output of a run of a search through and index of 50.000 documents. It gave me out-of-memory errors until I allocated 100 megabytes of heap-space. The top 10: SITES BEGIN (ordered by live bytes) Sun Nov 11 15:50:31 2001 percent live alloc'ed stack class rank self accumbytes objs bytes objs trace name 1 26.41% 26.41% 12485200 12005 45566560 43814 1783 [B 2 25.18% 51.59% 11904880 11447 44867680 43142 1796 [B 3 4.15% 55.74% 1962904 69214 171546352 5510292 1632 [C 4 3.83% 59.58% 1812096 3432 1812096 3432 1768 [I 5 3.83% 63.41% 1812096 3432 1812096 3432 1769 [I 6 3.34% 66.75% 1580688 65862 130618992 5442458 1631 java.lang.String 7 3.19% 69.95% 1509584 44763 1509584 44763 458 [C 8 3.03% 72.98% 1432416 44763 1432416 44763 459 org.apache.lucene.index.TermInfo 9 2.27% 75.25% 1074312 44763 1074312 44763 457 java.lang.String 10 2.23% 77.48% 1053792 65862 87079328 5442458 1631 org.apache.lucene.index.Term and the top 3 traces were: TRACE 1783: org.apache.lucene.store.InputStream.refill(InputStream.java:165) org.apache.lucene.store.InputStream.readByte(InputStream.java:80) org.apache.lucene.store.InputStream.readVInt(InputStream.java:106) org.apache.lucene.index.SegmentTermDocs.next(SegmentTermDocs.java:101) TRACE 1796: org.apache.lucene.store.InputStream.refill(InputStream.java:165) org.apache.lucene.store.InputStream.readByte(InputStream.java:80) org.apache.lucene.store.InputStream.readVInt(InputStream.java:106) org.apache.lucene.index.SegmentTermPositions.next(SegmentTermP ositions.java: 100) TRACE 1632: java.lang.String.init(String.java:198) org.apache.lucene.index.SegmentTermEnum.readTerm(SegmentTermEn um.java:134) org.apache.lucene.index.SegmentTermEnum.next(SegmentTermEnum.java:114) org.apache.lucene.index.TermInfosReader.scanEnum(TermInfosRead er.java:166) I've attached the whole trace as gzipped.txt regards, Anders Nielsen -Original Message- From: Doug Cutting [mailto:[EMAIL PROTECTED]] Sent: 10. november 2001 04:35 To: 'Lucene Users List' Subject: RE: Memory Usage? I'm surprised that your memory use is that high. An IndexReader requires: one byte per field per document in index (norms) one open file per file in index 1/128 of the Terms in the index a Term has two pointers (8 bytes) and a String (4 pointers = 24 bytes, one to 16-bit chars) A Search requires: 1 1024 byte buffer per TermQuery 2 128 int buffers per TermQuery 2 1024 byte buffers per PhraseQuery term 1 1024 element bucket array per BooleanQuery each bucket has 5 fields, and hence requires ~20 bytes 1 bit per document in index per DateFilter A Hits requires: up to n+100 ScoreDocs (float+int, 8 bytes) where n is the highest Hits.doc(n) accessed up to 200 Document objects I may have forgotten something... Let's assume that your 1M document index has 2M unique terms, and that you only look at the top-100 hits, that your index has three fields, and that the typical document has two stored fields, each 20 characters. Your 30-term boolean query over a 1M document index should use around the following numbers of bytes: IndexReader: 3,000,000 (norms) 1,000,000 (1/128 of 2M terms, each requiring ~50 bytes) during search 50,000 (TermQuery buffers) 20,000 (BooleanQuery buckets) 100,000 (DateFilter bit vector) in Hits 2,000 (200 ScoreDocs) 30,000 (up to 200 cached Documents) So searches should run in a 5Mb heap. Are my assumptions off? You can also see why it is useful to keep a single IndexReader and use it for all queries. (IndexReader is thread safe.) You could also 'java -Xrunhprof:heap=sites' to see what's using memory. Doug -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
RE: Memory Usage?
hmm, I seem to be getting a different number of hits when I use the files you sent out. -Original Message- From: Doug Cutting [mailto:[EMAIL PROTECTED]] Sent: 12. november 2001 20:47 To: 'Lucene Users List' Subject: RE: Memory Usage? From: Anders Nielsen [mailto:[EMAIL PROTECTED]] this was a big boolean query, with several prefixqueries but no wildcard queries in the or-branches. Well it looks like those prefixes are expanding to a lot of terms, a total of over 40,000! (A prefix query expands into a BooleanQuery with all the terms matching the prefix.) If most of these expansions are low-frequency, then a simple fix should improve things considerably. I've attached an optimized version of TermQuery that will hold less memory per low-frequency term. In particular, if a term occurs fewer than 128 times then a 1024 byte InputStream buffer is freed immediately. Tell me how this works. Please send another heap dump. Longer term, or if lots of the expanded terms occur more than 128 times, perhaps BooleanScorer should use a different algorithm when there are thousands of terms. In this case it might use less memory to construct an array of score buckets for all documents. If (query.termCount() * 1024) (12 * getMaxDoc()) then this would use less memory. In your case, with 500,000 documents and a 40,000 term query, it's currently taking 40MB/query, and could be done in 6MB/query. This optimization would not be too difficult, as it could be mostly isolated to BooleanQuery and BooleanScorer. Doug -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
RE: Memory Usage?
From: Anders Nielsen [mailto:[EMAIL PROTECTED]] hmm, I seem to be getting a different number of hits when I use the files you sent out. Please provide more information! Is it larger or smaller than before? By how much? What differences show up in the hits? That's a terrible bug report... I think before it may have been possible to get a spurious hit if a query term only occurred in deleted documents. A wildcard query with 40,000 terms might make this sort of thing happen more often, and unless you tried to access the Hits.doc() for such a hit, you would not see an error. If this was in fact a problem, the code I just sent out would have fixed it. So your results may in fact be better. Or there may be a bug in what I sent. Or both! For the cases I have tried I get the same results with and without those changes. Doug -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]