Re: Running OutOfMemory while optimizing and searching
Doug Thank you for confirming this. ZJ Doug Cutting [EMAIL PROTECTED] wrote: John Z wrote: We have indexes of around 1 million docs and around 25 searchable fields. We noticed that without any searches performed on the indexes, on startup, the memory taken up by the searcher is roughly 7 times the .tii file size. The .tii file is read into memory as per the code. Our .tii files are around 8-10 MB in size and our startup memory foot print is around 60-70 MB. Then when we start doing our searches, the memory goes up, depending on the fields we search on. We are noticing that if we start searching on new fields, the memory kind of goes up. Doug, Your calculation below on what is taken up by the searcher, does it take into account the .tii file being read into memory or am I not making any sense ? 1 byte * Number of searchable fields in your index * Number of docs in your index plus 1k bytes * number of terms in query plus 1k bytes * number of phrase terms in query You make perfect sense. The formula above does not include the .tii. My mistake: I forgot that. By default, every 128th Term in the index is read into memory, to permit random access to terms. These are stored in the .tii file, compressed. So it is not surprising that they require 7x the size of the .tii file in memory. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - Do you Yahoo!? Express yourself with Y! Messenger! Free. Download now.
Re: Running OutOfMemory while optimizing and searching
John Z wrote: We have indexes of around 1 million docs and around 25 searchable fields. We noticed that without any searches performed on the indexes, on startup, the memory taken up by the searcher is roughly 7 times the .tii file size. The .tii file is read into memory as per the code. Our .tii files are around 8-10 MB in size and our startup memory foot print is around 60-70 MB. Then when we start doing our searches, the memory goes up, depending on the fields we search on. We are noticing that if we start searching on new fields, the memory kind of goes up. Doug, Your calculation below on what is taken up by the searcher, does it take into account the .tii file being read into memory or am I not making any sense ? 1 byte * Number of searchable fields in your index * Number of docs in your index plus 1k bytes * number of terms in query plus 1k bytes * number of phrase terms in query You make perfect sense. The formula above does not include the .tii. My mistake: I forgot that. By default, every 128th Term in the index is read into memory, to permit random access to terms. These are stored in the .tii file, compressed. So it is not surprising that they require 7x the size of the .tii file in memory. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Running OutOfMemory while optimizing and searching
Hi We are trying to get the memory footprint on our searchers. We have indexes of around 1 million docs and around 25 searchable fields. We noticed that without any searches performed on the indexes, on startup, the memory taken up by the searcher is roughly 7 times the .tii file size. The .tii file is read into memory as per the code. Our .tii files are around 8-10 MB in size and our startup memory foot print is around 60-70 MB. Then when we start doing our searches, the memory goes up, depending on the fields we search on. We are noticing that if we start searching on new fields, the memory kind of goes up. Doug, Your calculation below on what is taken up by the searcher, does it take into account the .tii file being read into memory or am I not making any sense ? 1 byte * Number of searchable fields in your index * Number of docs in your index plus 1k bytes * number of terms in query plus 1k bytes * number of phrase terms in query Thank you ZJ Doug Cutting [EMAIL PROTECTED] wrote: What do your queries look like? The memory required for a query can be computed by the following equation: 1 Byte * Number of fields in your query * Number of docs in your index So if your query searches on all 50 fields of your 3.5 Million document index then each search would take about 175MB. If your 3-4 searches run concurrently then that's about 525MB to 700MB chewed up at once. That's not quite right. If you use the same IndexSearcher (or IndexReader) for all of the searches, then only 175MB are used. The arrays in question (the norms) are read-only and can be shared by all searches. In general, the amount of memory required is: 1 byte * Number of searchable fields in your index * Number of docs in your index plus 1k bytes * number of terms in query plus 1k bytes * number of phrase terms in query The latter are for i/o buffers. There are a few other things, but these are the major ones. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com
Re: Running OutOfMemory while optimizing and searching
Note that force is really just 'suggest'. Regardless, I have seen apps running under 1.3.1 JVM where this worked. Otis --- David Spencer [EMAIL PROTECTED] wrote: This in theory should not help, but anyway, just in case, the idea is to call gc() periodically to force gc - this is the code I use which tries to force it... public static long gc() { long bef = mem(); System.gc(); sleep( 100); System.runFinalization(); sleep( 100); System.gc(); long aft= mem(); return aft-bef; } Mark Florence wrote: Thanks, Jim. I'm pretty sure I'm throwing OOM for real, and not because I've run out of file handles. I can easily recreate the latter condition, and it is always reported accurately. I've also monitored the OOM as it occurs using top and I can see memory usage climbing until it is exhausted -- if you will excuse the pun! I'm not familiar with the new compound file format. Where can I look to find more information? -- Mark -Original Message- From: James Dunn [mailto:[EMAIL PROTECTED] Sent: Friday, July 02, 2004 01:29 pm To: Lucene Users List Subject: Re: Running OutOfMemory while optimizing and searching Ah yes, I don't think I made that clear enough. From Mark's original post, I believe he mentioned that he used seperate readers for each simultaneous query. His other issue was that he was getting an OOM during an optimize, even when he set the JVM heap to 2GB. He said his index was about 10.5GB spread over ~7000 files on Linux. My guess is that OOM might actually be a too many open files error. I have seen that type of error being reported by the JVM as an OutOfMemory error on Linux before. I had the same problem but once I switched to the new Lucene compound file format, I haven't had that problem since. Mark, have you tried switching to the compound file format? Jim --- Doug Cutting [EMAIL PROTECTED] wrote: What do your queries look like? The memory required for a query can be computed by the following equation: 1 Byte * Number of fields in your query * Number of docs in your index So if your query searches on all 50 fields of your 3.5 Million document index then each search would take about 175MB. If your 3-4 searches run concurrently then that's about 525MB to 700MB chewed up at once. That's not quite right. If you use the same IndexSearcher (or IndexReader) for all of the searches, then only 175MB are used. The arrays in question (the norms) are read-only and can be shared by all searches. In general, the amount of memory required is: 1 byte * Number of searchable fields in your index * Number of docs in your index plus 1k bytes * number of terms in query plus 1k bytes * number of phrase terms in query The latter are for i/o buffers. There are a few other things, but these are the major ones. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Running OutOfMemory while optimizing and searching
Thanks, Jim. I'm pretty sure I'm throwing OOM for real, and not because I've run out of file handles. I can easily recreate the latter condition, and it is always reported accurately. I've also monitored the OOM as it occurs using top and I can see memory usage climbing until it is exhausted -- if you will excuse the pun! I'm not familiar with the new compound file format. Where can I look to find more information? -- Mark -Original Message- From: James Dunn [mailto:[EMAIL PROTECTED] Sent: Friday, July 02, 2004 01:29 pm To: Lucene Users List Subject: Re: Running OutOfMemory while optimizing and searching Ah yes, I don't think I made that clear enough. From Mark's original post, I believe he mentioned that he used seperate readers for each simultaneous query. His other issue was that he was getting an OOM during an optimize, even when he set the JVM heap to 2GB. He said his index was about 10.5GB spread over ~7000 files on Linux. My guess is that OOM might actually be a too many open files error. I have seen that type of error being reported by the JVM as an OutOfMemory error on Linux before. I had the same problem but once I switched to the new Lucene compound file format, I haven't had that problem since. Mark, have you tried switching to the compound file format? Jim --- Doug Cutting [EMAIL PROTECTED] wrote: What do your queries look like? The memory required for a query can be computed by the following equation: 1 Byte * Number of fields in your query * Number of docs in your index So if your query searches on all 50 fields of your 3.5 Million document index then each search would take about 175MB. If your 3-4 searches run concurrently then that's about 525MB to 700MB chewed up at once. That's not quite right. If you use the same IndexSearcher (or IndexReader) for all of the searches, then only 175MB are used. The arrays in question (the norms) are read-only and can be shared by all searches. In general, the amount of memory required is: 1 byte * Number of searchable fields in your index * Number of docs in your index plus 1k bytes * number of terms in query plus 1k bytes * number of phrase terms in query The latter are for i/o buffers. There are a few other things, but these are the major ones. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Running OutOfMemory while optimizing and searching
This in theory should not help, but anyway, just in case, the idea is to call gc() periodically to force gc - this is the code I use which tries to force it... public static long gc() { long bef = mem(); System.gc(); sleep( 100); System.runFinalization(); sleep( 100); System.gc(); long aft= mem(); return aft-bef; } Mark Florence wrote: Thanks, Jim. I'm pretty sure I'm throwing OOM for real, and not because I've run out of file handles. I can easily recreate the latter condition, and it is always reported accurately. I've also monitored the OOM as it occurs using top and I can see memory usage climbing until it is exhausted -- if you will excuse the pun! I'm not familiar with the new compound file format. Where can I look to find more information? -- Mark -Original Message- From: James Dunn [mailto:[EMAIL PROTECTED] Sent: Friday, July 02, 2004 01:29 pm To: Lucene Users List Subject: Re: Running OutOfMemory while optimizing and searching Ah yes, I don't think I made that clear enough. From Mark's original post, I believe he mentioned that he used seperate readers for each simultaneous query. His other issue was that he was getting an OOM during an optimize, even when he set the JVM heap to 2GB. He said his index was about 10.5GB spread over ~7000 files on Linux. My guess is that OOM might actually be a too many open files error. I have seen that type of error being reported by the JVM as an OutOfMemory error on Linux before. I had the same problem but once I switched to the new Lucene compound file format, I haven't had that problem since. Mark, have you tried switching to the compound file format? Jim --- Doug Cutting [EMAIL PROTECTED] wrote: What do your queries look like? The memory required for a query can be computed by the following equation: 1 Byte * Number of fields in your query * Number of docs in your index So if your query searches on all 50 fields of your 3.5 Million document index then each search would take about 175MB. If your 3-4 searches run concurrently then that's about 525MB to 700MB chewed up at once. That's not quite right. If you use the same IndexSearcher (or IndexReader) for all of the searches, then only 175MB are used. The arrays in question (the norms) are read-only and can be shared by all searches. In general, the amount of memory required is: 1 byte * Number of searchable fields in your index * Number of docs in your index plus 1k bytes * number of terms in query plus 1k bytes * number of phrase terms in query The latter are for i/o buffers. There are a few other things, but these are the major ones. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Running OutOfMemory while optimizing and searching
What do your queries look like? The memory required for a query can be computed by the following equation: 1 Byte * Number of fields in your query * Number of docs in your index So if your query searches on all 50 fields of your 3.5 Million document index then each search would take about 175MB. If your 3-4 searches run concurrently then that's about 525MB to 700MB chewed up at once. That's not quite right. If you use the same IndexSearcher (or IndexReader) for all of the searches, then only 175MB are used. The arrays in question (the norms) are read-only and can be shared by all searches. In general, the amount of memory required is: 1 byte * Number of searchable fields in your index * Number of docs in your index plus 1k bytes * number of terms in query plus 1k bytes * number of phrase terms in query The latter are for i/o buffers. There are a few other things, but these are the major ones. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Running OutOfMemory while optimizing and searching
Wow, I have say that those sort of numbers are concerning to me... Now I know 3.5 million documents is a lot, but still... What would be causing a query to require and hold that much memory? I could understand that it surely would be doing a lot of memory work, but why would it need to hold onto/grab that much memory for the length of the query? (this is getting into the internals a bit, but it's always good to know what's going on under the hood from a design decision point of view and how I would have to structure an App to handle this sort of load). Cheers, Paul Smith 1 Byte * Number of fields in your query * Number of docs in your index So if your query searches on all 50 fields of your 3.5 Million document index then each search would take about 175MB. If your 3-4 searches run concurrently then that's about 525MB to 700MB chewed up at once. Also, if your queries use wildcards, the memory requirements could be much greater. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Running OutOfMemory while optimizing and searching
Mark, Tough situation. I hate when things like this happen on production :(. You are not mentioning what you are using for various IndexWriter parameters. You may be able to get this working by tweaking them (see http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/index/IndexWriter.html#field_summary). Hm, now that I think about it, I am not sure if those are considered during index optimization. I'll try checking the sources later. Otis --- Mark Florence [EMAIL PROTECTED] wrote: Hi, I'm using Lucene to index ~3.5M documents, over about 50 fields. The Lucene index itself is ~10.5GB, spread over ~7,000 files. Some of these files are large -- that is, several PRX files are ~1.5GB. Lucene runs on a dedicated server (Linux on a 1Ghz Dell, with 1GB RAM). Clients on other machines use RMI to perform reads / writes. Each night the server automatically performs an optimize. The problem is that the optimize now dies with an OutOfMemory exception, even when the JVM heap size is set to its maximum of 2GB. I need to optimize, because as the number of Lucene files grows, search performance becomes unacceptable. Search performance is also adversely affected because I've had to effectively single-thread reads and writes. I was using a simple read / write lock mechanism, allowing multiple readers to simultaneously search, but now more than 3-4 simultaneous readers will also cause an OutOfMemory condition. Searches can take as long as 30-40 seconds, and with single-threading, that's crippling the main client application. Needless to say, the Lucene index is mission-critical, and must run 24/7. I've seen other posts along this same vein, but no definite consensus. Is my problem simply inadequate hardware? Should I run on a 64-bit platform, where I can allocate a Java heap of 2GB? Or could there be something fundamentally wrong with my index? I should add that I've just spent about a week (!!) rebuilding from scratch, over all 3.5M documents. -- Many thanks for any help! Mark Florence - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]