Re: best way to get the size of an index
On Oct 1, 2009, at 12:18 PM, Phillip Farber wrote: Resuming this discussion in a new thread to focus only on this question: What is the best way to get the size of an index so it does not get too big to be optimized (or to allow a very large segment merge) given space limits? I already have the largest 15,000rpm SCSI direct attached storage so buying storage is not an option. I don't do deletes. From what I've read, I expect no more than a 2x increase during optimization and have not seen more in practice. I'm thinking: stop indexing, commit, do a du. That sounds reasonable, but on the other thread, I'd still plan for a 3x increase, even if you aren't doing deletes, just to be on the safe side. I wonder if there is a way to report it back via Java/Lucene in a Request Handler or in the Luke Request Handler? May be worth taking the time to add. -- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using Solr/Lucene: http://www.lucidimagination.com/search
Re: best way to get the size of an index
Phillip Farber wrote: Resuming this discussion in a new thread to focus only on this question: What is the best way to get the size of an index so it does not get too big to be optimized (or to allow a very large segment merge) given space limits? I already have the largest 15,000rpm SCSI direct attached storage so buying storage is not an option. I don't do deletes. Even if you did do deletes, its not really a 3x problem - thats just theory - you'd have to work to get there. Deletes are merged out as you index additional docs as segments are merged over time. The 3x scenario brought up is more of a fun mind exercise than anything that would realistically happen. From what I've read, I expect no more than a 2x increase during optimization and have not seen more in practice. I'm thinking: stop indexing, commit, do a du. Will this give me the number I need for what I'm trying to do? Is there a better way? Should work fine. When you do the commit, onCommit will be called on the IndexDeltionPolicy, and all of the snapshots of the index other than the latest one will be removed. You should have a clean index to gauge the size with. Using something like Java Replication complicates this though - in that case, older commit points can be reserved while they are being copied. Phil -- - Mark http://www.lucidimagination.com
Re: best way to get the size of an index
Mark Miller wrote: Phillip Farber wrote: Resuming this discussion in a new thread to focus only on this question: What is the best way to get the size of an index so it does not get too big to be optimized (or to allow a very large segment merge) given space limits? I already have the largest 15,000rpm SCSI direct attached storage so buying storage is not an option. I don't do deletes. Even if you did do deletes, its not really a 3x problem - thats just theory - you'd have to work to get there. Deletes are merged out as you index additional docs as segments are merged over time. The 3x scenario brought up is more of a fun mind exercise than anything that would realistically happen. And for completeness for those following along: Lets say you did do some crazy deleting, and deleted half the docs in your index. Those docs stay around, and the ids are just added to a list that keeps those docs from being seen. Later, as natural merging occurs, or if you force merges with an optimize, those deleted docs will physically be removed. Lets then say you then managed to re-add all of those docs without any merging occurring while adding those docs (say you wanted to see this affect so bad that you wrote and put in a custom merge policy that doesn't find any segments to merge). Even if you do all that, before you do the optimize, your going to look at the size of your index and see its n GB. Thats your current index size. Now say you kick off the optimize. Its not even going to take 2x that n size to optimize - this is because all those deletes will be removed as the index is optimized down to one segment. Its going to take 2x. This delete thing, as I said, is more of a fun mental thing. It has little relation to how much space you need to optimize in comparison to how big your index is before optimizing. And its really worse than a worse case scenario unless you write a custom merge policy, or crank some settings insanely high and have enough RAM to do (all the indexing would have to take place in one huge segment in RAM that would then get flushed). -- - Mark http://www.lucidimagination.com
Re: best way to get the size of an index
Thanks, Mark. I really appreciate your confirmation. Phil Mark Miller wrote: Phillip Farber wrote: Resuming this discussion in a new thread to focus only on this question: What is the best way to get the size of an index so it does not get too big to be optimized (or to allow a very large segment merge) given space limits? I already have the largest 15,000rpm SCSI direct attached storage so buying storage is not an option. I don't do deletes. Even if you did do deletes, its not really a 3x problem - thats just theory - you'd have to work to get there. Deletes are merged out as you index additional docs as segments are merged over time. The 3x scenario brought up is more of a fun mind exercise than anything that would realistically happen. From what I've read, I expect no more than a 2x increase during optimization and have not seen more in practice. I'm thinking: stop indexing, commit, do a du. Will this give me the number I need for what I'm trying to do? Is there a better way? Should work fine. When you do the commit, onCommit will be called on the IndexDeltionPolicy, and all of the snapshots of the index other than the latest one will be removed. You should have a clean index to gauge the size with. Using something like Java Replication complicates this though - in that case, older commit points can be reserved while they are being copied. Phil