Re: best way to get the size of an index

2009-10-02 Thread Grant Ingersoll


On Oct 1, 2009, at 12:18 PM, Phillip Farber wrote:



Resuming this discussion in a new thread to focus only on this  
question:


What is the best way to get the size of an index so it does not get  
too big to be optimized (or to allow a very large segment merge)  
given space limits?


I already have the largest 15,000rpm SCSI direct attached storage so  
buying storage is not an option.  I don't do deletes.


From what I've read, I expect no more than a 2x increase during  
optimization and have not seen more in practice.


I'm thinking: stop indexing, commit, do a du.


That sounds reasonable, but on the other thread, I'd still plan for a  
3x increase, even if you aren't doing deletes, just to be on the safe  
side.



I wonder if there is a way to report it back via Java/Lucene in a  
Request Handler or in the Luke Request Handler?  May be worth taking  
the time to add.


--
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:

http://www.lucidimagination.com/search



Re: best way to get the size of an index

2009-10-02 Thread Mark Miller
Phillip Farber wrote:

 Resuming this discussion in a new thread to focus only on this question:

 What is the best way to get the size of an index so it does not get
 too big to be optimized (or to allow a very large segment merge) given
 space limits?

 I already have the largest 15,000rpm SCSI direct attached storage so
 buying storage is not an option.  I don't do deletes.
Even if you did do deletes, its not really a 3x problem - thats just
theory - you'd have to work to get there. Deletes are merged out as you
index additional docs as segments are merged over time. The 3x scenario
brought up is more of a fun mind exercise than anything that would
realistically happen.

 From what I've read, I expect no more than a 2x increase during
 optimization and have not seen more in practice.

 I'm thinking: stop indexing, commit, do a du.

 Will this give me the number I need for what I'm trying to do? Is
 there a better way?
Should work fine. When you do the commit, onCommit will be called on the
IndexDeltionPolicy, and all of the snapshots of the index other than
the latest one will be removed. You should have a clean index to gauge
the size with. Using something like Java Replication complicates this
though - in that case, older commit points can be reserved while they
are being copied.

 Phil


-- 
- Mark

http://www.lucidimagination.com





Re: best way to get the size of an index

2009-10-02 Thread Mark Miller
Mark Miller wrote:
 Phillip Farber wrote:
   
 Resuming this discussion in a new thread to focus only on this question:

 What is the best way to get the size of an index so it does not get
 too big to be optimized (or to allow a very large segment merge) given
 space limits?

 I already have the largest 15,000rpm SCSI direct attached storage so
 buying storage is not an option.  I don't do deletes.
 
 Even if you did do deletes, its not really a 3x problem - thats just
 theory - you'd have to work to get there. Deletes are merged out as you
 index additional docs as segments are merged over time. The 3x scenario
 brought up is more of a fun mind exercise than anything that would
 realistically happen.
   

And for completeness for those following along:

Lets say you did do some crazy deleting, and deleted half the docs in
your index. Those docs stay around, and the ids are just added to a list
that keeps those docs from being seen. Later, as natural merging
occurs, or if you force merges with an optimize, those deleted docs will
physically be removed. Lets then say you then managed to re-add all of
those docs without any merging occurring while adding those docs (say
you wanted to see this affect so bad that you wrote and put in a custom
merge policy that doesn't find any segments to merge). Even if you do
all that, before you do the optimize, your going to look at the size of
your index and see its n GB. Thats your current index size. Now say you
kick off the optimize. Its not even going to take 2x that n size to
optimize - this is because all those deletes will be removed as the
index is optimized down to one segment. Its going to take 2x.

This delete thing, as I said, is more of a fun mental thing. It has
little relation to how much space you need to optimize in comparison to
how big your index is before optimizing. And its really worse than a
worse case scenario unless you write a custom merge policy, or crank
some settings insanely high and have enough RAM to do (all the indexing
would have to take place in one huge segment in RAM that would then get
flushed).


-- 
- Mark

http://www.lucidimagination.com





Re: best way to get the size of an index

2009-10-02 Thread Phillip Farber

Thanks, Mark. I really appreciate your confirmation.

Phil

Mark Miller wrote:

Phillip Farber wrote:

Resuming this discussion in a new thread to focus only on this question:

What is the best way to get the size of an index so it does not get
too big to be optimized (or to allow a very large segment merge) given
space limits?

I already have the largest 15,000rpm SCSI direct attached storage so
buying storage is not an option.  I don't do deletes.

Even if you did do deletes, its not really a 3x problem - thats just
theory - you'd have to work to get there. Deletes are merged out as you
index additional docs as segments are merged over time. The 3x scenario
brought up is more of a fun mind exercise than anything that would
realistically happen.

From what I've read, I expect no more than a 2x increase during
optimization and have not seen more in practice.

I'm thinking: stop indexing, commit, do a du.

Will this give me the number I need for what I'm trying to do? Is
there a better way?

Should work fine. When you do the commit, onCommit will be called on the
IndexDeltionPolicy, and all of the snapshots of the index other than
the latest one will be removed. You should have a clean index to gauge
the size with. Using something like Java Replication complicates this
though - in that case, older commit points can be reserved while they
are being copied.

Phil