subject:"\"index size before and after commit\""

Re: index size before and after commit

2009-10-01 Thread Lance Norskog

Ha! Searching "partial optimize" on
http://www.lucidimagination.com/search , we discover SOLR-603 which
gives the 'maxSegments' option to the  command. The text
does not include the word 'partial'.

It's on http://wiki.apache.org/solr/UpdateXmlMessages. The command
gives a number of Lucene segments, and I have no idea how this will
translate to disk space. To minimize disk space, you could run it
repetitively with the number of segments decreasing to one.

On Thu, Oct 1, 2009 at 11:49 AM, Lance Norskog  wrote:
> I've heard there is a new "partial optimize" feature in Lucene, but it
> is not mentioned in the Solr or Lucene wikis so I cannot advise you
> how to use it.
>
> On a previous project we had a 500GB index for 450m documents. It took
> 14 hours to optimize. We found that Solr worked well (given enough RAM
> for sorting and faceting requests) but that the IT logistics of a 500G
> fileset were too much.
>
> Also, if you want your query servers to continue serving while
> propogating the newly optimized index, you need 2X space to store both
> copies on the slave during the transfer. For us this 35 minutes over
> 1G ethernet.
>
> On Thu, Oct 1, 2009 at 7:36 AM, Walter Underwood  
> wrote:
>> I've now worked on three different search engines and they all have a 3X
>> worst
>> case on space, so I'm familiar with this case. --wunder
>>
>> On Oct 1, 2009, at 7:15 AM, Mark Miller wrote:
>>
>>> Nice one ;) Its not technically a case where optimize requires > 2x
>>> though in case the user asking gets confused. Its a case unrelated to
>>> optimize that can grow your index. Then you need < 2x for the optimize,
>>> since you won't copy the deletes.
>>>
>>> It also requires that you jump hoops to delete everything. If you delete
>>> everything with *:*, that is smart enough not to just do a delete on
>>> every document - it just creates a new index, allowing the removal of
>>> the old very efficiently.
>>>
>>> Def agree on the more disk space.
>>>
>>> Walter Underwood wrote:

 Here is how you need 3X. First, index everything and optimize. Then
 delete everything and reindex without any merges.

 You have one full-size index containing only deleted docs, one
 full-size index containing reindexed docs, and need that much space
 for a third index.

 Honestly, disk is cheap, and there is no way to make Lucene work
 reliably with less disk. 1TB is a few hundred dollars. You have a free
 search engine, buy some disk.

 wunder

 On Oct 1, 2009, at 6:25 AM, Grant Ingersoll wrote:

>> 151GB or as little as from 183GB to 182GB.  Is that size after a
>> commit close to the size the index would be after an optimize?  For
>> that matter, are there cases where optimization can take more than
>> 2x?  I've heard of cases but have not observed them in my system.
>
> I seem to recall a case where it can be 3x, but I don't know that it
> has been observed much.

>>>
>>>
>>> --
>>> - Mark
>>>
>>> http://www.lucidimagination.com
>>>
>>>
>>>
>>
>>
>
>
>
> --
> Lance Norskog
> goks...@gmail.com
>



-- 
Lance Norskog
goks...@gmail.com

Re: index size before and after commit

2009-10-01 Thread Lance Norskog

I've heard there is a new "partial optimize" feature in Lucene, but it
is not mentioned in the Solr or Lucene wikis so I cannot advise you
how to use it.

On a previous project we had a 500GB index for 450m documents. It took
14 hours to optimize. We found that Solr worked well (given enough RAM
for sorting and faceting requests) but that the IT logistics of a 500G
fileset were too much.

Also, if you want your query servers to continue serving while
propogating the newly optimized index, you need 2X space to store both
copies on the slave during the transfer. For us this 35 minutes over
1G ethernet.

On Thu, Oct 1, 2009 at 7:36 AM, Walter Underwood  wrote:
> I've now worked on three different search engines and they all have a 3X
> worst
> case on space, so I'm familiar with this case. --wunder
>
> On Oct 1, 2009, at 7:15 AM, Mark Miller wrote:
>
>> Nice one ;) Its not technically a case where optimize requires > 2x
>> though in case the user asking gets confused. Its a case unrelated to
>> optimize that can grow your index. Then you need < 2x for the optimize,
>> since you won't copy the deletes.
>>
>> It also requires that you jump hoops to delete everything. If you delete
>> everything with *:*, that is smart enough not to just do a delete on
>> every document - it just creates a new index, allowing the removal of
>> the old very efficiently.
>>
>> Def agree on the more disk space.
>>
>> Walter Underwood wrote:
>>>
>>> Here is how you need 3X. First, index everything and optimize. Then
>>> delete everything and reindex without any merges.
>>>
>>> You have one full-size index containing only deleted docs, one
>>> full-size index containing reindexed docs, and need that much space
>>> for a third index.
>>>
>>> Honestly, disk is cheap, and there is no way to make Lucene work
>>> reliably with less disk. 1TB is a few hundred dollars. You have a free
>>> search engine, buy some disk.
>>>
>>> wunder
>>>
>>> On Oct 1, 2009, at 6:25 AM, Grant Ingersoll wrote:
>>>
> 151GB or as little as from 183GB to 182GB.  Is that size after a
> commit close to the size the index would be after an optimize?  For
> that matter, are there cases where optimization can take more than
> 2x?  I've heard of cases but have not observed them in my system.

 I seem to recall a case where it can be 3x, but I don't know that it
 has been observed much.
>>>
>>
>>
>> --
>> - Mark
>>
>> http://www.lucidimagination.com
>>
>>
>>
>
>



-- 
Lance Norskog
goks...@gmail.com

Re: index size before and after commit

2009-10-01 Thread Walter Underwood

I've now worked on three different search engines and they all have a  
3X worst

case on space, so I'm familiar with this case. --wunder

On Oct 1, 2009, at 7:15 AM, Mark Miller wrote:


Nice one ;) Its not technically a case where optimize requires > 2x
though in case the user asking gets confused. Its a case unrelated to
optimize that can grow your index. Then you need < 2x for the  
optimize,

since you won't copy the deletes.

It also requires that you jump hoops to delete everything. If you  
delete

everything with *:*, that is smart enough not to just do a delete on
every document - it just creates a new index, allowing the removal of
the old very efficiently.

Def agree on the more disk space.

Walter Underwood wrote:

Here is how you need 3X. First, index everything and optimize. Then
delete everything and reindex without any merges.

You have one full-size index containing only deleted docs, one
full-size index containing reindexed docs, and need that much space
for a third index.

Honestly, disk is cheap, and there is no way to make Lucene work
reliably with less disk. 1TB is a few hundred dollars. You have a  
free

search engine, buy some disk.

wunder

On Oct 1, 2009, at 6:25 AM, Grant Ingersoll wrote:


151GB or as little as from 183GB to 182GB.  Is that size after a
commit close to the size the index would be after an optimize?  For
that matter, are there cases where optimization can take more than
2x?  I've heard of cases but have not observed them in my system.


I seem to recall a case where it can be 3x, but I don't know that it
has been observed much.





--
- Mark

http://www.lucidimagination.com

Re: index size before and after commit

2009-10-01 Thread Mark Miller

bq. and reindex without any merges.

Thats actually quite a hoop to jump as well - though if you determined
and you have tons of RAM, its somewhat doable.

Mark Miller wrote:
> Nice one ;) Its not technically a case where optimize requires > 2x
> though in case the user asking gets confused. Its a case unrelated to
> optimize that can grow your index. Then you need < 2x for the optimize,
> since you won't copy the deletes.
>
> It also requires that you jump hoops to delete everything. If you delete
> everything with *:*, that is smart enough not to just do a delete on
> every document - it just creates a new index, allowing the removal of
> the old very efficiently.
>
> Def agree on the more disk space.
>
> Walter Underwood wrote:
>   
>> Here is how you need 3X. First, index everything and optimize. Then
>> delete everything and reindex without any merges.
>>
>> You have one full-size index containing only deleted docs, one
>> full-size index containing reindexed docs, and need that much space
>> for a third index.
>>
>> Honestly, disk is cheap, and there is no way to make Lucene work
>> reliably with less disk. 1TB is a few hundred dollars. You have a free
>> search engine, buy some disk.
>>
>> wunder
>>
>> On Oct 1, 2009, at 6:25 AM, Grant Ingersoll wrote:
>>
>> 
 151GB or as little as from 183GB to 182GB.  Is that size after a
 commit close to the size the index would be after an optimize?  For
 that matter, are there cases where optimization can take more than
 2x?  I've heard of cases but have not observed them in my system.
 
>>> I seem to recall a case where it can be 3x, but I don't know that it
>>> has been observed much.
>>>   
>
>
>   


-- 
- Mark

http://www.lucidimagination.com

Re: index size before and after commit

2009-10-01 Thread Mark Miller

Nice one ;) Its not technically a case where optimize requires > 2x
though in case the user asking gets confused. Its a case unrelated to
optimize that can grow your index. Then you need < 2x for the optimize,
since you won't copy the deletes.

It also requires that you jump hoops to delete everything. If you delete
everything with *:*, that is smart enough not to just do a delete on
every document - it just creates a new index, allowing the removal of
the old very efficiently.

Def agree on the more disk space.

Walter Underwood wrote:
> Here is how you need 3X. First, index everything and optimize. Then
> delete everything and reindex without any merges.
>
> You have one full-size index containing only deleted docs, one
> full-size index containing reindexed docs, and need that much space
> for a third index.
>
> Honestly, disk is cheap, and there is no way to make Lucene work
> reliably with less disk. 1TB is a few hundred dollars. You have a free
> search engine, buy some disk.
>
> wunder
>
> On Oct 1, 2009, at 6:25 AM, Grant Ingersoll wrote:
>
>>> 151GB or as little as from 183GB to 182GB.  Is that size after a
>>> commit close to the size the index would be after an optimize?  For
>>> that matter, are there cases where optimization can take more than
>>> 2x?  I've heard of cases but have not observed them in my system.
>>
>> I seem to recall a case where it can be 3x, but I don't know that it
>> has been observed much.
>

-- 
- Mark

http://www.lucidimagination.com

Re: index size before and after commit

2009-10-01 Thread Walter Underwood

Here is how you need 3X. First, index everything and optimize. Then  
delete everything and reindex without any merges.


You have one full-size index containing only deleted docs, one full- 
size index containing reindexed docs, and need that much space for a  
third index.


Honestly, disk is cheap, and there is no way to make Lucene work  
reliably with less disk. 1TB is a few hundred dollars. You have a free  
search engine, buy some disk.


wunder

On Oct 1, 2009, at 6:25 AM, Grant Ingersoll wrote:

151GB or as little as from 183GB to 182GB.  Is that size after a  
commit close to the size the index would be after an optimize?  For  
that matter, are there cases where optimization can take more than  
2x?  I've heard of cases but have not observed them in my system.


I seem to recall a case where it can be 3x, but I don't know that it  
has been observed much.

Re: index size before and after commit

2009-10-01 Thread Mark Miller

Whoops - they way I have mail come in, not easy to tell if I'm replying
to Lucene or Solr list ;)

The way Solr works with Searchers and reopen, it shouldn't run into a
situation that requires greater than
2x to optimize. I won't guarantee it ;) But based on what I know, it
shouldn't happen under normal circumstances.

Mark Miller wrote:
> Phillip Farber wrote:
>   
>> I am trying to automate a build process that adds documents to 10
>> shards over 5 machines and need to limit the size of a shard to no
>> more than 200GB because I only have 400GB of disk available to
>> optimize a given shard.
>>
>> Why does the size (du) of an index typically decrease after a commit? 
>> I've observed a decrease in size of as much as from 296GB down to
>> 151GB or as little as from 183GB to 182GB.  Is that size after a
>> commit close to the size the index would be after an optimize?  
>> 
> Likely. Until you commit or close the Writer, the unoptimized index is
> the "live" index. And then you also have the optimized index. Once you
> commit and make the optimized index the "live" index, the unoptimized
> index can be removed (depending on your delete policy, which by default
> only keeps the latest commit point).
>   
>> For that matter, are there cases where optimization can take more than
>> 2x?  I've heard of cases but have not observed them in my system.  I
>> only do adds to the shards, never query them. An LVM snapshot of the
>> shard receives the queries.
>> 
> There are cases where it takes over 2x - but they involve using reopen.
> If you have more than one Reader on the index, and only reopen some of
> them, the new Readers created can hold open the partially optimized
> segments that existed at that moment, creating a need for greater than 2x.
>   
>> Is doing a commit before I take a du a reliable way to gauge the size
>> of the shard?  It is really bad news to allow a shard to go over 200GB
>> in my use case.  How do others manage this problem of 2x space needed
>> to optimize with "limited" dosk space?
>> 
> Get more disk space ;) Or don't optimize. A lower mergefactor can make
> optimizations less necessary.
>   
>> Advice greatly appreciated.
>>
>> Phil
>>
>> 
>
>
>   


-- 
- Mark

http://www.lucidimagination.com

Re: index size before and after commit

2009-10-01 Thread Mark Miller

Phillip Farber wrote:
> I am trying to automate a build process that adds documents to 10
> shards over 5 machines and need to limit the size of a shard to no
> more than 200GB because I only have 400GB of disk available to
> optimize a given shard.
>
> Why does the size (du) of an index typically decrease after a commit? 
> I've observed a decrease in size of as much as from 296GB down to
> 151GB or as little as from 183GB to 182GB.  Is that size after a
> commit close to the size the index would be after an optimize?  
Likely. Until you commit or close the Writer, the unoptimized index is
the "live" index. And then you also have the optimized index. Once you
commit and make the optimized index the "live" index, the unoptimized
index can be removed (depending on your delete policy, which by default
only keeps the latest commit point).
> For that matter, are there cases where optimization can take more than
> 2x?  I've heard of cases but have not observed them in my system.  I
> only do adds to the shards, never query them. An LVM snapshot of the
> shard receives the queries.
There are cases where it takes over 2x - but they involve using reopen.
If you have more than one Reader on the index, and only reopen some of
them, the new Readers created can hold open the partially optimized
segments that existed at that moment, creating a need for greater than 2x.
>
> Is doing a commit before I take a du a reliable way to gauge the size
> of the shard?  It is really bad news to allow a shard to go over 200GB
> in my use case.  How do others manage this problem of 2x space needed
> to optimize with "limited" dosk space?
Get more disk space ;) Or don't optimize. A lower mergefactor can make
optimizations less necessary.
>
> Advice greatly appreciated.
>
> Phil
>


-- 
- Mark

http://www.lucidimagination.com

Re: index size before and after commit

2009-10-01 Thread Grant Ingersoll

It may take some time before resources are released and garbage  
collected, so that may be part of the reason why things hang around  
and du doesn't report much of a drop.


On Oct 1, 2009, at 8:54 AM, Phillip Farber wrote:

I am trying to automate a build process that adds documents to 10  
shards over 5 machines and need to limit the size of a shard to no  
more than 200GB because I only have 400GB of disk available to  
optimize a given shard.


Why does the size (du) of an index typically decrease after a  
commit?  I've observed a decrease in size of as much as from 296GB  
down to 151GB or as little as from 183GB to 182GB.  Is that size  
after a commit close to the size the index would be after an  
optimize?  For that matter, are there cases where optimization can  
take more than 2x?  I've heard of cases but have not observed them  
in my system.


I seem to recall a case where it can be 3x, but I don't know that it  
has been observed much.


I only do adds to the shards, never query them. An LVM snapshot of  
the shard receives the queries.


Is doing a commit before I take a du a reliable way to gauge the  
size of the shard?  It is really bad news to allow a shard to go  
over 200GB in my use case.  How do others manage this problem of 2x  
space needed to optimize with "limited" dosk space?


Do you need to optimize at all?


--
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:

http://www.lucidimagination.com/search

index size before and after commit

2009-10-01 Thread Phillip Farber

I am trying to automate a build process that adds documents to 10 shards 
over 5 machines and need to limit the size of a shard to no more than 
200GB because I only have 400GB of disk available to optimize a given shard.


Why does the size (du) of an index typically decrease after a commit?  
I've observed a decrease in size of as much as from 296GB down to 151GB 
or as little as from 183GB to 182GB.  Is that size after a commit close 
to the size the index would be after an optimize?  For that matter, are 
there cases where optimization can take more than 2x?  I've heard of 
cases but have not observed them in my system.  I only do adds to the 
shards, never query them. An LVM snapshot of the shard receives the queries.


Is doing a commit before I take a du a reliable way to gauge the size of 
the shard?  It is really bad news to allow a shard to go over 200GB in 
my use case.  How do others manage this problem of 2x space needed to 
optimize with "limited" dosk space?


Advice greatly appreciated.

Phil

Re: index size before and after commit

Re: index size before and after commit

Re: index size before and after commit

Re: index size before and after commit

Re: index size before and after commit

Re: index size before and after commit

Re: index size before and after commit

Re: index size before and after commit

Re: index size before and after commit

index size before and after commit

10 matches

Site Navigation

Mail list logo

Footer information