Re: index size before and after commit
Ha! Searching "partial optimize" on http://www.lucidimagination.com/search , we discover SOLR-603 which gives the 'maxSegments' option to the command. The text does not include the word 'partial'. It's on http://wiki.apache.org/solr/UpdateXmlMessages. The command gives a number of Lucene segments, and I have no idea how this will translate to disk space. To minimize disk space, you could run it repetitively with the number of segments decreasing to one. On Thu, Oct 1, 2009 at 11:49 AM, Lance Norskog wrote: > I've heard there is a new "partial optimize" feature in Lucene, but it > is not mentioned in the Solr or Lucene wikis so I cannot advise you > how to use it. > > On a previous project we had a 500GB index for 450m documents. It took > 14 hours to optimize. We found that Solr worked well (given enough RAM > for sorting and faceting requests) but that the IT logistics of a 500G > fileset were too much. > > Also, if you want your query servers to continue serving while > propogating the newly optimized index, you need 2X space to store both > copies on the slave during the transfer. For us this 35 minutes over > 1G ethernet. > > On Thu, Oct 1, 2009 at 7:36 AM, Walter Underwood > wrote: >> I've now worked on three different search engines and they all have a 3X >> worst >> case on space, so I'm familiar with this case. --wunder >> >> On Oct 1, 2009, at 7:15 AM, Mark Miller wrote: >> >>> Nice one ;) Its not technically a case where optimize requires > 2x >>> though in case the user asking gets confused. Its a case unrelated to >>> optimize that can grow your index. Then you need < 2x for the optimize, >>> since you won't copy the deletes. >>> >>> It also requires that you jump hoops to delete everything. If you delete >>> everything with *:*, that is smart enough not to just do a delete on >>> every document - it just creates a new index, allowing the removal of >>> the old very efficiently. >>> >>> Def agree on the more disk space. >>> >>> Walter Underwood wrote: Here is how you need 3X. First, index everything and optimize. Then delete everything and reindex without any merges. You have one full-size index containing only deleted docs, one full-size index containing reindexed docs, and need that much space for a third index. Honestly, disk is cheap, and there is no way to make Lucene work reliably with less disk. 1TB is a few hundred dollars. You have a free search engine, buy some disk. wunder On Oct 1, 2009, at 6:25 AM, Grant Ingersoll wrote: >> 151GB or as little as from 183GB to 182GB. Is that size after a >> commit close to the size the index would be after an optimize? For >> that matter, are there cases where optimization can take more than >> 2x? I've heard of cases but have not observed them in my system. > > I seem to recall a case where it can be 3x, but I don't know that it > has been observed much. >>> >>> >>> -- >>> - Mark >>> >>> http://www.lucidimagination.com >>> >>> >>> >> >> > > > > -- > Lance Norskog > goks...@gmail.com > -- Lance Norskog goks...@gmail.com
Re: index size before and after commit
I've heard there is a new "partial optimize" feature in Lucene, but it is not mentioned in the Solr or Lucene wikis so I cannot advise you how to use it. On a previous project we had a 500GB index for 450m documents. It took 14 hours to optimize. We found that Solr worked well (given enough RAM for sorting and faceting requests) but that the IT logistics of a 500G fileset were too much. Also, if you want your query servers to continue serving while propogating the newly optimized index, you need 2X space to store both copies on the slave during the transfer. For us this 35 minutes over 1G ethernet. On Thu, Oct 1, 2009 at 7:36 AM, Walter Underwood wrote: > I've now worked on three different search engines and they all have a 3X > worst > case on space, so I'm familiar with this case. --wunder > > On Oct 1, 2009, at 7:15 AM, Mark Miller wrote: > >> Nice one ;) Its not technically a case where optimize requires > 2x >> though in case the user asking gets confused. Its a case unrelated to >> optimize that can grow your index. Then you need < 2x for the optimize, >> since you won't copy the deletes. >> >> It also requires that you jump hoops to delete everything. If you delete >> everything with *:*, that is smart enough not to just do a delete on >> every document - it just creates a new index, allowing the removal of >> the old very efficiently. >> >> Def agree on the more disk space. >> >> Walter Underwood wrote: >>> >>> Here is how you need 3X. First, index everything and optimize. Then >>> delete everything and reindex without any merges. >>> >>> You have one full-size index containing only deleted docs, one >>> full-size index containing reindexed docs, and need that much space >>> for a third index. >>> >>> Honestly, disk is cheap, and there is no way to make Lucene work >>> reliably with less disk. 1TB is a few hundred dollars. You have a free >>> search engine, buy some disk. >>> >>> wunder >>> >>> On Oct 1, 2009, at 6:25 AM, Grant Ingersoll wrote: >>> > 151GB or as little as from 183GB to 182GB. Is that size after a > commit close to the size the index would be after an optimize? For > that matter, are there cases where optimization can take more than > 2x? I've heard of cases but have not observed them in my system. I seem to recall a case where it can be 3x, but I don't know that it has been observed much. >>> >> >> >> -- >> - Mark >> >> http://www.lucidimagination.com >> >> >> > > -- Lance Norskog goks...@gmail.com
Re: index size before and after commit
I've now worked on three different search engines and they all have a 3X worst case on space, so I'm familiar with this case. --wunder On Oct 1, 2009, at 7:15 AM, Mark Miller wrote: Nice one ;) Its not technically a case where optimize requires > 2x though in case the user asking gets confused. Its a case unrelated to optimize that can grow your index. Then you need < 2x for the optimize, since you won't copy the deletes. It also requires that you jump hoops to delete everything. If you delete everything with *:*, that is smart enough not to just do a delete on every document - it just creates a new index, allowing the removal of the old very efficiently. Def agree on the more disk space. Walter Underwood wrote: Here is how you need 3X. First, index everything and optimize. Then delete everything and reindex without any merges. You have one full-size index containing only deleted docs, one full-size index containing reindexed docs, and need that much space for a third index. Honestly, disk is cheap, and there is no way to make Lucene work reliably with less disk. 1TB is a few hundred dollars. You have a free search engine, buy some disk. wunder On Oct 1, 2009, at 6:25 AM, Grant Ingersoll wrote: 151GB or as little as from 183GB to 182GB. Is that size after a commit close to the size the index would be after an optimize? For that matter, are there cases where optimization can take more than 2x? I've heard of cases but have not observed them in my system. I seem to recall a case where it can be 3x, but I don't know that it has been observed much. -- - Mark http://www.lucidimagination.com
Re: index size before and after commit
bq. and reindex without any merges. Thats actually quite a hoop to jump as well - though if you determined and you have tons of RAM, its somewhat doable. Mark Miller wrote: > Nice one ;) Its not technically a case where optimize requires > 2x > though in case the user asking gets confused. Its a case unrelated to > optimize that can grow your index. Then you need < 2x for the optimize, > since you won't copy the deletes. > > It also requires that you jump hoops to delete everything. If you delete > everything with *:*, that is smart enough not to just do a delete on > every document - it just creates a new index, allowing the removal of > the old very efficiently. > > Def agree on the more disk space. > > Walter Underwood wrote: > >> Here is how you need 3X. First, index everything and optimize. Then >> delete everything and reindex without any merges. >> >> You have one full-size index containing only deleted docs, one >> full-size index containing reindexed docs, and need that much space >> for a third index. >> >> Honestly, disk is cheap, and there is no way to make Lucene work >> reliably with less disk. 1TB is a few hundred dollars. You have a free >> search engine, buy some disk. >> >> wunder >> >> On Oct 1, 2009, at 6:25 AM, Grant Ingersoll wrote: >> >> 151GB or as little as from 183GB to 182GB. Is that size after a commit close to the size the index would be after an optimize? For that matter, are there cases where optimization can take more than 2x? I've heard of cases but have not observed them in my system. >>> I seem to recall a case where it can be 3x, but I don't know that it >>> has been observed much. >>> > > > -- - Mark http://www.lucidimagination.com
Re: index size before and after commit
Nice one ;) Its not technically a case where optimize requires > 2x though in case the user asking gets confused. Its a case unrelated to optimize that can grow your index. Then you need < 2x for the optimize, since you won't copy the deletes. It also requires that you jump hoops to delete everything. If you delete everything with *:*, that is smart enough not to just do a delete on every document - it just creates a new index, allowing the removal of the old very efficiently. Def agree on the more disk space. Walter Underwood wrote: > Here is how you need 3X. First, index everything and optimize. Then > delete everything and reindex without any merges. > > You have one full-size index containing only deleted docs, one > full-size index containing reindexed docs, and need that much space > for a third index. > > Honestly, disk is cheap, and there is no way to make Lucene work > reliably with less disk. 1TB is a few hundred dollars. You have a free > search engine, buy some disk. > > wunder > > On Oct 1, 2009, at 6:25 AM, Grant Ingersoll wrote: > >>> 151GB or as little as from 183GB to 182GB. Is that size after a >>> commit close to the size the index would be after an optimize? For >>> that matter, are there cases where optimization can take more than >>> 2x? I've heard of cases but have not observed them in my system. >> >> I seem to recall a case where it can be 3x, but I don't know that it >> has been observed much. > -- - Mark http://www.lucidimagination.com
Re: index size before and after commit
Here is how you need 3X. First, index everything and optimize. Then delete everything and reindex without any merges. You have one full-size index containing only deleted docs, one full- size index containing reindexed docs, and need that much space for a third index. Honestly, disk is cheap, and there is no way to make Lucene work reliably with less disk. 1TB is a few hundred dollars. You have a free search engine, buy some disk. wunder On Oct 1, 2009, at 6:25 AM, Grant Ingersoll wrote: 151GB or as little as from 183GB to 182GB. Is that size after a commit close to the size the index would be after an optimize? For that matter, are there cases where optimization can take more than 2x? I've heard of cases but have not observed them in my system. I seem to recall a case where it can be 3x, but I don't know that it has been observed much.
Re: index size before and after commit
Whoops - they way I have mail come in, not easy to tell if I'm replying to Lucene or Solr list ;) The way Solr works with Searchers and reopen, it shouldn't run into a situation that requires greater than 2x to optimize. I won't guarantee it ;) But based on what I know, it shouldn't happen under normal circumstances. Mark Miller wrote: > Phillip Farber wrote: > >> I am trying to automate a build process that adds documents to 10 >> shards over 5 machines and need to limit the size of a shard to no >> more than 200GB because I only have 400GB of disk available to >> optimize a given shard. >> >> Why does the size (du) of an index typically decrease after a commit? >> I've observed a decrease in size of as much as from 296GB down to >> 151GB or as little as from 183GB to 182GB. Is that size after a >> commit close to the size the index would be after an optimize? >> > Likely. Until you commit or close the Writer, the unoptimized index is > the "live" index. And then you also have the optimized index. Once you > commit and make the optimized index the "live" index, the unoptimized > index can be removed (depending on your delete policy, which by default > only keeps the latest commit point). > >> For that matter, are there cases where optimization can take more than >> 2x? I've heard of cases but have not observed them in my system. I >> only do adds to the shards, never query them. An LVM snapshot of the >> shard receives the queries. >> > There are cases where it takes over 2x - but they involve using reopen. > If you have more than one Reader on the index, and only reopen some of > them, the new Readers created can hold open the partially optimized > segments that existed at that moment, creating a need for greater than 2x. > >> Is doing a commit before I take a du a reliable way to gauge the size >> of the shard? It is really bad news to allow a shard to go over 200GB >> in my use case. How do others manage this problem of 2x space needed >> to optimize with "limited" dosk space? >> > Get more disk space ;) Or don't optimize. A lower mergefactor can make > optimizations less necessary. > >> Advice greatly appreciated. >> >> Phil >> >> > > > -- - Mark http://www.lucidimagination.com
Re: index size before and after commit
Phillip Farber wrote: > I am trying to automate a build process that adds documents to 10 > shards over 5 machines and need to limit the size of a shard to no > more than 200GB because I only have 400GB of disk available to > optimize a given shard. > > Why does the size (du) of an index typically decrease after a commit? > I've observed a decrease in size of as much as from 296GB down to > 151GB or as little as from 183GB to 182GB. Is that size after a > commit close to the size the index would be after an optimize? Likely. Until you commit or close the Writer, the unoptimized index is the "live" index. And then you also have the optimized index. Once you commit and make the optimized index the "live" index, the unoptimized index can be removed (depending on your delete policy, which by default only keeps the latest commit point). > For that matter, are there cases where optimization can take more than > 2x? I've heard of cases but have not observed them in my system. I > only do adds to the shards, never query them. An LVM snapshot of the > shard receives the queries. There are cases where it takes over 2x - but they involve using reopen. If you have more than one Reader on the index, and only reopen some of them, the new Readers created can hold open the partially optimized segments that existed at that moment, creating a need for greater than 2x. > > Is doing a commit before I take a du a reliable way to gauge the size > of the shard? It is really bad news to allow a shard to go over 200GB > in my use case. How do others manage this problem of 2x space needed > to optimize with "limited" dosk space? Get more disk space ;) Or don't optimize. A lower mergefactor can make optimizations less necessary. > > Advice greatly appreciated. > > Phil > -- - Mark http://www.lucidimagination.com
Re: index size before and after commit
It may take some time before resources are released and garbage collected, so that may be part of the reason why things hang around and du doesn't report much of a drop. On Oct 1, 2009, at 8:54 AM, Phillip Farber wrote: I am trying to automate a build process that adds documents to 10 shards over 5 machines and need to limit the size of a shard to no more than 200GB because I only have 400GB of disk available to optimize a given shard. Why does the size (du) of an index typically decrease after a commit? I've observed a decrease in size of as much as from 296GB down to 151GB or as little as from 183GB to 182GB. Is that size after a commit close to the size the index would be after an optimize? For that matter, are there cases where optimization can take more than 2x? I've heard of cases but have not observed them in my system. I seem to recall a case where it can be 3x, but I don't know that it has been observed much. I only do adds to the shards, never query them. An LVM snapshot of the shard receives the queries. Is doing a commit before I take a du a reliable way to gauge the size of the shard? It is really bad news to allow a shard to go over 200GB in my use case. How do others manage this problem of 2x space needed to optimize with "limited" dosk space? Do you need to optimize at all? -- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using Solr/Lucene: http://www.lucidimagination.com/search
index size before and after commit
I am trying to automate a build process that adds documents to 10 shards over 5 machines and need to limit the size of a shard to no more than 200GB because I only have 400GB of disk available to optimize a given shard. Why does the size (du) of an index typically decrease after a commit? I've observed a decrease in size of as much as from 296GB down to 151GB or as little as from 183GB to 182GB. Is that size after a commit close to the size the index would be after an optimize? For that matter, are there cases where optimization can take more than 2x? I've heard of cases but have not observed them in my system. I only do adds to the shards, never query them. An LVM snapshot of the shard receives the queries. Is doing a commit before I take a du a reliable way to gauge the size of the shard? It is really bad news to allow a shard to go over 200GB in my use case. How do others manage this problem of 2x space needed to optimize with "limited" dosk space? Advice greatly appreciated. Phil