There's really no good way to purge deleted documents from the index other than to wait until merging happens.
Optimize/forceMerge and expungeDeletes both suffer from the problem that they create massive segments that then stick around for a very long time, see: https://lucidworks.com/2017/10/13/segment-merging-deleted-documents-optimize-may-bad/ Best, Erick On Mon, Apr 30, 2018 at 1:56 PM, Michael Joyner <mich...@newsrx.com> wrote: > Based on experience, 2x head room is room is not always enough, sometimes > not even 3x, if you are optimizing from many segments down to 1 segment in a > single go. > > We have however figured out a way that can work with as little as 51% free > space via the following iteration cycle: > > public void solrOptimize() { > int initialMaxSegments = 256; > int finalMaxSegments = 1; > if (isShowSegmentCounter()) { > log.info("Optimizing ..."); > } > try (SolrClient solrServerInstance = getSolrClientInstance()){ > for (int segments=initialMaxSegments; > segments>=finalMaxSegments; segments--) { > if (isShowSegmentCounter()) { > System.out.println("Optimizing to a max of "+segments+" > segments."); > } > solrServerInstance.optimize(true, true, segments); > } > } catch (SolrServerException | IOException e) { > throw new RuntimeException(e); > > } > } > > > On 04/30/2018 04:23 PM, Walter Underwood wrote: >> >> You need 2X the minimum index size in disk space anyway, so don’t worry >> about keeping the indexes as small as possible. Worry about having enough >> headroom. >> >> If your indexes are 250 GB, you need 250 GB of free space. >> >> wunder >> Walter Underwood >> wun...@wunderwood.org >> http://observer.wunderwood.org/ (my blog) >> >>> On Apr 30, 2018, at 1:13 PM, Antony A <antonyaugus...@gmail.com> wrote: >>> >>> Thanks Erick/Deepak. >>> >>> The cloud is running on baremetal (128 GB/24 cpu). >>> >>> Is there an option to run a compact on the data files to make the size >>> equal on both the clouds? I am trying find all the options before I add >>> the >>> new fields into the production cloud. >>> >>> Thanks >>> AA >>> >>> On Mon, Apr 30, 2018 at 10:45 AM, Erick Erickson >>> <erickerick...@gmail.com> >>> wrote: >>> >>>> Anthony: >>>> >>>> You are probably seeing the results of removing deleted documents from >>>> the shards as they're merged. Even on replicas in the same _shard_, >>>> the size of the index on disk won't necessarily be identical. This has >>>> to do with which segments are selected for merging, which are not >>>> necessarily coordinated across replicas. >>>> >>>> The test is if the number of docs on each collection is the same. If >>>> it is, then don't worry about index sizes. >>>> >>>> Best, >>>> Erick >>>> >>>> On Mon, Apr 30, 2018 at 9:38 AM, Deepak Goel <deic...@gmail.com> wrote: >>>>> >>>>> Could you please also give the machine details of the two clouds you >>>>> are >>>>> running? >>>>> >>>>> >>>>> >>>>> Deepak >>>>> "The greatness of a nation can be judged by the way its animals are >>>>> treated. Please stop cruelty to Animals, become a Vegan" >>>>> >>>>> +91 73500 12833 >>>>> deic...@gmail.com >>>>> >>>>> Facebook: https://www.facebook.com/deicool >>>>> LinkedIn: www.linkedin.com/in/deicool >>>>> >>>>> "Plant a Tree, Go Green" >>>>> >>>>> Make In India : http://www.makeinindia.com/home >>>>> >>>>> On Mon, Apr 30, 2018 at 9:51 PM, Antony A <antonyaugus...@gmail.com> >>>> >>>> wrote: >>>>>> >>>>>> Hi Shawn, >>>>>> >>>>>> The cloud is running version 6.2.1. with ClassicIndexSchemaFactory >>>>>> >>>>>> The sum of size from admin UI on all the shards is around 265 G vs 224 >>>>>> G >>>>>> between the two clouds. >>>>>> >>>>>> I created the collection using "numShards" so compositeId router. >>>>>> >>>>>> If you need more information, please let me know. >>>>>> >>>>>> Thanks >>>>>> AA >>>>>> >>>>>> On Mon, Apr 30, 2018 at 10:04 AM, Shawn Heisey <apa...@elyograg.org> >>>>>> wrote: >>>>>> >>>>>>> On 4/30/2018 9:51 AM, Antony A wrote: >>>>>>> >>>>>>>> I am running two separate solr clouds. I have 8 shards in each with >>>>>>>> a >>>>>>>> total >>>>>>>> of 300 million documents. Both the clouds are indexing the document >>>> >>>> from >>>>>>>> >>>>>>>> the same source/configuration. >>>>>>>> >>>>>>>> I am noticing there is a difference in the size of the collection >>>>>> >>>>>> between >>>>>>>> >>>>>>>> them. I am planning to add more shards to see if that helps solve >>>>>>>> the >>>>>>>> issue. Has anyone come across similar issue? >>>>>>>> >>>>>>> There's no information here about exactly what you are seeing, what >>>> >>>> you >>>>>>> >>>>>>> are expecting to see, and why you believe that what you are seeing is >>>>>> >>>>>> wrong. >>>>>>> >>>>>>> You did say that there is "a difference in size". That is a very >>>> >>>> vague >>>>>>> >>>>>>> problem description. >>>>>>> >>>>>>> FYI, unless a SolrCloud collection is using the implicit router, you >>>>>>> cannot add shards. And if it *IS* using the implicit router, then >>>>>>> you >>>>>> >>>>>> are >>>>>>> >>>>>>> 100% in control of document routing -- Solr cannot influence that at >>>> >>>> all. >>>>>>> >>>>>>> Thanks, >>>>>>> Shawn >>>>>>> >>>>>>> >> >