You can always increase the maximum segment size. For large indexes that should reduce the number of segments. But watch your indexing stats, I can't predict the consequences of bumping it to 100G for instance. I'd _expect_ bursty I/O whne those large segments started to be created or merged....
You'll be interested in LUCENE-7976 (Solr 7.4?), especially (probably) the idea of increasing the segment sizes and/or a related JIRA that allows you to tweak how aggressively solr merges segments that have deleted docs. NOTE: that JIRA has the consequence that _by default_ the optimize with no parameters respects the maximum segment size, which is a change from now. Finally, expungeDeletes may be useful as that too will respect max segment size, again after LUCENE-7976 is committed. Best, Erick On Wed, May 2, 2018 at 9:22 AM, Michael Joyner <mich...@newsrx.com> wrote: > The main reason we go this route is that after awhile (with default > settings) we end up with hundreds of shards and performance of course drops > abysmally as a result. By using a stepped optimize a) we don't run into the > we need the 3x+ head room issue, b) optimize performance penalty during > optimize is less than the hundreds of shards not being optimized performance > penalty. > > BTW, as we use batched a batch insert/update cycle [once daily] we only do > optimize to a segment of 1 after a complete batch has been run. Though > during the batch we reduce segment counts down to a max of 16 every 250K > insert/updates to prevent the large segment count performance penalty. > > > On 04/30/2018 07:10 PM, Erick Erickson wrote: >> >> There's really no good way to purge deleted documents from the index >> other than to wait until merging happens. >> >> Optimize/forceMerge and expungeDeletes both suffer from the problem >> that they create massive segments that then stick around for a very >> long time, see: >> >> https://lucidworks.com/2017/10/13/segment-merging-deleted-documents-optimize-may-bad/ >> >> Best, >> Erick >> >> On Mon, Apr 30, 2018 at 1:56 PM, Michael Joyner <mich...@newsrx.com> >> wrote: >>> >>> Based on experience, 2x head room is room is not always enough, sometimes >>> not even 3x, if you are optimizing from many segments down to 1 segment >>> in a >>> single go. >>> >>> We have however figured out a way that can work with as little as 51% >>> free >>> space via the following iteration cycle: >>> >>> public void solrOptimize() { >>> int initialMaxSegments = 256; >>> int finalMaxSegments = 1; >>> if (isShowSegmentCounter()) { >>> log.info("Optimizing ..."); >>> } >>> try (SolrClient solrServerInstance = getSolrClientInstance()){ >>> for (int segments=initialMaxSegments; >>> segments>=finalMaxSegments; segments--) { >>> if (isShowSegmentCounter()) { >>> System.out.println("Optimizing to a max of >>> "+segments+" >>> segments."); >>> } >>> solrServerInstance.optimize(true, true, segments); >>> } >>> } catch (SolrServerException | IOException e) { >>> throw new RuntimeException(e); >>> >>> } >>> } >>> >>> >>> On 04/30/2018 04:23 PM, Walter Underwood wrote: >>>> >>>> You need 2X the minimum index size in disk space anyway, so don’t worry >>>> about keeping the indexes as small as possible. Worry about having >>>> enough >>>> headroom. >>>> >>>> If your indexes are 250 GB, you need 250 GB of free space. >>>> >>>> wunder >>>> Walter Underwood >>>> wun...@wunderwood.org >>>> http://observer.wunderwood.org/ (my blog) >>>> >>>>> On Apr 30, 2018, at 1:13 PM, Antony A <antonyaugus...@gmail.com> wrote: >>>>> >>>>> Thanks Erick/Deepak. >>>>> >>>>> The cloud is running on baremetal (128 GB/24 cpu). >>>>> >>>>> Is there an option to run a compact on the data files to make the size >>>>> equal on both the clouds? I am trying find all the options before I add >>>>> the >>>>> new fields into the production cloud. >>>>> >>>>> Thanks >>>>> AA >>>>> >>>>> On Mon, Apr 30, 2018 at 10:45 AM, Erick Erickson >>>>> <erickerick...@gmail.com> >>>>> wrote: >>>>> >>>>>> Anthony: >>>>>> >>>>>> You are probably seeing the results of removing deleted documents from >>>>>> the shards as they're merged. Even on replicas in the same _shard_, >>>>>> the size of the index on disk won't necessarily be identical. This has >>>>>> to do with which segments are selected for merging, which are not >>>>>> necessarily coordinated across replicas. >>>>>> >>>>>> The test is if the number of docs on each collection is the same. If >>>>>> it is, then don't worry about index sizes. >>>>>> >>>>>> Best, >>>>>> Erick >>>>>> >>>>>> On Mon, Apr 30, 2018 at 9:38 AM, Deepak Goel <deic...@gmail.com> >>>>>> wrote: >>>>>>> >>>>>>> Could you please also give the machine details of the two clouds you >>>>>>> are >>>>>>> running? >>>>>>> >>>>>>> >>>>>>> >>>>>>> Deepak >>>>>>> "The greatness of a nation can be judged by the way its animals are >>>>>>> treated. Please stop cruelty to Animals, become a Vegan" >>>>>>> >>>>>>> +91 73500 12833 >>>>>>> deic...@gmail.com >>>>>>> >>>>>>> Facebook: https://www.facebook.com/deicool >>>>>>> LinkedIn: www.linkedin.com/in/deicool >>>>>>> >>>>>>> "Plant a Tree, Go Green" >>>>>>> >>>>>>> Make In India : http://www.makeinindia.com/home >>>>>>> >>>>>>> On Mon, Apr 30, 2018 at 9:51 PM, Antony A <antonyaugus...@gmail.com> >>>>>> >>>>>> wrote: >>>>>>>> >>>>>>>> Hi Shawn, >>>>>>>> >>>>>>>> The cloud is running version 6.2.1. with ClassicIndexSchemaFactory >>>>>>>> >>>>>>>> The sum of size from admin UI on all the shards is around 265 G vs >>>>>>>> 224 >>>>>>>> G >>>>>>>> between the two clouds. >>>>>>>> >>>>>>>> I created the collection using "numShards" so compositeId router. >>>>>>>> >>>>>>>> If you need more information, please let me know. >>>>>>>> >>>>>>>> Thanks >>>>>>>> AA >>>>>>>> >>>>>>>> On Mon, Apr 30, 2018 at 10:04 AM, Shawn Heisey <apa...@elyograg.org> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> On 4/30/2018 9:51 AM, Antony A wrote: >>>>>>>>> >>>>>>>>>> I am running two separate solr clouds. I have 8 shards in each >>>>>>>>>> with >>>>>>>>>> a >>>>>>>>>> total >>>>>>>>>> of 300 million documents. Both the clouds are indexing the >>>>>>>>>> document >>>>>> >>>>>> from >>>>>>>>>> >>>>>>>>>> the same source/configuration. >>>>>>>>>> >>>>>>>>>> I am noticing there is a difference in the size of the collection >>>>>>>> >>>>>>>> between >>>>>>>>>> >>>>>>>>>> them. I am planning to add more shards to see if that helps solve >>>>>>>>>> the >>>>>>>>>> issue. Has anyone come across similar issue? >>>>>>>>>> >>>>>>>>> There's no information here about exactly what you are seeing, what >>>>>> >>>>>> you >>>>>>>>> >>>>>>>>> are expecting to see, and why you believe that what you are seeing >>>>>>>>> is >>>>>>>> >>>>>>>> wrong. >>>>>>>>> >>>>>>>>> You did say that there is "a difference in size". That is a very >>>>>> >>>>>> vague >>>>>>>>> >>>>>>>>> problem description. >>>>>>>>> >>>>>>>>> FYI, unless a SolrCloud collection is using the implicit router, >>>>>>>>> you >>>>>>>>> cannot add shards. And if it *IS* using the implicit router, then >>>>>>>>> you >>>>>>>> >>>>>>>> are >>>>>>>>> >>>>>>>>> 100% in control of document routing -- Solr cannot influence that >>>>>>>>> at >>>>>> >>>>>> all. >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> Shawn >>>>>>>>> >>>>>>>>> >