You might want to test with softcommit of hours vs 5m for heavy indexing + light query -- even though there is internal memory structure overhead for no soft commits, in our testing a 5m soft commit (via commitWithin) has resulted in a very very large heap usage which I suspect is because of other overhead associated with it.
On Tue, Jun 4, 2019 at 8:03 AM Erick Erickson <erickerick...@gmail.com> wrote: > I need to update that, didn’t understand the bits about retaining internal > memory structures at the time. > > > On Jun 4, 2019, at 2:10 AM, John Davis <johndavis925...@gmail.com> > wrote: > > > > Erick - These conflict, what's changed? > > > > So if I were going to recommend settings, they’d be something like this: > > Do a hard commit with openSearcher=false every 60 seconds. > > Do a soft commit every 5 minutes. > > > > vs > > > > Index-heavy, Query-light > > Set your soft commit interval quite long, up to the maximum latency you > can > > stand for documents to be visible. This could be just a couple of minutes > > or much longer. Maybe even hours with the capability of issuing a hard > > commit (openSearcher=true) or soft commit on demand. > > > https://lucidworks.com/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/ > > > > > > > > > > On Sun, Jun 2, 2019 at 8:58 PM Erick Erickson <erickerick...@gmail.com> > > wrote: > > > >>> I've looked through SolrJ, DIH and others -- is the bottomline > >>> across all of them to "batch updates" and not commit as long as > possible? > >> > >> Of course it’s more complicated than that ;)…. > >> > >> But to start, yes, I urge you to batch. Here’s some stats: > >> https://lucidworks.com/2015/10/05/really-batch-updates-solr-2/ > >> > >> Note that at about 100 docs/batch you hit diminishing returns. > _However_, > >> that test was run on a single shard collection, so if you have 10 shards > >> you’d > >> have to send 1,000 docs/batch. I wouldn’t sweat that number much, just > >> don’t > >> send one at a time. And there are the usual gotchas if your documents > are > >> 1M .vs. 1K. > >> > >> About committing. No, don’t hold off as long as possible. When you > commit, > >> segments are merged. _However_, the default 100M internal buffer size > means > >> that segments are written anyway even if you don’t hit a commit point > when > >> you have 100M of index data, and merges happen anyway. So you won’t save > >> anything on merging by holding off commits. > >> And you’ll incur penalties. Here’s more than you want to know about > >> commits: > >> > >> > https://lucidworks.com/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/ > >> > >> But some key take-aways… If for some reason Solr abnormally > >> terminates, the accumulated documents since the last hard > >> commit are replayed. So say you don’t commit for an hour of > >> furious indexing and someone does a “kill -9”. When you restart > >> Solr it’ll try to re-index all the docs for the last hour. Hard commits > >> with openSearcher=false aren’t all that expensive. I usually set mine > >> for a minute and forget about it. > >> > >> Transaction logs hold a window, _not_ the entire set of operations > >> since time began. When you do a hard commit, the current tlog is > >> closed and a new one opened and ones that are “too old” are deleted. If > >> you never commit you have a huge transaction log to no good purpose. > >> > >> Also, while indexing, in order to accommodate “Real Time Get”, all > >> the docs indexed since the last searcher was opened have a pointer > >> kept in memory. So if you _never_ open a new searcher, that internal > >> structure can get quite large. So in bulk-indexing operations, I > >> suggest you open a searcher every so often. > >> > >> Opening a new searcher isn’t terribly expensive if you have no > autowarming > >> going on. Autowarming as defined in solrconfig.xml in filterCache, > >> queryResultCache > >> etc. > >> > >> So if I were going to recommend settings, they’d be something like this: > >> Do a hard commit with openSearcher=false every 60 seconds. > >> Do a soft commit every 5 minutes. > >> > >> I’d actually be surprised if you were able to measure differences > between > >> those settings and just hard commit with openSearcher=true every 60 > >> seconds and soft commit at -1 (never)… > >> > >> Best, > >> Erick > >> > >>> On Jun 2, 2019, at 3:35 PM, John Davis <johndavis925...@gmail.com> > >> wrote: > >>> > >>> If we assume there is no query load then effectively this boils down to > >>> most effective way for adding a large number of documents to the solr > >>> index. I've looked through SolrJ, DIH and others -- is the bottomline > >>> across all of them to "batch updates" and not commit as long as > possible? > >>> > >>> On Sun, Jun 2, 2019 at 7:44 AM Erick Erickson <erickerick...@gmail.com > > > >>> wrote: > >>> > >>>> Oh, there are about a zillion reasons ;). > >>>> > >>>> First of all, most tools that show heap usage also count uncollected > >>>> garbage. So your 10G could actually be much less “live” data. Quick > way > >> to > >>>> test is to attach jconsole to the running Solr and hit the button that > >>>> forces a full GC. > >>>> > >>>> Another way is to reduce your heap when you start Solr (on a test > system > >>>> of course) until bad stuff happens, if you reduce it to very close to > >> what > >>>> Solr needs, you’ll get slower as more and more cycles are spent on GC, > >> if > >>>> you reduce it a little more you’ll get OOMs. > >>>> > >>>> You can take heap dumps of course to see where all the memory is being > >>>> used, but that’s tricky as it also includes garbage. > >>>> > >>>> I’ve seen cache sizes (filterCache in particular) be something that > uses > >>>> lots of memory, but that requires queries to be fired. Each > filterCache > >>>> entry can take up to roughly maxDoc/8 bytes + overhead…. > >>>> > >>>> A classic error is to sort, group or facet on a docValues=false field. > >>>> Starting with Solr 7.6, you can add an option to fields to throw an > >> error > >>>> if you do this, see: https://issues.apache.org/jira/browse/SOLR-12962 > . > >>>> > >>>> In short, there’s not enough information until you dive in and test > >>>> bunches of stuff to tell. > >>>> > >>>> Best, > >>>> Erick > >>>> > >>>> > >>>>> On Jun 2, 2019, at 2:22 AM, John Davis <johndavis925...@gmail.com> > >>>> wrote: > >>>>> > >>>>> This makes sense, any ideas why lucene/solr will use 10g heap for a > 20g > >>>>> index.My hypothesis was merging segments was trying to read it all > but > >> if > >>>>> that's not the case I am out of ideas. The one caveat is we are > trying > >> to > >>>>> add the documents quickly (~1g an hour) but if lucene does write 100m > >>>>> segments and does streaming merge it shouldn't matter? > >>>>> > >>>>> On Sat, Jun 1, 2019 at 9:24 AM Walter Underwood < > wun...@wunderwood.org > >>> > >>>>> wrote: > >>>>> > >>>>>>> On May 31, 2019, at 11:27 PM, John Davis < > johndavis925...@gmail.com> > >>>>>> wrote: > >>>>>>> > >>>>>>> 2. Merging segments - does solr load the entire segment in memory > or > >>>>>> chunks > >>>>>>> of it? if later how large are these chunks > >>>>>> > >>>>>> No, it does not read the entire segment into memory. > >>>>>> > >>>>>> A fundamental part of the Lucene design is streaming posting lists > >> into > >>>>>> memory and processing them sequentially. The same amount of memory > is > >>>>>> needed for small or large segments. Each posting list is in > >> document-id > >>>>>> order. The merge is a merge of sorted lists, writing a new posting > >> list > >>>> in > >>>>>> document-id order. > >>>>>> > >>>>>> wunder > >>>>>> Walter Underwood > >>>>>> wun...@wunderwood.org > >>>>>> http://observer.wunderwood.org/ (my blog) > >>>>>> > >>>>>> > >>>> > >>>> > >> > >> > >