What would be the best way to understand where heap is being used? On Tue, Jun 4, 2019 at 9:31 PM Greg Harris <harrisgre...@gmail.com> wrote:
> Just a couple of points I’d make here. I did some testing a while back in > which if no commit is made, (hard or soft) there are internal memory > structures holding tlogs and it will continue to get worse the more docs > that come in. I don’t know if that’s changed in further versions. I’d > recommend doing commits with some amount of frequency in indexing heavy > apps, otherwise you are likely to have heap issues. I personally would > advocate for some of the points already made. There are too many variables > going on here and ways to modify stuff to make sizing decisions and think > you’re doing anything other than a pure guess if you don’t test and > monitor. I’d advocate for a process in which testing is done regularly to > figure out questions like number of shards/replicas, heap size, memory etc. > Hard data, good process and regular testing will trump guesswork every time > > Greg > > On Tue, Jun 4, 2019 at 9:22 AM John Davis <johndavis925...@gmail.com> > wrote: > > > You might want to test with softcommit of hours vs 5m for heavy indexing > + > > light query -- even though there is internal memory structure overhead > for > > no soft commits, in our testing a 5m soft commit (via commitWithin) has > > resulted in a very very large heap usage which I suspect is because of > > other overhead associated with it. > > > > On Tue, Jun 4, 2019 at 8:03 AM Erick Erickson <erickerick...@gmail.com> > > wrote: > > > > > I need to update that, didn’t understand the bits about retaining > > internal > > > memory structures at the time. > > > > > > > On Jun 4, 2019, at 2:10 AM, John Davis <johndavis925...@gmail.com> > > > wrote: > > > > > > > > Erick - These conflict, what's changed? > > > > > > > > So if I were going to recommend settings, they’d be something like > > this: > > > > Do a hard commit with openSearcher=false every 60 seconds. > > > > Do a soft commit every 5 minutes. > > > > > > > > vs > > > > > > > > Index-heavy, Query-light > > > > Set your soft commit interval quite long, up to the maximum latency > you > > > can > > > > stand for documents to be visible. This could be just a couple of > > minutes > > > > or much longer. Maybe even hours with the capability of issuing a > hard > > > > commit (openSearcher=true) or soft commit on demand. > > > > > > > > > > https://lucidworks.com/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/ > > > > > > > > > > > > > > > > > > > > On Sun, Jun 2, 2019 at 8:58 PM Erick Erickson < > erickerick...@gmail.com > > > > > > > wrote: > > > > > > > >>> I've looked through SolrJ, DIH and others -- is the bottomline > > > >>> across all of them to "batch updates" and not commit as long as > > > possible? > > > >> > > > >> Of course it’s more complicated than that ;)…. > > > >> > > > >> But to start, yes, I urge you to batch. Here’s some stats: > > > >> https://lucidworks.com/2015/10/05/really-batch-updates-solr-2/ > > > >> > > > >> Note that at about 100 docs/batch you hit diminishing returns. > > > _However_, > > > >> that test was run on a single shard collection, so if you have 10 > > shards > > > >> you’d > > > >> have to send 1,000 docs/batch. I wouldn’t sweat that number much, > just > > > >> don’t > > > >> send one at a time. And there are the usual gotchas if your > documents > > > are > > > >> 1M .vs. 1K. > > > >> > > > >> About committing. No, don’t hold off as long as possible. When you > > > commit, > > > >> segments are merged. _However_, the default 100M internal buffer > size > > > means > > > >> that segments are written anyway even if you don’t hit a commit > point > > > when > > > >> you have 100M of index data, and merges happen anyway. So you won’t > > save > > > >> anything on merging by holding off commits. > > > >> And you’ll incur penalties. Here’s more than you want to know about > > > >> commits: > > > >> > > > >> > > > > > > https://lucidworks.com/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/ > > > >> > > > >> But some key take-aways… If for some reason Solr abnormally > > > >> terminates, the accumulated documents since the last hard > > > >> commit are replayed. So say you don’t commit for an hour of > > > >> furious indexing and someone does a “kill -9”. When you restart > > > >> Solr it’ll try to re-index all the docs for the last hour. Hard > > commits > > > >> with openSearcher=false aren’t all that expensive. I usually set > mine > > > >> for a minute and forget about it. > > > >> > > > >> Transaction logs hold a window, _not_ the entire set of operations > > > >> since time began. When you do a hard commit, the current tlog is > > > >> closed and a new one opened and ones that are “too old” are deleted. > > If > > > >> you never commit you have a huge transaction log to no good purpose. > > > >> > > > >> Also, while indexing, in order to accommodate “Real Time Get”, all > > > >> the docs indexed since the last searcher was opened have a pointer > > > >> kept in memory. So if you _never_ open a new searcher, that internal > > > >> structure can get quite large. So in bulk-indexing operations, I > > > >> suggest you open a searcher every so often. > > > >> > > > >> Opening a new searcher isn’t terribly expensive if you have no > > > autowarming > > > >> going on. Autowarming as defined in solrconfig.xml in filterCache, > > > >> queryResultCache > > > >> etc. > > > >> > > > >> So if I were going to recommend settings, they’d be something like > > this: > > > >> Do a hard commit with openSearcher=false every 60 seconds. > > > >> Do a soft commit every 5 minutes. > > > >> > > > >> I’d actually be surprised if you were able to measure differences > > > between > > > >> those settings and just hard commit with openSearcher=true every 60 > > > >> seconds and soft commit at -1 (never)… > > > >> > > > >> Best, > > > >> Erick > > > >> > > > >>> On Jun 2, 2019, at 3:35 PM, John Davis <johndavis925...@gmail.com> > > > >> wrote: > > > >>> > > > >>> If we assume there is no query load then effectively this boils > down > > to > > > >>> most effective way for adding a large number of documents to the > solr > > > >>> index. I've looked through SolrJ, DIH and others -- is the > bottomline > > > >>> across all of them to "batch updates" and not commit as long as > > > possible? > > > >>> > > > >>> On Sun, Jun 2, 2019 at 7:44 AM Erick Erickson < > > erickerick...@gmail.com > > > > > > > >>> wrote: > > > >>> > > > >>>> Oh, there are about a zillion reasons ;). > > > >>>> > > > >>>> First of all, most tools that show heap usage also count > uncollected > > > >>>> garbage. So your 10G could actually be much less “live” data. > Quick > > > way > > > >> to > > > >>>> test is to attach jconsole to the running Solr and hit the button > > that > > > >>>> forces a full GC. > > > >>>> > > > >>>> Another way is to reduce your heap when you start Solr (on a test > > > system > > > >>>> of course) until bad stuff happens, if you reduce it to very close > > to > > > >> what > > > >>>> Solr needs, you’ll get slower as more and more cycles are spent on > > GC, > > > >> if > > > >>>> you reduce it a little more you’ll get OOMs. > > > >>>> > > > >>>> You can take heap dumps of course to see where all the memory is > > being > > > >>>> used, but that’s tricky as it also includes garbage. > > > >>>> > > > >>>> I’ve seen cache sizes (filterCache in particular) be something > that > > > uses > > > >>>> lots of memory, but that requires queries to be fired. Each > > > filterCache > > > >>>> entry can take up to roughly maxDoc/8 bytes + overhead…. > > > >>>> > > > >>>> A classic error is to sort, group or facet on a docValues=false > > field. > > > >>>> Starting with Solr 7.6, you can add an option to fields to throw > an > > > >> error > > > >>>> if you do this, see: > > https://issues.apache.org/jira/browse/SOLR-12962 > > > . > > > >>>> > > > >>>> In short, there’s not enough information until you dive in and > test > > > >>>> bunches of stuff to tell. > > > >>>> > > > >>>> Best, > > > >>>> Erick > > > >>>> > > > >>>> > > > >>>>> On Jun 2, 2019, at 2:22 AM, John Davis < > johndavis925...@gmail.com> > > > >>>> wrote: > > > >>>>> > > > >>>>> This makes sense, any ideas why lucene/solr will use 10g heap > for a > > > 20g > > > >>>>> index.My hypothesis was merging segments was trying to read it > all > > > but > > > >> if > > > >>>>> that's not the case I am out of ideas. The one caveat is we are > > > trying > > > >> to > > > >>>>> add the documents quickly (~1g an hour) but if lucene does write > > 100m > > > >>>>> segments and does streaming merge it shouldn't matter? > > > >>>>> > > > >>>>> On Sat, Jun 1, 2019 at 9:24 AM Walter Underwood < > > > wun...@wunderwood.org > > > >>> > > > >>>>> wrote: > > > >>>>> > > > >>>>>>> On May 31, 2019, at 11:27 PM, John Davis < > > > johndavis925...@gmail.com> > > > >>>>>> wrote: > > > >>>>>>> > > > >>>>>>> 2. Merging segments - does solr load the entire segment in > memory > > > or > > > >>>>>> chunks > > > >>>>>>> of it? if later how large are these chunks > > > >>>>>> > > > >>>>>> No, it does not read the entire segment into memory. > > > >>>>>> > > > >>>>>> A fundamental part of the Lucene design is streaming posting > lists > > > >> into > > > >>>>>> memory and processing them sequentially. The same amount of > memory > > > is > > > >>>>>> needed for small or large segments. Each posting list is in > > > >> document-id > > > >>>>>> order. The merge is a merge of sorted lists, writing a new > posting > > > >> list > > > >>>> in > > > >>>>>> document-id order. > > > >>>>>> > > > >>>>>> wunder > > > >>>>>> Walter Underwood > > > >>>>>> wun...@wunderwood.org > > > >>>>>> http://observer.wunderwood.org/ (my blog) > > > >>>>>> > > > >>>>>> > > > >>>> > > > >>>> > > > >> > > > >> > > > > > > > > >