You might want to test with softcommit of hours vs 5m for heavy indexing +
light query -- even though there is internal memory structure overhead for
no soft commits, in our testing a 5m soft commit (via commitWithin) has
resulted in a very very large heap usage which I suspect is because of
other overhead associated with it.

On Tue, Jun 4, 2019 at 8:03 AM Erick Erickson <erickerick...@gmail.com>
wrote:

> I need to update that, didn’t understand the bits about retaining internal
> memory structures at the time.
>
> > On Jun 4, 2019, at 2:10 AM, John Davis <johndavis925...@gmail.com>
> wrote:
> >
> > Erick - These conflict, what's changed?
> >
> > So if I were going to recommend settings, they’d be something like this:
> > Do a hard commit with openSearcher=false every 60 seconds.
> > Do a soft commit every 5 minutes.
> >
> > vs
> >
> > Index-heavy, Query-light
> > Set your soft commit interval quite long, up to the maximum latency you
> can
> > stand for documents to be visible. This could be just a couple of minutes
> > or much longer. Maybe even hours with the capability of issuing a hard
> > commit (openSearcher=true) or soft commit on demand.
> >
> https://lucidworks.com/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/
> >
> >
> >
> >
> > On Sun, Jun 2, 2019 at 8:58 PM Erick Erickson <erickerick...@gmail.com>
> > wrote:
> >
> >>> I've looked through SolrJ, DIH and others -- is the bottomline
> >>> across all of them to "batch updates" and not commit as long as
> possible?
> >>
> >> Of course it’s more complicated than that ;)….
> >>
> >> But to start, yes, I urge you to batch. Here’s some stats:
> >> https://lucidworks.com/2015/10/05/really-batch-updates-solr-2/
> >>
> >> Note that at about 100 docs/batch you hit diminishing returns.
> _However_,
> >> that test was run on a single shard collection, so if you have 10 shards
> >> you’d
> >> have to send 1,000 docs/batch. I wouldn’t sweat that number much, just
> >> don’t
> >> send one at a time. And there are the usual gotchas if your documents
> are
> >> 1M .vs. 1K.
> >>
> >> About committing. No, don’t hold off as long as possible. When you
> commit,
> >> segments are merged. _However_, the default 100M internal buffer size
> means
> >> that segments are written anyway even if you don’t hit a commit point
> when
> >> you have 100M of index data, and merges happen anyway. So you won’t save
> >> anything on merging by holding off commits.
> >> And you’ll incur penalties. Here’s more than you want to know about
> >> commits:
> >>
> >>
> https://lucidworks.com/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/
> >>
> >> But some key take-aways… If for some reason Solr abnormally
> >> terminates, the accumulated documents since the last hard
> >> commit are replayed. So say you don’t commit for an hour of
> >> furious indexing and someone does a “kill -9”. When you restart
> >> Solr it’ll try to re-index all the docs for the last hour. Hard commits
> >> with openSearcher=false aren’t all that expensive. I usually set mine
> >> for a minute and forget about it.
> >>
> >> Transaction logs hold a window, _not_ the entire set of operations
> >> since time began. When you do a hard commit, the current tlog is
> >> closed and a new one opened and ones that are “too old” are deleted. If
> >> you never commit you have a huge transaction log to no good purpose.
> >>
> >> Also, while indexing, in order to accommodate “Real Time Get”, all
> >> the docs indexed since the last searcher was opened have a pointer
> >> kept in memory. So if you _never_ open a new searcher, that internal
> >> structure can get quite large. So in bulk-indexing operations, I
> >> suggest you open a searcher every so often.
> >>
> >> Opening a new searcher isn’t terribly expensive if you have no
> autowarming
> >> going on. Autowarming as defined in solrconfig.xml in filterCache,
> >> queryResultCache
> >> etc.
> >>
> >> So if I were going to recommend settings, they’d be something like this:
> >> Do a hard commit with openSearcher=false every 60 seconds.
> >> Do a soft commit every 5 minutes.
> >>
> >> I’d actually be surprised if you were able to measure differences
> between
> >> those settings and just hard commit with openSearcher=true every 60
> >> seconds and soft commit at -1 (never)…
> >>
> >> Best,
> >> Erick
> >>
> >>> On Jun 2, 2019, at 3:35 PM, John Davis <johndavis925...@gmail.com>
> >> wrote:
> >>>
> >>> If we assume there is no query load then effectively this boils down to
> >>> most effective way for adding a large number of documents to the solr
> >>> index. I've looked through SolrJ, DIH and others -- is the bottomline
> >>> across all of them to "batch updates" and not commit as long as
> possible?
> >>>
> >>> On Sun, Jun 2, 2019 at 7:44 AM Erick Erickson <erickerick...@gmail.com
> >
> >>> wrote:
> >>>
> >>>> Oh, there are about a zillion reasons ;).
> >>>>
> >>>> First of all, most tools that show heap usage also count uncollected
> >>>> garbage. So your 10G could actually be much less “live” data. Quick
> way
> >> to
> >>>> test is to attach jconsole to the running Solr and hit the button that
> >>>> forces a full GC.
> >>>>
> >>>> Another way is to reduce your heap when you start Solr (on a test
> system
> >>>> of course) until bad stuff happens, if you reduce it to very close to
> >> what
> >>>> Solr needs, you’ll get slower as more and more cycles are spent on GC,
> >> if
> >>>> you reduce it a little more you’ll get OOMs.
> >>>>
> >>>> You can take heap dumps of course to see where all the memory is being
> >>>> used, but that’s tricky as it also includes garbage.
> >>>>
> >>>> I’ve seen cache sizes (filterCache in particular) be something that
> uses
> >>>> lots of memory, but that requires queries to be fired. Each
> filterCache
> >>>> entry can take up to roughly maxDoc/8 bytes + overhead….
> >>>>
> >>>> A classic error is to sort, group or facet on a docValues=false field.
> >>>> Starting with Solr 7.6, you can add an option to fields to throw an
> >> error
> >>>> if you do this, see: https://issues.apache.org/jira/browse/SOLR-12962
> .
> >>>>
> >>>> In short, there’s not enough information until you dive in and test
> >>>> bunches of stuff to tell.
> >>>>
> >>>> Best,
> >>>> Erick
> >>>>
> >>>>
> >>>>> On Jun 2, 2019, at 2:22 AM, John Davis <johndavis925...@gmail.com>
> >>>> wrote:
> >>>>>
> >>>>> This makes sense, any ideas why lucene/solr will use 10g heap for a
> 20g
> >>>>> index.My hypothesis was merging segments was trying to read it all
> but
> >> if
> >>>>> that's not the case I am out of ideas. The one caveat is we are
> trying
> >> to
> >>>>> add the documents quickly (~1g an hour) but if lucene does write 100m
> >>>>> segments and does streaming merge it shouldn't matter?
> >>>>>
> >>>>> On Sat, Jun 1, 2019 at 9:24 AM Walter Underwood <
> wun...@wunderwood.org
> >>>
> >>>>> wrote:
> >>>>>
> >>>>>>> On May 31, 2019, at 11:27 PM, John Davis <
> johndavis925...@gmail.com>
> >>>>>> wrote:
> >>>>>>>
> >>>>>>> 2. Merging segments - does solr load the entire segment in memory
> or
> >>>>>> chunks
> >>>>>>> of it? if later how large are these chunks
> >>>>>>
> >>>>>> No, it does not read the entire segment into memory.
> >>>>>>
> >>>>>> A fundamental part of the Lucene design is streaming posting lists
> >> into
> >>>>>> memory and processing them sequentially. The same amount of memory
> is
> >>>>>> needed for small or large segments. Each posting list is in
> >> document-id
> >>>>>> order. The merge is a merge of sorted lists, writing a new posting
> >> list
> >>>> in
> >>>>>> document-id order.
> >>>>>>
> >>>>>> wunder
> >>>>>> Walter Underwood
> >>>>>> wun...@wunderwood.org
> >>>>>> http://observer.wunderwood.org/  (my blog)
> >>>>>>
> >>>>>>
> >>>>
> >>>>
> >>
> >>
>
>

Reply via email to