Re: [Q] Faster Atomic Updates - use docValues?

Erick Erickson Wed, 11 Dec 2019 04:51:30 -0800

I doubt GC alone would make nearly that difference. More likely
it’s I/O interacting with MMapDirectory. Lucene uses OS memory
space for much of its index, i.e. the RAM left over
after that used for the running Solr process (and any other
processes of course). See:


https://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html

So if you, you don’t leave much OS memory space for Lucene’s 
use via MMap, that can lead to swapping. My bet is that was
what was happening, and your CPU utilization was low; Lucene and
thus Solr was spending all its time waiting around for I/O. If that theory
is true, your disk I/O should have been much higher before you reduced
your heap.

IOW, I claim if you left the java heap at 12G and increased the physical
memory to 24G you’d see an identical (or nearly) speedup. GC for a 12G
heap is rarely a bottleneck. That said you want to use as little heap for
your Java process as possible, but if you reduce it too much you wind up
with other problems. OOM for one, and I’ve also seen GC take an inordinate
amount of time when it’s _barely_ enough to run. You hit a GC that recovers,
say, 10M of heap which is barely enough to continue for a few milliseconds
and hits another GC….. As you can tell, “this is more art than science”…

Glad to hear you’re making progress!
Erick

> On Dec 11, 2019, at 5:06 AM, Paras Lehana <paras.leh...@indiamart.com> wrote:
> 
> Just to update, I kept the defaults. The indexing got only a little boost
> though I have decided to continue with the defaults and do incremental
> experiments only. To my surprise, our development server had only 12GB RAM,
> of which 8G was allocated to Java. Because I could not increase the RAM, I
> tried decreasing it to 4G and guess what! My indexing speed got a boost of
> over *50x*. Erick, thanks for helping. I think I should do more homework
> about GCs also. Your GC guess seems to be valid. I have raised the request
> to increase RAM on the development to 24GB.
> 
> On Mon, 9 Dec 2019 at 20:23, Erick Erickson <erickerick...@gmail.com> wrote:
> 
>> Note that that article is from 2011. That was in the Solr 3x days when
>> many, many, many things were different. There was no SolrCloud for
>> instance. Plus Tom’s problem space is indexing _books_. Whole, complete,
>> books. Which is, actually, not “normal” indexing at all as most Solr
>> indexes are much smaller documents. Books are a perfectly reasonable
>> use-case of course, but have a whole bunch of special requirements.
>> 
>> get-by-id should be very efficient, _except_ that the longer you spend
>> before opening a new searcher, the larger the internal data buffers
>> supporting get-by-id need to be.
>> 
>> Anyway, best of luck
>> Erick
>> 
>>> On Dec 9, 2019, at 1:05 AM, Paras Lehana <paras.leh...@indiamart.com>
>> wrote:
>>> 
>>> Hi Erick,
>>> 
>>> I have reverted back to original values and yes, I did see improvement. I
>>> will collect more stats. *Thank you for helping. :)*
>>> 
>>> Also, here is the reference article that I had referred for changing
>>> values:
>>> 
>> https://www.hathitrust.org/blogs/large-scale-search/forty-days-and-forty-nights-re-indexing-7-million-books-part-1
>>> 
>>> The article was perhaps for normal indexing and thus, suggested
>> increasing
>>> mergeFactor and then finally optimizing. In my case, a large number of
>>> segments could have impacted get-by-id of atomic updates? Just being
>>> curious.
>>> 
>>> On Fri, 6 Dec 2019 at 19:02, Paras Lehana <paras.leh...@indiamart.com>
>>> wrote:
>>> 
>>>> Hey Erick,
>>>> 
>>>> We have just upgraded to 8.3 before starting the indexing. We were on
>> 6.6
>>>> before that.
>>>> 
>>>> Thank you for your continued support and resources. Again, I have
>> already
>>>> taken your suggestion to start afresh and that's what I'm going to do.
>>>> Don't get me wrong but I have been just asking doubts. I will surely get
>>>> back with my experience after performing the full indexing.
>>>> 
>>>> Thanks again! :)
>>>> 
>>>> On Fri, 6 Dec 2019 at 18:48, Erick Erickson <erickerick...@gmail.com>
>>>> wrote:
>>>> 
>>>>> Nothing implicitly handles optimization, you must continue to do that
>>>>> externally.
>>>>> 
>>>>> Until you get to the bottom of your indexing slowdown, I wouldn’t
>> bother
>>>>> with it at all, trying to do all these things at once is what lead to
>> your
>>>>> problem in the first place, please change one thing at a time. You say:
>>>>> 
>>>>> “For a full indexing, optimizations occurred 30 times between batches”.
>>>>> 
>>>>> This is horrible. I’m not sure what version of Solr you’re using. If
>> it’s
>>>>> 7.4 or earlier, this means the the entire index was rewritten 30 times.
>>>>> The first time it would condense all segments into a single segment, or
>>>>> 1/30 of the total. The second time it would rewrite all that, 2/30 of
>> the
>>>>> index into a new segment. The third time 3/30. And so on.
>>>>> 
>>>>> If Solr 7.5 or later, it wouldn’t be as bad, assuming your index was
>> over
>>>>> 5G. But still.
>>>>> 
>>>>> See:
>>>>> 
>> https://lucidworks.com/post/segment-merging-deleted-documents-optimize-may-bad/
>>>>> for 7.4 and earlier,
>>>>> https://lucidworks.com/post/solr-and-optimizing-your-index-take-ii/
>> for
>>>>> 7.5 and later
>>>>> 
>>>>> Eventually you can optimize by sending in an http or curl request like
>>>>> this:
>>>>> ../solr/collection/update?optimize=true
>>>>> 
>>>>> You also changed to using StandardDirectory. The default has heuristics
>>>>> built in
>>>>> to choose the best directory implementation.
>>>>> 
>>>>> I can’t emphasize enough that you’re changing lots of things at one
>> time.
>>>>> I
>>>>> _strongly_ urge you to go back to the standard setup, make _no_
>>>>> modifications
>>>>> and change things one at a time. Some very bright people have done a
>> lot
>>>>> of work to try to make Lucene/Solr work well.
>>>>> 
>>>>> Make one change at a time. Measure. If that change isn’t helpful, undo
>> it
>>>>> and
>>>>> move to the next one. You’re trying to second-guess the Lucene/Solr
>>>>> developers who have years of understanding how this all works. Assume
>> they
>>>>> picked reasonable options for defaults and that Lucene/Solr performs
>>>>> reasonably
>>>>> well. When I get unexplainably poor results, I usually assume it was
>> the
>>>>> last
>>>>> thing I changed….
>>>>> 
>>>>> Best,
>>>>> Erick
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>>> On Dec 6, 2019, at 1:31 AM, Paras Lehana <paras.leh...@indiamart.com>
>>>>> wrote:
>>>>>> 
>>>>>> Hi Erick,
>>>>>> 
>>>>>> I believed optimizing explicitly merges segments and that's why I was
>>>>>> expecting it to give performance boost. I know that optimizations
>> should
>>>>>> not be done very frequently. For a full indexing, optimizations
>>>>> occurred 30
>>>>>> times between batches. I take your suggestion to undo all the changes
>>>>> and
>>>>>> that's what I'm going to do. I mentioned about the optimizations
>> giving
>>>>> an
>>>>>> indexing boost (for sometime) only to support your point of my
>>>>> mergePolicy
>>>>>> backfiring. I will certainly read again about the merge process.
>>>>>> 
>>>>>> Taking your suggestions - so, commits would be handled by autoCommit.
>>>>> What
>>>>>> implicitly handles optimizations? I think the merge policy or is there
>>>>> any
>>>>>> other setting I'm missing?
>>>>>> 
>>>>>> I'm indexing via Curl API on the same server. The Current Speed of
>> curl
>>>>> is
>>>>>> only 50k (down from 1300k in the first batch). I think - as the curl
>> is
>>>>>> transmitting the XML, the documents are getting indexing. Because then
>>>>> only
>>>>>> would speed be so low. I don't think that the whole XML is taking the
>>>>>> memory - I remember I had to change the curl options to get rid of the
>>>>>> transmission error for large files.
>>>>>> 
>>>>>> This is my curl request:
>>>>>> 
>>>>>> curl 'http://localhost:$port/solr/product/update?commit=true'  -T
>>>>>> batch1.xml -X POST -H 'Content-type:text/xml
>>>>>> 
>>>>>> Although, we had been doing this since ages - I think I should now
>>>>> consider
>>>>>> using the solr post service (since the indexing files stays on the
>> same
>>>>>> server) or using Solarium (we use PHP to make XMLs).
>>>>>> 
>>>>>> On Thu, 5 Dec 2019 at 20:00, Erick Erickson <erickerick...@gmail.com>
>>>>> wrote:
>>>>>> 
>>>>>>>> I think I should have also done optimize between batches, no?
>>>>>>> 
>>>>>>> No, no, no, no. Absolutely not. Never. Never, never, never between
>>>>> batches.
>>>>>>> I don’t  recommend optimizing at _all_ unless there are demonstrable
>>>>>>> improvements.
>>>>>>> 
>>>>>>> Please don’t take this the wrong way, the whole merge process is
>> really
>>>>>>> hard to get your head around. But the very fact that you’d suggest
>>>>>>> optimizing between batches shows that the entire merge process is
>>>>>>> opaque to you. I’ve seen many people just start changing things and
>>>>>>> get themselves into a bad place, then try to change more things to
>> get
>>>>>>> out of that hole. Rinse. Repeat.
>>>>>>> 
>>>>>>> I _strongly_ recommend that you undo all your changes. Neither
>>>>>>> commit nor optimize from outside Solr. Set your autocommit
>>>>>>> settings to something like 5 minutes with openSearcher=true.
>>>>>>> Set all autowarm counts in your caches in solrconfig.xml to 0,
>>>>>>> especially filterCache and queryResultCache.
>>>>>>> 
>>>>>>> Do not set soft commit at all, leave it at -1.
>>>>>>> 
>>>>>>> Repeat do _not_ commit or optimize from the client! Just let your
>>>>>>> autocommit settings do the commits.
>>>>>>> 
>>>>>>> It’s also pushing things to send 5M docs in a single XML packet.
>>>>>>> That all has to be held in memory and then indexed, adding to
>>>>>>> pressure on the heap. I usually index from SolrJ in batches
>>>>>>> of 1,000. See:
>>>>>>> https://lucidworks.com/post/indexing-with-solrj/
>>>>>>> 
>>>>>>> Simply put, your slowdown should not be happening. I strongly
>>>>>>> believe that it’s something in your environment, most likely
>>>>>>> 1> your changes eventually shoot you in the foot OR
>>>>>>> 2> you are running in too little memory and eventually GC is killing
>>>>> you.
>>>>>>> Really, analyze your GC logs. OR
>>>>>>> 3> you are running on underpowered hardware which just can’t take the
>>>>> load
>>>>>>> OR
>>>>>>> 4> something else in your environment
>>>>>>> 
>>>>>>> I’ve never heard of a Solr installation with such a massive slowdown
>>>>> during
>>>>>>> indexing that was fixed by tweaking things like the merge policy etc.
>>>>>>> 
>>>>>>> Best,
>>>>>>> Erick
>>>>>>> 
>>>>>>> 
>>>>>>>> On Dec 5, 2019, at 12:57 AM, Paras Lehana <
>> paras.leh...@indiamart.com
>>>>>> 
>>>>>>> wrote:
>>>>>>>> 
>>>>>>>> Hey Erick,
>>>>>>>> 
>>>>>>>> This is a huge red flag to me: "(but I could only test for the first
>>>>> few
>>>>>>>>> thousand documents”.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Yup, that's probably where the culprit lies. I could only test for
>> the
>>>>>>>> starting batch because I had to wait for a day to actually compare.
>> I
>>>>>>>> tweaked the merge values and kept whatever gave a speed boost. My
>>>>> first
>>>>>>>> batch of 5 million docs took only 40 minutes (atomic updates
>> included)
>>>>>>> and
>>>>>>>> the last batch of 5 million took more than 18 hours. If this is an
>>>>> issue
>>>>>>> of
>>>>>>>> mergePolicy, I think I should have also done optimize between
>> batches,
>>>>>>> no?
>>>>>>>> I remember, when I indexed a single XML of 80 million after
>> optimizing
>>>>>>> the
>>>>>>>> core already indexed with 30 XMLs of 5 million each, I could post 80
>>>>>>>> million in a day only.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> The indexing rate you’re seeing is abysmal unless these are _huge_
>>>>>>>>> documents
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Documents only contain the suggestion name, possible titles,
>>>>>>>> phonetics/spellcheck/synonym fields and numerical fields for
>> boosting.
>>>>>>> They
>>>>>>>> are far smaller than what a Search Document would contain.
>>>>> Auto-Suggest
>>>>>>> is
>>>>>>>> only concerned about suggestions so you can guess how simple the
>>>>>>> documents
>>>>>>>> would be.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Some data is held on the heap and some in the OS RAM due to
>>>>> MMapDirectory
>>>>>>>> 
>>>>>>>> 
>>>>>>>> I'm using StandardDirectory (which will make Solr choose the right
>>>>>>>> implementation). Also, planning to read more about these (looking
>>>>> forward
>>>>>>>> to use MMap). Thanks for the article!
>>>>>>>> 
>>>>>>>> 
>>>>>>>> You're right. I should change one thing at a time. Let me experiment
>>>>> and
>>>>>>>> then I will summarize here what I tried. Thank you for your
>>>>> responses. :)
>>>>>>>> 
>>>>>>>> On Wed, 4 Dec 2019 at 20:31, Erick Erickson <
>> erickerick...@gmail.com>
>>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> This is a huge red flag to me: "(but I could only test for the
>> first
>>>>> few
>>>>>>>>> thousand documents”
>>>>>>>>> 
>>>>>>>>> You’re probably right that that would speed things up, but pretty
>>>>> soon
>>>>>>>>> when you’re indexing
>>>>>>>>> your entire corpus there are lots of other considerations.
>>>>>>>>> 
>>>>>>>>> The indexing rate you’re seeing is abysmal unless these are _huge_
>>>>>>>>> documents, but you
>>>>>>>>> indicate that at the start you’re getting 1,400 docs/second so I
>>>>> don’t
>>>>>>>>> think the complexity
>>>>>>>>> of the docs is the issue here.
>>>>>>>>> 
>>>>>>>>> Do note that when we’re throwing RAM figures out, we need to draw a
>>>>>>> sharp
>>>>>>>>> distinction
>>>>>>>>> between Java heap and total RAM. Some data is held on the heap and
>>>>> some
>>>>>>> in
>>>>>>>>> the OS
>>>>>>>>> RAM due to MMapDirectory, see Uwe’s excellent article:
>>>>>>>>> 
>>>>>>> 
>>>>> 
>> https://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html
>>>>>>>>> 
>>>>>>>>> Uwe recommends about 25% of your available physical RAM be
>> allocated
>>>>> to
>>>>>>>>> Java as
>>>>>>>>> a starting point. Your particular Solr installation may need a
>> larger
>>>>>>>>> percent, IDK.
>>>>>>>>> 
>>>>>>>>> But basically I’d go back to all default settings and change one
>>>>> thing
>>>>>>> at
>>>>>>>>> a time.
>>>>>>>>> First, I’d look at GC performance. Is it taking all your CPU? In
>>>>> which
>>>>>>>>> case you probably need to
>>>>>>>>> increase your heap. I pick this first because it’s very common that
>>>>> this
>>>>>>>>> is a root cause.
>>>>>>>>> 
>>>>>>>>> Next, I’d put a profiler on it to see exactly where I’m spending
>>>>> time.
>>>>>>>>> Otherwise you wind
>>>>>>>>> up making random changes and hoping one of them works.
>>>>>>>>> 
>>>>>>>>> Best,
>>>>>>>>> Erick
>>>>>>>>> 
>>>>>>>>>> On Dec 4, 2019, at 3:21 AM, Paras Lehana <
>>>>> paras.leh...@indiamart.com>
>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>> (but I could only test for the first few
>>>>>>>>>> thousand documents
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> --
>>>>>>>> --
>>>>>>>> Regards,
>>>>>>>> 
>>>>>>>> *Paras Lehana* [65871]
>>>>>>>> Development Engineer, Auto-Suggest,
>>>>>>>> IndiaMART Intermesh Ltd.
>>>>>>>> 
>>>>>>>> 8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
>>>>>>>> Noida, UP, IN - 201303
>>>>>>>> 
>>>>>>>> Mob.: +91-9560911996
>>>>>>>> Work: 01203916600 | Extn:  *8173*
>>>>>>>> 
>>>>>>>> --
>>>>>>>> *
>>>>>>>> *
>>>>>>>> 
>>>>>>>> <https://www.facebook.com/IndiaMART/videos/578196442936091/>
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> --
>>>>>> --
>>>>>> Regards,
>>>>>> 
>>>>>> *Paras Lehana* [65871]
>>>>>> Development Engineer, Auto-Suggest,
>>>>>> IndiaMART Intermesh Ltd.
>>>>>> 
>>>>>> 8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
>>>>>> Noida, UP, IN - 201303
>>>>>> 
>>>>>> Mob.: +91-9560911996
>>>>>> Work: 01203916600 | Extn:  *8173*
>>>>>> 
>>>>>> --
>>>>>> *
>>>>>> *
>>>>>> 
>>>>>> <https://www.facebook.com/IndiaMART/videos/578196442936091/>
>>>>> 
>>>>> 
>>>> 
>>>> --
>>>> --
>>>> Regards,
>>>> 
>>>> *Paras Lehana* [65871]
>>>> Development Engineer, Auto-Suggest,
>>>> IndiaMART Intermesh Ltd.
>>>> 
>>>> 8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
>>>> Noida, UP, IN - 201303
>>>> 
>>>> Mob.: +91-9560911996
>>>> Work: 01203916600 | Extn:  *8173*
>>>> 
>>> 
>>> 
>>> --
>>> --
>>> Regards,
>>> 
>>> *Paras Lehana* [65871]
>>> Development Engineer, Auto-Suggest,
>>> IndiaMART Intermesh Ltd.
>>> 
>>> 8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
>>> Noida, UP, IN - 201303
>>> 
>>> Mob.: +91-9560911996
>>> Work: 01203916600 | Extn:  *8173*
>>> 
>>> --
>>> *
>>> *
>>> 
>>> <https://www.facebook.com/IndiaMART/videos/578196442936091/>
>> 
>> 
> 
> -- 
> -- 
> Regards,
> 
> *Paras Lehana* [65871]
> Development Engineer, Auto-Suggest,
> IndiaMART Intermesh Ltd.
> 
> 8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
> Noida, UP, IN - 201303
> 
> Mob.: +91-9560911996
> Work: 01203916600 | Extn:  *8173*
> 
> -- 
> *
> *
> 
> <https://www.facebook.com/IndiaMART/videos/578196442936091/>

Re: [Q] Faster Atomic Updates - use docValues?

Reply via email to