Re: [Q] Faster Atomic Updates - use docValues?

Erick Erickson Thu, 05 Dec 2019 06:30:30 -0800

>  I think I should have also done optimize between batches, no?

No, no, no, no. Absolutely not. Never. Never, never, never between batches.
I don’t  recommend optimizing at _all_ unless there are demonstrable
improvements.


Please don’t take this the wrong way, the whole merge process is really
hard to get your head around. But the very fact that you’d suggest
optimizing between batches shows that the entire merge process is
opaque to you. I’ve seen many people just start changing things and
get themselves into a bad place, then try to change more things to get
out of that hole. Rinse. Repeat.

I _strongly_ recommend that you undo all your changes. Neither
commit nor optimize from outside Solr. Set your autocommit
settings to something like 5 minutes with openSearcher=true.
Set all autowarm counts in your caches in solrconfig.xml to 0,
especially filterCache and queryResultCache.

Do not set soft commit at all, leave it at -1.

Repeat do _not_ commit or optimize from the client! Just let your
autocommit settings do the commits.

It’s also pushing things to send 5M docs in a single XML packet.
That all has to be held in memory and then indexed, adding to
pressure on the heap. I usually index from SolrJ in batches
of 1,000. See:
https://lucidworks.com/post/indexing-with-solrj/

Simply put, your slowdown should not be happening. I strongly
believe that it’s something in your environment, most likely
1> your changes eventually shoot you in the foot OR
2> you are running in too little memory and eventually GC is killing you. 
Really, analyze your GC logs. OR
3> you are running on underpowered hardware which just can’t take the load OR
4> something else in your environment

I’ve never heard of a Solr installation with such a massive slowdown during
indexing that was fixed by tweaking things like the merge policy etc.

Best,
Erick


> On Dec 5, 2019, at 12:57 AM, Paras Lehana <paras.leh...@indiamart.com> wrote:
> 
> Hey Erick,
> 
> This is a huge red flag to me: "(but I could only test for the first few
>> thousand documents”.
> 
> 
> Yup, that's probably where the culprit lies. I could only test for the
> starting batch because I had to wait for a day to actually compare. I
> tweaked the merge values and kept whatever gave a speed boost. My first
> batch of 5 million docs took only 40 minutes (atomic updates included) and
> the last batch of 5 million took more than 18 hours. If this is an issue of
> mergePolicy, I think I should have also done optimize between batches, no?
> I remember, when I indexed a single XML of 80 million after optimizing the
> core already indexed with 30 XMLs of 5 million each, I could post 80
> million in a day only.
> 
> 
> 
>> The indexing rate you’re seeing is abysmal unless these are _huge_
>> documents
> 
> 
> Documents only contain the suggestion name, possible titles,
> phonetics/spellcheck/synonym fields and numerical fields for boosting. They
> are far smaller than what a Search Document would contain. Auto-Suggest is
> only concerned about suggestions so you can guess how simple the documents
> would be.
> 
> 
> Some data is held on the heap and some in the OS RAM due to MMapDirectory
> 
> 
> I'm using StandardDirectory (which will make Solr choose the right
> implementation). Also, planning to read more about these (looking forward
> to use MMap). Thanks for the article!
> 
> 
> You're right. I should change one thing at a time. Let me experiment and
> then I will summarize here what I tried. Thank you for your responses. :)
> 
> On Wed, 4 Dec 2019 at 20:31, Erick Erickson <erickerick...@gmail.com> wrote:
> 
>> This is a huge red flag to me: "(but I could only test for the first few
>> thousand documents”
>> 
>> You’re probably right that that would speed things up, but pretty soon
>> when you’re indexing
>> your entire corpus there are lots of other considerations.
>> 
>> The indexing rate you’re seeing is abysmal unless these are _huge_
>> documents, but you
>> indicate that at the start you’re getting 1,400 docs/second so I don’t
>> think the complexity
>> of the docs is the issue here.
>> 
>> Do note that when we’re throwing RAM figures out, we need to draw a sharp
>> distinction
>> between Java heap and total RAM. Some data is held on the heap and some in
>> the OS
>> RAM due to MMapDirectory, see Uwe’s excellent article:
>> https://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html
>> 
>> Uwe recommends about 25% of your available physical RAM be allocated to
>> Java as
>> a starting point. Your particular Solr installation may need a larger
>> percent, IDK.
>> 
>> But basically I’d go back to all default settings and change one thing at
>> a time.
>> First, I’d look at GC performance. Is it taking all your CPU? In which
>> case you probably need to
>> increase your heap. I pick this first because it’s very common that this
>> is a root cause.
>> 
>> Next, I’d put a profiler on it to see exactly where I’m spending time.
>> Otherwise you wind
>> up making random changes and hoping one of them works.
>> 
>> Best,
>> Erick
>> 
>>> On Dec 4, 2019, at 3:21 AM, Paras Lehana <paras.leh...@indiamart.com>
>> wrote:
>>> 
>>> (but I could only test for the first few
>>> thousand documents
>> 
>> 
> 
> -- 
> -- 
> Regards,
> 
> *Paras Lehana* [65871]
> Development Engineer, Auto-Suggest,
> IndiaMART Intermesh Ltd.
> 
> 8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
> Noida, UP, IN - 201303
> 
> Mob.: +91-9560911996
> Work: 01203916600 | Extn:  *8173*
> 
> -- 
> *
> *
> 
> <https://www.facebook.com/IndiaMART/videos/578196442936091/>

Re: [Q] Faster Atomic Updates - use docValues?

Reply via email to