Re: [Q] Faster Atomic Updates - use docValues?

Paras Lehana Thu, 05 Dec 2019 22:32:47 -0800

Hi Erick,

I believed optimizing explicitly merges segments and that's why I was
expecting it to give performance boost. I know that optimizations should
not be done very frequently. For a full indexing, optimizations occurred 30
times between batches. I take your suggestion to undo all the changes and
that's what I'm going to do. I mentioned about the optimizations giving an
indexing boost (for sometime) only to support your point of my mergePolicy
backfiring. I will certainly read again about the merge process.


Taking your suggestions - so, commits would be handled by autoCommit. What
implicitly handles optimizations? I think the merge policy or is there any
other setting I'm missing?

I'm indexing via Curl API on the same server. The Current Speed of curl is
only 50k (down from 1300k in the first batch). I think - as the curl is
transmitting the XML, the documents are getting indexing. Because then only
would speed be so low. I don't think that the whole XML is taking the
memory - I remember I had to change the curl options to get rid of the
transmission error for large files.

This is my curl request:

curl 'http://localhost:$port/solr/product/update?commit=true'  -T
batch1.xml -X POST -H 'Content-type:text/xml

Although, we had been doing this since ages - I think I should now consider
using the solr post service (since the indexing files stays on the same
server) or using Solarium (we use PHP to make XMLs).

On Thu, 5 Dec 2019 at 20:00, Erick Erickson <erickerick...@gmail.com> wrote:

> >  I think I should have also done optimize between batches, no?
>
> No, no, no, no. Absolutely not. Never. Never, never, never between batches.
> I don’t  recommend optimizing at _all_ unless there are demonstrable
> improvements.
>
> Please don’t take this the wrong way, the whole merge process is really
> hard to get your head around. But the very fact that you’d suggest
> optimizing between batches shows that the entire merge process is
> opaque to you. I’ve seen many people just start changing things and
> get themselves into a bad place, then try to change more things to get
> out of that hole. Rinse. Repeat.
>
> I _strongly_ recommend that you undo all your changes. Neither
> commit nor optimize from outside Solr. Set your autocommit
> settings to something like 5 minutes with openSearcher=true.
> Set all autowarm counts in your caches in solrconfig.xml to 0,
> especially filterCache and queryResultCache.
>
> Do not set soft commit at all, leave it at -1.
>
> Repeat do _not_ commit or optimize from the client! Just let your
> autocommit settings do the commits.
>
> It’s also pushing things to send 5M docs in a single XML packet.
> That all has to be held in memory and then indexed, adding to
> pressure on the heap. I usually index from SolrJ in batches
> of 1,000. See:
> https://lucidworks.com/post/indexing-with-solrj/
>
> Simply put, your slowdown should not be happening. I strongly
> believe that it’s something in your environment, most likely
> 1> your changes eventually shoot you in the foot OR
> 2> you are running in too little memory and eventually GC is killing you.
> Really, analyze your GC logs. OR
> 3> you are running on underpowered hardware which just can’t take the load
> OR
> 4> something else in your environment
>
> I’ve never heard of a Solr installation with such a massive slowdown during
> indexing that was fixed by tweaking things like the merge policy etc.
>
> Best,
> Erick
>
>
> > On Dec 5, 2019, at 12:57 AM, Paras Lehana <paras.leh...@indiamart.com>
> wrote:
> >
> > Hey Erick,
> >
> > This is a huge red flag to me: "(but I could only test for the first few
> >> thousand documents”.
> >
> >
> > Yup, that's probably where the culprit lies. I could only test for the
> > starting batch because I had to wait for a day to actually compare. I
> > tweaked the merge values and kept whatever gave a speed boost. My first
> > batch of 5 million docs took only 40 minutes (atomic updates included)
> and
> > the last batch of 5 million took more than 18 hours. If this is an issue
> of
> > mergePolicy, I think I should have also done optimize between batches,
> no?
> > I remember, when I indexed a single XML of 80 million after optimizing
> the
> > core already indexed with 30 XMLs of 5 million each, I could post 80
> > million in a day only.
> >
> >
> >
> >> The indexing rate you’re seeing is abysmal unless these are _huge_
> >> documents
> >
> >
> > Documents only contain the suggestion name, possible titles,
> > phonetics/spellcheck/synonym fields and numerical fields for boosting.
> They
> > are far smaller than what a Search Document would contain. Auto-Suggest
> is
> > only concerned about suggestions so you can guess how simple the
> documents
> > would be.
> >
> >
> > Some data is held on the heap and some in the OS RAM due to MMapDirectory
> >
> >
> > I'm using StandardDirectory (which will make Solr choose the right
> > implementation). Also, planning to read more about these (looking forward
> > to use MMap). Thanks for the article!
> >
> >
> > You're right. I should change one thing at a time. Let me experiment and
> > then I will summarize here what I tried. Thank you for your responses. :)
> >
> > On Wed, 4 Dec 2019 at 20:31, Erick Erickson <erickerick...@gmail.com>
> wrote:
> >
> >> This is a huge red flag to me: "(but I could only test for the first few
> >> thousand documents”
> >>
> >> You’re probably right that that would speed things up, but pretty soon
> >> when you’re indexing
> >> your entire corpus there are lots of other considerations.
> >>
> >> The indexing rate you’re seeing is abysmal unless these are _huge_
> >> documents, but you
> >> indicate that at the start you’re getting 1,400 docs/second so I don’t
> >> think the complexity
> >> of the docs is the issue here.
> >>
> >> Do note that when we’re throwing RAM figures out, we need to draw a
> sharp
> >> distinction
> >> between Java heap and total RAM. Some data is held on the heap and some
> in
> >> the OS
> >> RAM due to MMapDirectory, see Uwe’s excellent article:
> >>
> https://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html
> >>
> >> Uwe recommends about 25% of your available physical RAM be allocated to
> >> Java as
> >> a starting point. Your particular Solr installation may need a larger
> >> percent, IDK.
> >>
> >> But basically I’d go back to all default settings and change one thing
> at
> >> a time.
> >> First, I’d look at GC performance. Is it taking all your CPU? In which
> >> case you probably need to
> >> increase your heap. I pick this first because it’s very common that this
> >> is a root cause.
> >>
> >> Next, I’d put a profiler on it to see exactly where I’m spending time.
> >> Otherwise you wind
> >> up making random changes and hoping one of them works.
> >>
> >> Best,
> >> Erick
> >>
> >>> On Dec 4, 2019, at 3:21 AM, Paras Lehana <paras.leh...@indiamart.com>
> >> wrote:
> >>>
> >>> (but I could only test for the first few
> >>> thousand documents
> >>
> >>
> >
> > --
> > --
> > Regards,
> >
> > *Paras Lehana* [65871]
> > Development Engineer, Auto-Suggest,
> > IndiaMART Intermesh Ltd.
> >
> > 8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
> > Noida, UP, IN - 201303
> >
> > Mob.: +91-9560911996
> > Work: 01203916600 | Extn:  *8173*
> >
> > --
> > *
> > *
> >
> > <https://www.facebook.com/IndiaMART/videos/578196442936091/>
>
>

-- 
-- 
Regards,

*Paras Lehana* [65871]
Development Engineer, Auto-Suggest,
IndiaMART Intermesh Ltd.

8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
Noida, UP, IN - 201303

Mob.: +91-9560911996
Work: 01203916600 | Extn:  *8173*

-- 
*
*

 <https://www.facebook.com/IndiaMART/videos/578196442936091/>

Re: [Q] Faster Atomic Updates - use docValues?

Reply via email to