Re: [MarkLogic Dev General] Mysterious, Dramatic Query Slowdown on Multi-Node Cluster

Geert Josten Fri, 17 Jun 2016 05:59:43 -0700

We also observed great impact on queries because of updates when filtering was 
involved, for instance because of value-queries. I’d start with closely 
examining the search process first..


Cheers,
Geert

From: 
<general-boun...@developer.marklogic.com<mailto:general-boun...@developer.marklogic.com>>
 on behalf of Andreas Hubmer 
<andreas.hub...@ebcont.com<mailto:andreas.hub...@ebcont.com>>
Reply-To: MarkLogic Developer Discussion 
<general@developer.marklogic.com<mailto:general@developer.marklogic.com>>
Date: Friday, June 17, 2016 at 9:40 AM
To: MarkLogic Developer Discussion 
<general@developer.marklogic.com<mailto:general@developer.marklogic.com>>
Subject: Re: [MarkLogic Dev General] Mysterious, Dramatic Query Slowdown on 
Multi-Node Cluster

Hi Ron,

By default app servers run in contemporaneous mode which means that "queries 
can block waiting for the contemporaneous transactions to fully commit". Maybe 
it would help to switch to nonblocking mode.
Details are described here: 
https://docs.marklogic.com/guide/app-dev/transactions#id_41639

Cheers,
Andreas

2016-06-17 0:15 GMT+02:00 Danny Sokolsky 
<danny.sokol...@marklogic.com<mailto:danny.sokol...@marklogic.com>>:
And of course (I figure you are doing all of these things, but for posterity):

* set the log to debug and look for any warnings or errors (perhaps something 
there can correlate with a fixed bug)
* make sure the system is set up properly (swap space per MarkLogic 
recommendations, transparent huge pages disabled, etc)
* check i/o levels and i/o wait and look for hot spots
* check CPU levels and look for hot spots
* use the meters database and monitoring dashboards to look for any anomalies
* profile queries and look for hot spots

-Danny


-----Original Message-----
From: 
general-boun...@developer.marklogic.com<mailto:general-boun...@developer.marklogic.com>
 
[mailto:general-boun...@developer.marklogic.com<mailto:general-boun...@developer.marklogic.com>]
 On Behalf Of Ron Hitchens
Sent: Thursday, June 16, 2016 2:52 PM
To: MarkLogic Developer Discussion
Subject: Re: [MarkLogic Dev General] Mysterious, Dramatic Query Slowdown on 
Multi-Node Cluster


Hi Danny,

   We’re spinning up a cloud cluster right now to see if we can make it happen 
outside Prod.  If so, we’ll upgrade the test cluster to the latest release to 
see if it fixes the problem.  We can’t experiment on Prod, but if we can show 
that upgrading will solve the problem then we can recommend a course of action. 
 We’d still have to certify that the app code works properly on the latest ML 
version, but it would give us a solution.

   I reviewed all the posted bug fixes since 8.0-3 but couldn’t find anything 
with a description that seemed relevant.

---
Ron Hitchens {r...@overstory.co.uk<mailto:r...@overstory.co.uk>}  +44 7879 
358212<tel:%2B44%207879%20358212>

> On Jun 16, 2016, at 9:33 PM, Danny Sokolsky 
> <danny.sokol...@marklogic.com<mailto:danny.sokol...@marklogic.com>> wrote:
>
> Hi Ron,
>
> It is hard to say for sure, but there have been many bug fixes since 8.0-3.2 
> that can account for some or all of this.
>
> Do you have an environment where you can try out the latest (8.0-5.4)?
>
> -Danny
>
> -----Original Message-----
> From: 
> general-boun...@developer.marklogic.com<mailto:general-boun...@developer.marklogic.com>
>  
> [mailto:general-boun...@developer.marklogic.com<mailto:general-boun...@developer.marklogic.com>]
>  On Behalf Of Ron Hitchens
> Sent: Thursday, June 16, 2016 1:18 PM
> To: MarkLogic Developer Discussion
> Subject: [MarkLogic Dev General] Mysterious, Dramatic Query Slowdown on 
> Multi-Node Cluster
>
>
>   We’re seeing a very odd phenomenon on a client project here in the UK.  
> Queries (as in read-only, no updates) slow down dramatically (from 1.2 
> seconds to 30-40 seconds or longer) while “Jobs” are running that do 
> relatively light updates.  But only on multi-node clusters (3 node in this 
> case).
>
> Details:
> MarkLogic 8.0-3.2
> Production (pre-launch): A three node cluster running on Linux in AWS JVM app 
> nodes (also in AWS) that perform different tasks, taking to the same ML 
> cluster
>
> QA is a single E+D MarkLogic node in AWS
>
>   The operational scenario is this.
>
> o Prod cluster (3 nodes) has about 14+ million documents (articles and books).
> o Some number of “API app nodes” which present a REST API dedicated to 
> queries o Some number of “worker bee” nodes that process batch jobs for 
> ingestion and content enrichment
>
>   The intention is that the worker bees handle the slow, lumpy work of 
> processing and validating content before ingesting it into ML.  There is a 
> job processing framework that is used on the worker bees to queue, throttle 
> and process jobs asynchronously.
>
>
>   The API nodes respond to queries from the web app front end and other 
> clients within the system to do searches, fetch documents, etc.  These, for 
> the most part, are pure queries that don’t do any updates.
>
>   The issue we’ve bumped up against is this: We have a worker bee job that 
> enriches content by, for a particular content document (such as an article), 
> taking each associated binary and submitting it to a thumbnail service.  A 
> thread then polls the service until the results are ready.  Those results are 
> then written to the content document with URIs of the thumbnail images.
>
>   In the course of processing these jobs, this is what happens (several can 
> run at once, but we see this problem even with only one running:
>
>   o A job is pulled off the queue.  The queue is just a bunch of job XML 
> documents in ML.
>   o The job’s state is updated to running in its XML doc
>   o Code starts running in the JVM to process the job
>   o During execution, messages can be logged for the job, which results in a 
> node insert to the job XML doc
>   o The thumbnail job reads a list of binary references from the content doc
>   o For each one it issues a request to an external service, then starts a 
> polling thread to check for good completion
>      o There can be up to 10 of these polling threads going at once
>      o They are waiting most of the time, not talking to ML
>   o Messages can be logged to the job doc in the previous step, but the 
> content doc is not touched
>   o When the thumbnail result is ready, then the results are inserted into 
> the content doc in ML
>   o The job finishes up and updates the state of the job doc
>
>   There is some lock contention for the job doc from multiple threads logging 
> messages, but it’s not normally significant.  We see the deadlocks logged by 
> ML at debug level and they seem to resolve within a few milliseconds as 
> expected and the updates always complete quickly.
>
>   When the results come back and the content doc is updated, there can be 
> contention there as well.  Some jitter is introduced to prevent the pollers 
> from all waking up at once, but again this shouldn’t matter even of they do.
>
>   The odd phenomenon is that while one of these jobs is running (spending 
> most of it’s time waiting, about 4-5 seconds between polls) on one of the 
> worker bee nodes, a query sent from one of the API JVM nodes will take many 
> tens of seconds to complete.  Once the job has finished, then query times 
> will return to normal (a few milliseconds to 1-2 seconds depending on the 
> specifics of the query).
>
>   So the mystery is this: why would a pure query apparently block for a long 
> time in this scenario.  Queries should run lock free, so even if there is 
> lock content happening with the thumbnail job, queries should not be held up. 
>  MarkLogic is not busy at all, nothing else is going on.
>
>   This doesn’t happen on a single node, which makes me suspect something to 
> do with cross-node lock propagation.  But like I said, logging doesn’t 
> indicate any sort of pathological lock storm or anything like that.
>
>   If someone can give me some assurance that the latest ML release will solve 
> this problem I’d be happy to recommend that to the client.  But I’ve reviewed 
> all the documented bug fixes since 8.0-3 and nothing seems relevant.
>
>   This is a rather urgent problem since all this thumbnail processing must 
> the completed soon without making the rest of the system unusable.
>
>   Thanks in advance.
>
> ---
> Ron Hitchens {r...@overstory.co.uk<mailto:r...@overstory.co.uk>}  +44 7879 
> 358212<tel:%2B44%207879%20358212>
>
> _______________________________________________
> General mailing list
> General@developer.marklogic.com<mailto:General@developer.marklogic.com>
> Manage your subscription at:
> http://developer.marklogic.com/mailman/listinfo/general
> _______________________________________________
> General mailing list
> General@developer.marklogic.com<mailto:General@developer.marklogic.com>
> Manage your subscription at:
> http://developer.marklogic.com/mailman/listinfo/general


_______________________________________________
General mailing list
General@developer.marklogic.com<mailto:General@developer.marklogic.com>
Manage your subscription at:
http://developer.marklogic.com/mailman/listinfo/general
_______________________________________________
General mailing list
General@developer.marklogic.com<mailto:General@developer.marklogic.com>
Manage your subscription at:
http://developer.marklogic.com/mailman/listinfo/general



--
Andreas Hubmer
Senior IT Consultant

EBCONT enterprise technologies GmbH
Millennium Tower
Handelskai 94-96
A-1200 Vienna

Web: http://www.ebcont.com

OUR TEAM IS YOUR SUCCESS

_______________________________________________
General mailing list
General@developer.marklogic.com
Manage your subscription at: 
http://developer.marklogic.com/mailman/listinfo/general

Re: [MarkLogic Dev General] Mysterious, Dramatic Query Slowdown on Multi-Node Cluster

Reply via email to