Which OS is this?

The "Hung..." messages mean that the OS was not letting MarkLogic do anything 
for N seconds. Sometimes that means memory is stressed by either swapping or 
fragmentation. Sometimes it means disk I/O capacity is overloaded. Hardware 
problems are also a possibility. Look into these areas.

If you don't have file-log-level=debug already, set that. It's in the group 
settings in the admin UI. You may see some interesting new information.

The "Hung..." messages fit nicely into the erratic load. If the database on one 
host is blocked by the OS, the other two hosts will have to wait until it it 
comes back before advancing timestamps. So any updates will have to wait for 
that host to come back. Queries that need results from forests on the blocked 
host will have to wait, too.

You don't have to worry about the config files differing from host to host 
within a cluster. The cluster takes care of that.

The CPF setup sounds odd to me. Normally you'd let CPF manage the state, and 
wouldn't need that scheduled task. I don't see how the scheduled task would 
reduce load, at least not over the long haul. Maybe that's the idea? You're 
trying to maintain insert performance and then run CPF in less busy times?

-- Mike

On 1 Feb 2013, at 05:28 , Miguel Rodríguez González <[email protected]> 
wrote:

> Hi all,
> we are using CPF for post-processing a set of documents, we load via 
> content-pump into a 3 node cluster (version 6.0-2). 
> When we do, we do experience an uneven load in one of the servers (it hangs 
> every now and then, while the other 2 seem to be waiting for more work), and 
> so far, we did not 
> manage to get a grip on what could be wrong.
> 
> In short, these are the steps we are following:
> 
> The process we follow:
> - the ETL creates the xml files (around 40 million docs).
> - content-pump pushes the documents into MarkLogic (10 threads with 100 
> documents per transaction).
> - a CPF pipeline adds some collections to the uploaded documents.
> 
> These are the steps of the CPF pipeline:
> - Creation or update of a document, changes the document status to 
> "unprocessed". This is saved in a document property 
> - A scheduled task picks up batches of 50k documents and changes the state to 
> processing every 2 minutes (here we spawn 50 batches of 1k documents to have 
> 50 transactions).
> * we opted for using a scheduled task insted of relaying solely on CPF, 
> because the servers were chocking on high volume.
> - The state change triggers CPF (on-state-change event) and the document 
> receives its collections after a query. 
> - Once the collections are set, the status is changed to done.
> 
> We did verify that the 3 nodes have the same configuration. To do so, we 
> checked the following files:
> 
> - assignments.xml 
> - clusters.xml 
> - databases.xml 
> - groups.xml 
> - hosts.xml 
> - server.xml (it has 2 obvious differences: the host-id and the ssl private 
> key)
> 
> The only difference between the 3 of them is the memory. These are the specs:
> - CPU: 2x X5650, 6 cores, in total 12 cores
> - MEM: 48 GB ( 64 GB on the third one)
> - DISK: 6x 600 GB 15K in RAID 10 config
> 
> Attached to this email there are 6 pictures, which clearly show the problem 
> we are facing:
> - System load (system load, 5 minutes) for each of the 3 nodes
> - CPU usage on a 100% scale, again for the 3 boxes
> 
> On the 3rd machine we see these warnings everytime the CPU is been hog 
> (Error.log):
> 
> 2013-02-01 00:02:01.327 Warning: Hung 65 sec
> 2013-02-01 00:03:19.243 Warning: Hung 54 sec
> 2013-02-01 00:04:00.802 Warning: Hung 41 sec
> 2013-02-01 00:06:40.061 Warning: Hung 130 sec
> 
> And some connection lost/time outs on the other 2 machines of the cluster:
> 
> 2013-02-01 00:01:08.567 Info: Disconnecting from domestic host 
> ml-c1-u3.swets.nl in cluster 17671928148696225660 because it has not 
> responded for 30 seconds.
> 2013-02-01 00:02:54.634 Info: Disconnecting from domestic host 
> ml-c1-u3.swets.nl in cluster 17671928148696225660 because it has not 
> responded for 30 seconds.
> 2013-02-01 00:03:50.673 Info: Disconnecting from domestic host 
> ml-c1-u3.swets.nl in cluster 17671928148696225660 because it has not 
> responded for 30 seconds.
> 2013-02-01 00:05:01.473 Info: Disconnecting from domestic host 
> ml-c1-u3.swets.nl in cluster 17671928148696225660 because it has not 
> responded for 30 seconds.
> 
> 
> Could you please provide advice?
> 
> Miguel Rodríguez
> Lead Developer 
> E [email protected]
> I www.swets.com
>  
> 
> _______________________________________________
> General mailing list
> [email protected]
> http://developer.marklogic.com/mailman/listinfo/general
> 

_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general

Reply via email to