Hi all,
we are using CPF for post-processing a set of documents, we load via 
content-pump into a 3 node cluster (version 6.0-2). 
When we do, we do experience an uneven load in one of the servers (it hangs 
every now and then, while the other 2 seem to be waiting for more work), and so 
far, we did not 
manage to get a grip on what could be wrong.

In short, these are the steps we are following:

The process we follow:
- the ETL creates the xml files (around 40 million docs).
- content-pump pushes the documents into MarkLogic (10 threads with 100 
documents per transaction).
- a CPF pipeline adds some collections to the uploaded documents.

These are the steps of the CPF pipeline:
- Creation or update of a document, changes the document status to 
"unprocessed". This is saved in a document property 
- A scheduled task picks up batches of 50k documents and changes the state to 
processing every 2 minutes (here we spawn 50 batches of 1k documents to have 50 
transactions).
* we opted for using a scheduled task insted of relaying solely on CPF, because 
the servers were chocking on high volume.
- The state change triggers CPF (on-state-change event) and the document 
receives its collections after a query. 
- Once the collections are set, the status is changed to done.

We did verify that the 3 nodes have the same configuration. To do so, we 
checked the following files:

- assignments.xml 
- clusters.xml 
- databases.xml 
- groups.xml 
- hosts.xml 
- server.xml (it has 2 obvious differences: the host-id and the ssl private key)

The only difference between the 3 of them is the memory. These are the specs:
- CPU: 2x X5650, 6 cores, in total 12 cores
- MEM: 48 GB ( 64 GB on the third one)
- DISK: 6x 600 GB 15K in RAID 10 config

Attached to this email there are 6 pictures, which clearly show the problem we 
are facing:
- System load (system load, 5 minutes) for each of the 3 nodes
- CPU usage on a 100% scale, again for the 3 boxes

On the 3rd machine we see these warnings everytime the CPU is been hog 
(Error.log):

2013-02-01 00:02:01.327 Warning: Hung 65 sec
2013-02-01 00:03:19.243 Warning: Hung 54 sec
2013-02-01 00:04:00.802 Warning: Hung 41 sec
2013-02-01 00:06:40.061 Warning: Hung 130 sec

And some connection lost/time outs on the other 2 machines of the cluster:

2013-02-01 00:01:08.567 Info: Disconnecting from domestic host 
ml-c1-u3.swets.nl in cluster 17671928148696225660 because it has not responded 
for 30 seconds.
2013-02-01 00:02:54.634 Info: Disconnecting from domestic host 
ml-c1-u3.swets.nl in cluster 17671928148696225660 because it has not responded 
for 30 seconds.
2013-02-01 00:03:50.673 Info: Disconnecting from domestic host 
ml-c1-u3.swets.nl in cluster 17671928148696225660 because it has not responded 
for 30 seconds.
2013-02-01 00:05:01.473 Info: Disconnecting from domestic host 
ml-c1-u3.swets.nl in cluster 17671928148696225660 because it has not responded 
for 30 seconds.


Could you please provide advice?

Miguel Rodríguez
Lead Developer 
E mrgonza...@nl.swets.com
I www.swets.com
 

_______________________________________________
General mailing list
General@developer.marklogic.com
http://developer.marklogic.com/mailman/listinfo/general

Reply via email to