Hi all, we are using CPF for post-processing a set of documents, we load via content-pump into a 3 node cluster (version 6.0-2). When we do, we do experience an uneven load in one of the servers (it hangs every now and then, while the other 2 seem to be waiting for more work), and so far, we did not manage to get a grip on what could be wrong.
In short, these are the steps we are following: The process we follow: - the ETL creates the xml files (around 40 million docs). - content-pump pushes the documents into MarkLogic (10 threads with 100 documents per transaction). - a CPF pipeline adds some collections to the uploaded documents. These are the steps of the CPF pipeline: - Creation or update of a document, changes the document status to "unprocessed". This is saved in a document property - A scheduled task picks up batches of 50k documents and changes the state to processing every 2 minutes (here we spawn 50 batches of 1k documents to have 50 transactions). * we opted for using a scheduled task insted of relaying solely on CPF, because the servers were chocking on high volume. - The state change triggers CPF (on-state-change event) and the document receives its collections after a query. - Once the collections are set, the status is changed to done. We did verify that the 3 nodes have the same configuration. To do so, we checked the following files: - assignments.xml - clusters.xml - databases.xml - groups.xml - hosts.xml - server.xml (it has 2 obvious differences: the host-id and the ssl private key) The only difference between the 3 of them is the memory. These are the specs: - CPU: 2x X5650, 6 cores, in total 12 cores - MEM: 48 GB ( 64 GB on the third one) - DISK: 6x 600 GB 15K in RAID 10 config Attached to this email there are 6 pictures, which clearly show the problem we are facing: - System load (system load, 5 minutes) for each of the 3 nodes - CPU usage on a 100% scale, again for the 3 boxes On the 3rd machine we see these warnings everytime the CPU is been hog (Error.log): 2013-02-01 00:02:01.327 Warning: Hung 65 sec 2013-02-01 00:03:19.243 Warning: Hung 54 sec 2013-02-01 00:04:00.802 Warning: Hung 41 sec 2013-02-01 00:06:40.061 Warning: Hung 130 sec And some connection lost/time outs on the other 2 machines of the cluster: 2013-02-01 00:01:08.567 Info: Disconnecting from domestic host ml-c1-u3.swets.nl in cluster 17671928148696225660 because it has not responded for 30 seconds. 2013-02-01 00:02:54.634 Info: Disconnecting from domestic host ml-c1-u3.swets.nl in cluster 17671928148696225660 because it has not responded for 30 seconds. 2013-02-01 00:03:50.673 Info: Disconnecting from domestic host ml-c1-u3.swets.nl in cluster 17671928148696225660 because it has not responded for 30 seconds. 2013-02-01 00:05:01.473 Info: Disconnecting from domestic host ml-c1-u3.swets.nl in cluster 17671928148696225660 because it has not responded for 30 seconds. Could you please provide advice? Miguel Rodríguez Lead Developer E mrgonza...@nl.swets.com I www.swets.com _______________________________________________ General mailing list General@developer.marklogic.com http://developer.marklogic.com/mailman/listinfo/general