This is an interesting use case. Much of the distributed processing external tools (like mlcp) are designed around initial ingestion. That is a case that if careful distributed can scale much better than a simplistic approach. However in even in ingestion there are very specific cases that will be improved in this way vs much simpler basic concepts like batching and multithreading. Even for ingestion its not always desirable to 'over think' the server rather than let it manage the document distribution across forests itself.
But for bulk processing of the nature your describing (adding an element to every document). That may not benefit much from distributing the task load across servers. The user-mode (xquery/js) CPU and memory overhead may be very low compared to the IO. Its conceivable (but I don't know if its implemented) that something like the following actually execute on the 'd-node' containing the document. https://docs.marklogic.com/xdmp:node-insert-after xdmp:node-insert-after(doc("/example.xml")/a/b, <c>ccc</c>); If it did then it wouldn't make much difference at all where it was executed, a few threads at once doing this (via corbs or manually ... should be able to load the system to capacity and efficiency. If it pulls the document to the calling node there's an additional overhead if its remote, but often not as much as people think. The latency between hosts on a good network can be smaller than the latency to disk. And at the point you reach IO bound the battle is over, you can send data back and forth like a tennis match and it won't matter. Then re-indexing and merging is going to kick in and the 'minor' work of inserting a node will not be the main contributing factor in the total time. Also considering that the task queue is shared by all nodes in the group, and xdmp:spawn , xdmp:spawn-function make use of the task queue I believe (need to check) that it also will make use of all hosts. So either way as long as you get some parallelism the load should spread fairly well and get close to the ideal maximum if you make use of the lowest level document update calls as possible, and take care to not keep transactions and locks open as a side effect for searching for the documents. Its worth trying a simple approach first before attempting to optimize. Then a basic 'split into parallel batches' should get about as close to theoretical maximum as possible. If you don't use an existing tool, I find its both easier overall, and easier to not hit a unexpected locking problem by pre-calculating the URI's of all the documents and split that list and store it (in the DB or on filesystem). Then launch your batches giving it the list of URIs. Very well written tools can do better than this, but its more tricky then it seems to iterate over the URI's and create batches all at once without running into some kind of locking or scheduling problem ----------------------------------------------------------------------------- David Lee Lead Engineer MarkLogic Corporation d...@marklogic.com Phone: +1 812-482-5224 Cell: +1 812-630-7622 www.marklogic.com<http://www.marklogic.com/> From: general-boun...@developer.marklogic.com [mailto:general-boun...@developer.marklogic.com] On Behalf Of Geert Josten Sent: Thursday, July 30, 2015 4:49 AM To: MarkLogic Developer Discussion <general@developer.marklogic.com> Subject: Re: [MarkLogic Dev General] Distributing Tasks Hi Andreas, Interesting slides, good find! If you are talking about more ad hoc processing, you could look into things like https://github.com/mblakele/taskbot, and https://github.com/marklogic/corb2. These are tools that can batch up the work very well. They won't spread load across a cluster automatically though. You could however try to split the load somehow, and run multiple instances in parallel, each against a different host. Though, that works best if you are targeting the host that actually holds the data you want to touch. But that is difficult. MLCP does that with its -fastload option. Would MLCP copy feature with a transform perhaps work? MarkLogic also provides Hadoop integration, so maybe that is also worth looking at? Cheers, Geert From: <general-boun...@developer.marklogic.com<mailto:general-boun...@developer.marklogic.com>> on behalf of Andreas Hubmer <andreas.hub...@ebcont.com<mailto:andreas.hub...@ebcont.com>> Reply-To: MarkLogic Developer Discussion <general@developer.marklogic.com<mailto:general@developer.marklogic.com>> Date: Thursday, July 30, 2015 at 8:56 AM To: MarkLogic Developer Discussion <general@developer.marklogic.com<mailto:general@developer.marklogic.com>> Subject: Re: [MarkLogic Dev General] Distributing Tasks Hi Geert, Thanks for the update. Triggers and the CPF aren't exactly what I'm looking for. What I want to do is to distribute one-time tasks like adding new elements to all existing documents. I've found some slides <http://developer.marklogic.com/media/mlw12/Distributed-Content-Processing-in-MarkLogic.pdf> from a ML consultant on "Distributed Content Processing in MarkLogic" but the code builds on ML 4. Probably I'll create a lightweight library myself. Either using one-time scheduled tasks or an HTTP server for distributing the tasks. Regards, Andreas 2015-07-29 17:56 GMT+02:00 Geert Josten <geert.jos...@marklogic.com<mailto:geert.jos...@marklogic.com>>: Hi Andreas, I haven't heard about anything in this direction recently, but FWIW I added a +1 to the RFE. Could post-commit triggers, or CPF help out in some way? They should run on the host that holds the forest that holds the document at hand from what I have heard.. Cheers, Geert From: <general-boun...@developer.marklogic.com<mailto:general-boun...@developer.marklogic.com>> on behalf of Andreas Hubmer <andreas.hub...@ebcont.com<mailto:andreas.hub...@ebcont.com>> Reply-To: MarkLogic Developer Discussion <general@developer.marklogic.com<mailto:general@developer.marklogic.com>> Date: Tuesday, July 28, 2015 at 5:20 PM To: MarkLogic Developer Discussion <general@developer.marklogic.com<mailto:general@developer.marklogic.com>> Subject: [MarkLogic Dev General] Distributing Tasks Hello, In this Knowledgebase article<https://help.marklogic.com/knowledgebase/article/View/112/0/techniques-for-dividing-tasks-between-hosts-in-a-cluster> there is talk about an RFE (2763) that would make it possible to pass in options into xdmp:spawn() to allow the execution of code on a specific host in a cluster. Are there still any plans for this feature? Thanks and cheers, Andreas -- Andreas Hubmer IT Consultant _______________________________________________ General mailing list General@developer.marklogic.com<mailto:General@developer.marklogic.com> Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general -- Andreas Hubmer IT Consultant EBCONT enterprise technologies GmbH Millennium Tower Handelskai 94-96 A-1200 Vienna Mobile: +43 664 60651861 Fax: +43 2772 512 69-9 Email: andreas.hub...@ebcont.com<mailto:andreas.hub...@ebcont.com> Web: http://www.ebcont.com OUR TEAM IS YOUR SUCCESS UID-Nr. ATU68135644 HG St.Pölten - FN 399978 d
_______________________________________________ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general