HI. Yes, Geert, thank you for translating my vague answer.
have 5000 docs, chunk them into 10 sets of 500 and pass those URIs to xdmp:spawn-function or xdmp:spawn where you have the logic to iterate over those individually. That's the rough idea. If you can try a sample like that, then you'll have a bit of understanding of what taskbot does for you. One other note: I often make these batches a bit more controllable. I love serialized maps, so if I want accounting information or some other way to control the processing, Ill store the uris for each batch as in a map of maps and then I can also write accounting info as I go.. Kind Regards, David Ennis David Ennis *Content Engineer* [image: HintTech] <http://www.hinttech.com/> Mastering the value of content creative | technology | content Delftechpark 37i 2628 XJ Delft The Netherlands T: +31 88 268 25 00 M: +31 63 091 72 80 [image: http://www.hinttech.com] <http://www.hinttech.com> <https://twitter.com/HintTech> <http://www.facebook.com/HintTech> <http://www.linkedin.com/company/HintTech> On 16 January 2015 at 09:22, Geert Josten <geert.jos...@marklogic.com> wrote: > Hi Alexei, > > Naively spawning for each document doesn’t work well indeed. To improve > that slightly, you should make batches of roughly 100 docs, and spawn > processing of batches. I think that is what David meant. > > To take it up even one more level, you could have you spawning query > spawn only a limited number of batches (10 maybe), and then spawn itself to > do the remainder. All spawns end up on the task queue, which are processed > in parallel already. The creation of tasks on the queue would be paced down > when spawning the query that creates the batches, so only a very limited > number of tasks would be on the queue on average. That would prevent > overflow, and also leave room for other tasks, like evals from other > processes and scheduled tasks. > > Batching is used by most tools, including Corb, MLCP, and I would expect > also taskbot. I am less sure about the spawning of the batching. I know it > is used internally by Information Studio, but that is designed for loading > of content, not for processing content already in the database.. > > Kind regards, > Geert > > From: Alexei Betin <abe...@elevate.com> > Reply-To: MarkLogic Developer Discussion <general@developer.marklogic.com> > Date: Friday, January 16, 2015 at 1:52 AM > To: MarkLogic Developer Discussion <general@developer.marklogic.com> > Subject: Re: [MarkLogic Dev General] Bulk content processing in MarkLogic > > Thanks, Paul and David! > > > > Using simply xdmp:spawn-function() did not help much since I ran into > “maximum tasks” limit when naively spawning a function for each document. > > > > But CoRB worked beautifully and it’s definitely parallelizing the job just > the way I wanted. > > > > I am also going to take a look at Taskbot which seems really useful. > > > > Thanks again! > > > > *[image: Forward Slash]* > > *[image: Elevate]* > > *Alexei Betin* > > Principal Architect; Big Data > P: (817) 928-1643 | Elevate.com <http://www.elevate.com> > 4150 International Plaza, Suite 300 > Fort Worth, TX 76109 > > > > Privileged and Confidential. This e-mail, and any attachments thereto, is > intended only for use by the addressee(s) named herein and may contain > privileged and/or confidential information. If you have received this > e-mail in error, please notify me immediately by a return e-mail and delete > this e-mail. You are hereby notified that any dissemination, distribution > or copying of this e-mail and/or any attachments thereto, is strictly > prohibited. > > > > *From:* general-boun...@developer.marklogic.com [ > mailto:general-boun...@developer.marklogic.com > <general-boun...@developer.marklogic.com>] *On Behalf Of *Paul Hoehne > *Sent:* Thursday, January 15, 2015 10:50 AM > *To:* MarkLogic Developer Discussion > *Subject:* Re: [MarkLogic Dev General] Bulk content processing in > MarkLogic > > > > There’s also the CORB facility. I would try Taskbot, but if you’re not > familiar with it, I would also try to do a version of it using > xdmp:spawn-function. Learning to use xdmp:spawn-function is s a sometimes > over-looked but extremely useful function. > > > > *Paul Hoehne* > > Senior Consultant > > MarkLogic Corporation > > paul.hoe...@marklogic.com > > mobile: +1 571 830 4735 > > www.marklogic.com > > > > Click http://po.st/hMGDFm to get your free NoSQL For Dummies e-book! > > > > *From: *David Ennis <david.en...@hinttech.com> > *Reply-To: *MarkLogic Developer Discussion < > general@developer.marklogic.com> > *Date: *Thursday, January 15, 2015 at 1:43 PM > *To: *MarkLogic Developer Discussion <general@developer.marklogic.com> > *Subject: *Re: [MarkLogic Dev General] Bulk content processing in > MarkLogic > > > > HI. > > > > I usually spawn these types of things in batches. > > > > Also, There is also a nice utility by Michael Blakeley out there to help > manage this and make good use of the resources of your particular setup - > including a nice sample to start with: > > > > https://github.com/mblakele/taskbot > > > > It uses pretty much the same functions you would likely use on your own - > but organized nicely in a reusable/configurable way. > > > > > > Kind Regards, > > David Ennis > > > > > > *David Ennis* > *Content Engineer* > > [image: HintTech] <http://www.hinttech.com/> > Mastering the value of content > creative | technology | content > > Delftechpark 37i > 2628 XJ Delft > The Netherlands > T: +31 88 268 25 00 > M: +31 63 091 72 80 > > [image: http://www.hinttech.com] <http://www.hinttech.com> > <https://twitter.com/HintTech> <http://www.facebook.com/HintTech> > <http://www.linkedin.com/company/HintTech> > > > > On 15 January 2015 at 19:33, Alexei Betin <abe...@elevate.com> wrote: > > Hello, > > > > I stumble upon what seems to be a straightforward task of making a bulk > modification of XML documents in MarkLogic (such as adding a new element to > every document in the collection). I’ve looked at CPF first, but it seems > like it only supports event-based processing (triggers) and does not have > any facility for batch processing. > > > > So I just write a simple xQuery as follows: > > > > for $x in collection()/A return ( xdmp:node-delete( $x/test ), > xdmp:node-insert-child( $x, <test>test</test> ) ) > > > > but it runs out of memory on a large collection – “Expanded tree cache > full”. So it looks like the above query is trying fetch all documents into > memory first, then iterate over them. > > > > Whereas what I want is to perform the work on smaller chunks of data that > fit into memory and, ideally, do several such chunks in parallel (think > “map” without “reduce”). > > > > Is there another approach? I am reading about CoRB that seems to be just > the thing I need, but I wonder if I am missing another potential solution > here. > > > > Also, while CoRB description mentions that it can run updates on disk (not > in memory), it does not mention parallelization – which eventually will be > quite important for my use case. > > > > Thanks, > > > > *[image: Forward Slash]* > > *[image: Elevate]* > > *Alexei Betin* > > Principal Architect; Big Data > P: (817) 928-1643 | Elevate.com <http://www.elevate.com> > 4150 International Plaza, Suite 300 > Fort Worth, TX 76109 > > > > Privileged and Confidential. This e-mail, and any attachments thereto, is > intended only for use by the addressee(s) named herein and may contain > privileged and/or confidential information. If you have received this > e-mail in error, please notify me immediately by a return e-mail and delete > this e-mail. You are hereby notified that any dissemination, distribution > or copying of this e-mail and/or any attachments thereto, is strictly > prohibited. > > > > > _______________________________________________ > General mailing list > General@developer.marklogic.com > http://developer.marklogic.com/mailman/listinfo/general > > > > _______________________________________________ > General mailing list > General@developer.marklogic.com > http://developer.marklogic.com/mailman/listinfo/general > >
_______________________________________________ General mailing list General@developer.marklogic.com http://developer.marklogic.com/mailman/listinfo/general