Re: [MarkLogic Dev General] Bulk content processing in MarkLogic

David Ennis Fri, 16 Jan 2015 00:39:31 -0800

HI.

Yes, Geert, thank you for translating my vague answer.


have 5000 docs, chunk them into 10 sets of 500 and pass those URIs to
xdmp:spawn-function or xdmp:spawn where you have the logic to iterate over
those individually.  That's the rough idea.

If you can try a sample like that, then you'll have a bit of understanding
of what taskbot does for you.

One other note:  I often make these batches a bit more controllable.  I
love serialized maps, so if I want accounting information or some other way
to control the processing, Ill store the uris for each batch as in a map of
maps and then I can also write accounting info as I go..




Kind Regards,
David Ennis


David Ennis
*Content Engineer*

[image: HintTech]  <http://www.hinttech.com/>
Mastering the value of content
creative | technology | content

Delftechpark 37i
2628 XJ Delft
The Netherlands
T: +31 88 268 25 00
M: +31 63 091 72 80

[image: http://www.hinttech.com] <http://www.hinttech.com>
<https://twitter.com/HintTech>  <http://www.facebook.com/HintTech>
<http://www.linkedin.com/company/HintTech>

On 16 January 2015 at 09:22, Geert Josten <geert.jos...@marklogic.com>
wrote:

>  Hi Alexei,
>
>  Naively spawning for each document doesn’t work well indeed. To improve
> that slightly, you should make batches of roughly 100 docs, and spawn
> processing of batches. I think that is what David meant.
>
>  To take it up even one more level, you could have you spawning query
> spawn only a limited number of batches (10 maybe), and then spawn itself to
> do the remainder. All spawns end up on the task queue, which are processed
> in parallel already. The creation of tasks on the queue would be paced down
> when spawning the query that creates the batches, so only a very limited
> number of tasks would be on the queue on average. That would prevent
> overflow, and also leave room for other tasks, like evals from other
> processes and scheduled tasks.
>
>  Batching is used by most tools, including Corb, MLCP, and I would expect
> also taskbot. I am less sure about the spawning of the batching. I know it
> is used internally by Information Studio, but that is designed for loading
> of content, not for processing content already in the database..
>
>  Kind regards,
> Geert
>
>   From: Alexei Betin <abe...@elevate.com>
> Reply-To: MarkLogic Developer Discussion <general@developer.marklogic.com>
> Date: Friday, January 16, 2015 at 1:52 AM
> To: MarkLogic Developer Discussion <general@developer.marklogic.com>
> Subject: Re: [MarkLogic Dev General] Bulk content processing in MarkLogic
>
>   Thanks, Paul and David!
>
>
>
> Using simply xdmp:spawn-function() did not help much since I ran into
> “maximum tasks” limit when naively spawning a function for each document.
>
>
>
> But CoRB worked beautifully and it’s definitely parallelizing the job just
> the way I wanted.
>
>
>
> I am also going to take a look at Taskbot which seems really useful.
>
>
>
> Thanks again!
>
>
>
> *[image: Forward Slash]*
>
> *[image: Elevate]*
>
> *Alexei Betin*
>
> Principal Architect; Big Data
> P: (817) 928-1643 | Elevate.com <http://www.elevate.com>
> 4150 International Plaza, Suite 300
> Fort Worth, TX 76109
>
>
>
> Privileged and Confidential. This e-mail, and any attachments thereto, is
> intended only for use by the addressee(s) named herein and may contain
> privileged and/or confidential information. If you have received this
> e-mail in error, please notify me immediately by a return e-mail and delete
> this e-mail. You are hereby notified that any dissemination, distribution
> or copying of this e-mail and/or any attachments thereto, is strictly
> prohibited.
>
>
>
> *From:* general-boun...@developer.marklogic.com [
> mailto:general-boun...@developer.marklogic.com
> <general-boun...@developer.marklogic.com>] *On Behalf Of *Paul Hoehne
> *Sent:* Thursday, January 15, 2015 10:50 AM
> *To:* MarkLogic Developer Discussion
> *Subject:* Re: [MarkLogic Dev General] Bulk content processing in
> MarkLogic
>
>
>
> There’s also the CORB facility.  I would try Taskbot, but if you’re not
> familiar with it, I would also try to do a version of it using
> xdmp:spawn-function.  Learning to use xdmp:spawn-function is s a sometimes
> over-looked but extremely useful function.
>
>
>
> *Paul Hoehne*
>
> Senior Consultant
>
> MarkLogic Corporation
>
> paul.hoe...@marklogic.com
>
> mobile: +1 571 830 4735
>
> www.marklogic.com
>
>
>
> Click http://po.st/hMGDFm to get your free NoSQL For Dummies e-book!
>
>
>
> *From: *David Ennis <david.en...@hinttech.com>
> *Reply-To: *MarkLogic Developer Discussion <
> general@developer.marklogic.com>
> *Date: *Thursday, January 15, 2015 at 1:43 PM
> *To: *MarkLogic Developer Discussion <general@developer.marklogic.com>
> *Subject: *Re: [MarkLogic Dev General] Bulk content processing in
> MarkLogic
>
>
>
> HI.
>
>
>
> I usually spawn these types of things in batches.
>
>
>
> Also, There is also a nice utility by Michael Blakeley out there to help
> manage this and make good use of the resources of your particular setup -
> including a nice sample to start with:
>
>
>
> https://github.com/mblakele/taskbot
>
>
>
> It uses pretty much the same functions you would likely use on your own -
> but organized nicely in a reusable/configurable way.
>
>
>
>
>
> Kind Regards,
>
> David Ennis
>
>
>
>
>
> *David Ennis*
> *Content Engineer*
>
> [image: HintTech]  <http://www.hinttech.com/>
> Mastering the value of content
> creative | technology | content
>
> Delftechpark 37i
> 2628 XJ Delft
> The Netherlands
> T: +31 88 268 25 00
> M: +31 63 091 72 80
>
> [image: http://www.hinttech.com] <http://www.hinttech.com>
> <https://twitter.com/HintTech>  <http://www.facebook.com/HintTech>
> <http://www.linkedin.com/company/HintTech>
>
>
>
> On 15 January 2015 at 19:33, Alexei Betin <abe...@elevate.com> wrote:
>
> Hello,
>
>
>
> I stumble upon what seems to be a straightforward task of making a bulk
> modification of XML documents in MarkLogic (such as adding a new element to
> every document in the collection). I’ve looked at CPF first, but it seems
> like it only supports event-based processing (triggers) and does not have
> any facility for batch processing.
>
>
>
> So I just write a simple xQuery as follows:
>
>
>
> for $x in collection()/A  return ( xdmp:node-delete( $x/test ),
> xdmp:node-insert-child( $x, <test>test</test> ) )
>
>
>
> but it runs out of memory on a large collection – “Expanded tree cache
> full”. So it looks like the above query is trying fetch all documents into
> memory first, then iterate over them.
>
>
>
> Whereas what I want is to perform the work on smaller chunks of data that
> fit into memory and, ideally, do several such chunks in parallel (think
> “map” without “reduce”).
>
>
>
> Is there another approach? I am reading about CoRB that seems to be just
> the thing I need, but I wonder if I am missing another potential solution
> here.
>
>
>
> Also, while CoRB description mentions that it can run updates on disk (not
> in memory), it does not mention parallelization – which eventually will be
> quite important for my use case.
>
>
>
> Thanks,
>
>
>
> *[image: Forward Slash]*
>
> *[image: Elevate]*
>
> *Alexei Betin*
>
> Principal Architect; Big Data
> P: (817) 928-1643 | Elevate.com <http://www.elevate.com>
> 4150 International Plaza, Suite 300
> Fort Worth, TX 76109
>
>
>
> Privileged and Confidential. This e-mail, and any attachments thereto, is
> intended only for use by the addressee(s) named herein and may contain
> privileged and/or confidential information. If you have received this
> e-mail in error, please notify me immediately by a return e-mail and delete
> this e-mail. You are hereby notified that any dissemination, distribution
> or copying of this e-mail and/or any attachments thereto, is strictly
> prohibited.
>
>
>
>
> _______________________________________________
> General mailing list
> General@developer.marklogic.com
> http://developer.marklogic.com/mailman/listinfo/general
>
>
>
> _______________________________________________
> General mailing list
> General@developer.marklogic.com
> http://developer.marklogic.com/mailman/listinfo/general
>
>

_______________________________________________
General mailing list
General@developer.marklogic.com
http://developer.marklogic.com/mailman/listinfo/general

Re: [MarkLogic Dev General] Bulk content processing in MarkLogic

Reply via email to