Yes I get it now.
By 'number of threads' you mean Task Server "threads" setting
(max threads per server).
It still seems it might get up to 2x-1 but that's vastly better than filling 
the entire queue ...
I had misread 'very low' to mean compared to the max threads setting, not the 
queue size.



-----------------------------------------------------------------------------
David Lee
Lead Engineer
MarkLogic Corporation
d...@marklogic.com
Phone: +1 812-482-5224
Cell:  +1 812-630-7622
www.marklogic.com<http://www.marklogic.com/>

From: general-boun...@developer.marklogic.com 
[mailto:general-boun...@developer.marklogic.com] On Behalf Of Geert Josten
Sent: Friday, January 16, 2015 5:39 AM
To: MarkLogic Developer Discussion
Subject: Re: [MarkLogic Dev General] Bulk content processing in MarkLogic

Yes, taskbot is definitely worth a look.. :)

"unless the processing time is exactly equal this could get ahead of itself ... 
if T' finishes before all T (say 3 T's are left when T is done and spawns 10 
more T's , repeat)."

@David Lee: not sure what you are trying to say there. There is always just one 
T(m) running, and it will always spawn just one T(m+1). If there are more tasks 
on the queue then there are threads available (because of T' or other tasks 
still busy), T(m+1) will automatically wait until the queue dropped below the 
nr of available threads. That way the nr of tasks in the queue will always be 
very low, regardless of the total nr of docs you intend to process..

Cheers,
Geert

From: David Lee <david....@marklogic.com<mailto:david....@marklogic.com>>
Reply-To: MarkLogic Developer Discussion 
<general@developer.marklogic.com<mailto:general@developer.marklogic.com>>
Date: Friday, January 16, 2015 at 10:35 AM
To: MarkLogic Developer Discussion 
<general@developer.marklogic.com<mailto:general@developer.marklogic.com>>
Subject: Re: [MarkLogic Dev General] Bulk content processing in MarkLogic

---->

To take it up even one more level, you could have you spawning query spawn only 
a limited number of batches (10 maybe), and then spawn itself to do the 
remainder. All spawns end up on the task queue, which are processed in parallel 
already. The creation of tasks on the queue would be paced down when spawning 
the query that creates the batches, so only a very limited number of tasks 
would be on the queue on average. That would prevent overflow, and also leave 
room for other tasks, like evals from other processes and scheduled tasks.
         --<

This is interesting, if I am reading it right, it's like this:
T: spawn controller ( processes batch then respawns T)
T': span worker ( process batch only )
n=batch#,m=controller#

T(m): spawn 10 T's one per batch  ( T'(n,n+10+m) )
       (runs 1 batch?) then spans itself  T(m+1)
T': process 1 batch - exit

unless the processing time is exactly equal this could get ahead of itself ... 
if T' finishes before all T (say 3 T's are left when T is done and spawns 10 
more T's , repeat).

I've done similar things also ... Its a great technique but I find every time I 
do it,
its trickier than I thought and the job has different needs so reusing the old 
code is hard.

This really calls for generic framework/library that separates out the task 
queue management from the work process and all the tiddly bits that add up to 
99% of the work.

< read back one message > <  face palm > wow!
https://github.com/mblakele/taskbot

Mike, why didn't you read my mind when I needed this for [insert recent 
project] ?
This is really nice.
Of course I can immediately see some additions I would have needed ... (replace 
the list with a function, check pointing state for restart persistence ,...)
and a billion features I don't need but would be potentially useful ( resource 
capping by querying the meters DB occasionally, cross server spawns  - may need
to serialize function items for that ... hmmm, and of course a GUI !  )

if  only it were open source, on GitHub, had a license that allowed commercial 
use without the worrying about the lawyers ... had logging and exception support
and written by a nice person that wouldn't mind the pull requests ...
... oh wait ...  it is ! wow.

One question before I put this on my infinite queue of jobs for my clones to 
work on in parallel universes

How does the using $tb:OPTIONS-SYNC  avoid the problem of the calling task 
timing out if the job takes too long ?

That and the problem of the original list itself being too expensive to create 
were my big stumbling blocks on a recent project.
( the rest was just a PITA of repetitive work this would have eliminated)

-----------------------------------------------------------------------------
David Lee
Lead Engineer
MarkLogic Corporation
d...@marklogic.com<mailto:d...@marklogic.com>
Phone: +1 812-482-5224
Cell:  +1 812-630-7622
www.marklogic.com<http://www.marklogic.com/>





_______________________________________________
General mailing list
General@developer.marklogic.com
http://developer.marklogic.com/mailman/listinfo/general

Reply via email to