Re: DUCC doesn't use all available machines

Jim Challenger Mon, 17 Nov 2014 13:33:13 -0800

It is also possible that RM "prediction" has decided that additionalprocesses are not needed. Itappears that there were likely 64 work items dispatched, plus the 6completed, leaving only30 that were "idle". If these work items appeared to be completingquickly, the RM would decide

that scale-up would be wasteful and not do it.

Very gory details if you're interested:
The time to start a new processes is measured by the RM based on the

observed initialization time of the processes plus an estimate of howlong it would take to geta new process actually running. A fudge-factor is added on top of thisbecause in a large operationit is wasteful to start processes (with associated preemptions) thatonly end up doing a "few" work

tems.  All is subjective and configurable.


The average time-per-work item is also reported to the RM.

The RM then looks at the number of work items remaining, and theestimated time needed toprocesses this work based on the above, and if it determines that thejob will be completed before

new processes can be scaled up and initialized, it does not scale up.

For short jobs, this can be a bit inaccurate, but those jobs are short :)

For longer jobs, the time-per-work-item becomes increasingly accurate sothe RM prediction tendsto improve and ramp-up WILL occur if the work-item time turns out to belarger than originallythought. (Our experience is that work-item times are mostly uniformwith occasional outliers, but

the prediction seems to work well).

Relevant configuration parameters in ducc.properties:

# Predict when a job will end and avoid expanding if not needed. Set tofalse to disable prediction.

   ducc.rm.prediction = true

# Add this fudge factor (milliseconds) to the expansion target whenusing prediction

   ducc.rm.prediction.fudge = 120000

You can observe this in the rm log, see the example below. I'mpreparing a guide to this log; for now,the net of these two log lines is: the projection for the job inquestion (job 208927) is that 16 processesare needed to complete this job, even though the job could use 20processes at full expanseion - the BaseCap -

so a max of 16 will be scheduled for it,  subject to fair-share constraint.

17 Nov 2014 15:07:38,880 INFO RM.RmJob - */getPrjCap/* 208927 bobuserO 2 T 343171 NTh 128 TI 143171 TR 6748.601431980907 R 1.8967e-02 QR 5043P 6509 F 0 ST 1416254363603*/return 16/*17 Nov 2014 15:07:38,880 INFO RM.RmJob - */initJobCap/* 208927 bobuserO 2 */Base cap:/* 20 Expected future cap: 16 potential cap 16 actual cap 16


Jim

On 11/17/14, 3:44 PM, Eddie Epstein wrote:

DuccRawTextSpec.job specifies that each job process (JP)
run 8 analytic pipeline threads. So for this job with 100 work
items, no more than 13 JPs would ever be started.

After successful initialization of the first JP, DUCC begins scaling
up the number of JPs using doubling. During JP scale up the
scheduler monitors the work item completion rate, compares that
with the JP initialization time, and stops scaling up JPs when
starting more JPs will not make the job run any faster.

Of course JP scale up is also limited by the job's "fair share"
of resources relative to total resources available for all preemptable jobs.

To see more JPs, increase the number and/or size of the input text files,
or decrease the number of pipeline threads per JP.

Note that it can be counter productive to run "too many" pipeline
threads per machine. Assuming analytic threads are 100% CPU bound,
running more threads than real cores will often slow down the overall
document processing rate.


On Mon, Nov 17, 2014 at 6:48 AM, Simon Hafner <reactorm...@gmail.com> wrote:

I fired the DuccRawTextSpec.job on a cluster consisting of three
machines, with 100 documents. The scheduler only runs the processes on
two machines instead of all three. Can I mess with a few config
variables to make it use all three?

id:22 state:Running total:100 done:0 error:0 retry:0 procs:1
id:22 state:Running total:100 done:0 error:0 retry:0 procs:2
id:22 state:Running total:100 done:0 error:0 retry:0 procs:4
id:22 state:Running total:100 done:1 error:0 retry:0 procs:8
id:22 state:Running total:100 done:6 error:0 retry:0 procs:8

Re: DUCC doesn't use all available machines

Reply via email to