I've thrown some numbers at it (doubling each) and it's running at comfortable 125 procs. However, at about 6.1k of 6.5k items, the procs drop down to 30.
2014-11-28 18:13 GMT-06:00 Eddie Epstein <eaepst...@gmail.com>: > Now you are hitting a limit configured in ducc.properties: > > # Max number of work-item CASes for each job > ducc.threads.limit = 500 > > 62 job process * 8 threads per process = 496 max concurrent work items. > This was put in to limit the memory required by the job driver. This value > can probably be pushed up in the range of 700-800 before the job driver > will go OOM. There are configuration parameters to increase JD memory: > > # Memory size in MB allocated for each JD > ducc.jd.share.quantum = 450 > # JD max heap size. Should be smaller than the JD share quantum > ducc.driver.jvm.args = -Xmx400M -DUimaAsCasTracking > > DUCC would have to be restarted for the JD size parameters to take effect. > > One of the current DUCC development items is to significantly reduce the > memory needed per work item, and raise the default limit for concurrent > work items by two or three orders of magnitude. > > > > On Fri, Nov 28, 2014 at 6:40 PM, Simon Hafner <reactorm...@gmail.com> wrote: > >> I've put the fudge to 12000, and it jumped immediately to 62 procs. >> However, it doesn't spawn new ones even though it has about 6k items >> left and it doesn't spawn more procs. >> >> 2014-11-17 15:30 GMT-06:00 Jim Challenger <chall...@gmail.com>: >> > It is also possible that RM "prediction" has decided that additional >> > processes are not needed. It >> > appears that there were likely 64 work items dispatched, plus the 6 >> > completed, leaving only >> > 30 that were "idle". If these work items appeared to be completing >> quickly, >> > the RM would decide >> > that scale-up would be wasteful and not do it. >> > >> > Very gory details if you're interested: >> > The time to start a new processes is measured by the RM based on the >> > observed initialization time of the processes plus an estimate of how >> long >> > it would take to get >> > a new process actually running. A fudge-factor is added on top of this >> > because in a large operation >> > it is wasteful to start processes (with associated preemptions) that only >> > end up doing a "few" work >> > tems. All is subjective and configurable. >> > >> > The average time-per-work item is also reported to the RM. >> > >> > The RM then looks at the number of work items remaining, and the >> estimated >> > time needed to >> > processes this work based on the above, and if it determines that the job >> > will be completed before >> > new processes can be scaled up and initialized, it does not scale up. >> > >> > For short jobs, this can be a bit inaccurate, but those jobs are short :) >> > >> > For longer jobs, the time-per-work-item becomes increasingly accurate so >> the >> > RM prediction tends >> > to improve and ramp-up WILL occur if the work-item time turns out to be >> > larger than originally >> > thought. (Our experience is that work-item times are mostly uniform with >> > occasional outliers, but >> > the prediction seems to work well). >> > >> > Relevant configuration parameters in ducc.properties: >> > # Predict when a job will end and avoid expanding if not needed. Set to >> > false to disable prediction. >> > ducc.rm.prediction = true >> > # Add this fudge factor (milliseconds) to the expansion target when using >> > prediction >> > ducc.rm.prediction.fudge = 120000 >> > >> > You can observe this in the rm log, see the example below. I'm >> preparing a >> > guide to this log; for now, >> > the net of these two log lines is: the projection for the job in question >> > (job 208927) is that 16 processes >> > are needed to complete this job, even though the job could use 20 >> processes >> > at full expanseion - the BaseCap - >> > so a max of 16 will be scheduled for it, subject to fair-share >> constraint. >> > >> > 17 Nov 2014 15:07:38,880 INFO RM.RmJob - */getPrjCap/* 208927 bobuser >> O 2 >> > T 343171 NTh 128 TI 143171 TR 6748.601431980907 R 1.8967e-02 QR 5043 P >> 6509 >> > F 0 ST 1416254363603*/return 16/* >> > 17 Nov 2014 15:07:38,880 INFO RM.RmJob - */initJobCap/* 208927 bobuser >> O 2 >> > */Base cap:/* 20 Expected future cap: 16 potential cap 16 actual cap 16 >> > >> > Jim >> > >> > >> > On 11/17/14, 3:44 PM, Eddie Epstein wrote: >> >> >> >> DuccRawTextSpec.job specifies that each job process (JP) >> >> run 8 analytic pipeline threads. So for this job with 100 work >> >> items, no more than 13 JPs would ever be started. >> >> >> >> After successful initialization of the first JP, DUCC begins scaling >> >> up the number of JPs using doubling. During JP scale up the >> >> scheduler monitors the work item completion rate, compares that >> >> with the JP initialization time, and stops scaling up JPs when >> >> starting more JPs will not make the job run any faster. >> >> >> >> Of course JP scale up is also limited by the job's "fair share" >> >> of resources relative to total resources available for all preemptable >> >> jobs. >> >> >> >> To see more JPs, increase the number and/or size of the input text >> files, >> >> or decrease the number of pipeline threads per JP. >> >> >> >> Note that it can be counter productive to run "too many" pipeline >> >> threads per machine. Assuming analytic threads are 100% CPU bound, >> >> running more threads than real cores will often slow down the overall >> >> document processing rate. >> >> >> >> >> >> On Mon, Nov 17, 2014 at 6:48 AM, Simon Hafner <reactorm...@gmail.com> >> >> wrote: >> >> >> >>> I fired the DuccRawTextSpec.job on a cluster consisting of three >> >>> machines, with 100 documents. The scheduler only runs the processes on >> >>> two machines instead of all three. Can I mess with a few config >> >>> variables to make it use all three? >> >>> >> >>> id:22 state:Running total:100 done:0 error:0 retry:0 procs:1 >> >>> id:22 state:Running total:100 done:0 error:0 retry:0 procs:2 >> >>> id:22 state:Running total:100 done:0 error:0 retry:0 procs:4 >> >>> id:22 state:Running total:100 done:1 error:0 retry:0 procs:8 >> >>> id:22 state:Running total:100 done:6 error:0 retry:0 procs:8 >> >>> >> > >>