Michael,
what score-setting do you use to achieve these results ?
if search algo= 1 what cube pruning number ?
Le 08/10/2015 19:05, Michael Denkowski a écrit :
Hi all,
I extended the multi_moses.py script to support multi-threaded moses
instances for cases where memory limits the number of decoders that
can run in parallel. The threads arg now takes the form "--threads
P:T:E" to run P processes using T threads each and an optional extra
process running E threads. The script sends input lines to instances
as they have free threads so all CPUs stay busy for the full decoding run.
I ran some more bench marks with the CompactPT system trading off
between threads and processes:
procs/threads per sent/sec
1x16 5.46
2x8 7.58
4x4 9.71
8x2 12.50
16x1 14.08
From the results so far, it's best to use as many instances as will
fit into memory and evenly distribute CPUs. For example, a system
with 32 CPUs that could fit 3 copies of moses into memory could use
"--threads 2:11:10" to run 2 instances with 11 threads each and 1
instance with 10 threads. The script can be used with mert-moses.pl
<http://mert-moses.pl> via the --multi-moses flag and
--decoder-flags='--threads P:T:E'.
Best,
Michael
On Tue, Oct 6, 2015 at 4:39 PM, Michael Denkowski
<michael.j.denkow...@gmail.com <mailto:michael.j.denkow...@gmail.com>>
wrote:
Hi Hieu and all,
I just checked in a bug fix for the multi_moses.py script. I
forgot to override the number of threads for each moses command,
so if [threads] were specified in the moses.ini, the multi-moses
runs were cheating by running a bunch of multi-threaded
instances. If threads were only being specified on the command
line, the script was correctly stripping the flag so everything
should be good. I finished a benchmark on my system with an
unpruned compact PT (with the fixed script) and got the following:
16 threads 5.38 sent/sec
16 procs 13.51 sent/sec
This definitely used a lot more memory though. Based on some very
rough estimates looking at free system memory, the memory mapped
suffix array PT went from 2G to 6G with 16 processes while the
compact PT went from 3G to 37G. For cases where everything fits
into memory, I've seen significant speedup from multi-process
decoding.
For cases where things don't fit into memory, the multi-moses
script could be extended to start as many multi-threaded instances
as will fit into ram and farm out sentences in a way that keeps
all of the CPUs busy. I know Marcin has mentioned using GNU parallel.
Best,
Michael
On Tue, Oct 6, 2015 at 4:16 PM, Hieu Hoang <hieuho...@gmail.com
<mailto:hieuho...@gmail.com>> wrote:
I've just run some comparison between multithreaded decoder
and the multi_moses.py script. It's good stuff.
It make me seriously wonder whether we should use abandon
multi-threading and go all out for the multi-process approach.
There's some advantage to multi-thread - eg. where model files
are loaded into memory rather than memory map. But there's
disadvantages too - it more difficult to maintain and there's
about a 10% overhead.
What do people think?
Phrase-based:
1 5 10 15 20 25 30
32 real4m37.000s real1m15.391s real0m51.217s
real0m48.287s real0m50.719s real0m52.027s real0m53.045s
Baseline (Compact pt) user4m21.544s user5m28.597s
user6m38.227s user8m0.975s user8m21.122s user8m3.195s
user8m4.663s
sys0m15.451s sys0m34.669s sys0m53.867s sys1m10.515s
sys1m20.746s sys1m24.368s sys1m23.677s
34 4m49.474s real1m17.867s real0m43.096s real0m31.999s
0m26.497s 0m26.296s killed
(32) + multi_moses 4m33.580s user4m40.486s user4m56.749s
user5m6.692s 5m43.845s 7m34.617s
0m15.957s sys0m32.347s sys0m51.016s sys1m11.106s
1m44.115s 2m21.263s
38 real4m46.254s real1m16.637s real0m49.711s
real0m48.389s real0m49.144s real0m51.676s real0m52.472s
Baseline (Probing pt) user4m30.596s user5m32.500s
user6m23.706s user7m40.791s user7m51.946s user7m52.892s
user7m53.569s
sys0m15.624s sys0m36.169s sys0m49.433s sys1m6.812s
sys1m9.614s sys1m13.108s sys1m12.644s
39 real4m43.882s real1m17.849s real0m34.245s
real0m31.318s real0m28.054s real0m24.120s real0m22.520s
(38) + multi moses user4m29.212s user4m47.693s
user5m5.750s user5m33.573s user6m18.847s user7m19.642s
user8m38.013s
sys0m15.835s sys0m25.398s sys0m36.716s sys0m41.349s
sys0m48.494s sys1m0.843s sys1m13.215s
Hiero:
3 real5m33.011s real1m28.935s real0m59.470s real1m0.315s
real0m55.619s real0m57.347s real0m59.191s 1m2.786s
6/10 baseline user4m53.187s user6m23.521s user8m17.170s
user12m48.303s user14m45.954s user17m58.109s
user20m22.891s 21m13.605s
sys0m39.696s sys0m51.519s sys1m3.788s sys1m22.125s
sys1m58.718s sys2m51.249s sys4m4.807s 4m37.691s
4
real1m27.215s real0m40.495s real0m36.206s real0m28.623s
real0m26.631s real0m25.817s 0m25.401s
(3) + multi_moses
user5m4.819s user5m42.070s user5m35.132s user6m46.001s
user7m38.151s user9m6.500s 10m32.739s
sys0m38.039s sys0m45.753s sys0m44.117s sys0m52.285s
sys0m56.655s sys1m6.749s 1m16.935s
On 05/10/2015 16:05, Michael Denkowski wrote:
Hi Philipp,
Unfortunately I don't have a precise measurement. If anyone
knows of a good way to benchmark a process tree with lots of
memory mapping the same files, I would be glad to run it.
--Michael
On Mon, Oct 5, 2015 at 10:26 AM, Philipp Koehn <p...@jhu.edu
<mailto:p...@jhu.edu>> wrote:
Hi,
great - that will be very useful.
Since you just ran the comparison - do you have any
numbers on "still allowed everything to fit into memory",
i.e., how much more memory is used by running parallel
instances?
-phi
On Mon, Oct 5, 2015 at 10:15 AM, Michael Denkowski
<michael.j.denkow...@gmail.com
<mailto:michael.j.denkow...@gmail.com>> wrote:
Hi all,
Like some other Moses users, I noticed diminishing
returns from running Moses with several threads. To
work around this, I added a script to run multiple
single-threaded instances of moses instead of one
multi-threaded instance. In practice, this sped
things up by about 2.5x for 16 cpus and using memory
mapped models still allowed everything to fit into
memory.
If anyone else is interested in using this, you can
prefix a moses command with
scripts/generic/multi_moses.py. To use multiple
instances in mert-moses.pl <http://mert-moses.pl>,
specify --multi-moses and control the number of
parallel instances with --decoder-flags='-threads N'.
Below is a benchmark on WMT fr-en data (2M training
sentences, 400M words mono, suffix array PT, compact
reordering, 5-gram KenLM) testing default stack
decoding vs cube pruning without and with the
parallelization script (+multi):
---
1cpu sent/sec
stack 1.04
cube 2.10
---
16cpu sent/sec
stack 7.63
+multi 12.20
cube 7.63
+multi 18.18
---
--Michael
_______________________________________________
Moses-support mailing list
Moses-support@mit.edu <mailto:Moses-support@mit.edu>
http://mailman.mit.edu/mailman/listinfo/moses-support
_______________________________________________
Moses-support mailing list
Moses-support@mit.edu <mailto:Moses-support@mit.edu>
http://mailman.mit.edu/mailman/listinfo/moses-support
--
Hieu Hoang
http://www.hoang.co.uk/hieu
_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support
_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support