Michael,
what score-setting do you use to achieve these results ?
if search algo= 1 what cube pruning number ?

Le 08/10/2015 19:05, Michael Denkowski a écrit :
Hi all,

I extended the multi_moses.py script to support multi-threaded moses instances for cases where memory limits the number of decoders that can run in parallel. The threads arg now takes the form "--threads P:T:E" to run P processes using T threads each and an optional extra process running E threads. The script sends input lines to instances as they have free threads so all CPUs stay busy for the full decoding run.

I ran some more bench marks with the CompactPT system trading off between threads and processes:

procs/threads per  sent/sec
1x16                   5.46
2x8                    7.58
4x4                    9.71
8x2                   12.50
16x1                  14.08

From the results so far, it's best to use as many instances as will fit into memory and evenly distribute CPUs. For example, a system with 32 CPUs that could fit 3 copies of moses into memory could use "--threads 2:11:10" to run 2 instances with 11 threads each and 1 instance with 10 threads. The script can be used with mert-moses.pl <http://mert-moses.pl> via the --multi-moses flag and --decoder-flags='--threads P:T:E'.

Best,
Michael


On Tue, Oct 6, 2015 at 4:39 PM, Michael Denkowski <michael.j.denkow...@gmail.com <mailto:michael.j.denkow...@gmail.com>> wrote:

    Hi Hieu and all,

    I just checked in a bug fix for the multi_moses.py script.  I
    forgot to override the number of threads for each moses command,
    so if [threads] were specified in the moses.ini, the multi-moses
    runs were cheating by running a bunch of multi-threaded
    instances.  If threads were only being specified on the command
    line, the script was correctly stripping the flag so everything
    should be good.  I finished a benchmark on my system with an
    unpruned compact PT (with the fixed script) and got the following:

    16 threads 5.38 sent/sec
    16 procs  13.51 sent/sec

    This definitely used a lot more memory though.  Based on some very
    rough estimates looking at free system memory, the memory mapped
    suffix array PT went from 2G to 6G with 16 processes while the
    compact PT went from 3G to 37G. For cases where everything fits
    into memory, I've seen significant speedup from multi-process
    decoding.

    For cases where things don't fit into memory, the multi-moses
    script could be extended to start as many multi-threaded instances
    as will fit into ram and farm out sentences in a way that keeps
    all of the CPUs busy.  I know Marcin has mentioned using GNU parallel.

    Best,
    Michael

    On Tue, Oct 6, 2015 at 4:16 PM, Hieu Hoang <hieuho...@gmail.com
    <mailto:hieuho...@gmail.com>> wrote:

        I've just run some comparison between multithreaded decoder
        and the multi_moses.py script. It's good stuff.

        It make me seriously wonder whether we should use abandon
        multi-threading and go all out for the multi-process approach.

        There's some advantage to multi-thread - eg. where model files
        are loaded into memory rather than memory map. But there's
        disadvantages too - it more difficult to maintain and there's
        about a 10% overhead.

        What do people think?

        Phrase-based:

                1       5       10      15      20      25      30
        32      real4m37.000s   real1m15.391s   real0m51.217s
        real0m48.287s   real0m50.719s   real0m52.027s   real0m53.045s
        Baseline (Compact pt)   user4m21.544s   user5m28.597s
        user6m38.227s   user8m0.975s    user8m21.122s   user8m3.195s
        user8m4.663s

                sys0m15.451s    sys0m34.669s    sys0m53.867s    sys1m10.515s
        sys1m20.746s    sys1m24.368s    sys1m23.677s

                
                
                
                
                
                
                
        34      4m49.474s       real1m17.867s   real0m43.096s   real0m31.999s
        0m26.497s       0m26.296s       killed
        (32) + multi_moses      4m33.580s       user4m40.486s   user4m56.749s
        user5m6.692s    5m43.845s       7m34.617s       

                0m15.957s       sys0m32.347s    sys0m51.016s    sys1m11.106s
        1m44.115s       2m21.263s       

                
                
                
                
                
                
                
        38      real4m46.254s   real1m16.637s   real0m49.711s
        real0m48.389s   real0m49.144s   real0m51.676s   real0m52.472s
        Baseline (Probing pt)   user4m30.596s   user5m32.500s
        user6m23.706s   user7m40.791s   user7m51.946s   user7m52.892s
        user7m53.569s

                sys0m15.624s    sys0m36.169s    sys0m49.433s    sys1m6.812s
        sys1m9.614s     sys1m13.108s    sys1m12.644s

                
                
                
                
                
                
                
        39      real4m43.882s   real1m17.849s   real0m34.245s
        real0m31.318s   real0m28.054s   real0m24.120s   real0m22.520s
        (38) + multi moses      user4m29.212s   user4m47.693s
        user5m5.750s    user5m33.573s   user6m18.847s   user7m19.642s
        user8m38.013s

                sys0m15.835s    sys0m25.398s    sys0m36.716s    sys0m41.349s
        sys0m48.494s    sys1m0.843s     sys1m13.215s


        Hiero:
        3       real5m33.011s   real1m28.935s   real0m59.470s   real1m0.315s
        real0m55.619s   real0m57.347s   real0m59.191s   1m2.786s
        6/10 baseline   user4m53.187s   user6m23.521s   user8m17.170s
        user12m48.303s  user14m45.954s  user17m58.109s
        user20m22.891s  21m13.605s

                sys0m39.696s    sys0m51.519s    sys1m3.788s     sys1m22.125s
        sys1m58.718s    sys2m51.249s    sys4m4.807s     4m37.691s

                
                
                
                
                
                
                
                
        4       
                real1m27.215s   real0m40.495s   real0m36.206s   real0m28.623s
        real0m26.631s   real0m25.817s   0m25.401s
        (3) + multi_moses       
                user5m4.819s    user5m42.070s   user5m35.132s   user6m46.001s
        user7m38.151s   user9m6.500s    10m32.739s

                
                sys0m38.039s    sys0m45.753s    sys0m44.117s    sys0m52.285s
        sys0m56.655s    sys1m6.749s     1m16.935s


        On 05/10/2015 16:05, Michael Denkowski wrote:
        Hi Philipp,

        Unfortunately I don't have a precise measurement.  If anyone
        knows of a good way to benchmark a process tree with lots of
        memory mapping the same files, I would be glad to run it.

        --Michael

        On Mon, Oct 5, 2015 at 10:26 AM, Philipp Koehn <p...@jhu.edu
        <mailto:p...@jhu.edu>> wrote:

            Hi,

            great - that will be very useful.

            Since you just ran the comparison - do you have any
            numbers on "still allowed everything to fit into memory",
            i.e., how much more memory is used by running parallel
            instances?

            -phi

            On Mon, Oct 5, 2015 at 10:15 AM, Michael Denkowski
            <michael.j.denkow...@gmail.com
            <mailto:michael.j.denkow...@gmail.com>> wrote:

                Hi all,

                Like some other Moses users, I noticed diminishing
                returns from running Moses with several threads.  To
                work around this, I added a script to run multiple
                single-threaded instances of moses instead of one
                multi-threaded instance.  In practice, this sped
                things up by about 2.5x for 16 cpus and using memory
                mapped models still allowed everything to fit into
                memory.

                If anyone else is interested in using this, you can
                prefix a moses command with
                scripts/generic/multi_moses.py. To use multiple
                instances in mert-moses.pl <http://mert-moses.pl>,
                specify --multi-moses and control the number of
                parallel instances with --decoder-flags='-threads N'.

                Below is a benchmark on WMT fr-en data (2M training
                sentences, 400M words mono, suffix array PT, compact
                reordering, 5-gram KenLM) testing default stack
                decoding vs cube pruning without and with the
                parallelization script (+multi):

                ---
                1cpu   sent/sec
                stack      1.04
                cube       2.10
                ---
                16cpu  sent/sec
                stack      7.63
                +multi    12.20
                cube       7.63
                +multi    18.18
                ---

                --Michael

                _______________________________________________
                Moses-support mailing list
                Moses-support@mit.edu <mailto:Moses-support@mit.edu>
                http://mailman.mit.edu/mailman/listinfo/moses-support





        _______________________________________________
        Moses-support mailing list
        Moses-support@mit.edu <mailto:Moses-support@mit.edu>
        http://mailman.mit.edu/mailman/listinfo/moses-support

-- Hieu Hoang
        http://www.hoang.co.uk/hieu





_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to