Re: [Moses-support] Faster decoding with multiple moses instances

Hieu Hoang Thu, 08 Oct 2015 06:56:11 -0700

oh, i forgot to attach results

        1       5       10      15      20      25      30      35

Current master 1m50.835s real0m24.373s real0m14.991s real0m12.999sreal0m11.012s real0m10.012s real0m10.108s 0m11.226s

1m48.409s user1m51.587s user2m6.720s user2m37.313s user2m42.219suser2m54.870s user3m20.491s 4m5.443s

0m2.412s sys0m2.663s sys0m3.350s sys0m5.051s sys0m7.094ssys0m12.152s sys0m17.519s 0m20.036s

v 0.91 real9m48.969s real2m0.393s real1m3.791s real0m48.806sreal0m41.872s real0m42.046s real0m39.679s 0m43.448s

user9m47.835s user9m34.309s user9m33.838s user10m24.816suser11m46.145s user13m41.972s user15m13.436s 15m28.816s

sys0m1.064s sys0m1.473s sys0m1.318s sys0m2.586s sys0m3.890ssys0m4.922s sys0m7.033s 0m25.970s

Current master (cube pruning 400) real0m17.623s real0m5.720sreal0m4.512s real0m4.605s real0m4.791s real0m4.965s real0m5.035s0m4.940s

user0m14.856s user0m18.203s user0m21.629s user0m26.831suser0m29.115s user0m29.152s user0m27.692s 0m27.632s

sys0m2.780s sys0m3.922s sys0m6.138s sys0m8.377s sys0m10.442ssys0m12.397s sys0m12.446s 0m10.814s

v0.91 (cube pruning 400) 1m30.621s real0m38.222s real0m41.789sreal1m17.326s real1m34.733s real1m49.563s real2m15.697s 2m20.562s

1m21.425s user2m12.044s user3m19.893s user5m41.552s user6m27.785suser6m25.569s user6m25.127s 5m40.199s

0m9.163s sys0m9.914s sys0m6.334s sys0m18.870s sys0m46.127ssys2m53.603s sys5m33.046s 6m59.326s




On 08/10/2015 14:15, Hieu Hoang wrote:

thanks for all your comments. It may look like we'll keep bothmulti-process and multi-thread for the time being. There may be usefor both further down the line.

Vito - no-one's written a wrapper to do multi-process, rather thanmulti-thread, with mosesserver. I would think the speed gain would bethe same.

Mike Ladwig - do you still have the results and/or model files you canshare? I just did a comparison between v 0.91 and current master.Master is way better at process and multi-threaded. The old code issimilarly afflicted the problem of not scaling above 10-15 threads.I'm surprised that the old code is slower in single threaded, but i'mnot surprised about multi-threading. We've haven't traditionallylooked at the that problem. However, v0.91 does use less memory.


On 08/10/2015 09:25, Vito Mandorino wrote:

Hi all,

what about mosesserver? Do you think the same speed gains would occur?

Best,
Vito

2015-10-06 22:39 GMT+02:00 Michael Denkowski<michael.j.denkow...@gmail.com <mailto:michael.j.denkow...@gmail.com>>:


    Hi Hieu and all,

    I just checked in a bug fix for the multi_moses.py script.  I
    forgot to override the number of threads for each moses command,
    so if [threads] were specified in the moses.ini, the multi-moses
    runs were cheating by running a bunch of multi-threaded
    instances.  If threads were only being specified on the command
    line, the script was correctly stripping the flag so everything
    should be good.  I finished a benchmark on my system with an
    unpruned compact PT (with the fixed script) and got the following:

    16 threads 5.38 sent/sec
    16 procs  13.51 sent/sec

    This definitely used a lot more memory though.  Based on some
    very rough estimates looking at free system memory, the memory
    mapped suffix array PT went from 2G to 6G with 16 processes while
    the compact PT went from 3G to 37G.  For cases where everything
    fits into memory, I've seen significant speedup from
    multi-process decoding.

    For cases where things don't fit into memory, the multi-moses
    script could be extended to start as many multi-threaded
    instances as will fit into ram and farm out sentences in a way
    that keeps all of the CPUs busy. I know Marcin has mentioned
    using GNU parallel.

    Best,
    Michael

    On Tue, Oct 6, 2015 at 4:16 PM, Hieu Hoang <hieuho...@gmail.com>
    wrote:

        I've just run some comparison between multithreaded decoder
        and the multi_moses.py script. It's good stuff.

        It make me seriously wonder whether we should use abandon
        multi-threading and go all out for the multi-process approach.

        There's some advantage to multi-thread - eg. where model
        files are loaded into memory rather than memory map. But
        there's disadvantages too - it more difficult to maintain and
        there's about a 10% overhead.

        What do people think?

        Phrase-based:

                1       5       10      15      20      25      30
        32      real4m37.000s   real1m15.391s   real0m51.217s
        real0m48.287s   real0m50.719s   real0m52.027s   real0m53.045s
        Baseline (Compact pt)   user4m21.544s   user5m28.597s
        user6m38.227s   user8m0.975s    user8m21.122s   user8m3.195s
        user8m4.663s

                sys0m15.451s    sys0m34.669s    sys0m53.867s    sys1m10.515s
        sys1m20.746s    sys1m24.368s    sys1m23.677s

                
                
                
                
                
                
                
        34      4m49.474s       real1m17.867s   real0m43.096s   real0m31.999s
        0m26.497s       0m26.296s       killed
        (32) + multi_moses      4m33.580s       user4m40.486s   user4m56.749s
        user5m6.692s    5m43.845s       7m34.617s       

                0m15.957s       sys0m32.347s    sys0m51.016s    sys1m11.106s
        1m44.115s       2m21.263s       

                
                
                
                
                
                
                
        38      real4m46.254s   real1m16.637s   real0m49.711s
        real0m48.389s   real0m49.144s   real0m51.676s   real0m52.472s
        Baseline (Probing pt)   user4m30.596s   user5m32.500s
        user6m23.706s   user7m40.791s   user7m51.946s   user7m52.892s
        user7m53.569s

                sys0m15.624s    sys0m36.169s    sys0m49.433s    sys1m6.812s
        sys1m9.614s     sys1m13.108s    sys1m12.644s

                
                
                
                
                
                
                
        39      real4m43.882s   real1m17.849s   real0m34.245s
        real0m31.318s   real0m28.054s   real0m24.120s   real0m22.520s
        (38) + multi moses      user4m29.212s   user4m47.693s
        user5m5.750s    user5m33.573s   user6m18.847s   user7m19.642s
        user8m38.013s

                sys0m15.835s    sys0m25.398s    sys0m36.716s    sys0m41.349s
        sys0m48.494s    sys1m0.843s     sys1m13.215s


        Hiero:
        3       real5m33.011s   real1m28.935s   real0m59.470s
        real1m0.315s    real0m55.619s   real0m57.347s   real0m59.191s
        1m2.786s
        6/10 baseline   user4m53.187s   user6m23.521s   user8m17.170s
        user12m48.303s  user14m45.954s  user17m58.109s
        user20m22.891s  21m13.605s

                sys0m39.696s    sys0m51.519s    sys1m3.788s     sys1m22.125s
        sys1m58.718s    sys2m51.249s    sys4m4.807s     4m37.691s

                
                
                
                
                
                
                
                
        4       
                real1m27.215s   real0m40.495s   real0m36.206s   real0m28.623s
        real0m26.631s   real0m25.817s   0m25.401s
        (3) + multi_moses       
                user5m4.819s    user5m42.070s   user5m35.132s   user6m46.001s
        user7m38.151s   user9m6.500s    10m32.739s

                
                sys0m38.039s    sys0m45.753s    sys0m44.117s    sys0m52.285s
        sys0m56.655s    sys1m6.749s     1m16.935s


        On 05/10/2015 16:05, Michael Denkowski wrote:

        Hi Philipp,

        Unfortunately I don't have a precise measurement.  If anyone
        knows of a good way to benchmark a process tree with lots of
        memory mapping the same files, I would be glad to run it.

        --Michael

        On Mon, Oct 5, 2015 at 10:26 AM, Philipp Koehn <p...@jhu.edu>
        wrote:

            Hi,

            great - that will be very useful.

            Since you just ran the comparison - do you have any
            numbers on "still allowed everything to fit into
            memory", i.e., how much more memory is used by running
            parallel instances?

            -phi

            On Mon, Oct 5, 2015 at 10:15 AM, Michael Denkowski
            <michael.j.denkow...@gmail.com> wrote:

                Hi all,

                Like some other Moses users, I noticed diminishing
                returns from running Moses with several threads.  To
                work around this, I added a script to run multiple
                single-threaded instances of moses instead of one
                multi-threaded instance.  In practice, this sped
                things up by about 2.5x for 16 cpus and using memory
                mapped models still allowed everything to fit into
                memory.

                If anyone else is interested in using this, you can
                prefix a moses command with
                scripts/generic/multi_moses.py. To use multiple
                instances in mert-moses.pl <http://mert-moses.pl>,
                specify --multi-moses and control the number of
                parallel instances with --decoder-flags='-threads N'.

                Below is a benchmark on WMT fr-en data (2M training
                sentences, 400M words mono, suffix array PT, compact
                reordering, 5-gram KenLM) testing default stack
                decoding vs cube pruning without and with the
                parallelization script (+multi):

                ---
                1cpu sent/sec
                stack 1.04
                cube 2.10
                ---
                16cpu sent/sec
                stack 7.63
                +multi 12.20
                cube 7.63
                +multi 18.18
                ---

                --Michael

                _______________________________________________
                Moses-support mailing list
                Moses-support@mit.edu <mailto:Moses-support@mit.edu>
                http://mailman.mit.edu/mailman/listinfo/moses-support





        _______________________________________________
        Moses-support mailing list
        Moses-support@mit.edu <mailto:Moses-support@mit.edu>
        http://mailman.mit.edu/mailman/listinfo/moses-support

--Hieu Hoang

        http://www.hoang.co.uk/hieu



    _______________________________________________
    Moses-support mailing list
    Moses-support@mit.edu <mailto:Moses-support@mit.edu>
    http://mailman.mit.edu/mailman/listinfo/moses-support




--
*M**. Vito MANDORINO -- Chief Scientist*

Description : Description : lingua_custodia_final full logo

*/The Translation Trustee/*

*1, Place Charles de Gaulle, **78180 Montigny-le-Bretonneux*

*Tel : +33 1 30 44 04 23  Mobile : +33 6 84 65 68 89*

*Email 
:****<mailto:massinissa.ah...@linguacustodia.com>vito.mandor...@linguacustodia.com***

*Website :****www.linguacustodia.com<http://www.linguacustodia.com/> - www.thetranslationtrustee.com *




_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


--
Hieu Hoang
http://www.hoang.co.uk/hieu


--
Hieu Hoang
http://www.hoang.co.uk/hieu

_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] Faster decoding with multiple moses instances

Reply via email to