Re: Community Review of New Language Pack

lewis john mcgibbney Thu, 03 Nov 2016 09:23:19 -0700

Hi Matt,
Again, I've not got the batch digest email through yet.
Your absolutely right... having run the command so many times I've made a
trivial but very costly mistake which is that I incorrectly switched source
and target from --source ru --target en (which is what it should have
been), to --source en --target ru (which is hopelessly incorrect for my
particular purpose).
As the famous saying goes... "Sometimes you just can't see the wood for the
trees!"
Well... I've kicked off a new pipeline and it now looks like the community
will have two new language packs to play with.
Thanks for all your help in debugging this.
Lewis


On Wed, Nov 2, 2016 at 8:35 PM, lewis john mcgibbney <lewi...@apache.org>
wrote:

> Hi Matt,
> OK seeing as I get digest emails and the next batch has not arrived I'll
> just reply to the thread anyway.
> Glad to see that the file downloaded fine.
>
> You stated "Something is obviously wrong. What BLEU scores did you get on
> your tuning and testing sets? Can you give me the first ten lines of
> grammar.gz and lm.gz? Impossible to do sanity checks without the raw
> grammars and LMs."
>
> OK so
> 1) Yes something is wrong, there is no doubt about that!
> 2) Regards BLEU scores, I can find an individual 'bleu' file for tuning,
> all I could find was the file at $rundir/tune/mert.log which states
>
> ----------------------------------------------------
> Z-MERT run ended @ Thu Oct 27 15:07:36 PDT 2016
> ----------------------------------------------------
>
> FINAL lambda: {0.32811579836756727, 0.13451331312861647,
> 3.854583349699589, 1.9529831007694043, -0.14803956975598645,
> 0.9677828073898326, 1.9314652655618239, 0.04035458882374297,
> -6.304458466023295, 1.0, -3.9877496903616656, 0.8720746273758616} (BLEU:
> 0.5627727651628539)
> Warning: after normalization, lambda[12]=0.8721 is outside its critical
> value range.
>
> With regards to testing, I was able to find the file at $rundir/test/bleu
> which contains
>
> Processing 5000 sentences...
> Evaluating candidate translations in plain file
> /usr/local/joshua_resources/russian_experiments/exp4/test/output...
> BLEU_precision(1) = 94856 / 144986 = 0.6542
> BLEU_precision(2) = 77801 / 139986 = 0.5558
> BLEU_precision(3) = 68491 / 134986 = 0.5074
> BLEU_precision(4) = 60862 / 129986 = 0.4682
> BLEU_precision = 0.5421
>
> Length of candidate corpus = 144986
> Effective length of reference corpus = 143814
> BLEU_BP = 1.0000
>
>   => BLEU = 0.5421
>
> 3) Regarding the first ten lines of grammar.gz and lm.gz, here they are
> below
>
> First grammar.gz is a binary file however if I unzip it then take the top
> 10 I get the following
>
> lmcgibbn@LMC-056430 ~/Desktop $ head -10 grammar
> [X] ||| "application ||| "application ||| 0 0.69315 1 1.00000 0 0.69315
> ||| 0-0
> [X] ||| "application server ||| "application server ||| 1.42471 0.74547 1
> 1.00000 0 0.69315 ||| 0-0 1-1
> [X] ||| "application server " ||| "application server " ||| 1.49321
> 1.08115 1 1.00000 0 0.69315 ||| 0-0 1-1 2-2
> [X] ||| "application server " mode ||| режиме "application server " |||
> 2.48813 1.75963 1 1.00000 0 0.69315 ||| 0-1 1-2 2-3 3-0
> [X] ||| "application server " mode . ||| режиме "application server " .
> ||| 2.50691 1.77390 1 1.00000 0 0.69315 ||| 0-1 1-2 2-3 3-0 4-4
> [X] ||| "application server " mode [X,1] ||| режиме "application server "
> [X,1] ||| 2.48813 1.75963 1 1.00000 0 0.69315 ||| 0-1 1-2 2-3 3-0
> [X] ||| "application server " [X,1] ||| [X,1] "application server " |||
> 1.49321 1.08115 1 1.00000 0 0.69315 ||| 0-1 1-2 2-3
> [X] ||| "application server " [X,1] . ||| [X,1] "application server " .
> ||| 1.51199 1.09541 1 1.00000 0 0.69315 ||| 0-1 1-2 2-3 4-4
> [X] ||| "application server [X,1] ||| "application server [X,1] |||
> 1.42471 0.74547 1 1.00000 0 0.69315 ||| 0-0 1-1
> [X] ||| "application server [X,1] mode ||| режиме "application server
> [X,1] ||| 2.41964 1.42395 1 1.00000 0 0.69315 ||| 0-1 1-2 3-0
>
> If I do the same with lm.gz I get the following
>
> lmcgibbn@LMC-056430 ~/Desktop $ head -10 lm
> # Input file: fd 3
> # Token count: 17545420
> # Smoothing: Modified Kneser-Ney
> \data\
> ngram 1=632340
> ngram 2=5054267
> ngram 3=10057059
> ngram 4=12250396
> ngram 5=12726506
>
> @Matt, if you want I can make absolutely everything I have available to
> you as a tar.gz if it will aid in debugging what is going on?
> Thanks
> Lewis
>
>
>
>
> On Wed, Nov 2, 2016 at 11:19 AM, lewis john mcgibbney <lewi...@apache.org>
> wrote:
>
>> Hi Matt,
>> Thanks for looking into this.
>>
>> On Wed, Nov 2, 2016 at 10:49 AM, <dev-digest-help@joshua.incuba
>> tor.apache.org> wrote:
>>
>>> From: Matt Post <p...@cs.jhu.edu>
>>> To: dev@joshua.incubator.apache.org
>>> Cc:
>>> Date: Tue, 1 Nov 2016 16:26:29 -0400
>>> Subject: Re: Community Review of New Language Pack
>>> Lewis, can I get an MD5 or SHA1 checksum? I'm getting errors unpacking.
>>>
>>
>> Yes, please see http://home.apache.org/~lewism
>> c/language-pack-ru-en-2016-10-28.tar.gz.md5
>>
>>
>>>
>>> I do see that you built the LP with the old scripts. I'll write up
>>> instructions on how to do it with the new set.
>>>
>>>
>> Correct. I would greatly appreciate that thank you Matt.
>> Lewis
>>
>
>
>
> --
> http://home.apache.org/~lewismc/
> @hectorMcSpector
> http://www.linkedin.com/in/lmcgibbney
>



-- 
http://home.apache.org/~lewismc/
@hectorMcSpector
http://www.linkedin.com/in/lmcgibbney

Re: Community Review of New Language Pack

Reply via email to