Re: Community Review of New Language Pack

Matt Post Thu, 03 Nov 2016 03:38:04 -0700

okay, bleu scores are fine, but from the grammar it seems that you might have 
built an English → Russian system, instead of the other way around? Piping 
English text into the decoder gets me Russian, so it seems this might be the 
problem. (When you send Russian in, none of the words are in the grammar, so 
they get pushed through untranslated).


Don't pack up everything, but if you give me the first 100k lines of lm.gz and 
grammar.gz, that should help confirm.

matt


> On Nov 2, 2016, at 11:35 PM, lewis john mcgibbney <lewi...@apache.org> wrote:
> 
> Hi Matt,
> OK seeing as I get digest emails and the next batch has not arrived I'll
> just reply to the thread anyway.
> Glad to see that the file downloaded fine.
> 
> You stated "Something is obviously wrong. What BLEU scores did you get on
> your tuning and testing sets? Can you give me the first ten lines of
> grammar.gz and lm.gz? Impossible to do sanity checks without the raw
> grammars and LMs."
> 
> OK so
> 1) Yes something is wrong, there is no doubt about that!
> 2) Regards BLEU scores, I can find an individual 'bleu' file for tuning,
> all I could find was the file at $rundir/tune/mert.log which states
> 
> ----------------------------------------------------
> Z-MERT run ended @ Thu Oct 27 15:07:36 PDT 2016
> ----------------------------------------------------
> 
> FINAL lambda: {0.32811579836756727, 0.13451331312861647, 3.854583349699589,
> 1.9529831007694043, -0.14803956975598645, 0.9677828073898326,
> 1.9314652655618239, 0.04035458882374297, -6.304458466023295, 1.0,
> -3.9877496903616656, 0.8720746273758616} (BLEU: 0.5627727651628539)
> Warning: after normalization, lambda[12]=0.8721 is outside its critical
> value range.
> 
> With regards to testing, I was able to find the file at $rundir/test/bleu
> which contains
> 
> Processing 5000 sentences...
> Evaluating candidate translations in plain file
> /usr/local/joshua_resources/russian_experiments/exp4/test/output...
> BLEU_precision(1) = 94856 / 144986 = 0.6542
> BLEU_precision(2) = 77801 / 139986 = 0.5558
> BLEU_precision(3) = 68491 / 134986 = 0.5074
> BLEU_precision(4) = 60862 / 129986 = 0.4682
> BLEU_precision = 0.5421
> 
> Length of candidate corpus = 144986
> Effective length of reference corpus = 143814
> BLEU_BP = 1.0000
> 
>  => BLEU = 0.5421
> 
> 3) Regarding the first ten lines of grammar.gz and lm.gz, here they are
> below
> 
> First grammar.gz is a binary file however if I unzip it then take the top
> 10 I get the following
> 
> lmcgibbn@LMC-056430 ~/Desktop $ head -10 grammar
> [X] ||| "application ||| "application ||| 0 0.69315 1 1.00000 0 0.69315 |||
> 0-0
> [X] ||| "application server ||| "application server ||| 1.42471 0.74547 1
> 1.00000 0 0.69315 ||| 0-0 1-1
> [X] ||| "application server " ||| "application server " ||| 1.49321 1.08115
> 1 1.00000 0 0.69315 ||| 0-0 1-1 2-2
> [X] ||| "application server " mode ||| режиме "application server " |||
> 2.48813 1.75963 1 1.00000 0 0.69315 ||| 0-1 1-2 2-3 3-0
> [X] ||| "application server " mode . ||| режиме "application server " . |||
> 2.50691 1.77390 1 1.00000 0 0.69315 ||| 0-1 1-2 2-3 3-0 4-4
> [X] ||| "application server " mode [X,1] ||| режиме "application server "
> [X,1] ||| 2.48813 1.75963 1 1.00000 0 0.69315 ||| 0-1 1-2 2-3 3-0
> [X] ||| "application server " [X,1] ||| [X,1] "application server " |||
> 1.49321 1.08115 1 1.00000 0 0.69315 ||| 0-1 1-2 2-3
> [X] ||| "application server " [X,1] . ||| [X,1] "application server " . |||
> 1.51199 1.09541 1 1.00000 0 0.69315 ||| 0-1 1-2 2-3 4-4
> [X] ||| "application server [X,1] ||| "application server [X,1] ||| 1.42471
> 0.74547 1 1.00000 0 0.69315 ||| 0-0 1-1
> [X] ||| "application server [X,1] mode ||| режиме "application server [X,1]
> ||| 2.41964 1.42395 1 1.00000 0 0.69315 ||| 0-1 1-2 3-0
> 
> If I do the same with lm.gz I get the following
> 
> lmcgibbn@LMC-056430 ~/Desktop $ head -10 lm
> # Input file: fd 3
> # Token count: 17545420
> # Smoothing: Modified Kneser-Ney
> \data\
> ngram 1=632340
> ngram 2=5054267
> ngram 3=10057059
> ngram 4=12250396
> ngram 5=12726506
> 
> @Matt, if you want I can make absolutely everything I have available to you
> as a tar.gz if it will aid in debugging what is going on?
> Thanks
> Lewis
> 
> 
> 
> 
> On Wed, Nov 2, 2016 at 11:19 AM, lewis john mcgibbney <lewi...@apache.org>
> wrote:
> 
>> Hi Matt,
>> Thanks for looking into this.
>> 
>> On Wed, Nov 2, 2016 at 10:49 AM, <dev-digest-help@joshua.
>> incubator.apache.org> wrote:
>> 
>>> From: Matt Post <p...@cs.jhu.edu>
>>> To: dev@joshua.incubator.apache.org
>>> Cc:
>>> Date: Tue, 1 Nov 2016 16:26:29 -0400
>>> Subject: Re: Community Review of New Language Pack
>>> Lewis, can I get an MD5 or SHA1 checksum? I'm getting errors unpacking.
>>> 
>> 
>> Yes, please see http://home.apache.org/~lewismc/language-pack-ru-en-
>> 2016-10-28.tar.gz.md5
>> 
>> 
>>> 
>>> I do see that you built the LP with the old scripts. I'll write up
>>> instructions on how to do it with the new set.
>>> 
>>> 
>> Correct. I would greatly appreciate that thank you Matt.
>> Lewis
>> 
> 
> 
> 
> -- 
> http://home.apache.org/~lewismc/
> @hectorMcSpector
> http://www.linkedin.com/in/lmcgibbney

Re: Community Review of New Language Pack

Reply via email to