Re: [Moses-support] is there a way to remove a bad entry in the phrase table ?

2015-09-24 Thread Matthias Huck
Hi Vincent,

On Thu, 2015-09-24 at 22:37 +0200, Vincent Nguyen wrote:
> Thanks Matthias for the detailed explanation.
> I think I have most of it in mind except not really understanding how 
> this one works :
> 
> "Difficult sentences generally have worse model score than easy ones but
> may still be useful for training."

Well, your data selection method may discard training instances that are
somehow hard to decode, e.g. because of complex sentence structure or
because of rare vocabulary. But that doesn't necessarily mean that it's
bad sentence pairs that you're removing. You should manually inspect
some samples if possible.

I didn't try, but I suspect that you'd get a higher decoder score on the
1-best decoder output of the first of the following two input sentences:

(1) " Merci ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! "
(2) " Je l' ai vécu moi-même en personne quand j' ai eu mon diplôme à Barnard 
College en 2002 . "

(Just as a simple made-up example.)

If we assume that you have a correct English target sentence for both of
those sentences in your training data, I wonder which of the two you
could learn more from?

If you're doing what I think, then you're also basically just assessing
whether the source side of the sentence pair is easy to translate. Does
this tell you anything about the target sentence? The target side might
be misaligned or in a different third language if your data is noisy.

Cheers,
Matthias



-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] is there a way to remove a bad entry in the phrase table ?

2015-09-24 Thread Vincent Nguyen
Thanks Matthias for the detailed explanation.
I think I have most of it in mind except not really understanding how 
this one works :

"Difficult sentences generally have worse model score than easy ones but
may still be useful for training."


but yes what you describe is more or less what I did to better 
understand the mechanism.
and I know I have to tune with in domain data for proper end result.

Cheers,
Vincent

Le 24/09/2015 22:13, Matthias Huck a écrit :
> Hi Vincent,
>
> This is a different topic, and I'm not completely clear about what
> exactly you did here. Did you decode the source side of the parallel
> training data, conduct sentence selection by applying a threshold on the
> decoder score, and extract a new phrase table from the selected fraction
> of the original parallel training data? If this is the case, I have some
> comments:
>
>
> - Be careful when you translate training data. The system knows these
> sentences and does things like frequently applying long singleton
> phrases that have been extracted from the very same sentence.
> https://aclweb.org/anthology/P/P10/P10-1049.pdf
>
> - Longer sentences may have worse model score than shorter sentences.
> Consider normalizing by sentence length if you use model score for data
> selection.
> Difficult sentences generally have worse model score than easy ones but
> may still be useful for training. You possibly keep the parts of the
> data that are easy to translate or are highly redundant in the corpus.
>
> - You probably see no out-of-vocabulary words (OOVs) when translating
> training data, or very few of them (depending on word alignment, phrase
> extraction method, and phrase table pruning), but be aware that if there
> are OOVs, this may affect the model score a lot.
>
> - Check to what extent the sentence selection reduces the vocabulary of
> your system.
>
>
> Last but not least, two more general comments:
>
> - You need dev and test sets that are similar to the type of real-world
> documents that you're building your system for. Don't tune on Europarl
> if you eventually want to translate pharmaceutical patents, for
> instance. Try to collect in-domain training data as well.
>
> - In case you have in-domain and out-of-domain training corpora, you can
> try modified Moore-Lewis filtering for data selection.
> https://aclweb.org/anthology/D/D11/D11-1033.pdf
>
>
> Cheers,
> Matthias
>
>
> On Thu, 2015-09-24 at 18:19 +0200, Vincent Nguyen wrote:
>> This is an interesting subject ..
>>
>> As a matter of fact I have done several tests.
>> I came up to that need after realizing that even though my results were
>> good in a "standard dev + test set" situation
>> I had some strange results with real-world documents.
>> That's why I investigated.
>>
>> But you are right removing some so-called bad entries could have
>> unexpected results.
>>
>> For instance here is a test I did :
>>
>> I trained a fr-en model on europarl v7 ( 2 millions sentences)
>> I tuned with a subset of 3 K sentences.
>> I ran a evaluation on the full 2 million lines.
>> then I removed the 90 K sentences for which the score was less than 0.2
>> retrained on 1917853 sentences.
>>
>> In the end I got more sentences (in %) with a score above 0.2
>> but when analyzing at > 0.3 it becomes similar and > 0.4 the initial
>> corpus is better.
>>
>> Just weird.
>
>
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] is there a way to remove a bad entry in the phrase table ?

2015-09-24 Thread Matthias Huck
Hi Vincent,

This is a different topic, and I'm not completely clear about what
exactly you did here. Did you decode the source side of the parallel
training data, conduct sentence selection by applying a threshold on the
decoder score, and extract a new phrase table from the selected fraction
of the original parallel training data? If this is the case, I have some
comments:


- Be careful when you translate training data. The system knows these
sentences and does things like frequently applying long singleton
phrases that have been extracted from the very same sentence.
https://aclweb.org/anthology/P/P10/P10-1049.pdf

- Longer sentences may have worse model score than shorter sentences.
Consider normalizing by sentence length if you use model score for data
selection.
Difficult sentences generally have worse model score than easy ones but
may still be useful for training. You possibly keep the parts of the
data that are easy to translate or are highly redundant in the corpus.

- You probably see no out-of-vocabulary words (OOVs) when translating
training data, or very few of them (depending on word alignment, phrase
extraction method, and phrase table pruning), but be aware that if there
are OOVs, this may affect the model score a lot.

- Check to what extent the sentence selection reduces the vocabulary of
your system.


Last but not least, two more general comments:

- You need dev and test sets that are similar to the type of real-world
documents that you're building your system for. Don't tune on Europarl
if you eventually want to translate pharmaceutical patents, for
instance. Try to collect in-domain training data as well.

- In case you have in-domain and out-of-domain training corpora, you can
try modified Moore-Lewis filtering for data selection. 
https://aclweb.org/anthology/D/D11/D11-1033.pdf


Cheers,
Matthias


On Thu, 2015-09-24 at 18:19 +0200, Vincent Nguyen wrote:
> This is an interesting subject ..
> 
> As a matter of fact I have done several tests.
> I came up to that need after realizing that even though my results were 
> good in a "standard dev + test set" situation
> I had some strange results with real-world documents.
> That's why I investigated.
> 
> But you are right removing some so-called bad entries could have 
> unexpected results.
> 
> For instance here is a test I did :
> 
> I trained a fr-en model on europarl v7 ( 2 millions sentences)
> I tuned with a subset of 3 K sentences.
> I ran a evaluation on the full 2 million lines.
> then I removed the 90 K sentences for which the score was less than 0.2
> retrained on 1917853 sentences.
> 
> In the end I got more sentences (in %) with a score above 0.2
> but when analyzing at > 0.3 it becomes similar and > 0.4 the initial 
> corpus is better.
> 
> Just weird.



-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] is there a way to remove a bad entry in the phrase table ?

2015-09-24 Thread Vincent Nguyen
0.103413
>>>>   > > 0.00192967 ||| 0-0
>>>>   > > 1-1 2-2 3-3 3-4 ||| 1 1 1 ||| |||
>>>>   > > ! ! ) ||| ! ! ) ||| 0.103413 0.278429 0.103413 0.533321
>>>>   > > ||| 0-0 1-1 2-2
>>>>   > > ||| 1 1 1 ||| |||
>>>>   > > ! ! ||| ! ! ||| 0.625 0.363573 0.769231 0.633844 ||| 0-0
>>>>   > > 1-1 ||| 16 13
>>>>   > > 10 ||| |||
>>>>   > > ! ! ||| . ||| 4.65922e-08 6.71089e-07 0.00795487 0.140779
>>>>   > > ||| 0-0 1-0
>>>>   > > ||| 2.21954e+06 13 1 ||| |||
>>>>   > > ! ! ||| budget ! ! ||| 0.0517067 0.363573 0.00795487
>>>>   > > 5.66022e-05 ||| 0-1
>>>>   > > 1-2 ||| 2 13 1 ||| |||
>>>>   > > ! ! ||| nécessaire ! ! ||| 0.103413 0.363573 0.00795487
>>>>   > > 0.000130572 |||
>>>>   > > 0-1 1-2 ||| 1 13 1 ||| |||
>>>>   > > ! [ never again ! ||| ! ||| 6.51628e-06 5.42074e-13
>>>>   > > 0.103413
>>>>   > > 0.796143 ||| 0-0 4-0 ||| 15870 1 1 ||| |||
>>>>   > > ! ] this is ||| tel est ||| 7.38667e-05 9.16191e-11
>>>>   > > 0.103413
>>>>   > > 0.00147917 ||| 2-0 3-1 ||| 1400 1 1 ||| |||
>>>>   > > ! ] this ||| tel ||| 1.09594e-05 1.44188e-10 0.103413
>>>>   > > 0.0035893 |||
>>>>       > > 2-0 ||| 9436 1 1 ||| |||
>>>>   > > ! ] ||| ! ] ||| 0.103413 0.352335 0.103413
>>>>   > > 0.472387 ||| 0-0 1-1
>>>>   > > ||| 1 1 1 ||| |||
>>>>   > > ! & quot ; ||| ! " . et ||| 0.0517067 2.36396e-12
>>>>   > > 0.0517067
>>>>   > > 1.88268e-05 ||| 0-0 1-1 2-1 3-3 ||| 2 2 1 ||| |||
>>>>   > > ! & quot ; ||| ! " ||| 0.000222394 1.44515e-11
>>>>   > > 0.0517067
>>>>   > > 0.518419 ||| 0-0 2-1 ||| 465 2 1 ||| |||
>>>>   > > ! & quot ||| ! " . ||| 0.000662906 8.30626e-09
>>>>   > > 0.0344711
>>>>   > > 0.00232791 ||| 0-0 1-1 2-1 ||| 156 3 1 ||| |||
>>>>   > > ! & quot ||| ! " ||| 0.00218918 8.30626e-09
>>>>   > > 0.339323 0.518419
>>>>   > > ||| 0-0 2-1 ||| 465 3 2 ||| |||
>>>>   > > ! & ||| ! ||| 6.51628e-06 7.21755e-05 0.103413
>>>>   > > 0.796143 ||| 0-0 |||
>>>>   > > 15870 1 1 ||| |||
>>>>   > > ! ' ] , addressed ||| ! " adressé |||
>>>>   > > 0.103413 3.70838e-07
>>>>   > > 0.103413 0.00596848 ||| 0-0 1-1 2-1 4-2 ||| 1 1 1 ||| |||
>>>>   > > ! ' ] , ||| ! " ||| 0.000222394 2.49698e-06
>>>>   > > 0.103413
>>>>   > > 0.215573 ||| 0-0 1-1 2-1 ||| 465 1 1 ||| |||
>>>>   > > ! ' ] ||| ! " ||| 0.000222394 3.57128e-05
>>>>   > > 0.103413
>>>>   > > 0.215573 ||| 0-0 1-1 2-1 ||| 465 1 1 ||| |||
>>>>   > > ! ' ' Alstom shares ||| l' on constate un
>>>>   > > dysfonctionnement ||| 0.0344711 5.62605e-16 0.103413
>>>>   > > 1.03361e-14 ||| 1-0
>>>>   > > 2-0 1-1 3-4 4-4 ||| 3 1 1 ||| |||
>>>>   > > ! ' ' ||| l' on constate un ||| 0.0147733
>>>>   > > 1.56906e-11
>>>>   > > 0.0129267 2.2766e-12 ||| 1-0 2-0 1-1 ||| 7 8 1 ||| |||
>>>>   > > ! ' ' ||| l' on constate ||| 0.000984889
>>>>   > > 1.56906e-11
>>>>   > > 0.0129267 2.36929e-10 ||| 1-0 2-0 1-1 ||| 105 8 1 ||| |||
>>>>   > > ! ' ' ||| l' on ||| 6.76656e-06 1.56906e-11
>>>>   > > 0.0129267
>>>>   > > 6.18613e-06 ||| 1-0 2-0 1-1 ||| 15283 8 1 ||| |||
>>>>   > > ! ' ' ||| ou que l' on constate |||
>>>>   > > 0.0344711 1.56906e-11
>>>>   > > 0.0129267 4.69534e-15 ||| 1-2 2-2 1-3 ||| 3 8 1 ||| |||
>>>>   > > ! ' ' ||| ou que l' on ||| 0.00304157
>>>>   > > 1.56906e-11
>>>>   > > 0.0129267 1.22594e-10 ||| 1-2 2-2 1-3 ||| 34 8 1 ||| |||
>>>>   > > ! ' ' ||| que l' on constate un |||
>>>>   > > 0.0344711 1.56906e-11
>>>>   > > 0.0129267 4.56092e-14 ||| 1-1 2-1 1-2 ||| 3 8 1 ||| |||
>>>>   > > ! ' ' ||| que l' on constate ||| 0.00323167
>>>>   > > 1.56906e-11
>>>>   > > 0.0129267 4.74661e-12 ||| 1-1 2-1 1-2 ||| 32 8 1 ||| |||
>>>>   > >
>>>>   > >
>>>>   > >
>>>>   > > Le 23/09/2015 15:12, Tom Hoar a écrit :
>>>>   > > > Vincent,
>>>>   > > >
>>>>   > > > If you suspect bad entries, isn't it better to address
>>>>   > > > the root of the
>>>>   > > > problem and prepare your training corpus better?
>>>>   > > >
>>>>   > > >
>>>>   > > > On 9/23/2015 6:46 PM, moses-support-requ...@mit.edu
>>>>   > > > wrote:
>>>>   > > > > Date: Tue, 22 Sep 2015 20:24:02 +0200
>>>>   > > > > From: Philipp Koehn
>>>>   > > > > Subject: Re: [Moses-support] is there a way to remove
>>>>   > > > > a bad entry in
>>>>   > > > > the phrase table ?
>>>>   > > > > To: Vincent Nguyen
>>>>   > > > > Cc: moses-support
>>>>   > > > >
>>>>   > > > > Hi,
>>>>   > > > >
>>>>   > > > > you can remove it manually (just edit the text file),
>>>>   > > > > there will be no
>>>>   > > > > negative consequences.
>>>>   > > > >
>>>>   > > > > However, it is not a realistic strategy to try to
>>>>   > > > > remove by hand every
>>>>   > > > > offending phrase table entry.
>>>>   > > > >
>>>>   > > > > -phi
>>>>   > > > >
>>>>   > > > > On Tue, Sep 22, 2015 at 4:05 PM, Vincent
>>>>   > > > > Nguyen  wrote:
>>>>   > > > >
>>>>   > > > > > >Hi,
>>>>   > > > > > >
>>>>   > > > > > >I was wondering if after an analysis of the
>>>>   > > > > > BLEU-Annotation file we
>>>>   > > > > > >realize that there must be a bad entry in the
>>>>   > > > > > phrase table,
>>>>   > > > > > >we could remove it manually or in some other
>>>>   > > > > > ways ?
>>>>   > > > > > >
>>>>   > > > > > >Gracias.
>>>>   > > > > > >V.
>>>>   > > > > > >___
>>>>   > > > > > >Moses-support mailing list
>>>>   > > > > > >Moses-support@mit.edu
>>>>   > > > > > >http://mailman.mit.edu/mailman/listinfo/moses-support
>>>>   > > > > > >
>>>>   > > >
>>>>   > > > --
>>>>   > > > Best regards,
>>>>   > > >
>>>>   > > > Tom Hoar
>>>>   > > > Chief Executive Officer
>>>>   > > > /*Precision Translation Tools Pte Ltd*/
>>>>   > > > Singapore/Thailand
>>>>   > > > Web: www.precisiontranslationtools.com
>>>>   > > > <http://www.precisiontranslationtools.com>
>>>>   > > > Thailand Mobile: +66 87 345-1875
>>>>   > > > Skype: tahoar
>>>>   > > >
>>>>   > > >
>>>>   > > > ___
>>>>   > > > Moses-support mailing list
>>>>   > > > Moses-support@mit.edu
>>>>   > > > http://mailman.mit.edu/mailman/listinfo/moses-support
>>>>   > >
>>>>   > >
>>>>   > >
>>>>   > > ___
>>>>   > > Moses-support mailing list
>>>>   > > Moses-support@mit.edu
>>>>   > > http://mailman.mit.edu/mailman/listinfo/moses-support
>>>>   > >
>>>>   >
>>>>   
>>>>   
>>>>   
>>>>   ___
>>>>   Moses-support mailing list
>>>>   Moses-support@mit.edu
>>>>   http://mailman.mit.edu/mailman/listinfo/moses-support
>>>>   
>>>>
>>>>
>>>> ___
>>>> Moses-support mailing list
>>>> Moses-support@mit.edu
>>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>>
>
>
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] is there a way to remove a bad entry in the phrase table ?

2015-09-24 Thread Vincent Nguyen
 Vincent Nguyen escribió:
>>  > > I agree and would like to.
>>  > > But this is tricky, look at the first 30 lines of my
>>  > > phrase table below.
>>  > >
>>  > > and this happens a lot in the first line of tables where
>>  > > there are &apos
>>  > > or weird codes, EN/FR pairs do not match.
>>  > >
>>  > >
>>  > >
>>  > >
>>  > > ! ! ! ! ||| ! ! ! ! ||| 0.103413 0.132185 0.103413
>>  > > 0.401758 ||| 0-0 1-1
>>  > > 2-2 3-3 ||| 1 1 1 ||| |||
>>  > > ! ! ! ) ||| ! ! ! ) ||| 0.339323 0.167884 0.508985 0.4246
>>  > > ||| 0-0 1-0
>>  > > 2-0 2-1 2-2 3-3 ||| 3 2 2 ||| |||
>>  > > ! ! ! ||| ! ! ! ||| 0.501834 0.219223 0.716905 0.50463 |||
>>  > > 0-0 1-1 2-2
>>  > > ||| 10 7 6 ||| |||
>>  > > ! ! ! ||| budget ! ! ! ||| 0.0517067 0.219223 0.0147733
>>  > > 4.50635e-05 |||
>>  > > 0-1 1-2 2-3 ||| 2 7 1 ||| |||
>>  > > ! ! ) , ||| ! ! ) - , ||| 0.103413 0.111989 0.103413
>>  > > 0.00192967 ||| 0-0
>>  > > 1-1 2-2 3-3 3-4 ||| 1 1 1 ||| |||
>>  > > ! ! ) ||| ! ! ) ||| 0.103413 0.278429 0.103413 0.533321
>>  > > ||| 0-0 1-1 2-2
>>  > > ||| 1 1 1 ||| |||
>>  > > ! ! ||| ! ! ||| 0.625 0.363573 0.769231 0.633844 ||| 0-0
>>  > > 1-1 ||| 16 13
>>  > > 10 ||| |||
>>  > > ! ! ||| . ||| 4.65922e-08 6.71089e-07 0.00795487 0.140779
>>  > > ||| 0-0 1-0
>>  > > ||| 2.21954e+06 13 1 ||| |||
>>  > > ! ! ||| budget ! ! ||| 0.0517067 0.363573 0.00795487
>>  > > 5.66022e-05 ||| 0-1
>>  > > 1-2 ||| 2 13 1 ||| |||
>>  > > ! ! ||| nécessaire ! ! ||| 0.103413 0.363573 0.00795487
>>  > > 0.000130572 |||
>>  > > 0-1 1-2 ||| 1 13 1 ||| |||
>>  > > ! [ never again ! ||| ! ||| 6.51628e-06 5.42074e-13
>>  > > 0.103413
>>  > > 0.796143 ||| 0-0 4-0 ||| 15870 1 1 ||| |||
>>  > > ! ] this is ||| tel est ||| 7.38667e-05 9.16191e-11
>>  > > 0.103413
>>  > > 0.00147917 ||| 2-0 3-1 ||| 1400 1 1 ||| |||
>>  > > ! ] this ||| tel ||| 1.09594e-05 1.44188e-10 0.103413
>>  > > 0.0035893 |||
>>  > > 2-0 ||| 9436 1 1 ||| |||
>>  > > ! ] ||| ! ] ||| 0.103413 0.352335 0.103413
>>  > > 0.472387 ||| 0-0 1-1
>>  > > ||| 1 1 1 ||| |||
>>  > > ! & quot ; ||| ! " . et ||| 0.0517067 2.36396e-12
>>  > > 0.0517067
>>  > > 1.88268e-05 ||| 0-0 1-1 2-1 3-3 ||| 2 2 1 ||| |||
>>  > > ! & quot ; ||| ! " ||| 0.000222394 1.44515e-11
>>  > > 0.0517067
>>  > > 0.518419 ||| 0-0 2-1 ||| 465 2 1 ||| |||
>>  > > ! & quot ||| ! " . ||| 0.000662906 8.30626e-09
>>  > > 0.0344711
>>  > > 0.00232791 ||| 0-0 1-1 2-1 ||| 156 3 1 ||| |||
>>  > > ! & quot ||| ! " ||| 0.00218918 8.30626e-09
>>  > > 0.339323 0.518419
>>  > > ||| 0-0 2-1 ||| 465 3 2 ||| |||
>>  > > ! & ||| ! ||| 6.51628e-06 7.21755e-05 0.103413
>>  > > 0.796143 ||| 0-0 |||
>>  > > 15870 1 1 ||| |||
>>  > > ! ' ] , addressed ||| ! " adressé |||
>>      > > 0.103413 3.70838e-07
>>  > > 0.103413 0.00596848 ||| 0-0 1-1 2-1 4-2 ||| 1 1 1 ||| |||
>>  > > ! ' ] , ||| ! " ||| 0.000222394 2.49698e-06
>>  > > 0.103413
>>  > > 0.215573 ||| 0-0 1-1 2-1 ||| 465 1 1 ||| |||
>>  > > ! ' ] ||| ! " ||| 0.000222394 3.57128e-05
>>  > > 0.103413
>>  > > 0.215573 ||| 0-0 1-1 2-1 ||| 465 1 1 ||| |||
>>  > > ! ' ' Alstom shares ||| l' on constate un
>>  > > dysfonctionnement ||| 0.0344711 5.62605e-16 0.103413
>>  > > 1.03361e-14 ||| 1-0
>>  > > 2-0 1-1 3-4 4-4 ||| 3 1 1 ||| |||
>>  > > ! ' ' ||| l' on constate un ||| 0.0147733
>>  > > 1.56906e-11
>>   

Re: [Moses-support] is there a way to remove a bad entry in the phrase table ?

2015-09-24 Thread Matthias Huck
|| 0-0 1-1
> >>  > > ||| 1 1 1 ||| |||
> >>  > > ! & quot ; ||| ! " . et ||| 0.0517067 2.36396e-12
> >>  > > 0.0517067
> >>  > > 1.88268e-05 ||| 0-0 1-1 2-1 3-3 ||| 2 2 1 ||| |||
> >>  > > ! & quot ; ||| ! " ||| 0.000222394 1.44515e-11
> >>  > > 0.0517067
> >>  > > 0.518419 ||| 0-0 2-1 ||| 465 2 1 ||| |||
> >>          > > ! & quot ||| ! " . ||| 0.000662906 8.30626e-09
> >>  > > 0.0344711
> >>  > > 0.00232791 ||| 0-0 1-1 2-1 ||| 156 3 1 ||| |||
> >>  > > ! & quot ||| ! " ||| 0.00218918 8.30626e-09
> >>  > > 0.339323 0.518419
> >>  > > ||| 0-0 2-1 ||| 465 3 2 ||| |||
> >>  > > ! & ||| ! ||| 6.51628e-06 7.21755e-05 0.103413
> >>  > > 0.796143 ||| 0-0 |||
> >>  > > 15870 1 1 ||| |||
> >>  > > ! ' ] , addressed ||| ! " adressé |||
> >>  > > 0.103413 3.70838e-07
> >>  > > 0.103413 0.00596848 ||| 0-0 1-1 2-1 4-2 ||| 1 1 1 ||| |||
> >>  > > ! ' ] , ||| ! " ||| 0.000222394 2.49698e-06
> >>  > > 0.103413
> >>  > > 0.215573 ||| 0-0 1-1 2-1 ||| 465 1 1 ||| |||
> >>  > > ! ' ] ||| ! " ||| 0.000222394 3.57128e-05
> >>  > > 0.103413
> >>  > > 0.215573 ||| 0-0 1-1 2-1 ||| 465 1 1 ||| |||
> >>  > > ! ' ' Alstom shares ||| l' on constate un
> >>  > > dysfonctionnement ||| 0.0344711 5.62605e-16 0.103413
> >>  > > 1.03361e-14 ||| 1-0
> >>  > > 2-0 1-1 3-4 4-4 ||| 3 1 1 ||| |||
> >>  > > ! ' ' ||| l' on constate un ||| 0.0147733
> >>  > > 1.56906e-11
> >>  > > 0.0129267 2.2766e-12 ||| 1-0 2-0 1-1 ||| 7 8 1 ||| |||
> >>  > > ! ' ' ||| l' on constate ||| 0.000984889
> >>  > > 1.56906e-11
> >>  > > 0.0129267 2.36929e-10 ||| 1-0 2-0 1-1 ||| 105 8 1 ||| |||
> >>  > > ! ' ' ||| l' on ||| 6.76656e-06 1.56906e-11
> >>  > > 0.0129267
> >>  > > 6.18613e-06 ||| 1-0 2-0 1-1 ||| 15283 8 1 ||| |||
> >>  > > ! ' ' ||| ou que l' on constate |||
> >>  > > 0.0344711 1.56906e-11
> >>  > > 0.0129267 4.69534e-15 ||| 1-2 2-2 1-3 ||| 3 8 1 ||| |||
> >>  > > ! ' ' ||| ou que l' on ||| 0.00304157
> >>  > > 1.56906e-11
> >>  > > 0.0129267 1.22594e-10 ||| 1-2 2-2 1-3 ||| 34 8 1 ||| |||
> >>  > > ! ' ' ||| que l' on constate un |||
> >>  > > 0.0344711 1.56906e-11
> >>  > > 0.0129267 4.56092e-14 ||| 1-1 2-1 1-2 ||| 3 8 1 ||| |||
> >>  > > ! ' ' ||| que l' on constate ||| 0.00323167
> >>  > > 1.56906e-11
> >>  > > 0.0129267 4.74661e-12 ||| 1-1 2-1 1-2 ||| 32 8 1 ||| |||
> >>  > >
> >>  > >
> >>  > >
> >>  > > Le 23/09/2015 15:12, Tom Hoar a écrit :
> >>  > > > Vincent,
> >>  > > >
> >>  > > > If you suspect bad entries, isn't it better to address
> >>  > > > the root of the
> >>  > > > problem and prepare your training corpus better?
> >>  > > >
> >>  > > >
> >>  > > > On 9/23/2015 6:46 PM, moses-support-requ...@mit.edu
> >>  > > > wrote:
> >>  > > > > Date: Tue, 22 Sep 2015 20:24:02 +0200
> >>  > > > > From: Philipp Koehn
> >>  > > > > Subject: Re: [Moses-support] is there a way to remove
> >>  > > > > a bad entry in
> >>  > > > > the phrase table ?
> >>  > > > > To: Vincent Nguyen
> >>  > > > > Cc: moses-support
> >>  > > > >
> >>  > > > > Hi,
> >>  > > > >
> >>  > > > > you can remove it manually (just edit the text file),
> >>  > > > > there will be no
> >>  > > > > negative consequences.
> >>  > > > >
> >>  > > > > However, it is not a realistic strategy to try to
> >>  > > > > remove by hand every
> >>  > > > > offending phrase table entry.
> >>  > > > >
> >>  > > > > -phi
> >>  > > > >
> >>  > > > > On Tue, Sep 22, 2015 at 4:05 PM, Vincent
> >>  > > > > Nguyen  wrote:
> >>  > > > >
> >>  > > > > > >Hi,
> >>  > > > > > >
> >>  > > > > > >I was wondering if after an analysis of the
> >>  > > > > > BLEU-Annotation file we
> >>  > > > > > >realize that there must be a bad entry in the
> >>  > > > > > phrase table,
> >>  > > > > > >we could remove it manually or in some other
> >>  > > > > > ways ?
> >>  > > > > > >
> >>  > > > > > >Gracias.
> >>  > > > > > >V.
> >>  > > > > > >___
> >>  > > > > > >Moses-support mailing list
> >>  > > > > > >Moses-support@mit.edu
> >>  > > > > > >http://mailman.mit.edu/mailman/listinfo/moses-support
> >>  > > > > > >
> >>  > > >
> >>  > > > --
> >>  > > > Best regards,
> >>  > > >
> >>  > > > Tom Hoar
> >>  > > > Chief Executive Officer
> >>  > > > /*Precision Translation Tools Pte Ltd*/
> >>  > > > Singapore/Thailand
> >>  > > > Web: www.precisiontranslationtools.com
> >>  > > > <http://www.precisiontranslationtools.com>
> >>  > > > Thailand Mobile: +66 87 345-1875
> >>  > > > Skype: tahoar
> >>  > > >
> >>  > > >
> >>  > > > ___
> >>  > > > Moses-support mailing list
> >>  > > > Moses-support@mit.edu
> >>  > > > http://mailman.mit.edu/mailman/listinfo/moses-support
> >>  > >
> >>  > >
> >>  > >
> >>  > > ___
> >>  > > Moses-support mailing list
> >>  > > Moses-support@mit.edu
> >>  > > http://mailman.mit.edu/mailman/listinfo/moses-support
> >>  > >
> >>  >
> >>  
> >>  
> >>  
> >>  ___
> >>  Moses-support mailing list
> >>  Moses-support@mit.edu
> >>  http://mailman.mit.edu/mailman/listinfo/moses-support
> >>  
> >>
> >>
> >> ___
> >> Moses-support mailing list
> >> Moses-support@mit.edu
> >> http://mailman.mit.edu/mailman/listinfo/moses-support
> >
> >
> 



-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] is there a way to remove a bad entry in the phrase table ?

2015-09-24 Thread Matthias Huck
 ||| ! ! ) - , ||| 0.103413 0.111989 0.103413
> > > 0.00192967 ||| 0-0 
> > > 1-1 2-2 3-3 3-4 ||| 1 1 1 ||| ||| 
> > > ! ! ) ||| ! ! ) ||| 0.103413 0.278429 0.103413 0.533321
> > > ||| 0-0 1-1 2-2 
> > > ||| 1 1 1 ||| ||| 
> > > ! ! ||| ! ! ||| 0.625 0.363573 0.769231 0.633844 ||| 0-0
> > > 1-1 ||| 16 13 
> > > 10 ||| ||| 
> > > ! ! ||| . ||| 4.65922e-08 6.71089e-07 0.00795487 0.140779
> > > ||| 0-0 1-0 
> > > ||| 2.21954e+06 13 1 ||| ||| 
> > > ! ! ||| budget ! ! ||| 0.0517067 0.363573 0.00795487
> > > 5.66022e-05 ||| 0-1 
> > > 1-2 ||| 2 13 1 ||| ||| 
> > > ! ! ||| nécessaire ! ! ||| 0.103413 0.363573 0.00795487
> > > 0.000130572 ||| 
> > > 0-1 1-2 ||| 1 13 1 ||| ||| 
> > > ! [ never again ! ||| ! ||| 6.51628e-06 5.42074e-13
> > > 0.103413 
> > > 0.796143 ||| 0-0 4-0 ||| 15870 1 1 ||| ||| 
> > > ! ] this is ||| tel est ||| 7.38667e-05 9.16191e-11
> > > 0.103413 
> > > 0.00147917 ||| 2-0 3-1 ||| 1400 1 1 ||| ||| 
> > > ! ] this ||| tel ||| 1.09594e-05 1.44188e-10 0.103413
> > > 0.0035893 ||| 
> > > 2-0 ||| 9436 1 1 ||| ||| 
> > > ! ] ||| ! ] ||| 0.103413 0.352335 0.103413
> > > 0.472387 ||| 0-0 1-1 
> > > ||| 1 1 1 ||| ||| 
> > > ! & quot ; ||| ! " . et ||| 0.0517067 2.36396e-12
> > > 0.0517067 
> > > 1.88268e-05 ||| 0-0 1-1 2-1 3-3 ||| 2 2 1 ||| ||| 
> > > ! & quot ; ||| ! " ||| 0.000222394 1.44515e-11
> > > 0.0517067 
> > > 0.518419 ||| 0-0 2-1 ||| 465 2 1 ||| ||| 
> > > ! & quot ||| ! " . ||| 0.000662906 8.30626e-09
> > > 0.0344711 
> > > 0.00232791 ||| 0-0 1-1 2-1 ||| 156 3 1 ||| ||| 
> > > ! & quot ||| ! " ||| 0.00218918 8.30626e-09
> > > 0.339323 0.518419 
> > > ||| 0-0 2-1 ||| 465 3 2 ||| ||| 
> > > ! & ||| ! ||| 6.51628e-06 7.21755e-05 0.103413
> > > 0.796143 ||| 0-0 ||| 
> > > 15870 1 1 ||| ||| 
> > > ! ' ] , addressed ||| ! " adressé |||
> > > 0.103413 3.70838e-07 
> > > 0.103413 0.00596848 ||| 0-0 1-1 2-1 4-2 ||| 1 1 1 ||| ||| 
> > > ! ' ] , ||| ! " ||| 0.000222394 2.49698e-06
> > > 0.103413 
> > > 0.215573 ||| 0-0 1-1 2-1 ||| 465 1 1 ||| ||| 
> > > ! ' ] ||| ! " ||| 0.000222394 3.57128e-05
> > > 0.103413 
> > > 0.215573 ||| 0-0 1-1 2-1 ||| 465 1 1 ||| ||| 
> > > ! ' ' Alstom shares ||| l' on constate un 
> > > dysfonctionnement ||| 0.0344711 5.62605e-16 0.103413
> > > 1.03361e-14 ||| 1-0 
>     > > 2-0 1-1 3-4 4-4 ||| 3 1 1 ||| ||| 
>         > > ! ' ' ||| l' on constate un ||| 0.0147733
> > > 1.56906e-11 
> > > 0.0129267 2.2766e-12 ||| 1-0 2-0 1-1 ||| 7 8 1 ||| ||| 
> > > ! ' ' ||| l' on constate ||| 0.000984889
> > > 1.56906e-11 
> > > 0.0129267 2.36929e-10 ||| 1-0 2-0 1-1 ||| 105 8 1 ||| ||| 
> > > ! ' ' ||| l' on ||| 6.76656e-06 1.56906e-11
> > > 0.0129267 
> > > 6.18613e-06 ||| 1-0 2-0 1-1 ||| 15283 8 1 ||| ||| 
> > > ! ' ' ||| ou que l' on constate |||
> > > 0.0344711 1.56906e-11 
> > > 0.0129267 4.69534e-15 ||| 1-2 2-2 1-3 ||| 3 8 1 ||| ||| 
> > > ! ' ' ||| ou que l' on ||| 0.00304157
> > > 1.56906e-11 
> > > 0.0129267 1.22594e-10 ||| 1-2 2-2 1-3 ||| 34 8 1 ||| ||| 
> > > ! ' ' ||| que l' on constate un |||
> > > 0.0344711 1.56906e-11 
> > > 0.0129267 4.56092e-14 ||| 1-1 2-1 1-2 ||| 3 8 1 ||| ||| 
> > > ! ' ' ||| que l' on constate ||| 0.00323167
> > > 1.56906e-11 
> > > 0.0129267 4.74661e-12 ||| 1-1 2-1 1-2 ||| 32 8 1 ||| ||| 
> > > 
> > > 
> > > 
> > > Le 23/09/2015 15:12, Tom Hoar a écrit : 
> > > > Vincent, 
> > > > 
> > > > If you suspect bad entries, isn't it better to ad

Re: [Moses-support] is there a way to remove a bad entry in the phrase table ?

2015-09-24 Thread Vincent Nguyen
 ||| 
1-1 ||| 393 26 1 ||| |||
" 1 ||| one ||| 1.32368e-05 5.22671e-06 0.0391025 0.0141179 ||| 1-0 
||| 76806 26 2 ||| |||
" 1,1 % ||| 1.1 % ||| 0.0022504 0.00241746 0.103519 0.875731 ||| 
1-0 2-1 ||| 46 1 1 ||| |||
" 1,1 milliard d' euros ||| EUR 1.1 billion ||| 0.00544835 
6.98053e-05 0.0517593 0.110019 ||| 3-0 4-0 1-1 2-1 2-2 ||| 19 2 1 ||| |||
" 1,1 milliard d' euros ||| by EUR 1.1 billion ||| 0.0345062 
6.98053e-05 0.0517593 0.000791519 ||| 3-1 4-1 1-2 2-2 2-3 ||| 3 2 1 ||| |||



Le 24/09/2015 09:54, Felipe Sánchez Martínez a écrit :

Hi,

This is quite common. If you look at the scores, they are pretty low 
when they do not make sense, so, even though they are in the phrase 
table, most probably they will never be used for translation. I would 
not bother.


Cheers
--
Felipe

El 23/09/15 a las 16:50, Vincent Nguyen escribió:

I agree and would like to.
But this is tricky, look at the first 30 lines of my phrase table below.

and this happens a lot in the first line of tables where there are &apos
or weird codes, EN/FR pairs do not match.




! ! ! ! ||| ! ! ! ! ||| 0.103413 0.132185 0.103413 0.401758 ||| 0-0 1-1
2-2 3-3 ||| 1 1 1 ||| |||
! ! ! ) ||| ! ! ! ) ||| 0.339323 0.167884 0.508985 0.4246 ||| 0-0 1-0
2-0 2-1 2-2 3-3 ||| 3 2 2 ||| |||
! ! ! ||| ! ! ! ||| 0.501834 0.219223 0.716905 0.50463 ||| 0-0 1-1 2-2
||| 10 7 6 ||| |||
! ! ! ||| budget ! ! ! ||| 0.0517067 0.219223 0.0147733 4.50635e-05 |||
0-1 1-2 2-3 ||| 2 7 1 ||| |||
! ! ) , ||| ! ! ) - , ||| 0.103413 0.111989 0.103413 0.00192967 ||| 0-0
1-1 2-2 3-3 3-4 ||| 1 1 1 ||| |||
! ! ) ||| ! ! ) ||| 0.103413 0.278429 0.103413 0.533321 ||| 0-0 1-1 2-2
||| 1 1 1 ||| |||
! ! ||| ! ! ||| 0.625 0.363573 0.769231 0.633844 ||| 0-0 1-1 ||| 16 13
10 ||| |||
! ! ||| . ||| 4.65922e-08 6.71089e-07 0.00795487 0.140779 ||| 0-0 1-0
||| 2.21954e+06 13 1 ||| |||
! ! ||| budget ! ! ||| 0.0517067 0.363573 0.00795487 5.66022e-05 ||| 0-1
1-2 ||| 2 13 1 ||| |||
! ! ||| nécessaire ! ! ||| 0.103413 0.363573 0.00795487 0.000130572 |||
0-1 1-2 ||| 1 13 1 ||| |||
! [ never again ! ||| ! ||| 6.51628e-06 5.42074e-13 0.103413
0.796143 ||| 0-0 4-0 ||| 15870 1 1 ||| |||
! ] this is ||| tel est ||| 7.38667e-05 9.16191e-11 0.103413
0.00147917 ||| 2-0 3-1 ||| 1400 1 1 ||| |||
! ] this ||| tel ||| 1.09594e-05 1.44188e-10 0.103413 0.0035893 |||
2-0 ||| 9436 1 1 ||| |||
! ] ||| ! ] ||| 0.103413 0.352335 0.103413 0.472387 ||| 0-0 1-1
||| 1 1 1 ||| |||
! & quot ; ||| ! " . et ||| 0.0517067 2.36396e-12 0.0517067
1.88268e-05 ||| 0-0 1-1 2-1 3-3 ||| 2 2 1 ||| |||
! & quot ; ||| ! " ||| 0.000222394 1.44515e-11 0.0517067
0.518419 ||| 0-0 2-1 ||| 465 2 1 ||| |||
! & quot ||| ! " . ||| 0.000662906 8.30626e-09 0.0344711
0.00232791 ||| 0-0 1-1 2-1 ||| 156 3 1 ||| |||
! & quot ||| ! " ||| 0.00218918 8.30626e-09 0.339323 0.518419
||| 0-0 2-1 ||| 465 3 2 ||| |||
! & ||| ! ||| 6.51628e-06 7.21755e-05 0.103413 0.796143 ||| 0-0 |||
15870 1 1 ||| |||
! ' ] , addressed ||| ! " adressé ||| 0.103413 3.70838e-07
0.103413 0.00596848 ||| 0-0 1-1 2-1 4-2 ||| 1 1 1 ||| |||
! ' ] , ||| ! " ||| 0.000222394 2.49698e-06 0.103413
0.215573 ||| 0-0 1-1 2-1 ||| 465 1 1 ||| |||
! ' ] ||| ! " ||| 0.000222394 3.57128e-05 0.103413
0.215573 ||| 0-0 1-1 2-1 ||| 465 1 1 ||| |||
! ' ' Alstom shares ||| l' on constate un
dysfonctionnement ||| 0.0344711 5.62605e-16 0.103413 1.03361e-14 ||| 1-0
2-0 1-1 3-4 4-4 ||| 3 1 1 ||| |||
! ' ' ||| l' on constate un ||| 0.0147733 1.56906e-11
0.0129267 2.2766e-12 ||| 1-0 2-0 1-1 ||| 7 8 1 ||| |||
! ' ' ||| l' on constate ||| 0.000984889 1.56906e-11
0.0129267 2.36929e-10 ||| 1-0 2-0 1-1 ||| 105 8 1 ||| |||
! ' ' ||| l' on ||| 6.76656e-06 1.56906e-11 0.0129267
6.18613e-06 ||| 1-0 2-0 1-1 ||| 15283 8 1 ||| |||
! ' ' ||| ou que l' on constate ||| 0.0344711 1.56906e-11
0.0129267 4.69534e-15 ||| 1-2 2-2 1-3 ||| 3 8 1 ||| |||
! ' ' ||| ou que l' on ||| 0.00304157 1.56906e-11
0.0129267 1.22594e-10 ||| 1-2 2-2 1-3 ||| 34 8 1 ||| |||
! ' ' ||| que l' on constate un ||| 0.0344711 1.56906e-11
0.0129267 4.56092e-14 ||| 1-1 2-1 1-2 ||| 3 8 1 ||| |||
! ' ' ||| que l' on constate ||| 0.00323167 1.56906e-11
0.0129267 4.74661e-12 ||| 1-1 2-1 1-2 ||| 32 8 1 ||| |||



Le 23/09/2015 15:12, Tom Hoar a écrit :

Vincent,

If you suspect bad entries, isn't it better to address the root of the
problem and prepare your training corpus better?


On 9/23/2015 6:46 PM, moses-support-requ...@mit.edu wrote:

Date: Tue, 22 Sep 2015 20:24:02 +0200
From: Philipp Koehn
Subject: Re: [Moses-support] is there a way to remove a bad entry in
the phrase table ?
To: Vincent Nguyen
Cc: moses-support

Hi,

you can remove it manually (just edit the text file), there will be no
negative consequences.

However, it is not a realistic strategy to try to remove by hand every
offending phrase

Re: [Moses-support] is there a way to remove a bad entry in the phrase table ?

2015-09-24 Thread Hieu Hoang
||| 0.103519 0.25 0.00398148 5.61e-05 ||| 0-0 1-0 |||
> 1 26 1 ||| |||
> " 1 ||| " 1 ||| 0.503492 0.361595 0.11619 0.187815 ||| 0-0 1-1
> ||| 6 26 4 ||| |||
> " 1 ||| 1 ||| 0.0010136 0.00278649 0.461538 0.805151 ||| 1-0 |||
> 11839 26 12 ||| |||
> *" 1 ||| One Million Roofs ||| 0.103519 0.00213892 0.00398148
> 3.32314e-15 ||| 0-0 1-0 0-1 0-2 ||| 1 26 1 ||| |||*
> " 1 ||| hardly 1 ||| 0.0258796 0.00278649 0.00398148 1.73108e-05 |||
> 1-1 ||| 4 26 1 ||| |||
> " 1 ||| million solar ||| 0.0345062 3.55949e-06 0.00398148
> 3.29783e-09 ||| 1-0 ||| 3 26 1 ||| |||
> " 1 ||| million ||| 5.83433e-06 3.55949e-06 0.00398148 0.0019399 |||
> 1-0 ||| 17743 26 1 ||| |||
> " 1 ||| of 1 ||| 0.000263406 0.00278649 0.00398148 0.0270917 ||| 1-1
> ||| 393 26 1 ||| |||
> " 1 ||| one ||| 1.32368e-05 5.22671e-06 0.0391025 0.0141179 ||| 1-0
> ||| 76806 26 2 ||| |||
> " 1,1 % ||| 1.1 % ||| 0.0022504 0.00241746 0.103519 0.875731 ||| 1-0
> 2-1 ||| 46 1 1 ||| |||
> " 1,1 milliard d' euros ||| EUR 1.1 billion ||| 0.00544835
> 6.98053e-05 0.0517593 0.110019 ||| 3-0 4-0 1-1 2-1 2-2 ||| 19 2 1 ||| |||
> " 1,1 milliard d' euros ||| by EUR 1.1 billion ||| 0.0345062
> 6.98053e-05 0.0517593 0.000791519 ||| 3-1 4-1 1-2 2-2 2-3 ||| 3 2 1 ||| |||
>
>
>
> Le 24/09/2015 09:54, Felipe Sánchez Martínez a écrit :
>
> Hi,
>
> This is quite common. If you look at the scores, they are pretty low when
> they do not make sense, so, even though they are in the phrase table, most
> probably they will never be used for translation. I would not bother.
>
> Cheers
> --
> Felipe
>
> El 23/09/15 a las 16:50, Vincent Nguyen escribió:
>
> I agree and would like to.
> But this is tricky, look at the first 30 lines of my phrase table below.
>
> and this happens a lot in the first line of tables where there are &apos
> or weird codes, EN/FR pairs do not match.
>
>
>
>
> ! ! ! ! ||| ! ! ! ! ||| 0.103413 0.132185 0.103413 0.401758 ||| 0-0 1-1
> 2-2 3-3 ||| 1 1 1 ||| |||
> ! ! ! ) ||| ! ! ! ) ||| 0.339323 0.167884 0.508985 0.4246 ||| 0-0 1-0
> 2-0 2-1 2-2 3-3 ||| 3 2 2 ||| |||
> ! ! ! ||| ! ! ! ||| 0.501834 0.219223 0.716905 0.50463 ||| 0-0 1-1 2-2
> ||| 10 7 6 ||| |||
> ! ! ! ||| budget ! ! ! ||| 0.0517067 0.219223 0.0147733 4.50635e-05 |||
> 0-1 1-2 2-3 ||| 2 7 1 ||| |||
> ! ! ) , ||| ! ! ) - , ||| 0.103413 0.111989 0.103413 0.00192967 ||| 0-0
> 1-1 2-2 3-3 3-4 ||| 1 1 1 ||| |||
> ! ! ) ||| ! ! ) ||| 0.103413 0.278429 0.103413 0.533321 ||| 0-0 1-1 2-2
> ||| 1 1 1 ||| |||
> ! ! ||| ! ! ||| 0.625 0.363573 0.769231 0.633844 ||| 0-0 1-1 ||| 16 13
> 10 ||| |||
> ! ! ||| . ||| 4.65922e-08 6.71089e-07 0.00795487 0.140779 ||| 0-0 1-0
> ||| 2.21954e+06 13 1 ||| |||
> ! ! ||| budget ! ! ||| 0.0517067 0.363573 0.00795487 5.66022e-05 ||| 0-1
> 1-2 ||| 2 13 1 ||| |||
> ! ! ||| nécessaire ! ! ||| 0.103413 0.363573 0.00795487 0.000130572 |||
> 0-1 1-2 ||| 1 13 1 ||| |||
> ! [ never again ! ||| ! ||| 6.51628e-06 5.42074e-13 0.103413
> 0.796143 ||| 0-0 4-0 ||| 15870 1 1 ||| |||
> ! ] this is ||| tel est ||| 7.38667e-05 9.16191e-11 0.103413
> 0.00147917 ||| 2-0 3-1 ||| 1400 1 1 ||| |||
> ! ] this ||| tel ||| 1.09594e-05 1.44188e-10 0.103413 0.0035893 |||
> 2-0 ||| 9436 1 1 ||| |||
> ! ] ||| ! ] ||| 0.103413 0.352335 0.103413 0.472387 ||| 0-0 1-1
> ||| 1 1 1 ||| |||
> ! & quot ; ||| ! " . et ||| 0.0517067 2.36396e-12 0.0517067
> 1.88268e-05 ||| 0-0 1-1 2-1 3-3 ||| 2 2 1 ||| |||
> ! & quot ; ||| ! " ||| 0.000222394 1.44515e-11 0.0517067
> 0.518419 ||| 0-0 2-1 ||| 465 2 1 ||| |||
> ! & quot ||| ! " . ||| 0.000662906 8.30626e-09 0.0344711
> 0.00232791 ||| 0-0 1-1 2-1 ||| 156 3 1 ||| |||
> ! & quot ||| ! " ||| 0.00218918 8.30626e-09 0.339323 0.518419
> ||| 0-0 2-1 ||| 465 3 2 ||| |||
> ! & ||| ! ||| 6.51628e-06 7.21755e-05 0.103413 0.796143 ||| 0-0 |||
> 15870 1 1 ||| |||
> ! ' ] , addressed ||| ! " adressé ||| 0.103413 3.70838e-07
> 0.103413 0.00596848 ||| 0-0 1-1 2-1 4-2 ||| 1 1 1 ||| |||
> ! ' ] , ||| ! " ||| 0.000222394 2.49698e-06 0.103413
> 0.215573 ||| 0-0 1-1 2-1 ||| 465 1 1 ||| |||
> ! ' ] ||| ! " ||| 0.000222394 3.57128e-05 0.103413
> 0.215573 ||| 0-0 1-1 2-1 ||| 465 1 1 ||| |||
> ! ' ' Alstom shares ||| l' on constate un
> dysfonctionnement ||| 0.0344711 5.62605e-16 0.103413 1.03361e-14 ||| 1-0
> 2-0 1-1 3-4 4-4 ||| 3 1 1 ||| |||
> ! ' ' ||| l' on constate un ||| 0.0147733 1.56906e-11
> 0.0129267 2.2766e-12 ||| 1-0 2-0 1-1 ||| 7 8 1 ||| |||
> ! ' ' ||| l' on constate ||| 0.000984889 1.56906e-11
> 0.0129267 2.36929e-10 ||| 1-0 2-0 1-1 ||| 105 8 1 ||| |||
> ! ' ' ||| l' on ||| 6.76656e-0

Re: [Moses-support] is there a way to remove a bad entry in the phrase table ?

2015-09-24 Thread Felipe Sánchez Martínez
Hi,

This is quite common. If you look at the scores, they are pretty low 
when they do not make sense, so, even though they are in the phrase 
table, most probably they will never be used for translation. I would 
not bother.

Cheers
--
Felipe

El 23/09/15 a las 16:50, Vincent Nguyen escribió:
> I agree and would like to.
> But this is tricky, look at the first 30 lines of my phrase table below.
>
> and this happens a lot in the first line of tables where there are &apos
> or weird codes, EN/FR pairs do not match.
>
>
>
>
> ! ! ! ! ||| ! ! ! ! ||| 0.103413 0.132185 0.103413 0.401758 ||| 0-0 1-1
> 2-2 3-3 ||| 1 1 1 ||| |||
> ! ! ! ) ||| ! ! ! ) ||| 0.339323 0.167884 0.508985 0.4246 ||| 0-0 1-0
> 2-0 2-1 2-2 3-3 ||| 3 2 2 ||| |||
> ! ! ! ||| ! ! ! ||| 0.501834 0.219223 0.716905 0.50463 ||| 0-0 1-1 2-2
> ||| 10 7 6 ||| |||
> ! ! ! ||| budget ! ! ! ||| 0.0517067 0.219223 0.0147733 4.50635e-05 |||
> 0-1 1-2 2-3 ||| 2 7 1 ||| |||
> ! ! ) , ||| ! ! ) - , ||| 0.103413 0.111989 0.103413 0.00192967 ||| 0-0
> 1-1 2-2 3-3 3-4 ||| 1 1 1 ||| |||
> ! ! ) ||| ! ! ) ||| 0.103413 0.278429 0.103413 0.533321 ||| 0-0 1-1 2-2
> ||| 1 1 1 ||| |||
> ! ! ||| ! ! ||| 0.625 0.363573 0.769231 0.633844 ||| 0-0 1-1 ||| 16 13
> 10 ||| |||
> ! ! ||| . ||| 4.65922e-08 6.71089e-07 0.00795487 0.140779 ||| 0-0 1-0
> ||| 2.21954e+06 13 1 ||| |||
> ! ! ||| budget ! ! ||| 0.0517067 0.363573 0.00795487 5.66022e-05 ||| 0-1
> 1-2 ||| 2 13 1 ||| |||
> ! ! ||| nécessaire ! ! ||| 0.103413 0.363573 0.00795487 0.000130572 |||
> 0-1 1-2 ||| 1 13 1 ||| |||
> ! [ never again ! ||| ! ||| 6.51628e-06 5.42074e-13 0.103413
> 0.796143 ||| 0-0 4-0 ||| 15870 1 1 ||| |||
> ! ] this is ||| tel est ||| 7.38667e-05 9.16191e-11 0.103413
> 0.00147917 ||| 2-0 3-1 ||| 1400 1 1 ||| |||
> ! ] this ||| tel ||| 1.09594e-05 1.44188e-10 0.103413 0.0035893 |||
> 2-0 ||| 9436 1 1 ||| |||
> ! ] ||| ! ] ||| 0.103413 0.352335 0.103413 0.472387 ||| 0-0 1-1
> ||| 1 1 1 ||| |||
> ! & quot ; ||| ! " . et ||| 0.0517067 2.36396e-12 0.0517067
> 1.88268e-05 ||| 0-0 1-1 2-1 3-3 ||| 2 2 1 ||| |||
> ! & quot ; ||| ! " ||| 0.000222394 1.44515e-11 0.0517067
> 0.518419 ||| 0-0 2-1 ||| 465 2 1 ||| |||
> ! & quot ||| ! " . ||| 0.000662906 8.30626e-09 0.0344711
> 0.00232791 ||| 0-0 1-1 2-1 ||| 156 3 1 ||| |||
> ! & quot ||| ! " ||| 0.00218918 8.30626e-09 0.339323 0.518419
> ||| 0-0 2-1 ||| 465 3 2 ||| |||
> ! & ||| ! ||| 6.51628e-06 7.21755e-05 0.103413 0.796143 ||| 0-0 |||
> 15870 1 1 ||| |||
> ! ' ] , addressed ||| ! " adressé ||| 0.103413 3.70838e-07
> 0.103413 0.00596848 ||| 0-0 1-1 2-1 4-2 ||| 1 1 1 ||| |||
> ! ' ] , ||| ! " ||| 0.000222394 2.49698e-06 0.103413
> 0.215573 ||| 0-0 1-1 2-1 ||| 465 1 1 ||| |||
> ! ' ] ||| ! " ||| 0.000222394 3.57128e-05 0.103413
> 0.215573 ||| 0-0 1-1 2-1 ||| 465 1 1 ||| |||
> ! ' ' Alstom shares ||| l' on constate un
> dysfonctionnement ||| 0.0344711 5.62605e-16 0.103413 1.03361e-14 ||| 1-0
> 2-0 1-1 3-4 4-4 ||| 3 1 1 ||| |||
> ! ' ' ||| l' on constate un ||| 0.0147733 1.56906e-11
> 0.0129267 2.2766e-12 ||| 1-0 2-0 1-1 ||| 7 8 1 ||| |||
> ! ' ' ||| l' on constate ||| 0.000984889 1.56906e-11
> 0.0129267 2.36929e-10 ||| 1-0 2-0 1-1 ||| 105 8 1 ||| |||
> ! ' ' ||| l' on ||| 6.76656e-06 1.56906e-11 0.0129267
> 6.18613e-06 ||| 1-0 2-0 1-1 ||| 15283 8 1 ||| |||
> ! ' ' ||| ou que l' on constate ||| 0.0344711 1.56906e-11
> 0.0129267 4.69534e-15 ||| 1-2 2-2 1-3 ||| 3 8 1 ||| |||
> ! ' ' ||| ou que l' on ||| 0.00304157 1.56906e-11
> 0.0129267 1.22594e-10 ||| 1-2 2-2 1-3 ||| 34 8 1 ||| |||
> ! ' ' ||| que l' on constate un ||| 0.0344711 1.56906e-11
> 0.0129267 4.56092e-14 ||| 1-1 2-1 1-2 ||| 3 8 1 ||| |||
> ! ' ' ||| que l' on constate ||| 0.00323167 1.56906e-11
> 0.0129267 4.74661e-12 ||| 1-1 2-1 1-2 ||| 32 8 1 ||| |||
>
>
>
> Le 23/09/2015 15:12, Tom Hoar a écrit :
>> Vincent,
>>
>> If you suspect bad entries, isn't it better to address the root of the
>> problem and prepare your training corpus better?
>>
>>
>> On 9/23/2015 6:46 PM, moses-support-requ...@mit.edu wrote:
>>> Date: Tue, 22 Sep 2015 20:24:02 +0200
>>> From: Philipp Koehn
>>> Subject: Re: [Moses-support] is there a way to remove a bad entry in
>>> the phrase table ?
>>> To: Vincent Nguyen
>>> Cc: moses-support
>>>
>>> Hi,
>>>
>>> you can remove it manually (just edit the text file), there will be no
>>> negative consequences.
>>>
>>> However, it is not a realistic strategy to try to remove by hand every
>>> offending

Re: [Moses-support] is there a way to remove a bad entry in the phrase table ?

2015-09-23 Thread Vincent Nguyen

I agree and would like to.
But this is tricky, look at the first 30 lines of my phrase table below.

and this happens a lot in the first line of tables where there are &apos 
or weird codes, EN/FR pairs do not match.





! ! ! ! ||| ! ! ! ! ||| 0.103413 0.132185 0.103413 0.401758 ||| 0-0 1-1 
2-2 3-3 ||| 1 1 1 ||| |||
! ! ! ) ||| ! ! ! ) ||| 0.339323 0.167884 0.508985 0.4246 ||| 0-0 1-0 
2-0 2-1 2-2 3-3 ||| 3 2 2 ||| |||
! ! ! ||| ! ! ! ||| 0.501834 0.219223 0.716905 0.50463 ||| 0-0 1-1 2-2 
||| 10 7 6 ||| |||
! ! ! ||| budget ! ! ! ||| 0.0517067 0.219223 0.0147733 4.50635e-05 ||| 
0-1 1-2 2-3 ||| 2 7 1 ||| |||
! ! ) , ||| ! ! ) - , ||| 0.103413 0.111989 0.103413 0.00192967 ||| 0-0 
1-1 2-2 3-3 3-4 ||| 1 1 1 ||| |||
! ! ) ||| ! ! ) ||| 0.103413 0.278429 0.103413 0.533321 ||| 0-0 1-1 2-2 
||| 1 1 1 ||| |||
! ! ||| ! ! ||| 0.625 0.363573 0.769231 0.633844 ||| 0-0 1-1 ||| 16 13 
10 ||| |||
! ! ||| . ||| 4.65922e-08 6.71089e-07 0.00795487 0.140779 ||| 0-0 1-0 
||| 2.21954e+06 13 1 ||| |||
! ! ||| budget ! ! ||| 0.0517067 0.363573 0.00795487 5.66022e-05 ||| 0-1 
1-2 ||| 2 13 1 ||| |||
! ! ||| nécessaire ! ! ||| 0.103413 0.363573 0.00795487 0.000130572 ||| 
0-1 1-2 ||| 1 13 1 ||| |||
! [ never again ! ||| ! ||| 6.51628e-06 5.42074e-13 0.103413 
0.796143 ||| 0-0 4-0 ||| 15870 1 1 ||| |||
! ] this is ||| tel est ||| 7.38667e-05 9.16191e-11 0.103413 
0.00147917 ||| 2-0 3-1 ||| 1400 1 1 ||| |||
! ] this ||| tel ||| 1.09594e-05 1.44188e-10 0.103413 0.0035893 ||| 
2-0 ||| 9436 1 1 ||| |||
! ] ||| ! ] ||| 0.103413 0.352335 0.103413 0.472387 ||| 0-0 1-1 
||| 1 1 1 ||| |||
! & quot ; ||| ! " . et ||| 0.0517067 2.36396e-12 0.0517067 
1.88268e-05 ||| 0-0 1-1 2-1 3-3 ||| 2 2 1 ||| |||
! & quot ; ||| ! " ||| 0.000222394 1.44515e-11 0.0517067 
0.518419 ||| 0-0 2-1 ||| 465 2 1 ||| |||
! & quot ||| ! " . ||| 0.000662906 8.30626e-09 0.0344711 
0.00232791 ||| 0-0 1-1 2-1 ||| 156 3 1 ||| |||
! & quot ||| ! " ||| 0.00218918 8.30626e-09 0.339323 0.518419 
||| 0-0 2-1 ||| 465 3 2 ||| |||
! & ||| ! ||| 6.51628e-06 7.21755e-05 0.103413 0.796143 ||| 0-0 ||| 
15870 1 1 ||| |||
! ' ] , addressed ||| ! " adressé ||| 0.103413 3.70838e-07 
0.103413 0.00596848 ||| 0-0 1-1 2-1 4-2 ||| 1 1 1 ||| |||
! ' ] , ||| ! " ||| 0.000222394 2.49698e-06 0.103413 
0.215573 ||| 0-0 1-1 2-1 ||| 465 1 1 ||| |||
! ' ] ||| ! " ||| 0.000222394 3.57128e-05 0.103413 
0.215573 ||| 0-0 1-1 2-1 ||| 465 1 1 ||| |||
! ' ' Alstom shares ||| l' on constate un 
dysfonctionnement ||| 0.0344711 5.62605e-16 0.103413 1.03361e-14 ||| 1-0 
2-0 1-1 3-4 4-4 ||| 3 1 1 ||| |||
! ' ' ||| l' on constate un ||| 0.0147733 1.56906e-11 
0.0129267 2.2766e-12 ||| 1-0 2-0 1-1 ||| 7 8 1 ||| |||
! ' ' ||| l' on constate ||| 0.000984889 1.56906e-11 
0.0129267 2.36929e-10 ||| 1-0 2-0 1-1 ||| 105 8 1 ||| |||
! ' ' ||| l' on ||| 6.76656e-06 1.56906e-11 0.0129267 
6.18613e-06 ||| 1-0 2-0 1-1 ||| 15283 8 1 ||| |||
! ' ' ||| ou que l' on constate ||| 0.0344711 1.56906e-11 
0.0129267 4.69534e-15 ||| 1-2 2-2 1-3 ||| 3 8 1 ||| |||
! ' ' ||| ou que l' on ||| 0.00304157 1.56906e-11 
0.0129267 1.22594e-10 ||| 1-2 2-2 1-3 ||| 34 8 1 ||| |||
! ' ' ||| que l' on constate un ||| 0.0344711 1.56906e-11 
0.0129267 4.56092e-14 ||| 1-1 2-1 1-2 ||| 3 8 1 ||| |||
! ' ' ||| que l' on constate ||| 0.00323167 1.56906e-11 
0.0129267 4.74661e-12 ||| 1-1 2-1 1-2 ||| 32 8 1 ||| |||




Le 23/09/2015 15:12, Tom Hoar a écrit :

Vincent,

If you suspect bad entries, isn't it better to address the root of the 
problem and prepare your training corpus better?



On 9/23/2015 6:46 PM, moses-support-requ...@mit.edu wrote:

Date: Tue, 22 Sep 2015 20:24:02 +0200
From: Philipp Koehn
Subject: Re: [Moses-support] is there a way to remove a bad entry in
the phrase table ?
To: Vincent Nguyen
Cc: moses-support

Hi,

you can remove it manually (just edit the text file), there will be no
negative consequences.

However, it is not a realistic strategy to try to remove by hand every
offending phrase table entry.

-phi

On Tue, Sep 22, 2015 at 4:05 PM, Vincent Nguyen  wrote:


>Hi,
>
>I was wondering if after an analysis of the BLEU-Annotation file we
>realize that there must be a bad entry in the phrase table,
>we could remove it manually or in some other ways ?
>
>Gracias.
>V.
>___
>Moses-support mailing list
>Moses-support@mit.edu
>http://mailman.mit.edu/mailman/listinfo/moses-support
>


--
Best regards,

Tom Hoar
Chief Executive Officer
/*Precision Translation Tools Pte Ltd*/
Singapore/Thailand
Web: www.precisiontranslationtools.com 
<http://www.precisiontranslationtools.com>

Thailand Mobile: +66 87 345-1875
Skype: tahoar


___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] is there a way to remove a bad entry in the phrase table ?

2015-09-23 Thread Tom Hoar

Vincent,

If you suspect bad entries, isn't it better to address the root of the 
problem and prepare your training corpus better?



On 9/23/2015 6:46 PM, moses-support-requ...@mit.edu wrote:

Date: Tue, 22 Sep 2015 20:24:02 +0200
From: Philipp Koehn
Subject: Re: [Moses-support] is there a way to remove a bad entry in
        the phrase table ?
To: Vincent Nguyen
Cc: moses-support

Hi,

you can remove it manually (just edit the text file), there will be no
negative consequences.

However, it is not a realistic strategy to try to remove by hand every
offending phrase table entry.

-phi

On Tue, Sep 22, 2015 at 4:05 PM, Vincent Nguyen  wrote:


>Hi,
>
>I was wondering if after an analysis of the BLEU-Annotation file we
>realize that there must be a bad entry in the phrase table,
>we could remove it manually or in some other ways ?
>
>Gracias.
>V.
>___
>Moses-support mailing list
>Moses-support@mit.edu
>http://mailman.mit.edu/mailman/listinfo/moses-support
>


--
Best regards,

Tom Hoar
Chief Executive Officer
/*Precision Translation Tools Pte Ltd*/
Singapore/Thailand
Web: www.precisiontranslationtools.com 
<http://www.precisiontranslationtools.com>

Thailand Mobile: +66 87 345-1875
Skype: tahoar
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] is there a way to remove a bad entry in the phrase table ?

2015-09-22 Thread Philipp Koehn
Hi,

you can remove it manually (just edit the text file), there will be no
negative consequences.

However, it is not a realistic strategy to try to remove by hand every
offending phrase table entry.

-phi

On Tue, Sep 22, 2015 at 4:05 PM, Vincent Nguyen  wrote:

> Hi,
>
> I was wondering if after an analysis of the BLEU-Annotation file we
> realize that there must be a bad entry in the phrase table,
> we could remove it manually or in some other ways ?
>
> Gracias.
> V.
> ___
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support