[Moses-support] phrase-table with ' " and other strage things. Additional corpus cleaning necessary?

Artem Shevchenko Sat, 04 Apr 2020 16:40:30 -0700

Hello,

following the manual for baseline creaition, I have trained the model using
Europarl v9 de-en pair.
Now I observe that obtained phrase table contains a lot of noise.


E.g. a lot of "&apos; ", "&quot;" which seem to distort the model and
decoder.
E.g. truecasing did not work properly with those special symbols:

&quot; ( Das sind sehr ||| &apos; ( these are very ||| 0.5 2.47962e-05
0.333333 7.4064e-05 ||| 0-0 1-1 2-2 3-3 4-4 ||| 2 3 1 ||| |||

Did you do any additional purification of the corpus before training?
Please share your experience.

Artem Shevchenko

_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

[Moses-support] phrase-table with ' " and other strage things. Additional corpus cleaning necessary?

Reply via email to