I am training my decoder on a string-to-tree model, ENglish to arabic. I am using these options : -hierarchical -glue-grammar -max-phrase-length 5 -ghkm --extract-options="--UnknownWordMinRelFreq 0.01 --MaxNodes 40 --MaxRuleDepth 7 --MaxRuleSize 7 --AllowUnary" --score-options="--GoodTuring --LowCountFeature --MinCountHierarchical 2 --MinScore 2:0.0001"
I am planning on filtering them manually but can you point out if I might be doing something wrong or which option is doing this? Could duplicate sentences cause this ? Maybe I need to filter my training set On Fri, Jul 1, 2016 at 7:31 PM, Hieu Hoang <hieuho...@gmail.com> wrote: > if shouldn't. How did you create it? can you give an example > > Hieu Hoang > http://www.hoang.co.uk/hieu > > On 1 July 2016 at 18:29, Ayah El Maghraby <ayah.elmaghr...@gmail.com> > wrote: > >> I mean also the file rule-table.gz contains duplicate entries >> On Fri, Jul 1, 2016 at 7:27 PM Hieu Hoang <hieuho...@gmail.com> wrote: >> >>> the extract.sorted.gz file is not your rule table. >>> >>> the training should create another file called phrase-table.*gz. This is >>> your rule table >>> On 30/06/2016 22:55, Ayah El Maghraby wrote: >>> >>> Hello, >>> I have duplicate entries in my rule table, extract.sorted.gz files. I am >>> training using a data set of of size around 1000,000 lines >>> I am translating from english to arabic. Is this normal ? >>> Will removing duplicates affect my decoder ? >>> >>> Regards, >>> Ayah >>> >>> >>> _______________________________________________ >>> Moses-support mailing >>> listMoses-support@mit.eduhttp://mailman.mit.edu/mailman/listinfo/moses-support >>> >>> >>> >
_______________________________________________ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support