Jehan, A brute-force method to give some phrases more weight is to simply create intentional duplicates in your training data set. Miles' option has more finesse.
Tom On Fri, 18 Nov 2011 18:29:50 +0900, Jehan Pages <je...@mygengo.com> wrote: > Hi, > > On Fri, Nov 18, 2011 at 6:00 PM, Miles Osborne <mi...@inf.ed.ac.uk> > wrote: >> re: not tuning on training data, in principle this shouldn't matter >> (especially if the tuning set is large and/or representative of the >> task). >> >> in reality, Moses will assign far too much weight to these examples, >> at the detriment of the others. (it will drastically overfit). >> this >> is why the tuning and training sets are typically disjoint. this is >> a >> standard tactic in NLP and not just Moses. > > Ok thanks. Actually I think that reminds me indeed what I learned > years ago on the topic (when I was still in university, in fact > working on these kind of topics, though now that's kind of far away). > > [Also, Tom Hoar, forget my questions on what you answer at this point > (when I asked "how do you do so?" and such). I misunderstood the > meaning of your answer! Now with Miles's answer, and rereading your > first one, I understand] > >> re: assigning more weight to certain translations, you have two >> options here. the first would be to assign more weight to these >> pairs >> when you run Giza++. (you can assign per-sentence pair weights at >> this stage). this is really just a hint and won't guarantee >> anything. >> the second option would be to force translations (using the XML >> markup). > > I see. Interesting. For what I want, the weights on GIZA++ looks > nice. > I'll try to find information on this. > > Thanks a lot for the answers. > > Jehan > >> Miles >> >> On 18 November 2011 08:42, Jehan Pages <je...@mygengo.com> wrote: >>> Hi, >>> >>> On Fri, Nov 18, 2011 at 2:59 PM, Tom Hoar >>> <tah...@precisiontranslationtools.com> wrote: >>>> Jehan, here are my strategies, others may vary. >>> >>> Thanks. >>> >>>> 1/ the 100-word (token) limit is a dependency of GIZA++ and >>>> MGIZA++, not >>>> just a convenience for speed. If you make the effort to use the >>>> BerkeleyAligner, this limit disappears. >>> >>> Ok I didn't know this alternative to GIZA++. I see there are some >>> explanation on the website for switching to this aligner. I may >>> give >>> it a try someday then. :-) >>> >>>> 2/ From a statistics and survey methodology point of view, your >>>> training >>>> data is a subset of individual samples selected from a whole >>>> population >>>> (linguistic domain) so-as to estimate the characteristics of the >>>> whole >>>> population. So, duplicates can exist and they play an important >>>> role in >>>> determining statistical significance and calculating >>>> probabilities. Some >>>> data sources, however, repeat information with little relevance to >>>> the >>>> linguistic balance of the whole domain. One example is a web sites >>>> with >>>> repetitive menus on every page. Therefore, for our use, we keep >>>> duplicates >>>> where we believe they represent a balanced sampling and results we >>>> want to >>>> achieve. We remove them when they do not. Not everyone, however, >>>> agrees with >>>> this approach. >>> >>> I see. And that confirms my thoughts. I don't know for sure what >>> will >>> be my strategy, but I think that will be keeping them all then, >>> most >>> probably. Making conditional removal like you do is interesting, >>> but >>> that would prove hard to do on our platform as we don't have >>> context >>> on translations stored. >>> >>>> 3/ Yes, none of the data pairs in the tuning set should be present >>>> in your >>>> training data. To do so skews the tuning weights to give excellent >>>> BLEU >>>> scores on the tuning results, but horrible scores on "real world" >>>> translations. >>> >>> I am not sure I understand what you say. How do you do so? Also why >>> would we want to give horrible score to real world translations? >>> Isn't >>> the point exactly that the tuning data should actually "represent" >>> this real world translations that we want to get close to? >>> >>> >>> 4/ Also I was wondering something else that I just remember. So >>> that >>> will be a fourth question! >>> Suppose in our system, we have some translations we know for sure >>> are >>> very good (all are good but some are supposed to be more like >>> "certified quality"). Is there no way in Moses to give some more >>> weight to some translations in order to influence the system >>> towards >>> quality data (still keeping all data though)? >>> >>> Thanks again! >>> >>> Jehan >>> >>>> Tom >>>> >>>> >>>> On Fri, 18 Nov 2011 14:31:44 +0900, Jehan Pages >>>> <je...@mygengo.com> wrote: >>>>> >>>>> Hi all, >>>>> >>>>> I have a few questions about quality of training and tuning. If >>>>> anyone >>>>> has any clarifications, that would be nice! :-) >>>>> >>>>> 1/ According to the documentation: >>>>> « >>>>> sentences longer than 100 words (and their corresponding >>>>> translations) >>>>> have to be eliminated >>>>> (note that a shorter sentence length limit will speed up >>>>> training >>>>> » >>>>> So is it only for the sake of training speed or can too long >>>>> sentences >>>>> end up being a liability in MT quality? In other words, when I >>>>> finally >>>>> need to train "for real usage", should I really remove long >>>>> sentences? >>>>> >>>>> 2/ My data is taken from real crowd-sourced translated data. As a >>>>> consequence, we end up with some duplicates (same original text >>>>> and >>>>> same translation). I wonder if for training, that either doesn't >>>>> matter, or else we should remove duplicates, or finally that's >>>>> better >>>>> to have duplicates. >>>>> >>>>> I would imagine the latter (keep duplicates) is the best as this >>>>> is >>>>> "statistical machine learning" and after all, these represent >>>>> "real >>>>> life" duplicates (text we often encounter and that we apparently >>>>> usually translate the same way) so that would be good to "insist >>>>> on" >>>>> these translations during training. >>>>> Am I right? >>>>> >>>>> 3/ Do training and tuning data have necessarily to be different? >>>>> I >>>>> guess for it to be meaningful, it should, and various examples on >>>>> the >>>>> website seem to go in that way, but I could not read anything >>>>> clearly >>>>> stating this. >>>>> >>>>> Thanks. >>>>> >>>>> Jehan >>>>> >>>>> _______________________________________________ >>>>> Moses-support mailing list >>>>> Moses-support@mit.edu >>>>> http://mailman.mit.edu/mailman/listinfo/moses-support >>>> >>>> >>> >>> _______________________________________________ >>> Moses-support mailing list >>> Moses-support@mit.edu >>> http://mailman.mit.edu/mailman/listinfo/moses-support >>> >> >> >> >> -- >> The University of Edinburgh is a charitable body, registered in >> Scotland, with registration number SC005336. >> > > _______________________________________________ > Moses-support mailing list > Moses-support@mit.edu > http://mailman.mit.edu/mailman/listinfo/moses-support _______________________________________________ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support