Hi, I found this older tutorial to be very useful as well:
"Practical Domain Adaptation" by Marcello Federico and Nicola Bertoldi http://www.mt-archive.info/10/AMTA-2012-Bertoldi-ppt.pdf (The document formatting is unfortunately slightly messed up.) SMT research survey wiki: http://www.statmt.org/survey/Topic/DomainAdaptation Cheers, Matthias On Fri, 2015-08-14 at 20:37 +0100, Barry Haddow wrote: > You could try this tutorial > > http://www.statmt.org/mtma15/uploads/mtma15-domain-adaptation.pdf > > On 14/08/15 20:20, Vincent Nguyen wrote: > > I had read this section, which deals with translation model combination. > > not much on language model or tuning. > > > > For instance : if I want to make sure that a specific expression > > "titres" is translated in "equities" from French to English. > > > > These 2 words have specifically to be in the Monolingual corpus of the > > language model, or in the parallel corpus ? > > > > the fact that 2 "parallel expressions" are in the tuning set but not > > present in the parallel corpora nor the monolingual LM, can it trigger a > > good translation ? > > > > I am not sure to be clear .... > > > > thanks again for your help. > > > > > > Le 14/08/2015 20:52, Rico Sennrich a écrit : > >> Hi Vincent, > >> > >> this section describes some domain adaptation methods that are > >> implemented in Moses: http://www.statmt.org/moses/?n=Advanced.Domain > >> > >> It is incomplete (focusing on parallel data and the translation model), > >> and does not recommend best practices. > >> > >> In general, my recommendation is to use in-domain data whenever possible > >> (for the language model, translation model, and held-out in-domain data > >> for tuning/testing). Out-of-domain data can help, but also hurt your > >> system: the effect depends on your domains and the amount of data you > >> have for each. Data selection, instance weighting, model interpolation > >> and domain features are different methods that give you the benefits of > >> out-of-domain data, but reduce its harmful effects, and are often better > >> than just concatenating all the data you have. > >> > >> best wishes, > >> Rico > >> > >> > >> On 14/08/15 16:22, Vincent Nguyen wrote: > >>> Hi, > >>> > >>> I can't find a sort of "tutorial " on domain adaptation path to follow. > >>> I read this in the doc : > >>> The language model should be trained on a corpus that is suitable to the > >>> domain. If the translation model is trained on a parallel corpus, then > >>> the language model should be trained on the output side of that corpus, > >>> although using additional training data is often beneficial. > >>> > >>> And in the training section of the EMS, there is a sub section with > >>> domain-features=.... > >>> > >>> What is the best practice ? > >>> > >>> Let's say for instance that I would like to specialize my modem in > >>> finance translation, with specific corpus. > >>> > >>> Should I train the Language model with finance stuff ? > >>> Should I include parallel corpus in the translation model training ? > >>> Should I tune with financial data sets ? > >>> > >>> Please help me to understand. > >>> Vincent > >>> > >>> _______________________________________________ > >>> Moses-support mailing list > >>> Moses-support@mit.edu > >>> http://mailman.mit.edu/mailman/listinfo/moses-support > >>> > >> _______________________________________________ > >> Moses-support mailing list > >> Moses-support@mit.edu > >> http://mailman.mit.edu/mailman/listinfo/moses-support > > _______________________________________________ > > Moses-support mailing list > > Moses-support@mit.edu > > http://mailman.mit.edu/mailman/listinfo/moses-support > > > > -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. _______________________________________________ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support