Hi, yes, this is what the RECASER section in EMS enables.
-phi On Wed, May 20, 2015 at 2:50 PM, Lane Schwartz <dowob...@gmail.com> wrote: > Got it. So then, how was casing handled in the "mbr/mp" column? Was all > of the data lowercased, then models trained, then recasing applied after > decoding? Or something else? > > On Wed, May 20, 2015 at 1:30 PM, Philipp Koehn <p...@jhu.edu> wrote: > >> Hi, >> >> no, the changes are made incrementally. >> >> So the recesed "baseline" is the previous "mbr/mp" column. >> >> -phi >> >> On Wed, May 20, 2015 at 2:01 PM, Lane Schwartz <dowob...@gmail.com> >> wrote: >> >>> Philipp, >>> >>> In Table 2 of the WMT 2009 paper, are the "baseline" and "truecased" >>> columns directly comparable? In other words, do the two columns indicate >>> identical conditions other than a single variable (how and/or when casing >>> was handled)? >>> >>> In the baseline condition, how and when was casing handled? >>> >>> Thanks, >>> Lane >>> >>> >>> On Wed, May 20, 2015 at 12:43 PM, Philipp Koehn <p...@jhu.edu> wrote: >>> >>>> Hi, >>>> >>>> see Section 2.2 in our WMT 2009 submission: >>>> http://www.statmt.org/wmt09/pdf/WMT-0929.pdf >>>> >>>> One practical reason to avoid recasing is the need >>>> for a second large cased language model. >>>> >>>> But there is of course also the practical issue with >>>> have a unique truecasing scheme for each data >>>> condition, handling of headlines, all-caps emphasis, >>>> etc. >>>> >>>> It would be worth to revisit this issue again under >>>> different data conditions / language pairs. Both >>>> options are readily available in EMS. >>>> >>>> Each of the two alternative methods could be >>>> improved as well. See for instance: >>>> http://www.aclweb.org/anthology/N06-1001 >>>> >>>> -phi >>>> >>>> -phi >>>> >>>> >>>> On Wed, May 20, 2015 at 12:31 PM, Lane Schwartz <dowob...@gmail.com> >>>> wrote: >>>> >>>>> Philipp (and others), >>>>> >>>>> I'm wondering what people's experience is regarding when truecasing >>>>> is applied. >>>>> >>>>> One option is to truecase the training data, then train your TM and >>>>> LM using that truecased data. Another option would be to lowercase the >>>>> data, train TM and LM on the lowercased data, and then perform truecasing >>>>> after decoding. >>>>> >>>>> I assume that the former gives better results, but the latter >>>>> approach has an advantage in terms of extensibility (namely if you get >>>>> more >>>>> data and update your truecase model, you don't have to re-train all of >>>>> your >>>>> TMs and LMs). >>>>> >>>>> Does anyone have any insights they would care to share on this? >>>>> >>>>> Thanks, >>>>> Lane >>>>> >>>>> >>>>> _______________________________________________ >>>>> Moses-support mailing list >>>>> Moses-support@mit.edu >>>>> http://mailman.mit.edu/mailman/listinfo/moses-support >>>>> >>>>> >>>> >>> >>> >>> -- >>> When a place gets crowded enough to require ID's, social collapse is not >>> far away. It is time to go elsewhere. The best thing about space travel >>> is that it made it possible to go elsewhere. >>> -- R.A. Heinlein, "Time Enough For Love" >>> >>> _______________________________________________ >>> Moses-support mailing list >>> Moses-support@mit.edu >>> http://mailman.mit.edu/mailman/listinfo/moses-support >>> >>> >> > > > -- > When a place gets crowded enough to require ID's, social collapse is not > far away. It is time to go elsewhere. The best thing about space travel > is that it made it possible to go elsewhere. > -- R.A. Heinlein, "Time Enough For Love" >
_______________________________________________ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support