text to be translated needs to be in the same format as the data used
for training and decoding.  typically, this means:

--tokenising
--lower-casing

but there is nothing in the framework which forces you to do this.
for example, you might want to preserve case information

best practise will depend upon the volume of material you have.  if
you have a lot of data, then it makes sense to keep it as much of  the
original format (information) as possible.  whenever the text is
transformed, you run the risk of throwing information away.  or,
reconstructing it might introduce extra errors.

but if you have not much data, or you suspect that it contains noise,
then cleaning etc might yield good results.

Miles

On 6 August 2010 14:36, Gary Daine <gda...@gmail.com> wrote:
> I have a very basic-sounding question, but I've not been able to find
> any reference in the documentation.
>
> Since Moses is trained on tokenized, lowercased corpora, is it necessary
> to tokenize and lowercase the text to be translated as well (and do the
> reverse to the output)?
>
> TIA
>
> Gary
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
>



-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.
_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to