I am having problems using the tagged and annotated output on dev/corpus files, specifically news-test2008, when there are intermediate periods (.) in the text. If this is truly an end-of-sentence marker, both TreeTagger and BitPar will interpret it correctly. If it is an abbreviation marker, however, TreeTagger will see it as an abbreviation, but BitPar will misinterpret it as an end-of-sentence marker, and the two will be out of sync.
For example, BitPar thinks the following contain two sentences: # Am 9. Dezember # wie z.B. die geringe Fahrpraxis It is trivial to write a Perl script to change the intermediate dots to, for example, "Am 9., Dezember". The question is, what would be the best substitution (is this the right way in the first place), and what ramifications would this have on tuning (this is being used as a tuning corpus)? Thank you! _______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
