Matt Post created JOSHUA-307: -------------------------------- Summary: Java-based tokenization and normalization Key: JOSHUA-307 URL: https://issues.apache.org/jira/browse/JOSHUA-307 Project: Joshua Issue Type: Improvement Reporter: Matt Post Priority: Minor Fix For: 6.2
Currently, Joshua expects data to be lowercased, normalized, and tokenized consistent with the way the training data was prepared before being passed in. This requires calling Perl scripts on the input data. It would be nice if these Perl scripts (located under $JOSHUA/scripts/preparation) were rewritten in Java (under org.apache.joshua.util) so that Joshua could do this normalization itself. This would be particularly useful for the language packs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)