Matt Post created JOSHUA-307:
--------------------------------

             Summary: Java-based tokenization and normalization
                 Key: JOSHUA-307
                 URL: https://issues.apache.org/jira/browse/JOSHUA-307
             Project: Joshua
          Issue Type: Improvement
            Reporter: Matt Post
            Priority: Minor
             Fix For: 6.2


Currently, Joshua expects data to be lowercased, normalized, and tokenized 
consistent with the way the training data was prepared before being passed in. 
This requires calling Perl scripts on the input data. It would be nice if these 
Perl scripts (located under $JOSHUA/scripts/preparation) were rewritten in Java 
(under org.apache.joshua.util) so that Joshua could do this normalization 
itself. This would be particularly useful for the language packs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to