Al 04/12/2014 07:25 PM, En/na Rafi Kamal ha escrit:
Another question, in the script, I need to split the corpus into sentences. Is there any tool in apertium which does this task (like NLTK tokenizer for Python)? I can simply split the corpus on . or ? or !. But in that case, sentences where '.' is not the last letter will create problem. For example "Washington D.C. is a beautiful place." will be spitted into three parts: "Washington D", "C", and "is a beautiful place".
First of all, the sentence is not a unit in Apertium. Apertium works with the chunks its modules deal with (lexical units, patterns, etc.) If you check the English dictionary you will see that some words have "." inside. For instance, "Washington D.C." is an entry in Apertium-en-es. There are many other entries there. Symbols such as "?" and "!" are analyzed as "sent", that is, sentence markers. like ".".

On the other hand, I believe the -m option (used in connection with translation memories) does deal with sentences in some way, but you should check.

All the best

Mikel

--
Mikel L. Forcada (http://www.dlsi.ua.es/~mlf/)
Departament de Llenguatges i Sistemes InformĂ tics
Universitat d'Alacant
E-03071 Alacant, Spain
Phone: +34 96 590 9776
Fax: +34 96 590 9326

------------------------------------------------------------------------------
Put Bad Developers to Shame
Dominate Development with Jenkins Continuous Integration
Continuously Automate Build, Test & Deployment 
Start a new project now. Try Jenkins in the cloud.
http://p.sf.net/sfu/13600_Cloudbees
_______________________________________________
Apertium-stuff mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Reply via email to