Hi. I am the coordinator of the project to translate to Spanish the book "Version control with Subversion" found at http://svnbook.red-bean.com/. You can find about this project in the translation to Spanish of that page. We use aspell as a safety net against silly typos in our translations, and we thank you for this. I wanted to expose here the way we work in order to see if something can be done to make our jobs easier.
The book is written in XML. This is not a problem for aspell and its html/sgml mode. Each chapter is stored in a separate file, and so far translators have gone individually through each file. When the file has gone through an initial translation, we fire up aspell on it. Since the text contains lots of technical words not included in the default dictionaries, and also sometimes text in English which is left verbatim, aspell reports many false positives which have to be ignored. Once the translator has gone through the XML file with aspell, we create an "ignore" file from the spell checked document. This ignore file is created with the list command and piped to a hidden file. On posterior aspell runs, this hidden file is converted into a custom dictionary (with "create master") and added on the commandline. While the document had already been translated and spell checked, the creation of this invisible file makes it easier for us to verify again the spelling if we need to further modify the translation. The whole spell checking process is hidden by two makefile targets, aspell_add_words and aspell_check, which use non portable bash for loops to process all the files we have translated to date. Possibly the best improvement we could find is if aspell was able to recognise different languages in the document being scanned. Reading the mailing list archives I've found out that this feature is not planned due to the intrinsic difficulty of detecting correctly a language. However, in the kind of documents we translate usually english text is left alone in specific tags, like <screen> or <quote>. Possibly heresy in itself, it could be useful if aspell had a basic XML scanner and was more aware of the format of the document it is parsing, providing user customised hooks whenever specific tags are found. By basic I mean really dump word matching: if the tag <quote is found and the user specified this as a hook, aspell could maybe change to another dictionary on the fly, prompt the user whether this change is OK (showing it at the same time on the screen for the user to judge), maybe pipe the bit of text to yet another program, etc, until the byte sequence </quote> had been found. Another improvement we would find useful is marking words with different meanings when going into a dictionary (or rather our current ignore list). Right now, maybe it's possible, but I don't know how to insert in the custom dictionary a word and tell aspell I'd always want it to be lower case or upper case. Or tell a word that it is actually in a different language. The latter is apparently not useful, until you realise that most of the ignored words in our translations are repeated from chapter to chapter. By marking words as "other language", it could be possible to create a global "english words" dictionary for all of the files, removing the tedious work of adding words like "Subversion" or "HTTP" to each ignore file. Not related directly to these questions, I've hacked myself a python script which strips "positive lines" from unified diffs and creates a temporary file, which is spell checked by aspell. I've found this very useful, and was wondering if somebody else is interested in this. _______________________________________________ Aspell-user mailing list [email protected] http://lists.gnu.org/mailman/listinfo/aspell-user
