[Aspell-user] Looking for usage advice

Grzegorz Adam Hankiewicz Sun, 30 Jan 2005 11:43:05 -0800

Hi.

I am the coordinator of the project to translate to
Spanish the book "Version control with Subversion" found at
http://svnbook.red-bean.com/.  You can find about this project in
the translation to Spanish of that page. We use aspell as a safety
net against silly typos in our translations, and we thank you for
this. I wanted to expose here the way we work in order to see if
something can be done to make our jobs easier.


The book is written in XML. This is not a problem for aspell and its
html/sgml mode. Each chapter is stored in a separate file, and so far
translators have gone individually through each file. When the file
has gone through an initial translation, we fire up aspell on it.

Since the text contains lots of technical words not included in the
default dictionaries, and also sometimes text in English which is
left verbatim, aspell reports many false positives which have to
be ignored.

Once the translator has gone through the XML file with aspell, we
create an "ignore" file from the spell checked document. This ignore
file is created with the list command and piped to a hidden file. On
posterior aspell runs, this hidden file is converted into a custom
dictionary (with "create master") and added on the commandline.

While the document had already been translated and spell checked,
the creation of this invisible file makes it easier for us to verify
again the spelling if we need to further modify the translation.

The whole spell checking process is hidden by two makefile targets,
aspell_add_words and aspell_check, which use non portable bash for
loops to process all the files we have translated to date.

Possibly the best improvement we could find is if aspell was able
to recognise different languages in the document being scanned.
Reading the mailing list archives I've found out that this feature is
not planned due to the intrinsic difficulty of detecting correctly
a language.  However, in the kind of documents we translate
usually english text is left alone in specific tags, like <screen>
or <quote>.

Possibly heresy in itself, it could be useful if aspell had a basic
XML scanner and was more aware of the format of the document it is
parsing, providing user customised hooks whenever specific tags are
found. By basic I mean really dump word matching: if the tag <quote
is found and the user specified this as a hook, aspell could maybe
change to another dictionary on the fly, prompt the user whether
this change is OK (showing it at the same time on the screen for the
user to judge), maybe pipe the bit of text to yet another program,
etc, until the byte sequence </quote> had been found.

Another improvement we would find useful is marking words with
different meanings when going into a dictionary (or rather
our current ignore list). Right now, maybe it's possible, but I
don't know how to insert in the custom dictionary a word and tell
aspell I'd always want it to be lower case or upper case. Or tell
a word that it is actually in a different language. The latter is
apparently not useful, until you realise that most of the ignored
words in our translations are repeated from chapter to chapter.

By marking words as "other language", it could be possible to
create a global "english words" dictionary for all of the files,
removing the tedious work of adding words like "Subversion" or
"HTTP" to each ignore file.

Not related directly to these questions, I've hacked myself a python
script which strips "positive lines" from unified diffs and creates
a temporary file, which is spell checked by aspell. I've found
this very useful, and was wondering if somebody else is interested
in this.


_______________________________________________
Aspell-user mailing list
[email protected]
http://lists.gnu.org/mailman/listinfo/aspell-user

[Aspell-user] Looking for usage advice

Reply via email to