[Aspell-user] aspell and hyphen at EOL

Bauke Jan Douma Wed, 05 Oct 2005 14:04:24 -0700

Hi,

I have a large number of (ASCII) *.txt files that I need to spell check,
preferably with aspell.


These files (which are in Dutch) contain many broken up words that are
hyphenated at EOL.  In Dutch, the constituing parts of a hyphenated word
are not themselves necessarily valid words.  For instance you could have

example:
... ... ... ... gete-
kend ... ... .... ...

where one, or in this case both, parts are invalid words in Dutch, but
the whole word 'getekend' is perfect Dutch.
The hyphen is the common dash character, '-'.

The *.txt files I am talking about are actually OCR'd scans of newspaper-
articles, and I have the important requirement that their structure must
be preserved exactly as it is now.  This includes hyphenation.
(I have a second requirement, but which does not regard the present
problem discussed here, which is that spelling mistakes in the original
are to be preserved in the *.txt files; the spell-check run is /only/ to
weed out OCR-errors; spelling errors in the original I plan to have anno-
tated in some way).

Back to the hyphenation. I noticed to my surprise that aspell does not
recognize words that are hyphenated this way.  Instead it tries to match
each part as a separate word.

As far as I can see and have experimented, there are no options to aspell
to make this work as expected.
I tried OpenOffice's spell-checker, and it has the same behaviour (little
surprise, I think).

Do any of you have any ideas on how to approach this?  Right now, I am
thinking of creating a 'before-apell' and 'after-aspell' filter (just shell
scripts, hopefully, or else in C), but I wonder if there isn't an easier
way, and if I'm the first to run into this situation.

Thanks for you time.

Regards,

Bauke Jan Douma
[EMAIL PROTECTED]



_______________________________________________
Aspell-user mailing list
[email protected]
http://lists.gnu.org/mailman/listinfo/aspell-user

[Aspell-user] aspell and hyphen at EOL

Reply via email to