Hi, I have a large number of (ASCII) *.txt files that I need to spell check, preferably with aspell.
These files (which are in Dutch) contain many broken up words that are hyphenated at EOL. In Dutch, the constituing parts of a hyphenated word are not themselves necessarily valid words. For instance you could have example: ... ... ... ... gete- kend ... ... .... ... where one, or in this case both, parts are invalid words in Dutch, but the whole word 'getekend' is perfect Dutch. The hyphen is the common dash character, '-'. The *.txt files I am talking about are actually OCR'd scans of newspaper- articles, and I have the important requirement that their structure must be preserved exactly as it is now. This includes hyphenation. (I have a second requirement, but which does not regard the present problem discussed here, which is that spelling mistakes in the original are to be preserved in the *.txt files; the spell-check run is /only/ to weed out OCR-errors; spelling errors in the original I plan to have anno- tated in some way). Back to the hyphenation. I noticed to my surprise that aspell does not recognize words that are hyphenated this way. Instead it tries to match each part as a separate word. As far as I can see and have experimented, there are no options to aspell to make this work as expected. I tried OpenOffice's spell-checker, and it has the same behaviour (little surprise, I think). Do any of you have any ideas on how to approach this? Right now, I am thinking of creating a 'before-apell' and 'after-aspell' filter (just shell scripts, hopefully, or else in C), but I wonder if there isn't an easier way, and if I'm the first to run into this situation. Thanks for you time. Regards, Bauke Jan Douma [EMAIL PROTECTED] _______________________________________________ Aspell-user mailing list [email protected] http://lists.gnu.org/mailman/listinfo/aspell-user
