With reference to aspell source file modules/tokenizer/basic.cpp, procedure
TokenizerBasic::advance(), the following block of code occurs at around the
18th line of the procedure:
if (is_begin(*cur) && is_word(cur[1]))
{
cur_pos += cur->width;
++cur;
}
This code applies to the case where the relevant *.dat file declares a
non-letter to be a valid word-initial symbol, and a token beginning with this
symbol is found in text being checked.
As the code stands, the valid word-initial symbol is not stored with the
extracted token (unlike a valid non-letter in the middle or at the end of a
token). I would suggest the inclusion of a statement
word.append(*cur);
as an additional first statement within the braces.
The non-retention of the initial symbol in the token produces the situation
where, given a dictionary which contains the form 'twas but not the form twas ,
the token 'twas in text is refused, and the suggested replacement is 'twas .
Inclusion of the suggested statement repairs this behaviour, and in doing so it
makes aspell conformant to its stated behaviour in
http://aspell.net/man-html/Words-With-Symbols-in-Them.html:
"The case where the symbol can appear at the beginning or end of the
word is more difficult to deal with...
Aspell currently handles this case by first trying to spell check the
word with the symbol and if that fails, try it without."
The effect of the proposed change on English should be an improvement, though
not a significant one: English examples are few and unimportant ('tis, 'twas,
'twill, 'twould). However in many languages word-initial (and word-final)
apostrophes are common, and moreover ASCII hex 27 is not used as a quotation
mark. An apostrophe is not a discardable punctuation mark, but part of the
spelling of the word; removing the apostrophe produces a different word (or
more usually, non-word). Nevertheless this is what aspell normally does; it
includes in the dictionary the residue of the word without any marginal
apostrophe, and (per the above quotation) checks the token less marginal
apostrophe against the dictionary. This strategy may be serviceable, if ugly,
for some languages, but the texts I wish to check contain so many words of this
type that it would be necessary to admit legions of non-words into the
dictionary and the whole operation breaks down.
The better way to proceed is to add the words to the dictionary complete with
their marginal apostrophes, and to check the tokens complete with their
marginal apostrophes against the dictionary. For this checking to work in the
case of word-initial apostrophes, the suggested change to aspell is a necessary
first step. At least two further steps will be needed also, before aspell will
be able to reproduce the ability of the MS Word spell-checker to handle these
situations satisfactorily.
I considered making a bug report on this, but I thought it needed more
explanation than a bug report would normally contain. Also it would be
necessary to be sure that there are no circumstances in which the present
behaviour is preferable - if there are, any change should be made to depend on
a configuration option.
This proposal concerns valid word-initial symbols in general, including in
particular ASCII hex 27, and is independent of any consideration of the Unicode
apostrophe.
My experiments have been conducted using the Hatier port of aspell for Windows
at http://www.niversoft.com/downloads/aspell-0.60.5-msvc.tar.bz2 .
Ciarán Ó Duibhín_______________________________________________
Aspell-devel mailing list
[email protected]
https://lists.gnu.org/mailman/listinfo/aspell-devel