This is the second part (change #2) of my consideration of apostrophes and 
hyphens in aspell.  The first part (change #1) was "Tokenization of 
word-initial specials" dated June 14 2013.

Currently, when *.dat marks apostrophe as valid initially, the dictionary form 
well validates the token 'well (in addition to the token well).  And, when 
*.dat marks apostrophe as valid finally, the dictionary form well also 
validates the token well' .  However, neither of the tokens 'well or well' 
should ever be validated by the form well, but approved only if those exact 
forms are present in the dictionary.

There are two cases: when the apostrophe is encountered in a token in a 
position, initial or final, where it IS NOT valid in *.dat (and note that this 
applies to en.dat), it is immediately dropped from the token, and only the 
token without the apostrophe is checked against the dictionary.  (Before change 
#1, even a valid initial apostrophe was dropped from the token, but not a valid 
final apostrophe.)  So if "trying the token without the special" is done with 
the intention of accepting a token of English which has contrived to include a 
neighbouring quotation mark, this is a non-existent situation whose removal 
will have no effect.

When the apostrophe is encountered in a token in a position, initial or final, 
where it IS valid in *.dat, the token should be accepted only if the dictionary 
contains the word including the apostrophe - the current practice of accepting 
the token, merely because the corresponding form without the apostrophe is in 
the dictionary, is to accept an invalid word, possibly resulting from a 
mistaken use of the apostrophe (ASCII hex 27) as a quotation mark.  (Remember 
that languages which accept valid word-marginal apostrophes in *.dat do not use 
ASCII hex 27 as a quotation mark.)

The code for "trying the token with and without any initial or final special" 
is found in procedure SensitiveCompare in modules/speller/default/language.cpp 
at around line 428.  The suggested change #2 is to remove the code which, when 
the token begins or ends with a valid special, and has failed to match the 
dictionary, compares the token minus the special to the dictionary.  (Note 
again that a token will never be found to begin or end with an INVALID special, 
as that special will have been dropped during tokenization.)  Specifically, I 
suggest removal of the four separate lines which use the special() function.  
Having no previous experience of C++ programming I cannot say that everything 
has been done which ought to be done, but the concept has been tried and shown 
to work.  I do not at present see any reason to make it conditional, ie. I 
cannot see any situation where the present behaviour is preferable.

This suggestion will enable a language like Italian, for example, to have a new 
it.dat in which word-final apostrophe is allowed, and non-words like anch may 
be replaced in the dictionary by anch' .  Even for English, a new en.dat 
allowing marginal apostrophes and a new dictionary (with, for example, 'twas 
and 'twill in place of twas and twill, and adding 'tis and 'twould) could 
produce an improvement, but only with English texts in which an encoding 
distinction has been made between apostrophe and quotation mark.  The main 
beneficiaries of the suggestion will be among languages other than English.

As before, my experiments have been conducted using the Hatier port of aspell 
for Windows at http://www.niversoft.com/downloads/aspell-0.60.5-msvc.tar.bz2 .

Third and final part to follow.

Ciarán Ó Duibhín

_______________________________________________
Aspell-devel mailing list
[email protected]
https://lists.gnu.org/mailman/listinfo/aspell-devel

Reply via email to