[aspell-devel] Tokenization of word-initial specials

Ciarán Ó Duibhín Fri, 14 Jun 2013 04:28:44 -0700

With reference to aspell source file modules/tokenizer/basic.cpp, procedure 
TokenizerBasic::advance(), the following block of code occurs at around the 
18th line of the procedure:


    if (is_begin(*cur) && is_word(cur[1]))
    {
      cur_pos += cur->width;
      ++cur;
    }

This code applies to the case where the relevant *.dat file declares a 
non-letter to be a valid word-initial symbol, and a token beginning with this 
symbol is found in text being checked.

As the code stands, the valid word-initial symbol is not stored with the 
extracted token (unlike a valid non-letter in the middle or at the end of a 
token).  I would suggest the inclusion of a statement

      word.append(*cur);

as an additional first statement within the braces.


The non-retention of the initial symbol in the token produces the situation 
where, given a dictionary which contains the form 'twas but not the form twas , 
the token 'twas in text is refused, and the suggested replacement is 'twas .

Inclusion of the suggested statement repairs this behaviour, and in doing so it 
makes aspell conformant to its stated behaviour in  
http://aspell.net/man-html/Words-With-Symbols-in-Them.html:
        "The case where the symbol can appear at the beginning or end of the 
word is more difficult to deal with...
         Aspell currently handles this case by first trying to spell check the 
word with the symbol and if that fails, try it without."


The effect of the proposed change on English should be an improvement, though 
not a significant one: English examples are few and unimportant ('tis, 'twas, 
'twill, 'twould).  However in many languages word-initial (and word-final) 
apostrophes are common, and moreover ASCII hex 27 is not used as a quotation 
mark.  An apostrophe is not a discardable punctuation mark, but part of the 
spelling of the word; removing the apostrophe produces a different word (or 
more usually, non-word).  Nevertheless this is what aspell normally does; it 
includes in the dictionary the residue of the word without any marginal 
apostrophe, and (per the above quotation) checks the token less marginal 
apostrophe against the dictionary.  This strategy may be serviceable, if ugly, 
for some languages, but the texts I wish to check contain so many words of this 
type that it would be necessary to admit legions of non-words into the 
dictionary and the whole operation breaks down.

The better way to proceed is to add the words to the dictionary complete with 
their marginal apostrophes, and to check the tokens complete with their 
marginal apostrophes against the dictionary.  For this checking to work in the 
case of word-initial apostrophes, the suggested change to aspell is a necessary 
first step.  At least two further steps will be needed also, before aspell will 
be able to reproduce the ability of the MS Word spell-checker to handle these 
situations satisfactorily.


I considered making a bug report on this, but I thought it needed more 
explanation than a bug report would normally contain.  Also it would be 
necessary to be sure that there are no circumstances in which the present 
behaviour is preferable - if there are, any change should be made to depend on 
a configuration option.


This proposal concerns valid word-initial symbols in general, including in 
particular ASCII hex 27, and is independent of any consideration of the Unicode 
apostrophe.


My experiments have been conducted using the Hatier port of aspell for Windows 
at http://www.niversoft.com/downloads/aspell-0.60.5-msvc.tar.bz2 .


Ciarán Ó Duibhín

_______________________________________________
Aspell-devel mailing list
[email protected]
https://lists.gnu.org/mailman/listinfo/aspell-devel

[aspell-devel] Tokenization of word-initial specials

Reply via email to