Hello spell checking colleagues,
As you probably know I have added spell checking facilities to Vim. I'm
using the word lists in Myspell compatible format as input. I hope we
can continue sharing the word lists, since that is were most effort is
spent.
Spell checking in Vim works very well now. I'm converting the word
list to a binary format ".spl" file, so that reading the word list is
very fast. There are a few tricks to keep the file small and spell
checking fast. The speed is required, because in Vim spell checking is
done when displaying the text. A trie with tail compression is used.
More about that below.
What I'm looking into now is compound words. I implemented the
Myspell compatible simplistic mechanism that allows words to be
concatenated without condition (as used for Danish, for example).
Clearly this allows lots of bad words.
I'm hoping that we can agree on a way to define compound words better.
Then the authors of word lists can start using this. It's obviously
very important that we agree on one format, so that the word lists can
be used with all the different programs. Or at least it should be
possible to convert the formats into each other, thus the mechanisms
must be similar.
I found an outline of a format in the online Aspell documentation. But
it's not finished. Let me propose something, then we can discuss it,
try it out and finally decide on what it's going to be.
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
PROPOSAL
1. Use flags after the word to define which words can be used in a
compound word. This is just like affixes. This also means we have a
limited number of flags, they must be different from the affixes. A
word can have several compound flags, all combinations will be valid.
Just like with affixes.
2. Use items in the affix file to define what the flags mean and
restrictions on what words may be concatenated to others. The
following describes this (excerpt from the Vim help draft):
COMPOUND WORDS *spell-affix-compound*
A compound word is a longer word made by concatenating words. To specify
which words may be concatenated a character is used. This character is put in
the list of affixes after the word. We will call this character a flag here.
Obviously these flags must be different from any affix IDs used.
*spell-COMPOUNDFLAG*
The Myspell compatible method uses one flag, specified with COMPOUNDFLAG.
All words with this flag combine in any order and without limit in length.
This means there is no control over which word comes first. Example:
COMPOUNDFLAG c
*spell-COMPOUNDFLAGS*
The method added by Vim allows specifying which words can be prepended to
other words, and which words can be appended to other words. This is a list
of comma separated items. Each item may contain zero or more dashes and plus
signs.
NOTE: At this moment COMPOUNDFLAGS has not been implemented yet!
An item without dashes specifies words that combine in any order and as often
as possible. Example:
COMPOUNDFLAGS c,m
This allows all words with the "c" flag to be combined and all words with the
"m" flag to be combined, but a word with the "c" flag doesn't combine with a
word with the "m" flag.
Flags that are put together, without a separating comma, are considered
interchangable. Example:
COMPOUNDFLAGS cm
This allows all words with the "c" and/or "m" flag to be combined.
An item with one dash specifies flags for a leading word and flags for a
trailing word. Thus only two-word combinations are made. Example:
COMPOUNDFLAGS f-d
Here the 'f' flag can be used for food and 'd' for dishes, such that you can
use these words in the dictionary:
tomato/f
onion/f
soup/d
salat/d
Which makes the words:
tomato
onion
soup
salat
tomatosoup
tomatosalat
onionsoup
onionsalat
Note that something like "souptomato" is not possible. And that it's actually
easier to list all the words if you have only this few.
More dashes can be used to allow more words to combine. For example:
COMPOUNDFLAGS f-d,f-f-d
Would allow "tomatoonionsoup" (OK, so this is a bad example, but you get the
idea).
When a word can be used an undetermined number of times use a plus instead of
a dash. Example:
COMPOUNDFLAGS f+d
Then you can make tasty "oniononiontomatotomatosoup".
The "+" may also appear at the end, which means that the last flags can be
repeated many times. Example:
COMPOUNDFLAGS f-d+
Which allows the use of "onionsoupsoupsoupsoupsoupsoup".
*spell-COMPOUNDMIN*
The minimal length of a word used for concatenation is specified with
COMPOUNDMIN. Example:
COMPOUNDMIN 5
When omitted a minimal length of 3 bytes is used. Obviously you could just
leave out the compound flag from short words instead, this feature is present
for compatibility with Myspell.
*spell-CMP*
NOTE: At this moment CMP has not been implemented yet!
Sometimes it is necessary to change a word when concatenating it to another,
by removing a few letters, inserting something or both. It can also be useful
to restrict concatenation to words that match a pattern. For this purpose CMP
items can be used. They look like this:
CMP {flag} {strip} {add} {cond} {cond2}
{flag} the flag, as used in COMPOUNDFLAGS for the lead word
{strip} text to remove from the end of the lead word (zero
for no stripping)
{add} text to insert between the words (zero for no
addition)
{cond} condition to match at the end of the lead word
{cond2} condition to match at the start of the following word
This is exactly the same as what is used for SFX and PFX items, except there
is an extra condition. Example:
CMP f 0 - . .
When used with the food and dish word list above, this means that a dash is
inserted after each food item. Thus you get "onion-soup" and
"onion-tomato-salat".
When there are CMP items for a compound flag the concatenation is only done
when a CMP item matches.
When there are no CMP items for a compound flag, then all words will be
concatenated, as if there was an item:
CMP {flag} 0 0 . .
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Note that I have no experience in trying to make a word list with
compound words, thus I'm not aware of language-specific issues. I'm
trying to provide a mechanism that's as generic as possible without
making the implementation too complicated.
Questions:
- Do we need a "not combining" indication on the compound flag, to
indicate only the basic word can be used, not the word plus affixes
applied? I guess not, you can repeat the word with different flags
when needed.
- Won't we run into the problem that one byte only allows a limited
number of affix flags? In Vim utf-8 can be used, thus allowing lots
of flags, but it won't work with Aspell or Myspell.
An alternative could be to separate the affix IDs from the compound
flags with a special character, e.g., a '('. That doubles the number
of different flags available. It's also easier to see the different
flags in the word list. Mistakes will be noticed quicker.
Other solutions require IDs with multiple letters, that would
complicate things considerably.
- The CMP items are used for the lead word and have no way to specify
which following compound flag is valid. Only the condition can be
used. I would think that it's useful to do something different when a
certain kind of word follows. Adding a field that specifies the
possible compound flags on the following word would be possible:
CMP {flag} {strip} {add} {cond} {cond2} {flags2}
{flags2} list of accepted compound flags on following
word. Use . to accept all.
Then you can define two items that only differ in the compound flags
for the following word:
CMP m 0 - . . x
CMP m 0 + . . y
In words: If an x word follows an m word insert a dash, if an y word
follows insert a plus.
Back to the ".spl" file format that I'm using for Vim. A few advantages:
- The file size is smaller than the .dic file used for input.
The size reduction goes up to 88%. For most languages it's about 50%.
Hebrew works best:
3103107 he_IL.dic
369213 he.iso-8859-8.spl
- Loading is much faster, since all pre-processing has been done
already.
- For nearly all languages affixes are expanded into all possible words,
thus conditions are checked when making the .spl file, not when
checking the words. Great speed increase.
- Works for multi-byte encodings, most notably utf-8. For Vim I
generate .spl files in multuple encodings to avoid the conversion at
runtime, but that is not required. Only using utf-8 .spl files is an
attractive method if you don't mind the runtime conversions.
- One .spl file can define multiple regions. For English there is one
.spl file for "en" (all regional words allowed) "en_ca", "en_us",
"en_au", "en_nz" and "en_gb".
There is one disadvantage: the ".spl" file must be generated from
the word list before it can be used.
If you are interested in the Vim code, you can find the latest version
in CVS (the version with Myspell compatible compound flags is going to
be checked in tonight):
http://cvs.sourceforge.net/viewcvs.py/vim/vim7/src/spell.c
The comments at the start explain how the trie structure is used.
Look in the on-line help to see what extra items I added:
http://cvs.sourceforge.net/viewcvs.py/vim/vim7/runtime/doc/spell.txt
A few that you may want to take over:
- SLASH defines a character that replaces the slash. Useful to be able
to define "TCP/IP" as a word.
- Keep-case words defined with a KEP item. Useful for words that must
not be capitalised.
- Rare words defined with a RAR item. These are highlighted
differently and less attractive for suggestions.
- Bad words defined with BAD, e.g. to mark "the the" as bad.
Comments are welcome! Vim 7 is still in alpha testing, I can change
anything.
- Bram
PS. If you want to try out Vim 7 spelling in action, here are
instructions on installing it: http://www.vim.org/develop.php
--
Never go to the toilet in a paperless office.
/// Bram Moolenaar -- [EMAIL PROTECTED] -- http://www.Moolenaar.net \\\
/// Sponsor Vim, vote for features -- http://www.Vim.org/sponsor/ \\\
\\\ Project leader for A-A-P -- http://www.A-A-P.org ///
\\\ Buy LOTR 3 and help AIDS victims -- http://ICCF.nl/lotr.html ///
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]