Re: [ql-users] dictionaries

Dilwyn Jones Tue, 02 Aug 2005 15:12:24 -0700

Many thanks, I'd figured out most of the .aff file meanings and
written a little SBASIC program to do the processing. I have to
manually enter rules for each letter key, because I was too lazy to
write a parser, but basically it worked first time apart from a little
typo which created some, err, interesting output files because I
forgot to take into account some files might have CR or CR+LF rather
than the LF I got on some of the first files. I wrote a few simply
utility functions for parsing the word list, like
EndsWith(any_of_these_characters$) and PrecededBy(any_of_these$) and
so on, which meant all I had to do was manually amend the rules in the
26 possible ones I allowed for. Those are written in such a way as to
only need one or two lines per rule to make them work. It's not the
most elegant program I've ever written but it works really fast, a
real tribute to SBASIC. The 25,000 word file expands to about 52,000
words now, as opposed to the 34k or so I got with my first fumbling
efforts.


I don't know if you are interested in taking this any further, but I'd
gladly send you a copy of my program if you wish to tinker with it.

I don't have knowledge of unixy things so I wouldn't use unixy
parsing, searching and scanning tools which are available for the QL
as irt would probably take me longer to learn them than to write
simple little procedures like my program which were good enough for
the job.

  dot (.) matches any single character
 [...]    matches any character in the square brackets
 [^...]   matches any character NOT in the square brackets (note the
            caret (^) as the first character).
  char    matches exactly that character

Thought that was the case, glad to see it confirmed.

There are more to regexp's, but that's all (I think) you need for
the affix files.  I suspect that the affix files are well defined
for each language. Try looking at the ispell web site for the other
language dictionaries - they'll have the relevant .aff files.  If
you want, I can send you the english.aff file, but (a) you can get
it from the ispell website, (b) I extracted the relevant info in my
last message.

I downloaded one to try, I've only tinkered with the English one so
far. Although the .AFF files were missing from the word lists I got,
they must have used the 'standard' English affix files because they
process correctly with that.

The way the rules work appear to be for prefixes, it matches the
first n characters.  As a dot (.) is specified, it matches any first
character, and then the rule says insert whatever.

I wasn't sure about the dot, luckily my routines assume no match
necessary before adding prefix.

For suffixes, it looks at the given end characters for a match, eg:

  [AEIOU]Y

matches any (word) string that ends "AY", "EY", "IY", "OY", "UY",
and then applies the rule.

The way I wrote my routines, it's actually processed as "Ends with Y"
and "Prcededby a or e or i or o or u". Amounts to the same thing,
slightly less efficient, but easier for me to manually amend the rules
procedures.

The rules are simple.  If the first char is a minus (-) it specifies
the character(s) to remove upto a comma, then; the characters to add
are specified.  They are specified in caps, so you may want to
convert to lowercase.

My first attempts tend to avoid the issue of case. I haven't included
the code yet, but planned on looking at whether the root word is all
lower, mixed case or all upper case and change case of root and prefix

accordingly. I did write some of this earlier tonight when I got youremail and although it seems to work it needs a bit of testingtomorrow. What tends to happen by adding prefixes is that the sortorder of the words is lost, so one of two solutions are possible:1. accept it's lost and let QTYP do the sorting, then re-extract thewords to get the sorted plain text list.2. write the new-prefix words to a temporary file, sort that (as it'ssmaller) and merge them in order with the big file.


Laziness threatens to let option 1 do the work!

google is my friend!  (If you're interested, I found it like this:
tried googling for 'dgs dgnsx' which found that 'ht://dig'
reference, which then led me to ispell on the htdig website.)

Google is everyone's friend I think, it certainly found all sorts ofword lists for me although I hadn't found iSpell.

May I suggest that you write your converter to read the conversion
from an .aff file.

Eventually, as it's a one-off job for the few dictionaries I plan on
doing, I may not get round to this, since the routines I wrote are

easily enough amended for the few times I'll need to run it, mucheasier and less time consuming than writing a little parser for the.AFF files for the sake of changing 20-30 lines of SBASIC as it standsfor rule changes.

If you're stuck, you could download the ispell tarball (about 620K)
as there's an unmucher in there (but it's integrated with the ispell
package, so (obviously) relies on initialisation elsewhere), along
with english.aff.

Not at the moment :-(

Also, you may have noticed the capitalisation of the words - along
with the normal first letter, ispell can handle various
capitalisations...intended for case matching in dictionary
searching.  If you're not bothered, just convert to lowercase.

...see above for case comments.

Wow, I've had fun with this! Not often I get a programming project

which I get so stuck into as this one! Rather more QTYP dictionariespossible now than I thought. As there are so many out there, I canoffer more than one size for most languages probably.


Thanks again for your help.

--
Dilwyn Jones



--
No virus found in this outgoing message.
Checked by AVG Anti-Virus.
Version: 7.0.338 / Virus Database: 267.9.9/62 - Release Date: 02/08/2005

_______________________________________________
QL-Users Mailing List
http://www.q-v-d.demon.co.uk/smsqe.htm

Re: [ql-users] dictionaries

Reply via email to