----- Original Message ----- From: "Kevin Atkinson" <[EMAIL PROTECTED]> To: "Pablo Saratxaga" <[EMAIL PROTECTED]> Cc: <[email protected]> Sent: Tuesday, October 25, 2005 9:42 PM Subject: Re: [Aspell-user] small bug: two following non alpha characters
> On Tue, 25 Oct 2005, Pablo Saratxaga wrote: > > > I found a bug in last versions of aspell that wasn't there previously. > > It probably doesn't show in the majority of languages, but > > it shows in Walloon; as you can have words lake "hait-l'-ovraedje" > > note the apostrophe followed by an hyphen. > > This is not a bug but a limitation of Aspell. Aspell can not handle the > case of two "middle" characters in a row. Aspell 0.50 accepted the word > when creating a dictionary but it would never be able to check a word > since the word will always be split into something like "hait-l" > "ovraedje". Aspell 0.60 checks that words are valid before accepting it. > Hi, Back in August I was trying to make my program working with Unicode and the koi8-r character set. One of the problems was tokenizing the text into words. It seemed aspell was treating all character sets as ASCII. The speller object does have a language member and the language member does have a sense of the characteristics of each character in the characterset. What are the characteristics of the ampersand and dash in your characterset? Might aspell make use those characterset specific characteristics to tokenize "hait-l'-ovraedje" as one word? In my port, I added functions to find the offset to the next word and to find the length of the next word. Currently, it tokenized based only on the alpha characteristic. I'm not sure that is proper in all cases and I'm open to other ideas on how this should be done. I can submit a patch to aspell, if that would be helpful. Best regards, Gary Setter _______________________________________________ Aspell-user mailing list [email protected] http://lists.gnu.org/mailman/listinfo/aspell-user
