Re: Bug report : Spell checking doesn't know about HTML entities
Bram Moolenaar wrote: Tony Mechelynck wrote: In languages using accented letters, the Vim spell checker doesn't recognise HTML entities (in HTML text): for example, the letters outside of the ...; entities are highlighted as spellBad (after :set spell spelllang=fr) in the following French words: ougrave; meaning: où (where) apregrave;s après (after) ceacute;reacute;monie cérémonie (ceremony) courrouccedil;a courrouça ([he] angered) deacute;sespeacute;reacute; désespéré (desperate) neacute;cessairenécessaire (necessary) anneacute;e année (year) etc. They are perfectly valid French words, if one takes into account the following equivalences: ugrave; = ù egrave; = è eacute; = é ccedil; = ç etc. I don't know how to solve the problem; maybe an interpretation layer to resolve the entities between the HTML text and the French (or other non-English language) dictionary? Well, words with HTML things in them are NOT French words. Why don't you use utf-8 encoded HTML? I started that particular site some years ago, in 7-bit ASCII plus entities. I'm loath to change it now, and risk making it incompatible with some older browsers. It already holds quite a bit of text. I disagree with the statement that these words are not French words. In an HTML file, where HTML syntax must be taken into account, they are. If you really want to recognize these words, you could take the French dictionary, do a global replace and build a spell file from that. Actually, I don't use spell (I am blessed with a good sense of orthography); but I wondered if there couldn't (someday) be a solution for people who don't share the same blessing. The proposed solution would mean creating an additional spell file, slightly larger than the French dictionary, for use only with HTML text. I'm not convinced of such a solution's viability, especially since it would have to be repeated for German, Swedish, Turkish, Polish, etc., etc., etc. Maybe even for words like risqué and garçon in English. You'll have to check if using and ; in the middle of a word is causing trouble. Adding them to word characters will probably create different problems. The semicolon can also mean a semicolon, which is a punctuation mark and not a word character, and can be used as such after a word with no intervening space (or with nbsp; preceding it, depending on typesetting conventions). The case of the ampersand is simpler: to obtain a true ampersand in the rendered text, one must use one of amp; (symbolic entity) #38; (decimal entity) or #x26; (hex entity) in the HTML. Best regards, Tony.
Re: Bug report : Spell checking doesn't know about HTML entities
François Pinard wrote: [Bram Moolenar] Tony Mechelynck wrote: In languages using accented letters, the Vim spell checker doesn't recognise HTML entities (in HTML text) [...] You'll have to check if using and ; in the middle of a word is causing trouble. Adding them to word characters will probably create different problems. Character entities come from the old time people were still trying to salvage the 8th bit of each byte, on communication channels, to convey byte parity. And also, whatever justification people may invent, to protect their laziness about using tools able to do more than ASCII. They also bypass compatibility problems for users who have to upload HTML pages to servers where they don't master the headers which will be sent with the HTML. (Yes, now I know about the BOM and the META HTTP-EQUIV=Content-Type tag, but the former isn't mentioned and the latter is only mentioned but not explained, in the books I have about HTML.) Even now, email channels aren't guaranteed do be able to convey 8-bit text other than by downgrading it to 7-bit by means of conversion schemes like quoted-printable or base64: some servers are 8-bit-compliant, others still aren't. In the email I get, I sometimes notice that the body has been autoconverted between 8-bit, quoted-printable and base64 by my ISP's routers, with no obviously apparent rule to such behaviour. One property of character entities which is apparently not so well known (or maybe that property was withdrawn since then) is that the semicolon is optional. It is only mandatory where ambiguity would otherwise arise (for example, when a letter follows, a fairly common case after all). That property is not part of the present rules; it is obsolete and deprecated: ce n'est pas la règle, c'est une tolérance. It is only recognised for downward compatibility; IIUC, it does not apply to XHTML. The semicolon has of course always been mandatory when the entity is immediately followed by a letter or semicolon (or by a digit, but that is rarer). I presume that if software (or people) generating HTML were sparing those semicolons wherever they may be spared, a lot of other software would break, we would get a riot against people following standards :-). I suppose that's why the most recent standards require the semicolons. Best regards, Tony. -- Everything is worth precisely as much as a belch, the difference being that a belch is more satisfying. -- Ingmar Bergman