Re: [fpc-devel] fpdoc and unicode characters
On Thu, 14 Aug 2008, Graeme Geldenhuys wrote: Hi, In researching how to type Unicode characters on different platforms, I came across an interesting argument regarding Unicode characters and HTML. The argument might apply to fpdoc documentation (xml) files as well—hence the reason for this post. With W3C embracing UTF-8 as the de facto standard for HTML pages, do we still need to escape characters like ampersand [' U+2019] to [amp;] etc. Unicode has been around for some time now, so surely all half-decent software should be able to read and display the actual character correctly by now (sensitive subject for FPC and Delphi at the moment), instead of having to bother with the escaped version. How does this argument fit with XML which also uses UTF-8 as the de facto standard encoding. And seeing that fpdoc uses XML for the documentation files, can I use the actual Unicode characters in my fpdoc documentation, or must I still stick with the—what now seems to be outdated—escaped method? BTW: These are the characters I was interested in. — (U+2014): emphasis dash … (U+2026): horizontal ellipses ' (U+2019): right single quotation (U+201C): left double quotation (U+201D): right double quotation ― (U+2015): quotation dash (introducing quoted text) Fixed, with the help of Sergei Gorelkin. (bug id 11881) Michael.___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] fpdoc and unicode characters
On Thu, Aug 14, 2008 at 1:14 PM, Marco van de Voort [EMAIL PROTECTED] wrote: How does this argument fit with XML which also uses UTF-8 as the de facto standard encoding. And seeing that fpdoc uses XML for the documentation files, can I use the actual Unicode characters in my fpdoc documentation, or must I still stick with the?what now seems to be outdated?escaped method? Depends. Is a steering character in all of XML, or only the xhtml like standards? I think only XHTML. But what is fpdoc's xml files? Pure XML, XHTML or some custom/hybrid format? The layout of fpdoc's files seem XML, but the documentation content seems some hybrid HTML - hence the confusion with what is allowed! Anybody know the rules of strict XML files and Unicode? Can I use Unicode characters as data in XML nodes? I would imagine I may because most well-formed XML files specify UTF-8 as the encoding type. Also something I think has been resolved in recent versions, but in older 'makeskel' versions, it did not include the encoding in the generated .xml file. So what are we supposed to treat such files encoding as? Default to W3C standards and use assume UTF-8? LCL and fpGUI's fpdoc documentation (mostly) has no encoding specified in the .xml files. FPC's documentation specifies ISO8859-1 as the encoding type, though I found one file (dateutils.xml) it FPC docs that hasn't got an encoding (but my doc update is out of date). Regards, - Graeme - ___ fpGUI - a cross-platform Free Pascal GUI toolkit http://opensoft.homeip.net/fpgui/ ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] fpdoc and unicode characters
Datum: Thu, 14 Aug 2008 13:44:14 +0200 Von: Graeme Geldenhuys [EMAIL PROTECTED] But what is fpdoc's xml files? Pure XML, XHTML or some custom/hybrid format? The layout of fpdoc's files seem XML, but the documentation content seems some hybrid HTML - hence the confusion with what is allowed! Well, what's the DTD saying about it? ;) Anybody know the rules of strict XML files and Unicode? Can I use Unicode characters as data in XML nodes? As long as the encoding doesn't say otherwise, yes. I would imagine I may because most well-formed XML files specify UTF-8 as the encoding type. Yes, you can. Also something I think has been resolved in recent versions, but in older 'makeskel' versions, it did not include the encoding in the generated .xml file. So what are we supposed to treat such files encoding as? Default to W3C standards and use assume UTF-8? Errmm, yes? LCL and fpGUI's fpdoc documentation (mostly) has no encoding specified in the .xml files. FPC's documentation specifies ISO8859-1 as the encoding type, though I found one file (dateutils.xml) it FPC docs that hasn't got an encoding (but my doc update is out of date). Well, as long as the contents is English, it doesn't matter to much, UTF-8 is fully compatible to 7-bit ASCII. ;) If you're unsure about the encoding, stick to the #x; unicode entities, that way you can encode anything in 7-bit plain, portable ASCII. Vinzent. -- GMX startet ShortView.de. Hier findest Du Leute mit Deinen Interessen! Jetzt dabei sein: http://www.shortview.de/[EMAIL PROTECTED] ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] fpdoc and unicode characters
On Thu, Aug 14, 2008 at 1:51 PM, Vinzent Höfler [EMAIL PROTECTED] wrote: But what is fpdoc's xml files? Pure XML, XHTML or some custom/hybrid format? The layout of fpdoc's files seem XML, but the documentation content seems some hybrid HTML - hence the confusion with what is allowed! Well, what's the DTD saying about it? ;) Errmm??? Also something I think has been resolved in recent versions, but in older 'makeskel' versions, it did not include the encoding in the generated .xml file. So what are we supposed to treat such files encoding as? Default to W3C standards and use assume UTF-8? Errmm, yes? Then we have a problem. LCL and fpGUI's fpdoc documentation (mostly) has no encoding specified in the .xml files. FPC's documentation specifies ISO8859-1 as the encoding type, though I found one file (dateutils.xml) it FPC docs that hasn't got an encoding (but my doc update is out of date). Well, as long as the contents is English, it doesn't matter to much, UTF-8 is fully compatible to 7-bit ASCII. ;) If you're unsure about the encoding, stick to the #x; unicode entities, that way you can encode anything in 7-bit plain, portable ASCII. I just tried that and it failed miserably! Steps to reproduce: 1. Make sure the fpdoc's .xml file specifies encoding of UTF-8. [By the way I think this encoding gets ignored totally.] 2. Type Unicode characters in any format. Actual or escaped. 3. Generate HTML documentation with fpdoc Problems: 1. The generated HTML always specifies encoding ISO8859-1! So why bother specifying the encoding in the .xml file??? Is the encoding in the xml file actually used anywhere? 2. Actual and escaped Unicode characters end up being '??' garbage characters in the generated HTML. Regards, - Graeme - ___ fpGUI - a cross-platform Free Pascal GUI toolkit http://opensoft.homeip.net/fpgui/ ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] fpdoc and unicode characters
Graeme Geldenhuys wrote: On Thu, Aug 14, 2008 at 1:14 PM, Marco van de Voort [EMAIL PROTECTED] wrote: How does this argument fit with XML which also uses UTF-8 as the de facto standard encoding. And seeing that fpdoc uses XML for the documentation files, can I use the actual Unicode characters in my fpdoc documentation, or must I still stick with the?what now seems to be outdated?escaped method? Depends. Is a steering character in all of XML, or only the xhtml like standards? I think only XHTML. XML too. In XML, you *must* escape ampersand (U+0026) and less-than sign (U+003C). Also greater-than sign (U+003E) must be escaped if it is preceded by ']]' sequence. Additionally, in attribute values, quotes (U+0022) must be escaped if they are used as value delimiters (other option is to delimit values with apostrophes (U+0027)). Here I mean the XML file, not the DOM tree. You may freely use the mentioned characters in plaintext while manupulating DOM; the writer will escape them on output. But what is fpdoc's xml files? Pure XML, XHTML or some custom/hybrid format? The layout of fpdoc's files seem XML, but the documentation content seems some hybrid HTML - hence the confusion with what is allowed! XHTML is XML with defined 'vocabulary' (DTD). These formats have no character-level differences. Anybody know the rules of strict XML files and Unicode? Can I use Unicode characters as data in XML nodes? I would imagine I may because most well-formed XML files specify UTF-8 as the encoding type. Also something I think has been resolved in recent versions, but in older 'makeskel' versions, it did not include the encoding in the generated .xml file. So what are we supposed to treat such files encoding as? Default to W3C standards and use assume UTF-8? LCL and fpGUI's fpdoc documentation (mostly) has no encoding specified in the .xml files. FPC's documentation specifies ISO8859-1 as the encoding type, though I found one file (dateutils.xml) it FPC docs that hasn't got an encoding (but my doc update is out of date). W3C demands that XML file without encoding label should be treated as UTF-8 (unless it has an UTF-16 BOM, in which case it should be treated as UTF-16). Therefore UTF-8 labeling is optional. In older times, makeskel used to write 'ISO8859-1' label, which btw is invalid (IANA recognized names are ISO-8859-1 and ISO_8859-1). Later, when the parser got more compliant, the labeling was removed. The parser has a workaround to understand the ISO8859-1 labeling. The XML writer always produces UTF-8 encoding and writes no label. To summarize: Unicode can be used in fpdoc xml files. If the file has ISO8859-1 encoding label, it should be removed or replaced with UTF-8 label. The output stages of fpdoc may or may not have problems with Unicode - that requires additional research. Sergei ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] fpdoc and unicode characters
On Thu, Aug 14, 2008 at 2:24 PM, Michael Van Canneyt [EMAIL PROTECTED] You can specify --charset=UTF8 on the command line of fpdoc, that will set OK, but it should be --charset=UTF-8 Thanks. the encoding of the generated HTML page, but it does NO conversion whatsoever. Well this is not entirely true then! Something is screwing up my documentation content. xml file -- ?xml version=1.0 encoding=UTF-8? fpdoc-descriptions package name=CoreLib module name=gfx_UTF8utils short/short descrIs this character: #x2026; displayed correctly? /descr [snip] --end--- U+2026 is the Horizontal Ellipses character. The following page has more details and shows the different encodings in HTML etc... http://www.fileformat.info/info/unicode/char/2026/index.htm - HTML generated output --- Overview Is this character: ? displayed correctly? end Exactly as is! I view the HTML source and the '...' character is not encoded or anything. It became a actual questionmark character. Regards, - Graeme - ___ fpGUI - a cross-platform Free Pascal GUI toolkit http://opensoft.homeip.net/fpgui/ ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] fpdoc and unicode characters
On Thu, Aug 14, 2008 at 2:40 PM, Graeme Geldenhuys [EMAIL PROTECTED] wrote: - HTML generated output --- Overview Is this character: ? displayed correctly? end Exactly as is! I view the HTML source and the '...' character is not encoded or anything. It became a actual questionmark character. I forgot the actual HTML... -- generated HTML source --- !DOCTYPE HTML PUBLIC -//W3C//DTD HTML 4.0 Transitional//EN html head meta content=text/html; charset=UTF-8 http-equiv=Content-Type titleReference for unit 'gfx_UTF8utils'/title [...snip...] h2Overview/h2 pIs this character: ? displayed correctly? /p /body /html --end Regards, - Graeme - ___ fpGUI - a cross-platform Free Pascal GUI toolkit http://opensoft.homeip.net/fpgui/ ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] fpdoc and unicode characters
On Thu, Aug 14, 2008 at 2:24 PM, Sergei Gorelkin [EMAIL PROTECTED] wrote: To summarize: Unicode can be used in fpdoc xml files. If the file has ISO8859-1 encoding label, it should be removed or replaced with UTF-8 label. I'll assume this is all in theory then. :-) See my previous reply. Even if I escaped a Unicode character as follows: #x2026; it becomes a literal '?' question mark character in the generated HTML Source output. Regards, - Graeme - ___ fpGUI - a cross-platform Free Pascal GUI toolkit http://opensoft.homeip.net/fpgui/ ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] fpdoc and unicode characters
On Thu, Aug 14, 2008 at 2:46 PM, Graeme Geldenhuys [EMAIL PROTECTED] wrote: On Thu, Aug 14, 2008 at 2:24 PM, Sergei Gorelkin [EMAIL PROTECTED] wrote: To summarize: Unicode can be used in fpdoc xml files. If the file has ISO8859-1 encoding label, it should be removed or replaced with UTF-8 label. I'll assume this is all in theory then. :-) See my previous reply. Even if I escaped a Unicode character as follows: #x2026; it becomes a literal '?' question mark character in the generated HTML Source output. Here is another example: part of fpdoc xml file--- module name=gfx_UTF8utils short/short descrIs this character: lt;#x2026;gt; displayed correctly? /descr -end ... and the generated html source with UTF-8 encoding ... html source-- h2Overview/h2 pIs this character: lt;?gt; displayed correctly? /p /body /html end--- NOTE: The and characters went through fine, but the ellipses character did not. Regards, - Graeme - ___ fpGUI - a cross-platform Free Pascal GUI toolkit http://opensoft.homeip.net/fpgui/ ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] fpdoc and unicode characters
Datum: Thu, 14 Aug 2008 14:12:33 +0200 Von: Graeme Geldenhuys [EMAIL PROTECTED] On Thu, Aug 14, 2008 at 1:51 PM, Vinzent Höfler [EMAIL PROTECTED] wrote: But what is fpdoc's xml files? Pure XML, XHTML or some custom/hybrid format? The layout of fpdoc's files seem XML, but the documentation content seems some hybrid HTML - hence the confusion with what is allowed! Well, what's the DTD saying about it? ;) Errmm??? The Document Type Description? Well, I suppose, there is none... If you're unsure about the encoding, stick to the #x; unicode entities, that way you can encode anything in 7-bit plain, portable ASCII. I just tried that and it failed miserably! Huh? Steps to reproduce: 1. Make sure the fpdoc's .xml file specifies encoding of UTF-8. [By the way I think this encoding gets ignored totally.] Yes, last time I checked, it did. 2. Actual and escaped Unicode characters end up being '??' garbage characters in the generated HTML. So the entities are probably not resolved correctly. I suppose someone just connected all the text-nodes without bothering to resolve any contained entities. Things like #x2462; should work regardless of the chosen encoding, as those are unicode-entities. Vinzent. -- GMX Kostenlose Spiele: Einfach online spielen und Spaß haben mit Pastry Passion! http://games.entertainment.gmx.net/de/entertainment/games/free/puzzle/6169196 ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] fpdoc and unicode characters
On Thu, Aug 14, 2008 at 2:54 PM, Vinzent Höfler [EMAIL PROTECTED] wrote: Well, what's the DTD saying about it? ;) Errmm??? The Document Type Description? Well, I suppose, there is none... I know what DTD means, I meant there is none I know of. :) 2. Actual and escaped Unicode characters end up being '??' garbage characters in the generated HTML. So the entities are probably not resolved correctly. I suppose someone just connected all the text-nodes without bothering to resolve any contained entities. Things like #x2462; should work regardless of the chosen encoding, as those are unicode-entities. I tried encoding it as decimal and hexadecimal notation and it always ends up being '?'. So Michael's theory that the documentation content gets copied as-is is not quite true. My example using lt;#x2026;gt; and becoming lt;?gt; in the .html file proves that somewhere fpdoc is doing something with the documentation content. Regards, - Graeme - ___ fpGUI - a cross-platform Free Pascal GUI toolkit http://opensoft.homeip.net/fpgui/ ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] fpdoc and unicode characters
In our previous episode, Graeme Geldenhuys said: To summarize: Unicode can be used in fpdoc xml files. If the file has ISO8859-1 encoding label, it should be removed or replaced with UTF-8 label. I'll assume this is all in theory then. :-) See my previous reply. Even if I escaped a Unicode character as follows: #x2026; it becomes a literal '?' question mark character in the generated HTML Source output. Here is another example: Sounds like a browser font issue, rather than an encoding issue. ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] fpdoc and unicode characters
Datum: Thu, 14 Aug 2008 15:30:51 +0200 Von: Graeme Geldenhuys [EMAIL PROTECTED] On Thu, Aug 14, 2008 at 3:06 PM, Marco van de Voort [EMAIL PROTECTED] wrote: it becomes a literal '?' question mark character in the generated HTML Source output. Here is another example: Sounds like a browser font issue, rather than an encoding issue. No, but that was my first thought as well. That is why I view the actual generated HTML file that fpdoc produced. It has a literal ? character in the .html file. I used Midnight Commander's editor and Gnomes gEdit to view the .html file. Try again with a unicode entity below 256 (like #xFC; for instance). If that works, the reason is probably there: -- 8 -- snip -- rtl/inc/wstrings.inc -- procedure DefaultWide2AnsiMove(source:pwidechar;var dest:ansistring;len:SizeInt); var i : SizeInt; begin setlength(dest,len); for i:=1 to len do begin if word(source^)256 then dest[i]:=char(word(source^)) else dest[i]:='?'; inc(source); end; end; -- 8 -- snip -- Vinzent. -- GMX Kostenlose Spiele: Einfach online spielen und Spaß haben mit Pastry Passion! http://games.entertainment.gmx.net/de/entertainment/games/free/puzzle/6169196 ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] fpdoc and unicode characters
On Thu, Aug 14, 2008 at 3:42 PM, Sergei Gorelkin [EMAIL PROTECTED] wrote: Graeme Geldenhuys wrote: No, but that was my first thought as well. That is why I view the actual generated HTML file that fpdoc produced. It has a literal ? character in the .html file. I used Midnight Commander's editor and Gnomes gEdit to view the .html file. It looks like the problem is in htmwrite.pp unit. It is AnsiString-based and therefore all Unicode gets simply stripped away and replaced by '?'. Maybe adding cwstring to fpdoc uses clause can make things better for Linux with UTF-8 locale. That makes no sense because I used the escaped unicode character format, just like HTML does. So those characters (documentation content) should be copied as-is to the HTML. The following characters interpreted by a Web Browser will display a unicode character, but on their own (as-is), they are valid ASCII characters: #x2026; #x2026; should be treated exactly the same as lt;or gt;or amp; when generating HTML output from fpdoc. Regards, - Graeme - ___ fpGUI - a cross-platform Free Pascal GUI toolkit http://opensoft.homeip.net/fpgui/ ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] fpdoc and unicode characters
Datum: Thu, 14 Aug 2008 16:06:14 +0200 Von: Graeme Geldenhuys [EMAIL PROTECTED] That makes no sense because I used the escaped unicode character format, just like HTML does. So those characters (documentation content) should be copied as-is to the HTML. I suspect those entities get parsed by the DOM-unit as entities (which is the right thing to do generally) and simply get lost in the transformation back to the byte stream (aka. AnsiString) then. #x2026; should be treated exactly the same as lt;or gt;or amp; when generating HTML output from fpdoc. Yes. But the latter have a character code below 256 (even below 128, so they're plain 7-bit ASCII). Vinzent. -- GMX Kostenlose Spiele: Einfach online spielen und Spaß haben mit Pastry Passion! http://games.entertainment.gmx.net/de/entertainment/games/free/puzzle/6169196 ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] fpdoc and unicode characters
On Thu, Aug 14, 2008 at 4:12 PM, Vinzent Höfler [EMAIL PROTECTED] wrote: I suspect those entities get parsed by the DOM-unit as entities (which is the right thing to do generally) and simply get lost in the transformation back to the byte stream (aka. AnsiString) then. #x2026; should be treated exactly the same as lt;or gt;or amp; when generating HTML output from fpdoc. Yes. But the latter have a character code below 256 (even below 128, so they're plain 7-bit ASCII). I think you have a point with the DOM-unit parsing the documentation content. So we can safely say, the actual content is NOT copied as-is! #x2026; If the above was interpreted as-is (with the rest of the content), it would be 8 ascii characters all below 256 character code. No issues then! So yes I think I agree with you. Somewhere the above is being parsed, then found that as a whole it's above 256 ascii code and simply replaced with a ? character. I simply found this confusing, because Michael is very versed with fpdoc, and when he said the content (not the XML tags) is copied as-is, I would not have envisioned any issues with Unicode escaped character. Now we know better! :-) I'll file a bug report in Mantis. Regards, - Graeme - ___ fpGUI - a cross-platform Free Pascal GUI toolkit http://opensoft.homeip.net/fpgui/ ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] fpdoc and unicode characters
Graeme Geldenhuys schreef: I think you have a point with the DOM-unit parsing the documentation content. So we can safely say, the actual content is NOT copied as-is! #x2026; If the above was interpreted as-is (with the rest of the content), it would be 8 ascii characters all below 256 character code. No issues then! So yes I think I agree with you. Somewhere the above is being parsed, It is parsed in the xmlreader. The DOM simple contains one single widechar, not these 8 chars. I suspect it is copied as is from the DOM to the output file. Vincent ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] fpdoc and unicode characters
On Thu, Aug 14, 2008 at 4:21 PM, Graeme Geldenhuys [EMAIL PROTECTED] wrote: I'll file a bug report in Mantis. Report as: http://bugs.freepascal.org/view.php?id=11881 Regards, - Graeme - ___ fpGUI - a cross-platform Free Pascal GUI toolkit http://opensoft.homeip.net/fpgui/ ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel