Re: [HACKERS] pgsql: We're going to have to spell dotless i

2006-09-25 Thread Markus Schaber
Hi, Hannu,

Hannu Krosing wrote:

 Are you sure it's UCS-4 ? I've always thought that XML is what is given
 in xml  tag, and utf-8 if no charset is given.
 You have to distinguish between the supported charset, and the document
 encoding.
 UCS-4 and UTF-8 are both encodings for UNICODE 
 see: http://en.wikipedia.org/wiki/UTF-32

Yes, I know.

The Point I wanted to make was that the document encoding is independent
from the allowed charset (except having to be a subset).

That is what XML entities were defined for.

So even in an document using LATIN-1 as encoding, the charset still is
Unicode, giving us the possibility to use entities; to use non-latin1
characters.

HTH,
Markus

-- 
Markus Schaber | Logical TrackingTracing International AG
Dipl. Inf. | Software Development GIS

Fight against software patents in Europe! www.ffii.org
www.nosoftwarepatents.org



signature.asc
Description: OpenPGP digital signature


Re: [HACKERS] pgsql: We're going to have to spell dotless i

2006-09-25 Thread Markus Schaber
Hi, Bruce,

Bruce Momjian wrote:

 I don't think that any of our SGML documentation is actually in UCS-4
 encoding.
 The source files use nothing beyond plain ASCII (and should remain that
 way, IMHO) so there isn't any need to inquire very far into exactly what
 the toolchain thinks the document encoding is.  The issue at hand here
 is what the *output* character set is, which is to say the document
 character set if I have the jargon right.  That is the space over which
 we are permitted to use -entities.
 
 Just for reference, if we could support UTF8, I was hoping to add
 non-Latin names as alternates to the ASCII versions, so we could have
 Japanese and Russian-lettered names in the release notes.  I thought it
 would be a nice touch.

We don't need UTF8 encoding for this. It's also possible using ASCII
encoding + #4711; entities.

But we need the Charset to be Unicode.

HTH,
Markus
-- 
Markus Schaber | Logical TrackingTracing International AG
Dipl. Inf. | Software Development GIS

Fight against software patents in Europe! www.ffii.org
www.nosoftwarepatents.org



signature.asc
Description: OpenPGP digital signature


Re: [HACKERS] pgsql: We're going to have to spell dotless i

2006-09-24 Thread Peter Eisentraut
Alvaro Herrera wrote:
 On the other hand, I don't understand why DocBook would be Latin-1
 only. What would be the point of that limitation?  Some googling
 seems to reveal that people indeed uses other charsets, UTF-8 in
 particular (but also Big5, Latin-2, etc), so apparently this isn't
 set in stone.  (I admit that they mainly talk about XML Docbook
 though).

DocBook SGML is Latin 1; DocBook XML, like all XML, is UCS-4.

-- 
Peter Eisentraut
http://developer.postgresql.org/~petere/

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] pgsql: We're going to have to spell dotless i

2006-09-24 Thread Hannu Krosing
Ühel kenal päeval, P, 2006-09-24 kell 10:20, kirjutas Peter Eisentraut:
 Alvaro Herrera wrote:
  On the other hand, I don't understand why DocBook would be Latin-1
  only. What would be the point of that limitation?  Some googling
  seems to reveal that people indeed uses other charsets, UTF-8 in
  particular (but also Big5, Latin-2, etc), so apparently this isn't
  set in stone.  (I admit that they mainly talk about XML Docbook
  though).
 
 DocBook SGML is Latin 1; DocBook XML, like all XML, is UCS-4.

Are you sure it's UCS-4 ? I've always thought that XML is what is given
in xml  tag, and utf-8 if no charset is given.

-- 

Hannu Krosing
Database Architect
Skype Technologies OÜ
Akadeemia tee 21 F, Tallinn, 12618, Estonia

Skype me:  callto:hkrosing
Get Skype for free:  http://www.skype.com



---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] pgsql: We're going to have to spell dotless i

2006-09-24 Thread Markus Schaber
Hi, Hannu,

Hannu Krosing wrote:

 Are you sure it's UCS-4 ? I've always thought that XML is what is given
 in xml  tag, and utf-8 if no charset is given.

You have to distinguish between the supported charset, and the document
encoding.

HTH,
Markus
-- 
Markus Schaber | Logical TrackingTracing International AG
Dipl. Inf. | Software Development GIS

Fight against software patents in Europe! www.ffii.org
www.nosoftwarepatents.org



signature.asc
Description: OpenPGP digital signature


Re: [HACKERS] pgsql: We're going to have to spell dotless i

2006-09-24 Thread David Fetter
On Sun, Sep 24, 2006 at 10:20:22AM +0200, Peter Eisentraut wrote:
 Alvaro Herrera wrote:
  On the other hand, I don't understand why DocBook would be Latin-1
  only. What would be the point of that limitation?  Some googling
  seems to reveal that people indeed uses other charsets, UTF-8 in
  particular (but also Big5, Latin-2, etc), so apparently this isn't
  set in stone.  (I admit that they mainly talk about XML Docbook
  though).
 
 DocBook SGML is Latin 1; DocBook XML, like all XML, is UCS-4.

This sheds a new light on the XML vs. SGML thing you said before.
While it's not necessarily compelling enough to force a switch, it is
a substantive difference that we can actually see.

Cheers,
D
-- 
David Fetter [EMAIL PROTECTED] http://fetter.org/
phone: +1 415 235 3778AIM: dfetter666
  Skype: davidfetter

Remember to vote!

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] pgsql: We're going to have to spell dotless i

2006-09-24 Thread Hannu Krosing
Ühel kenal päeval, P, 2006-09-24 kell 14:56, kirjutas Markus Schaber:
 Hi, Hannu,
 
 Hannu Krosing wrote:
 
  Are you sure it's UCS-4 ? I've always thought that XML is what is given
  in xml  tag, and utf-8 if no charset is given.
 
 You have to distinguish between the supported charset, and the document
 encoding.

UCS-4 and UTF-8 are both encodings for UNICODE 

see: http://en.wikipedia.org/wiki/UTF-32


 HTH,
 Markus
-- 

Hannu Krosing
Database Architect
Skype Technologies OÜ
Akadeemia tee 21 F, Tallinn, 12618, Estonia

Skype me:  callto:hkrosing
Get Skype for free:  http://www.skype.com



---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] pgsql: We're going to have to spell dotless i

2006-09-24 Thread Andrew Dunstan



Hannu Krosing wrote:

Ühel kenal päeval, P, 2006-09-24 kell 14:56, kirjutas Markus Schaber:
  

Hi, Hannu,

Hannu Krosing wrote:



Are you sure it's UCS-4 ? I've always thought that XML is what is given
in xml  tag, and utf-8 if no charset is given.
  

You have to distinguish between the supported charset, and the document
encoding.



UCS-4 and UTF-8 are both encodings for UNICODE 


see: http://en.wikipedia.org/wiki/UTF-32
  



If we want to quote references, we should quote the XML standard. For 
example, see here to see the exact charset supported by XML: 
http://www.w3.org/TR/2006/REC-xml11-20060816/#charsets.


A little lower down it defines the encodings allowed too.


cheers

andrew

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] pgsql: We're going to have to spell dotless i

2006-09-24 Thread Peter Eisentraut
Andrew Dunstan wrote:
 If we want to quote references, we should quote the XML standard. For
 example, see here to see the exact charset supported by XML:
 http://www.w3.org/TR/2006/REC-xml11-20060816/#charsets.

The actual cause of the processing problems we have been seeing are the
character set definitions in the SGML declarations of the respective
document types.

For DocBook SGML 4.2:

CHARSET

BASESET
  ISO 646:1983//CHARSET International Reference Version (IRV)//ESC 2/5 4/0
DESCSET
0   9   UNUSED
9   2 9
   11   2   UNUSED
   13   113
   14  18   UNUSED
   32  9532
  127   1   UNUSED

BASESET
  ISO Registration Number 100//CHARSET ECMA-94 Right Part of Latin Alphabet 
Nr. 1//ESC 2/13 4/1
DESCSET
  128  32   UNUSED
  160  96   32

For XML:

 CHARSET
 BASESET
 ISO Registration Number 177//CHARSET
  ISO/IEC 10646-1:1993 UCS-4 with implementation
  level 3//ESC 2/5 2/15 4/6
 DESCSET
 09  UNUSED
 92   9
112  UNUSED
131  13
14   18  UNUSED
32   95  32
   1271  UNUSED
   128   32  UNUSED
   16055136 160
 55296 2048  UNUSED -- surrogates --
 57344 8190   57344
 655342  UNUSED -- FFFE and  --
 65536  1048576   65536 -- 16 planes outside BMP --

-- 
Peter Eisentraut
http://developer.postgresql.org/~petere/

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] pgsql: We're going to have to spell dotless i

2006-09-24 Thread Hannu Krosing
Ühel kenal päeval, E, 2006-09-25 kell 00:23, kirjutas Peter Eisentraut:
 Andrew Dunstan wrote:
  If we want to quote references, we should quote the XML standard. For
  example, see here to see the exact charset supported by XML:
  http://www.w3.org/TR/2006/REC-xml11-20060816/#charsets.
 
 The actual cause of the processing problems we have been seeing are the
 character set definitions in the SGML declarations of the respective
 document types.

I see charsets, but where are encodings defined ?

I don't think that any of our SGML documentation is actually in UCS-4
encoding.

-- 

Hannu Krosing
Database Architect
Skype Technologies OÜ
Akadeemia tee 21 F, Tallinn, 12618, Estonia

Skype me:  callto:hkrosing
Get Skype for free:  http://www.skype.com



---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] pgsql: We're going to have to spell dotless i

2006-09-24 Thread Tom Lane
Hannu Krosing [EMAIL PROTECTED] writes:
 I don't think that any of our SGML documentation is actually in UCS-4
 encoding.

The source files use nothing beyond plain ASCII (and should remain that
way, IMHO) so there isn't any need to inquire very far into exactly what
the toolchain thinks the document encoding is.  The issue at hand here
is what the *output* character set is, which is to say the document
character set if I have the jargon right.  That is the space over which
we are permitted to use -entities.

regards, tom lane

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] pgsql: We're going to have to spell dotless i

2006-09-24 Thread Bruce Momjian
Tom Lane wrote:
 Hannu Krosing [EMAIL PROTECTED] writes:
  I don't think that any of our SGML documentation is actually in UCS-4
  encoding.
 
 The source files use nothing beyond plain ASCII (and should remain that
 way, IMHO) so there isn't any need to inquire very far into exactly what
 the toolchain thinks the document encoding is.  The issue at hand here
 is what the *output* character set is, which is to say the document
 character set if I have the jargon right.  That is the space over which
 we are permitted to use -entities.

Just for reference, if we could support UTF8, I was hoping to add
non-Latin names as alternates to the ASCII versions, so we could have
Japanese and Russian-lettered names in the release notes.  I thought it
would be a nice touch.

-- 
  Bruce Momjian   [EMAIL PROTECTED]
  EnterpriseDBhttp://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] pgsql: We're going to have to spell dotless i

2006-09-24 Thread Martijn van Oosterhout
On Sun, Sep 24, 2006 at 07:38:20PM -0400, Tom Lane wrote:
 Hannu Krosing [EMAIL PROTECTED] writes:
  I don't think that any of our SGML documentation is actually in UCS-4
  encoding.
 
 The source files use nothing beyond plain ASCII (and should remain that
 way, IMHO) so there isn't any need to inquire very far into exactly what
 the toolchain thinks the document encoding is.  The issue at hand here
 is what the *output* character set is, which is to say the document
 character set if I have the jargon right.  That is the space over which
 we are permitted to use -entities.

What you're talking about is generally referred to as the character
repertoire, the abstract set of characters a document is considered to
be composed of. For example: HTML4 (and XML IIRC) explicitly defines
the character repertoire to be Unicode, even though the character
encoding may only point to a subset of the total. Any others can be
generated via the xxx; escape syntax.

I'm surprised about the difference in installations. I didn't use your
-c option because that directory does not exist on my computer, but
maybe that's all the difference...

http://www.unicode.org/unicode/reports/tr17/

Have a nice day,
-- 
Martijn van Oosterhout   kleptog@svana.org   http://svana.org/kleptog/
 From each according to his ability. To each according to his ability to 
 litigate.


signature.asc
Description: Digital signature


Re: [HACKERS] pgsql: We're going to have to spell dotless i as plain i, because

2006-09-23 Thread Martijn van Oosterhout
On Fri, Sep 22, 2006 at 12:29:05PM -0300, Tom Lane wrote:
 Log Message:
 ---
 We're going to have to spell dotless i as plain i, because dotless i is
 not in the character set supported by DocBook nor standard HTML.  (Sorry
 Volkan.)  Also replace random character-set references by a pointer to
 the actual standard.

Well you could always use te HTML4 #305; which most tools should
understand. At least browsers have good support for this kind of
entity.

Have a nice day,
-- 
Martijn van Oosterhout   kleptog@svana.org   http://svana.org/kleptog/
 From each according to his ability. To each according to his ability to 
 litigate.


signature.asc
Description: Digital signature


Re: [HACKERS] pgsql: We're going to have to spell dotless i as plain i, because

2006-09-23 Thread Peter Eisentraut
Martijn van Oosterhout wrote:
 Well you could always use te HTML4 #305; which most tools should
 understand. At least browsers have good support for this kind of
 entity.

Please review the recent thread on pgsql-docs before reiterating all the 
suggestions.

-- 
Peter Eisentraut
http://developer.postgresql.org/~petere/

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] pgsql: We're going to have to spell dotless i as plain i, because

2006-09-23 Thread Martijn van Oosterhout
On Sat, Sep 23, 2006 at 11:54:47AM +0200, Peter Eisentraut wrote:
 Martijn van Oosterhout wrote:
  Well you could always use te HTML4 #305; which most tools should
  understand. At least browsers have good support for this kind of
  entity.
 
 Please review the recent thread on pgsql-docs before reiterating all the 
 suggestions.

Oh sorry, it wasn't clear from the commit entry. It's not that DocBook
doesn't support the character or that it can't be represented. It's
just not supported in the document encoding we're using.

Sorry for the noise.

Have a nice day,
-- 
Martijn van Oosterhout   kleptog@svana.org   http://svana.org/kleptog/
 From each according to his ability. To each according to his ability to 
 litigate.


signature.asc
Description: Digital signature


Re: [HACKERS] pgsql: We're going to have to spell dotless i

2006-09-23 Thread Bruce Momjian
Martijn van Oosterhout wrote:
-- Start of PGP signed section.
 On Sat, Sep 23, 2006 at 11:54:47AM +0200, Peter Eisentraut wrote:
  Martijn van Oosterhout wrote:
   Well you could always use te HTML4 #305; which most tools should
   understand. At least browsers have good support for this kind of
   entity.
  
  Please review the recent thread on pgsql-docs before reiterating all the 
  suggestions.
 
 Oh sorry, it wasn't clear from the commit entry. It's not that DocBook
 doesn't support the character or that it can't be represented. It's
 just not supported in the document encoding we're using.

That's not how I understand it.  The document encoding is only related
to how high-bit characters are interpreted, I am told by Peter, but for
some reason the toolchain just doesn't support UTF8, even though if you
use #305; in SGML it does come out right in HTML, but new toolchains
throw an error for it.

-- 
  Bruce Momjian   [EMAIL PROTECTED]
  EnterpriseDBhttp://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] pgsql: We're going to have to spell dotless i

2006-09-23 Thread Martijn van Oosterhout
On Sat, Sep 23, 2006 at 08:49:02AM -0400, Bruce Momjian wrote:
 That's not how I understand it.  The document encoding is only related
 to how high-bit characters are interpreted, I am told by Peter, but for
 some reason the toolchain just doesn't support UTF8, even though if you
 use #305; in SGML it does come out right in HTML, but new toolchains
 throw an error for it.

Dunno about UTF-8, but openjade only supports one character repertoire,
and that's Unicode (under character handling in the man page).

According to the docbook reference, a way to specify the dotless i
is inodot; 

http://www.oasis-open.org/docbook/documentation/reference/html/iso-lat2.html

But it's part of Latin-2, and if your stylesheet declares latin1 as
the only valid characters, then that character is invalid, no matter
how you represent it. I was just surprised, because inodot; has been
part of docbook since version 3, which is quite some time ago now.

So to me (a more docbook novice) it seems like it's the stylesheet
that's limiting you to latin1, not the docbook parser.

Anyway, the problem has been solved, so we can all get back to testing
the beta now.

Have a nice day,
-- 
Martijn van Oosterhout   kleptog@svana.org   http://svana.org/kleptog/
 From each according to his ability. To each according to his ability to 
 litigate.


signature.asc
Description: Digital signature


Re: [HACKERS] pgsql: We're going to have to spell dotless i

2006-09-23 Thread Tom Lane
Martijn van Oosterhout kleptog@svana.org writes:
 So to me (a more docbook novice) it seems like it's the stylesheet
 that's limiting you to latin1, not the docbook parser.

But the stylesheet in question is part of the basic docbook
infrastructure, so the above distinction is academic.  (Or at least
that's what Peter stated upthread.)

To my mind the real problem is that one of the principal output formats
we are interested in is HTML, and there is no dotless-i entity in any
version of the HTML standard.  I trust I need not point out again the
difference between my browser recognizes this construct and it's in
the standard.

regards, tom lane

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] pgsql: We're going to have to spell dotless i as plain i, because

2006-09-23 Thread Peter Eisentraut
Martijn van Oosterhout wrote:
 Oh sorry, it wasn't clear from the commit entry. It's not that
 DocBook doesn't support the character or that it can't be
 represented. It's just not supported in the document encoding we're
 using.

No, no, and no.

The reason that it doesn't work is that the document character set for
DocBook is Latin 1, so any attempt to refer to a character not in this 
set is going to fail.

-- 
Peter Eisentraut
http://developer.postgresql.org/~petere/

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] pgsql: We're going to have to spell dotless i

2006-09-23 Thread Martijn van Oosterhout
On Sat, Sep 23, 2006 at 12:27:51PM -0400, Tom Lane wrote:
 To my mind the real problem is that one of the principal output formats
 we are interested in is HTML, and there is no dotless-i entity in any
 version of the HTML standard.  I trust I need not point out again the
 difference between my browser recognizes this construct and it's in
 the standard.

Sure there is, HTML4 includes all of Unicode, thus also the dotless-i.
They gave up assigning names to them after latin1, but numerical
references are in the standard also (decimal and hex).

I created a simple docbook document on my computer with inodot; and
ran openjade over and in the output file it is converted to #305;.
Openjade knows how to generate valid character references. The input
file is attached, I compiled it with the command:

openjade -V draft-mode -wall -wno-unused-param -wno-empty -i output-html -t 
sgml /tmp/a.sgml

For dsl file just copy the stylesheet.dsl file in the postgresql source
tree.

Why it doesn't work in the current docs I don't know, but I think we can
rule out limitations of HTML or Docbook.

Have a nice day,
-- 
Martijn van Oosterhout   kleptog@svana.org   http://svana.org/kleptog/
 From each according to his ability. To each according to his ability to 
 litigate.
!-- $PostgreSQL: pgsql/doc/src/sgml/postgres.sgml,v 1.78 2006/09/14 13:40:28 
teodor Exp $ --

!DOCTYPE article PUBLIC -//OASIS//DTD DocBook XML V4.2//EN
   docbook/dtd/xml/4.2/docbookx.dtd
article
  articleinfo
title
  inodot; #305;
/title
  /articleinfo
  section
titleIntroduction/title
para
  inodot; #305;
/para
  /section
/article


signature.asc
Description: Digital signature


Re: [HACKERS] pgsql: We're going to have to spell dotless i

2006-09-23 Thread Tom Lane
Martijn van Oosterhout kleptog@svana.org writes:
 I created a simple docbook document on my computer with inodot; and
 ran openjade over and in the output file it is converted to #305;.

I experimented with that, and openjade didn't complain about it, but
it renders in my browser (Safari) as

Have the COPY command return a command tag that includes the number of rows 
copied (Volkan Yazinodot;cinodot;)

So that hardly looks like a portable solution either.

regards, tom lane

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] pgsql: We're going to have to spell dotless i

2006-09-23 Thread Alvaro Herrera
Tom Lane wrote:
 Martijn van Oosterhout kleptog@svana.org writes:
  I created a simple docbook document on my computer with inodot; and
  ran openjade over and in the output file it is converted to #305;.
 
 I experimented with that, and openjade didn't complain about it, but
 it renders in my browser (Safari) as
 
 Have the COPY command return a command tag that includes the number of rows 
 copied (Volkan Yazinodot;cinodot;)

Well, if I put a inodot; into an HTML document and open it on my
browser (Epiphany, which is Mozilla-based), it surely looks like
verbatim inodot;.  However, if I replace it with #305; then it looks
like a dotless i.  So maybe your Openjade is not exactly the same
Martijn was using, because what I understood was that Openjade replaced
the inodot; with #305;, which should work.

Does your browser display it correctly if you replace manually with #305;?

On the other hand, I don't understand why DocBook would be Latin-1 only.
What would be the point of that limitation?  Some googling seems to
reveal that people indeed uses other charsets, UTF-8 in particular (but
also Big5, Latin-2, etc), so apparently this isn't set in stone.  (I
admit that they mainly talk about XML Docbook though).

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] pgsql: We're going to have to spell dotless i

2006-09-23 Thread Tom Lane
Alvaro Herrera [EMAIL PROTECTED] writes:
 So maybe your Openjade is not exactly the same
 Martijn was using, because what I understood was that Openjade replaced
 the inodot; with #305;, which should work.

I think it's more likely that he was running with a non-DocBook
stylesheet (his openjade command did not explicitly select a catalog and
stylesheet the way that our Makefiles do).  Or just a different version
of the stylesheet.  I'm testing with whatever ships in Fedora Core 5.
I see definitions of inodot; in some of the files under
/usr/share/sgml, but evidently none of them are included by docbook...

 Does your browser display it correctly if you replace manually with #305;?

Doesn't really matter whether it does or not, since my gripe about that
is that DocBook rejects it.

 On the other hand, I don't understand why DocBook would be Latin-1 only.

I'm surprised too that it couldn't be easily overridden.  Peter, any
idea why not?

regards, tom lane

---(end of broadcast)---
TIP 6: explain analyze is your friend