Re: [GENERAL] COPY command character set

2010-02-23 Thread Peter Headland
As Tom says, this doesn't really address my original issue, which was
not that I read the material on encoding and misunderstood it, but that
I didn't even see that material because it was mixed in with a bunch of
other notes on all sorts of random subjects.

To address this issue in the documentation at large, I would like to see
every command that has I18N/L10N-related behavior have a separate
sub-head for the explanation of that behavior. That way, anyone who
needs to know about that aspect (which should be everyone), just has to
look for the sub-head to be sure they have found what they need to know.
Whilst I know we can't do that for every single cross-command topic, it
seems to me that I18N/L10N is sufficiently important to users of pg that
it merits this treatment.

FWIW, I think error handling/behavior also merits its own sub-heads
throughout. And there should be links within the pages to sub-heads (cf.
DB2's online doc).

Of course, all of this would be a substantial project.

Note that for the COPY command the I18N/L10N material covers both
DateStyle and encoding.

In respect of Bruce's proposed changes, I prefer the original wording
(for the same reasons as Tom), but with the addition of the mention of
the server - ... read from or written to a file directly by the
server.

-- 
Peter Headland
Architect
Actuate Corporation

-Original Message-
From: Bruce Momjian [mailto:br...@momjian.us] 
Sent: Monday, February 22, 2010 22:01
To: Tom Lane
Cc: Peter Headland; Adrian Klaver; pgsql-general@postgresql.org
Subject: Re: [GENERAL] COPY command character set

Tom Lane wrote:
 Bruce Momjian br...@momjian.us writes:
  I have updated the documentation to be more direct about COPY
encoding
  behavior.  Patch attached and applied.
 
 Uh, why exactly do you find that better?  Processes data seems a lot
 vaguer to me than the previous wording.  I certainly don't think that
 this does much to address Peter's original complaint.

I thought the problem was that we said input, then output and then
got to the point about the server, and I thought the reader just stopped
reading that far, so I tried to shorten it so the idea was sooner, and I
mentioned server at the end.  It might not be better, but I tried.

We don't want to highlight the input/output, we want to highlight that
all input and output are controlled by the client encoding.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com
  PG East:  http://www.enterprisedb.com/community/nav-pg-east-2010.do
  + If your life is a hard drive, Christ can be your backup. +

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] COPY command character set

2010-02-23 Thread Bruce Momjian
Peter Headland wrote:
 In respect of Bruce's proposed changes, I prefer the original wording
 (for the same reasons as Tom), but with the addition of the mention of
 the server - ... read from or written to a file directly by the
 server.

OK, done with the attached patch.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com
  PG East:  http://www.enterprisedb.com/community/nav-pg-east-2010.do
  + If your life is a hard drive, Christ can be your backup. +
Index: doc/src/sgml/ref/copy.sgml
===
RCS file: /cvsroot/pgsql/doc/src/sgml/ref/copy.sgml,v
retrieving revision 1.94
diff -c -c -r1.94 copy.sgml
*** doc/src/sgml/ref/copy.sgml	23 Feb 2010 05:17:33 -	1.94
--- doc/src/sgml/ref/copy.sgml	23 Feb 2010 21:38:07 -
***
*** 1,5 
  !--
! $PostgreSQL: pgsql/doc/src/sgml/ref/copy.sgml,v 1.94 2010/02/23 05:17:33 momjian Exp $
  PostgreSQL documentation
  --
  
--- 1,5 
  !--
! $PostgreSQL: pgsql/doc/src/sgml/ref/copy.sgml,v 1.93 2010/02/17 04:19:39 tgl Exp $
  PostgreSQL documentation
  --
  
***
*** 367,376 
 /para
  
 para
! commandCOPY/command always processes data according to the
! current client encoding, even if the data does not pass through
! the client but is read from or written to a file directly by the
! server.
 /para
  
 para
--- 367,376 
 /para
  
 para
! Input data is interpreted according to the current client encoding,
! and output data is encoded in the the current client encoding, even
! if the data does not pass through the client but is read from or
! written to a file directly by the server.
 /para
  
 para

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] COPY command character set

2010-02-22 Thread Bruce Momjian

I have updated the documentation to be more direct about COPY encoding
behavior.  Patch attached and applied.

---

Peter Headland wrote:
  Maybe the link might help?
  
  http://www.postgresql.org/docs/8.4/interactive/multibyte.html
 
 That page is too generic; what would be helpful is a section in the doc for 
 each command that is affected by I18N/L10N considerations, that identifies 
 how that specific command behaves.
 
 Now that I have grasped the behavior, I'm more than happy to edit the COPY 
 doc page, if people think that would be helpful/worthwhile.
 
 -- 
 Peter Headland
 Architect
 Actuate Corporation
 
 
 -Original Message-
 From: Adrian Klaver [mailto:akla...@comcast.net] 
 Sent: Thursday, September 10, 2009 11:06
 To: Peter Headland
 Cc: pgsql-general@postgresql.org; Tom Lane
 Subject: Re: [GENERAL] COPY command character set
 
 
 - Peter Headland pheadl...@actuate.com wrote:
 
   The COPY command reference page saith
  
  Input data is interpreted according to the current client
  encoding,
  and output data is encoded in the the current client encoding,
  even
  if the data does not pass through the client but is read from or
  written to a file.
  
  Rats - I read the manual page twice and that didn't register on my
  feeble consciousness. I suspect that I didn't look beyond the word
  client, since I knew I wasn't interested in client behavior and I
  was
  speed-reading. On the assumption that I am not uniquely stupid, maybe
  we
  could re-phrase this slightly, with a for example, and add a
  heading
  Localization?
  
  As a general comment, I18N/L10N is a hairy enough topic that it
  merits
  its own heading in any commands where it is an issue.
  
  How about my suggestion to add a means (extend COPY syntax) to
  specify
  encoding explicitly and handle UTF lead bytes - would that be of
  interest?
  
  -- 
  Peter Headland
  Architect
  Actuate Corporation
  
 
  
  The COPY command reference page saith
  
  Input data is interpreted according to the current client
  encoding,
  and output data is encoded in the the current client encoding,
  even
  if the data does not pass through the client but is read from or
  written to a file. 
  
  Seems clear enough to me.
  
  regards, tom lane
 
 Maybe the link might help?
 
 http://www.postgresql.org/docs/8.4/interactive/multibyte.html
 
 
 Adrian Klaver
 akla...@comcast.net
 
 -- 
 Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-general

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com
  PG East:  http://www.enterprisedb.com/community/nav-pg-east-2010.do
  + If your life is a hard drive, Christ can be your backup. +
Index: doc/src/sgml/ref/copy.sgml
===
RCS file: /cvsroot/pgsql/doc/src/sgml/ref/copy.sgml,v
retrieving revision 1.93
diff -c -c -r1.93 copy.sgml
*** doc/src/sgml/ref/copy.sgml	17 Feb 2010 04:19:39 -	1.93
--- doc/src/sgml/ref/copy.sgml	23 Feb 2010 05:15:00 -
***
*** 367,376 
 /para
  
 para
! Input data is interpreted according to the current client encoding,
! and output data is encoded in the the current client encoding, even
! if the data does not pass through the client but is read from or
! written to a file.
 /para
  
 para
--- 367,376 
 /para
  
 para
! commandCOPY/command always processes data according to the
! current client encoding, even if the data does not pass through
! the client but is read from or written to a file directly by the
! server.
 /para
  
 para

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] COPY command character set

2010-02-22 Thread Tom Lane
Bruce Momjian br...@momjian.us writes:
 I have updated the documentation to be more direct about COPY encoding
 behavior.  Patch attached and applied.

Uh, why exactly do you find that better?  Processes data seems a lot
vaguer to me than the previous wording.  I certainly don't think that
this does much to address Peter's original complaint.

regards, tom lane

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] COPY command character set

2010-02-22 Thread Bruce Momjian
Tom Lane wrote:
 Bruce Momjian br...@momjian.us writes:
  I have updated the documentation to be more direct about COPY encoding
  behavior.  Patch attached and applied.
 
 Uh, why exactly do you find that better?  Processes data seems a lot
 vaguer to me than the previous wording.  I certainly don't think that
 this does much to address Peter's original complaint.

I thought the problem was that we said input, then output and then
got to the point about the server, and I thought the reader just stopped
reading that far, so I tried to shorten it so the idea was sooner, and I
mentioned server at the end.  It might not be better, but I tried.

We don't want to highlight the input/output, we want to highlight that
all input and output are controlled by the client encoding.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com
  PG East:  http://www.enterprisedb.com/community/nav-pg-east-2010.do
  + If your life is a hard drive, Christ can be your backup. +

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] COPY command character set

2009-09-10 Thread Peter Headland
 set client_encoding = 'utf8';
 copy from stdin/to stdout;

What if I want to do this on the server side (because it's much, much
faster)? Does COPY use the default encoding of the database? If not,
what?

If this is a restrictive as it appears, and there are no outstanding
enhancements planned in this area, I might be interested in improving
this command to allow specifying the encoding and to have it do obvious
stuff like recognize UTF lead bytes automatically. At the very least,
the documentation needs some work to explain these subtleties.

-- 
Peter Headland
Architect
Actuate Corporation

-Original Message-
From: Tom Lane [mailto:t...@sss.pgh.pa.us] 
Sent: Wednesday, September 09, 2009 19:14
To: Peter Headland
Cc: pgsql-general@postgresql.org
Subject: Re: [GENERAL] COPY command character set 

Peter Headland pheadl...@actuate.com writes:
 The documentation of the COPY command does not state what character
 set(s) are recognized or written. I need to import and export UTF-8
 data; how can I do that?

set client_encoding = 'utf8';
copy from stdin/to stdout;

regards, tom lane

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] COPY command character set

2009-09-10 Thread Tom Lane
Peter Headland pheadl...@actuate.com writes:
 set client_encoding = 'utf8';
 copy from stdin/to stdout;

 What if I want to do this on the server side (because it's much, much
 faster)? Does COPY use the default encoding of the database? If not,
 what?

 If this is a restrictive as it appears, and there are no outstanding
 enhancements planned in this area, I might be interested in improving
 this command to allow specifying the encoding and to have it do obvious
 stuff like recognize UTF lead bytes automatically. At the very least,
 the documentation needs some work to explain these subtleties.

The COPY command reference page saith

Input data is interpreted according to the current client encoding,
and output data is encoded in the the current client encoding, even
if the data does not pass through the client but is read from or
written to a file. 

Seems clear enough to me.

regards, tom lane

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] COPY command character set

2009-09-10 Thread Peter Headland
 The COPY command reference page saith

Input data is interpreted according to the current client encoding,
and output data is encoded in the the current client encoding, even
if the data does not pass through the client but is read from or
written to a file.

Rats - I read the manual page twice and that didn't register on my
feeble consciousness. I suspect that I didn't look beyond the word
client, since I knew I wasn't interested in client behavior and I was
speed-reading. On the assumption that I am not uniquely stupid, maybe we
could re-phrase this slightly, with a for example, and add a heading
Localization?

As a general comment, I18N/L10N is a hairy enough topic that it merits
its own heading in any commands where it is an issue.

How about my suggestion to add a means (extend COPY syntax) to specify
encoding explicitly and handle UTF lead bytes - would that be of
interest?

-- 
Peter Headland
Architect
Actuate Corporation


-Original Message-
From: Tom Lane [mailto:t...@sss.pgh.pa.us] 
Sent: Thursday, September 10, 2009 10:38
To: Peter Headland
Cc: pgsql-general@postgresql.org
Subject: Re: [GENERAL] COPY command character set 

Peter Headland pheadl...@actuate.com writes:
 set client_encoding = 'utf8';
 copy from stdin/to stdout;

 What if I want to do this on the server side (because it's much, much
 faster)? Does COPY use the default encoding of the database? If not,
 what?

 If this is a restrictive as it appears, and there are no outstanding
 enhancements planned in this area, I might be interested in improving
 this command to allow specifying the encoding and to have it do
obvious
 stuff like recognize UTF lead bytes automatically. At the very least,
 the documentation needs some work to explain these subtleties.

The COPY command reference page saith

Input data is interpreted according to the current client encoding,
and output data is encoded in the the current client encoding, even
if the data does not pass through the client but is read from or
written to a file. 

Seems clear enough to me.

regards, tom lane

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] COPY command character set

2009-09-10 Thread Adrian Klaver

- Peter Headland pheadl...@actuate.com wrote:

  The COPY command reference page saith
 
 Input data is interpreted according to the current client
 encoding,
 and output data is encoded in the the current client encoding,
 even
 if the data does not pass through the client but is read from or
 written to a file.
 
 Rats - I read the manual page twice and that didn't register on my
 feeble consciousness. I suspect that I didn't look beyond the word
 client, since I knew I wasn't interested in client behavior and I
 was
 speed-reading. On the assumption that I am not uniquely stupid, maybe
 we
 could re-phrase this slightly, with a for example, and add a
 heading
 Localization?
 
 As a general comment, I18N/L10N is a hairy enough topic that it
 merits
 its own heading in any commands where it is an issue.
 
 How about my suggestion to add a means (extend COPY syntax) to
 specify
 encoding explicitly and handle UTF lead bytes - would that be of
 interest?
 
 -- 
 Peter Headland
 Architect
 Actuate Corporation
 

 
 The COPY command reference page saith
 
 Input data is interpreted according to the current client
 encoding,
 and output data is encoded in the the current client encoding,
 even
 if the data does not pass through the client but is read from or
 written to a file. 
 
 Seems clear enough to me.
 
   regards, tom lane

Maybe the link might help?

http://www.postgresql.org/docs/8.4/interactive/multibyte.html


Adrian Klaver
akla...@comcast.net

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] COPY command character set

2009-09-10 Thread Tom Lane
Peter Headland pheadl...@actuate.com writes:
 How about my suggestion to add a means (extend COPY syntax) to specify
 encoding explicitly and handle UTF lead bytes - would that be of
 interest?

There are no lead bytes in UTF-8, and we make no pretense of handling
UTF-16, so I don't think we'd be interested in some hack that cleans
up misencoding problems.

The idea of overriding client_encoding has been suggested before.  I
don't remember if it was rejected or is just languishing on the TODO
list.  I'd be a little worried about sending clients data in an encoding
they aren't expecting, but if it only works for I/O to a file it might
be okay.

regards, tom lane

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] COPY command character set

2009-09-10 Thread Peter Headland
 There are no lead bytes in UTF-8

Sorry, sloppy use of terminology. I should have said UTF signatures
aka the byte order mark. IOW, the magic number bytes commonly found
at the front of UTF encoded files:

UTF-16 little-endian   FF FE
UTF-16 big-endian   FE FF
UTF-8   EF BB BF

These tend to be inserted automatically by text editors, so it would be
advantageous to have them handled automatically by COPY (at least as an
option). Right now, if I edit a UTF-8 file then load it with COPY, I get
errors or bad data if the editor chose to add the 3 signature bytes.

Whilst UTF-16 is not supported internally, COPY seems to be a legitimate
special case, because it is used for migration to/from other tools that
may emit or expect UTF-16. ISTR that Postgres uses UCI? If so it would
be near-trivial to allow COPY to read and write UTF-16. If done via a
syntax extension to COPY (which I think is the most desirable
implementation), this would have no adverse effect on any other
capability. It also seems sufficiently isolated from sensitive/complex
areas of the code that it might make a suitable first project for
someone who is interested in becoming a contributor...

-- 
Peter Headland
Architect
Actuate Corporation


-Original Message-
From: Tom Lane [mailto:t...@sss.pgh.pa.us] 
Sent: Thursday, September 10, 2009 11:13
To: Peter Headland
Cc: pgsql-general@postgresql.org
Subject: Re: [GENERAL] COPY command character set 

Peter Headland pheadl...@actuate.com writes:
 How about my suggestion to add a means (extend COPY syntax) to specify
 encoding explicitly and handle UTF lead bytes - would that be of
 interest?

There are no lead bytes in UTF-8, and we make no pretense of handling
UTF-16, so I don't think we'd be interested in some hack that cleans
up misencoding problems.

The idea of overriding client_encoding has been suggested before.  I
don't remember if it was rejected or is just languishing on the TODO
list.  I'd be a little worried about sending clients data in an encoding
they aren't expecting, but if it only works for I/O to a file it might
be okay.

regards, tom lane

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] COPY command character set

2009-09-10 Thread Peter Headland
 Maybe the link might help?
 
 http://www.postgresql.org/docs/8.4/interactive/multibyte.html

That page is too generic; what would be helpful is a section in the doc for 
each command that is affected by I18N/L10N considerations, that identifies how 
that specific command behaves.

Now that I have grasped the behavior, I'm more than happy to edit the COPY doc 
page, if people think that would be helpful/worthwhile.

-- 
Peter Headland
Architect
Actuate Corporation


-Original Message-
From: Adrian Klaver [mailto:akla...@comcast.net] 
Sent: Thursday, September 10, 2009 11:06
To: Peter Headland
Cc: pgsql-general@postgresql.org; Tom Lane
Subject: Re: [GENERAL] COPY command character set


- Peter Headland pheadl...@actuate.com wrote:

  The COPY command reference page saith
 
 Input data is interpreted according to the current client
 encoding,
 and output data is encoded in the the current client encoding,
 even
 if the data does not pass through the client but is read from or
 written to a file.
 
 Rats - I read the manual page twice and that didn't register on my
 feeble consciousness. I suspect that I didn't look beyond the word
 client, since I knew I wasn't interested in client behavior and I
 was
 speed-reading. On the assumption that I am not uniquely stupid, maybe
 we
 could re-phrase this slightly, with a for example, and add a
 heading
 Localization?
 
 As a general comment, I18N/L10N is a hairy enough topic that it
 merits
 its own heading in any commands where it is an issue.
 
 How about my suggestion to add a means (extend COPY syntax) to
 specify
 encoding explicitly and handle UTF lead bytes - would that be of
 interest?
 
 -- 
 Peter Headland
 Architect
 Actuate Corporation
 

 
 The COPY command reference page saith
 
 Input data is interpreted according to the current client
 encoding,
 and output data is encoded in the the current client encoding,
 even
 if the data does not pass through the client but is read from or
 written to a file. 
 
 Seems clear enough to me.
 
   regards, tom lane

Maybe the link might help?

http://www.postgresql.org/docs/8.4/interactive/multibyte.html


Adrian Klaver
akla...@comcast.net

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] COPY command character set

2009-09-10 Thread Alvaro Herrera
Peter Headland wrote:

 As a general comment, I18N/L10N is a hairy enough topic that it merits
 its own heading in any commands where it is an issue.

I agree, this seems a good idea because people is often confused by
this.

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] COPY command character set

2009-09-09 Thread Tom Lane
Peter Headland pheadl...@actuate.com writes:
 The documentation of the COPY command does not state what character
 set(s) are recognized or written. I need to import and export UTF-8
 data; how can I do that?

set client_encoding = 'utf8';
copy from stdin/to stdout;

regards, tom lane

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general