Re: [GENERAL] COPY command character set

2010-02-23 Thread Peter Headland
As Tom says, this doesn't really address my original issue, which was
not that I read the material on encoding and misunderstood it, but that
I didn't even see that material because it was mixed in with a bunch of
other notes on all sorts of random subjects.

To address this issue in the documentation at large, I would like to see
every command that has I18N/L10N-related behavior have a separate
sub-head for the explanation of that behavior. That way, anyone who
needs to know about that aspect (which should be everyone), just has to
look for the sub-head to be sure they have found what they need to know.
Whilst I know we can't do that for every single cross-command topic, it
seems to me that I18N/L10N is sufficiently important to users of pg that
it merits this treatment.

FWIW, I think error handling/behavior also merits its own sub-heads
throughout. And there should be links within the pages to sub-heads (cf.
DB2's online doc).

Of course, all of this would be a substantial project.

Note that for the COPY command the I18N/L10N material covers both
DateStyle and encoding.

In respect of Bruce's proposed changes, I prefer the original wording
(for the same reasons as Tom), but with the addition of the mention of
the server - "... read from or written to a file directly by the
server".

-- 
Peter Headland
Architect
Actuate Corporation

-Original Message-
From: Bruce Momjian [mailto:br...@momjian.us] 
Sent: Monday, February 22, 2010 22:01
To: Tom Lane
Cc: Peter Headland; Adrian Klaver; pgsql-general@postgresql.org
Subject: Re: [GENERAL] COPY command character set

Tom Lane wrote:
> Bruce Momjian  writes:
> > I have updated the documentation to be more direct about COPY
encoding
> > behavior.  Patch attached and applied.
> 
> Uh, why exactly do you find that better?  "Processes data" seems a lot
> vaguer to me than the previous wording.  I certainly don't think that
> this does much to address Peter's original complaint.

I thought the problem was that we said "input", then "output" and then
got to the point about the server, and I thought the reader just stopped
reading that far, so I tried to shorten it so the idea was sooner, and I
mentioned "server" at the end.  It might not be better, but I tried.

We don't want to highlight the input/output, we want to highlight that
all input and output are controlled by the client encoding.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com
  PG East:  http://www.enterprisedb.com/community/nav-pg-east-2010.do
  + If your life is a hard drive, Christ can be your backup. +

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] COPY command character set

2009-09-10 Thread Peter Headland
> Maybe the link might help?
> 
> http://www.postgresql.org/docs/8.4/interactive/multibyte.html

That page is too generic; what would be helpful is a section in the doc for 
each command that is affected by I18N/L10N considerations, that identifies how 
that specific command behaves.

Now that I have grasped the behavior, I'm more than happy to edit the COPY doc 
page, if people think that would be helpful/worthwhile.

-- 
Peter Headland
Architect
Actuate Corporation


-Original Message-
From: Adrian Klaver [mailto:akla...@comcast.net] 
Sent: Thursday, September 10, 2009 11:06
To: Peter Headland
Cc: pgsql-general@postgresql.org; Tom Lane
Subject: Re: [GENERAL] COPY command character set


- "Peter Headland"  wrote:

> > The COPY command reference page saith
> >
> >Input data is interpreted according to the current client
> encoding,
> >and output data is encoded in the the current client encoding,
> even
> >if the data does not pass through the client but is read from or
> >written to a file.
> 
> Rats - I read the manual page twice and that didn't register on my
> feeble consciousness. I suspect that I didn't look beyond the word
> "client", since I knew I wasn't interested in client behavior and I
> was
> speed-reading. On the assumption that I am not uniquely stupid, maybe
> we
> could re-phrase this slightly, with a "for example", and add a
> heading
> "Localization"?
> 
> As a general comment, I18N/L10N is a hairy enough topic that it
> merits
> its own heading in any commands where it is an issue.
> 
> How about my suggestion to add a means (extend COPY syntax) to
> specify
> encoding explicitly and handle UTF lead bytes - would that be of
> interest?
> 
> -- 
> Peter Headland
> Architect
> Actuate Corporation
> 

> 
> The COPY command reference page saith
> 
> Input data is interpreted according to the current client
> encoding,
> and output data is encoded in the the current client encoding,
> even
> if the data does not pass through the client but is read from or
> written to a file. 
> 
> Seems clear enough to me.
> 
>   regards, tom lane

Maybe the link might help?

http://www.postgresql.org/docs/8.4/interactive/multibyte.html


Adrian Klaver
akla...@comcast.net

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] COPY command character set

2009-09-10 Thread Peter Headland
> There are no lead bytes in UTF-8

Sorry, sloppy use of terminology. I should have said "UTF signatures"
aka the "byte order mark". IOW, the "magic number" bytes commonly found
at the front of UTF encoded files:

UTF-16 little-endian   FF FE
UTF-16 big-endian   FE FF
UTF-8   EF BB BF

These tend to be inserted automatically by text editors, so it would be
advantageous to have them handled automatically by COPY (at least as an
option). Right now, if I edit a UTF-8 file then load it with COPY, I get
errors or bad data if the editor chose to add the 3 signature bytes.

Whilst UTF-16 is not supported internally, COPY seems to be a legitimate
special case, because it is used for migration to/from other tools that
may emit or expect UTF-16. ISTR that Postgres uses UCI? If so it would
be near-trivial to allow COPY to read and write UTF-16. If done via a
syntax extension to COPY (which I think is the most desirable
implementation), this would have no adverse effect on any other
capability. It also seems sufficiently isolated from sensitive/complex
areas of the code that it might make a suitable first project for
someone who is interested in becoming a contributor...

-- 
Peter Headland
Architect
Actuate Corporation


-Original Message-
From: Tom Lane [mailto:t...@sss.pgh.pa.us] 
Sent: Thursday, September 10, 2009 11:13
To: Peter Headland
Cc: pgsql-general@postgresql.org
Subject: Re: [GENERAL] COPY command character set 

"Peter Headland"  writes:
> How about my suggestion to add a means (extend COPY syntax) to specify
> encoding explicitly and handle UTF lead bytes - would that be of
> interest?

There are no lead bytes in UTF-8, and we make no pretense of handling
UTF-16, so I don't think we'd be interested in some hack that cleans
up misencoding problems.

The idea of overriding client_encoding has been suggested before.  I
don't remember if it was rejected or is just languishing on the TODO
list.  I'd be a little worried about sending clients data in an encoding
they aren't expecting, but if it only works for I/O to a file it might
be okay.

regards, tom lane

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] COPY command character set

2009-09-10 Thread Peter Headland
> The COPY command reference page saith
>
>Input data is interpreted according to the current client encoding,
>and output data is encoded in the the current client encoding, even
>if the data does not pass through the client but is read from or
>written to a file.

Rats - I read the manual page twice and that didn't register on my
feeble consciousness. I suspect that I didn't look beyond the word
"client", since I knew I wasn't interested in client behavior and I was
speed-reading. On the assumption that I am not uniquely stupid, maybe we
could re-phrase this slightly, with a "for example", and add a heading
"Localization"?

As a general comment, I18N/L10N is a hairy enough topic that it merits
its own heading in any commands where it is an issue.

How about my suggestion to add a means (extend COPY syntax) to specify
encoding explicitly and handle UTF lead bytes - would that be of
interest?

-- 
Peter Headland
Architect
Actuate Corporation


-Original Message-
From: Tom Lane [mailto:t...@sss.pgh.pa.us] 
Sent: Thursday, September 10, 2009 10:38
To: Peter Headland
Cc: pgsql-general@postgresql.org
Subject: Re: [GENERAL] COPY command character set 

"Peter Headland"  writes:
>> set client_encoding = 'utf8';
>> copy from stdin/to stdout;

> What if I want to do this on the server side (because it's much, much
> faster)? Does COPY use the default encoding of the database? If not,
> what?

> If this is a restrictive as it appears, and there are no outstanding
> enhancements planned in this area, I might be interested in improving
> this command to allow specifying the encoding and to have it do
obvious
> stuff like recognize UTF lead bytes automatically. At the very least,
> the documentation needs some work to explain these subtleties.

The COPY command reference page saith

Input data is interpreted according to the current client encoding,
and output data is encoded in the the current client encoding, even
if the data does not pass through the client but is read from or
written to a file. 

Seems clear enough to me.

regards, tom lane

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] COPY command character set

2009-09-10 Thread Peter Headland
> set client_encoding = 'utf8';
> copy from stdin/to stdout;

What if I want to do this on the server side (because it's much, much
faster)? Does COPY use the default encoding of the database? If not,
what?

If this is a restrictive as it appears, and there are no outstanding
enhancements planned in this area, I might be interested in improving
this command to allow specifying the encoding and to have it do obvious
stuff like recognize UTF lead bytes automatically. At the very least,
the documentation needs some work to explain these subtleties.

-- 
Peter Headland
Architect
Actuate Corporation

-Original Message-
From: Tom Lane [mailto:t...@sss.pgh.pa.us] 
Sent: Wednesday, September 09, 2009 19:14
To: Peter Headland
Cc: pgsql-general@postgresql.org
Subject: Re: [GENERAL] COPY command character set 

"Peter Headland"  writes:
> The documentation of the COPY command does not state what character
> set(s) are recognized or written. I need to import and export UTF-8
> data; how can I do that?

set client_encoding = 'utf8';
copy from stdin/to stdout;

regards, tom lane

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


[GENERAL] COPY command character set

2009-09-09 Thread Peter Headland
The documentation of the COPY command does not state what character
set(s) are recognized or written. I need to import and export UTF-8
data; how can I do that?

-- 
Peter Headland
Architect
Actuate Corporation




Re: [GENERAL] Race hazard deleting using CTID?

2009-08-11 Thread Peter Headland
> Well, the CTID of a row you can see can't be changed by another
> transaction while your transaction is still live.  However, if
> someone else does modify/delete one of those rows concurrently,
> it will fail the outer WHERE check and thus silently not be
> deleted.  Net effect is that you might delete fewer than 10
> rows.  Not sure if you'd consider that a race hazard or not.

Thanks for the explanation. My actual use-case is an UPDATE, and
updating fewer rows than I wanted to is definitely not the desired
behavior.

It's a race hazard condition by definition if the behavior of a system
is non-deterministic due to timing of another process.

This is all in a pl/pgsql function.

I was trying to avoid a cursor and loop solution in the hope of better
performance. Sounds like I need to separate the SELECT into a top-level
statement and get the list of ctids into an array variable that I then
use in the DELETE (actually an UPDATE in my case, but that's beside the
point). I know this question would be better on new users, but since we
already have all the context established, I'd be grateful to know how to
get the ctids into an array variable in pl/pgsql.

-- 
Peter Headland
Architect
Actuate Corporation


-Original Message-
From: Tom Lane [mailto:t...@sss.pgh.pa.us] 
Sent: Tuesday, August 11, 2009 12:01
To: Peter Headland
Cc: pgsql-general@postgresql.org; Pavel Stehule
Subject: Re: Race hazard deleting using CTID? 

"Peter Headland"  writes:
> My question is, does this code contain a race hazard, because the list
from the SELECT might get changed by another session before the DELETE
uses it?

>   delete from del where ctid = any(array(select ctid from del limit
10))

Well, the CTID of a row you can see can't be changed by another
transaction while your transaction is still live.  However, if someone
else does modify/delete one of those rows concurrently, it will fail the
outer WHERE check and thus silently not be deleted.  Net effect is that
you might delete fewer than 10 rows.  Not sure if you'd consider that a
race hazard or not.

> If so, am I correct to think that adding FOR UPDATE to create the
version below would eliminate the hazard?

>   delete from del where ctid = any(array(select ctid from del limit 10
for update))

If you'd bothered to try that before asking the list, you'd know the
system won't take it --- FOR UPDATE is only supported at top level.
You could probably do something equivalent using a plpgsql loop, or
pulling the CTIDs back to the client side.

regards, tom lane

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


[GENERAL] Race hazard deleting using CTID?

2009-08-11 Thread Peter Headland
I believe that the following statement was originally suggested by Tom Lane; I 
got it from Pavel Stehule's PostgreSQL Tips page.
My question is, does this code contain a race hazard, because the list from the 
SELECT might get changed by another session before the DELETE uses it?

  delete from del where ctid = any(array(select ctid from del limit 10))

If so, am I correct to think that adding FOR UPDATE to create the version below 
would eliminate the hazard?

  delete from del where ctid = any(array(select ctid from del limit 10 for 
update))

-- 
Peter Headland
Architect
Actuate Corporation


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


[GENERAL] Does PERFORM hold a lock?

2009-08-11 Thread Peter Headland
If I do the following in a pl/pgsql function, will an update lock be
held until the end of the transaction?

  PERFORM c FROM t FOR UPDATE;

-- 
Peter Headland
Architect
Actuate Corporation