Re: [HACKERS] UTF8 with BOM support in psql

2009-11-21 Thread Peter Eisentraut
On mån, 2009-11-16 at 22:37 +0200, Peter Eisentraut wrote: > On ons, 2009-10-21 at 13:11 +0900, Itagaki Takahiro wrote: > > Sure. Client encoding is declared in body of a file, but BOM is > > in head of the file. So, we should always ignore BOM sequence > > at the file head no matter what client en

Re: [HACKERS] UTF8 with BOM support in psql

2009-11-18 Thread Tom Lane
Peter Eisentraut writes: > This is certainly a workaround, just like piping the file through a > suitable sed expression would be, but conceptually, the client encoding > is a property of the file and should therefore be marked in the file. In a perfect world things would be like that, but the wo

Re: [HACKERS] UTF8 with BOM support in psql

2009-11-18 Thread Peter Eisentraut
On ons, 2009-11-18 at 08:52 -0500, Andrew Dunstan wrote: > 4) set the client encoding before the file is read in any of the ways > that have already been discussed and then allow psql to eat the BOM. This is certainly a workaround, just like piping the file through a suitable sed expression would

Re: [HACKERS] UTF8 with BOM support in psql

2009-11-18 Thread Andrew Dunstan
Peter Eisentraut wrote: But now we're back to the original problem. Certain editors insert BOMs at the beginning of the file. And that is by any definition before the embedded client encoding declaration. I think the only ways to solve this are: 1) Ignore the BOM if a client encoding declar

Re: [HACKERS] UTF8 with BOM support in psql

2009-11-18 Thread Peter Eisentraut
On tis, 2009-11-17 at 23:22 -0500, Andrew Dunstan wrote: > Itagaki Takahiro wrote: > > I don't want user to check the encoding of scripts before executing > -- > > it is far from fail-safe. > > > > > > > > That's what we require in all other cases. Why should UTF8 be special? But now we're bac

Re: [HACKERS] UTF8 with BOM support in psql

2009-11-18 Thread Peter Eisentraut
On ons, 2009-11-18 at 12:52 +0900, Itagaki Takahiro wrote: > Peter Eisentraut wrote: > > > Together, that should cover a lot of cases. Not perfect, but far from > > useless. > > For Japanese users on Windows, the client encoding are always set to SJIS > because of the restriction of cmd.exe. Bu

Re: [HACKERS] UTF8 with BOM support in psql

2009-11-17 Thread Itagaki Takahiro
Andrew Dunstan wrote: > Itagaki Takahiro wrote: > > I don't want user to check the encoding of scripts before executing -- > > it is far from fail-safe. > > That's what we require in all other cases. Why should UTF8 be special? No. I didn't think about UTF-8 nor BOM in that point. I assumed w

Re: [HACKERS] UTF8 with BOM support in psql

2009-11-17 Thread Andrew Dunstan
Itagaki Takahiro wrote: I don't want user to check the encoding of scripts before executing -- it is far from fail-safe. That's what we require in all other cases. Why should UTF8 be special? If I have a script in Latin1 and Postgres thinks it's UTF8 it will probably explode. Same for

Re: [HACKERS] UTF8 with BOM support in psql

2009-11-17 Thread Itagaki Takahiro
Andrew Dunstan wrote: > Itagaki Takahiro wrote: > > Multi-byte scripts > > without encoding are always dangerous whether BOM is present or not. > > I'd say we can always throw an error when we find queries that contain > > multi-byte characters if no prior encoding declaration. > > You will br

Re: [HACKERS] UTF8 with BOM support in psql

2009-11-17 Thread Itagaki Takahiro
Peter Eisentraut wrote: > Together, that should cover a lot of cases. Not perfect, but far from > useless. For Japanese users on Windows, the client encoding are always set to SJIS because of the restriction of cmd.exe. But the script file can be written in UTF8 with BOM. I don't think we shou

Re: [HACKERS] UTF8 with BOM support in psql

2009-11-17 Thread Tom Lane
Andrew Dunstan writes: > Well, it might be a good idea to provide at least some support in libpq. > Making each client do it from scratch seems a bit inefficient. Encoding conversion seems far outside libpq's charter, and as for "from scratch" there are other libraries for that.

Re: [HACKERS] UTF8 with BOM support in psql

2009-11-17 Thread Andrew Dunstan
Tom Lane wrote: Andrew Dunstan writes: Peter Eisentraut wrote: Well, someone could implement UTF-16 or UTF-whatever as client encoding. But I have not heard of any concrete proposals about that. Doesn't the nul byte problem make that seriously hard? Just about imp

Re: [HACKERS] UTF8 with BOM support in psql

2009-11-17 Thread Tom Lane
Andrew Dunstan writes: > Peter Eisentraut wrote: >> Well, someone could implement UTF-16 or UTF-whatever as client encoding. >> But I have not heard of any concrete proposals about that. > Doesn't the nul byte problem make that seriously hard? Just about impossible. It would require a protocol

Re: [HACKERS] UTF8 with BOM support in psql

2009-11-17 Thread Chuck McDevitt
> -Original Message- > From: Andrew Dunstan [mailto:and...@dunslane.net] > Sent: Tuesday, November 17, 2009 9:15 AM > To: Peter Eisentraut > Cc: Chuck McDevitt; Itagaki Takahiro; pgsql-hackers@postgresql.org > Subject: Re: [HACKERS] UTF8 with BOM support in psql > >

Re: [HACKERS] UTF8 with BOM support in psql

2009-11-17 Thread Andrew Dunstan
Peter Eisentraut wrote: On tis, 2009-11-17 at 00:59 -0800, Chuck McDevitt wrote: Or is there a plan to read and convert the UTF-16 or UTF-32 to UTF-8, so psql and PostgreSQL understand it? (BTW, that would actually be nice on Windows, where UTF-16 is common). Well, someone could impl

Re: [HACKERS] UTF8 with BOM support in psql

2009-11-17 Thread Chuck McDevitt
> -Original Message- > From: Peter Eisentraut [mailto:pete...@gmx.net] > Sent: Tuesday, November 17, 2009 9:05 AM > To: Chuck McDevitt > Cc: Itagaki Takahiro; pgsql-hackers@postgresql.org > Subject: Re: [HACKERS] UTF8 with BOM support in psql > > On tis, 2009-11-

Re: [HACKERS] UTF8 with BOM support in psql

2009-11-17 Thread Peter Eisentraut
On tis, 2009-11-17 at 00:59 -0800, Chuck McDevitt wrote: > Or is there a plan to read and convert the UTF-16 or UTF-32 to UTF-8, > so psql and PostgreSQL understand it? > (BTW, that would actually be nice on Windows, where UTF-16 is common). Well, someone could implement UTF-16 or UTF-whatever as

Re: [HACKERS] UTF8 with BOM support in psql

2009-11-17 Thread Peter Eisentraut
On tis, 2009-11-17 at 09:31 +0900, Itagaki Takahiro wrote: > Peter Eisentraut wrote: > > > OK, I think the consensus here is: > > - Eat BOM at beginning of file (as you implemented) > > - Only when client encoding is UTF-8 --> please fix that > > Are they AND condition? If so, this patch will be

Re: [HACKERS] UTF8 with BOM support in psql

2009-11-17 Thread Tom Lane
Peter Eisentraut writes: > I think I could support using the presence of the BOM as a fall-back > indicator of encoding in absence of any other declaration. It seems to > me, however, that the description above ignores the existence of > encodings other than SQL_ASCII and UTF8. Yeah. This entir

Re: [HACKERS] UTF8 with BOM support in psql

2009-11-17 Thread Andrew Dunstan
Itagaki Takahiro wrote: Multi-byte scripts without encoding are always dangerous whether BOM is present or not. I'd say we can always throw an error when we find queries that contain multi-byte characters if no prior encoding declaration. You will break a gazillion scripts that today wo

Re: [HACKERS] UTF8 with BOM support in psql

2009-11-17 Thread Chuck McDevitt
> > I don't know what the best solution is here. The BOM encoded as UTF-8 > is valid data in other encodings. Of course, there is your point that > such data cannot be at the start of an SQL command. > Is the UTF-8 BOM ( EF BB BF ) actually valid data in any other multi-byte encoding (other t

Re: [HACKERS] UTF8 with BOM support in psql

2009-11-16 Thread Itagaki Takahiro
Peter Eisentraut wrote: > I think I could support using the presence of the BOM as a fall-back > indicator of encoding in absence of any other declaration. What is the difference the fall-back and <> ? I read this discussion that we cannot accept any automatic encoding detections (properly spea

Re: [HACKERS] UTF8 with BOM support in psql

2009-11-16 Thread Peter Eisentraut
On tis, 2009-11-17 at 14:19 +0900, Itagaki Takahiro wrote: > The attachd patch is a new proposal of the feature. > When we found BOM at beginning of file, set "expected_encoding" to UTF8. > Before every execusion of query, if pset.encoding is not UTF8, we check the > query string not to contain any

Re: [HACKERS] UTF8 with BOM support in psql

2009-11-16 Thread Itagaki Takahiro
Tom Lane wrote: > Itagaki Takahiro writes: > > If encoding setting is reverted, > >> "Eat BOM at beginning of file and <>" > > will be much safer. > > This isn't going to happen, so please stop wasting our time arguing > about it. Ok, sorry. But I still cannot accept this restriction. >> - O

Re: [HACKERS] UTF8 with BOM support in psql

2009-11-16 Thread Tom Lane
Itagaki Takahiro writes: > If encoding setting is reverted, >> "Eat BOM at beginning of file and <>" > will be much safer. This isn't going to happen, so please stop wasting our time arguing about it. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hack

Re: [HACKERS] UTF8 with BOM support in psql

2009-11-16 Thread Itagaki Takahiro
Tom Lane wrote: > Andrew Dunstan writes: > > if you need to, using PGOPTIONS or psql > > "dbname=mydb options='-c client_encoding=utf8'". > > It could also be set in ~/.psqlrc, which would probably be the most > convenient method for regular users of UTF8 files who need to talk > to non-UTF8

Re: [HACKERS] UTF8 with BOM support in psql

2009-11-16 Thread Tom Lane
Andrew Dunstan writes: > As for when it can be set, unless I'm mistaken you should be able to set > it before any file is opened, if you need to, using PGOPTIONS or psql > "dbname=mydb options='-c client_encoding=utf8'". Of course, if the > server encoding is utf8 then, in the absence of it bei

Re: [HACKERS] UTF8 with BOM support in psql

2009-11-16 Thread Andrew Dunstan
Itagaki Takahiro wrote: Peter Eisentraut wrote: OK, I think the consensus here is: - Eat BOM at beginning of file (as you implemented) - Only when client encoding is UTF-8 --> please fix that Are they AND condition? If so, this patch will be useless. Please remember \encoding or SE

Re: [HACKERS] UTF8 with BOM support in psql

2009-11-16 Thread Tom Lane
Itagaki Takahiro writes: > Please remember \encoding or SET client_encoding appear > *after* BOM at beginning of file. I'll agree if the condition is > "Eat BOM at beginning of file and <>", As has been stated multiple times, that will not get accepted, because it will *break* files in other enc

Re: [HACKERS] UTF8 with BOM support in psql

2009-11-16 Thread Itagaki Takahiro
Peter Eisentraut wrote: > OK, I think the consensus here is: > - Eat BOM at beginning of file (as you implemented) > - Only when client encoding is UTF-8 --> please fix that Are they AND condition? If so, this patch will be useless. Please remember \encoding or SET client_encoding appear *after

Re: [HACKERS] UTF8 with BOM support in psql

2009-11-16 Thread Tom Lane
Peter Eisentraut writes: > I'm not sure if replacing a BOM by three spaces is a good way to > implement "eating", because it might throw off a column indicator > somewhere, say, but I couldn't reproduce a problem. Note that the U > +FEFF character is defined as *zero-width* non-breaking space. S

Re: [HACKERS] UTF8 with BOM support in psql

2009-11-16 Thread Peter Eisentraut
On ons, 2009-10-21 at 13:11 +0900, Itagaki Takahiro wrote: > Sure. Client encoding is declared in body of a file, but BOM is > in head of the file. So, we should always ignore BOM sequence > at the file head no matter what client encoding is used. > > The attached patch replace BOM with while spac

Re: [HACKERS] UTF8 with BOM support in psql

2009-11-14 Thread Andrew Dunstan
Peter Eisentraut wrote: On ons, 2009-10-21 at 13:11 +0900, Itagaki Takahiro wrote: Client encoding is declared in body of a file, but BOM is in head of the file. So, we should always ignore BOM sequence at the file head no matter what client encoding is used. The attached patch replace BOM

Re: [HACKERS] UTF8 with BOM support in psql

2009-11-14 Thread Peter Eisentraut
On ons, 2009-10-21 at 13:11 +0900, Itagaki Takahiro wrote: > Client encoding is declared in body of a file, but BOM is > in head of the file. So, we should always ignore BOM sequence > at the file head no matter what client encoding is used. > > The attached patch replace BOM with while spaces, bu

Re: [HACKERS] UTF8 with BOM support in psql

2009-10-24 Thread Peter Eisentraut
On ons, 2009-10-21 at 13:11 +0900, Itagaki Takahiro wrote: > So, we should always ignore BOM sequence > at the file head no matter what client encoding is used. I think we can't do that. That byte sequence might be valid user data in other encodings. -- Sent via pgsql-hackers mailing list (pgs

Re: [HACKERS] UTF8 with BOM support in psql

2009-10-21 Thread Andrew Dunstan
Peter Eisentraut wrote: On Wed, 2009-10-21 at 13:11 +0900, Itagaki Takahiro wrote: The attached patch replace BOM with while spaces, but it does not change client encoding automatically. I think we can always ignore client encoding at the replacement because SQL command cannot start with BO

Re: [HACKERS] UTF8 with BOM support in psql

2009-10-21 Thread Peter Eisentraut
On Wed, 2009-10-21 at 13:11 +0900, Itagaki Takahiro wrote: > The attached patch replace BOM with while spaces, but it does not > change client encoding automatically. I think we can always ignore > client encoding at the replacement because SQL command cannot start > with BOM sequence. If we don't

Re: [HACKERS] UTF8 with BOM support in psql

2009-10-20 Thread Itagaki Takahiro
David Christensen wrote: > Is that only when the default client encoding is set to UTF8 > (PGCLIENTENCODING, whatever), or will it be coded to work with the > following: > > $ psql -f > Where is: > > SET CLIENT ENCODING 'utf8'; Sure. Client encoding is declared in body of a file, but BO

Re: [HACKERS] UTF8 with BOM support in psql

2009-10-20 Thread David Christensen
On Oct 20, 2009, at 10:51 AM, Tom Lane wrote: Andrew Dunstan writes: What I think we might sensibly do is to eat the leading BOM of an SQL file iff the client encoding is UTF8, and otherwise treat it as just bytes in whatever the encoding is. That seems relatively non-risky. Is that only

Re: [HACKERS] UTF8 with BOM support in psql

2009-10-20 Thread Magnus Hagander
2009/10/20 Tom Lane : > Andrew Dunstan writes: >> What I think we might sensibly do is to eat the leading BOM of an SQL >> file iff the client encoding is UTF8, and otherwise treat it as just >> bytes in whatever the encoding is. > > That seems relatively non-risky. +1. >> Should we also do the

Re: [HACKERS] UTF8 with BOM support in psql

2009-10-20 Thread Kevin Grittner
Andrew Dunstan wrote: > What I think we might sensibly do is to eat the leading BOM of an > SQL file iff the client encoding is UTF8, and otherwise treat it as > just bytes in whatever the encoding is. Only at the beginning of the file or stream? What happens when people concatenate files? W

Re: [HACKERS] UTF8 with BOM support in psql

2009-10-20 Thread Tom Lane
Andrew Dunstan writes: > What I think we might sensibly do is to eat the leading BOM of an SQL > file iff the client encoding is UTF8, and otherwise treat it as just > bytes in whatever the encoding is. That seems relatively non-risky. > Should we also do the same for files passed via \copy? W

Re: [HACKERS] UTF8 with BOM support in psql

2009-10-20 Thread Andrew Dunstan
Tom Lane wrote: Bruce Momjian writes: Seems there is community support for accepting BOM: http://archives.postgresql.org/pgsql-hackers/2009-09/msg01625.php That discussion has approximately nothing to do with the much-more-invasive change that Itagaki-san is suggesting. In

Re: [HACKERS] UTF8 with BOM support in psql

2009-10-20 Thread Tom Lane
Bruce Momjian writes: > Seems there is community support for accepting BOM: > http://archives.postgresql.org/pgsql-hackers/2009-09/msg01625.php That discussion has approximately nothing to do with the much-more-invasive change that Itagaki-san is suggesting. In particular I think an automa

Re: [HACKERS] UTF8 with BOM support in psql

2009-10-20 Thread Peter Eisentraut
On Tue, 2009-10-20 at 14:41 +0900, Itagaki Takahiro wrote: > UTF8 encoding text files with BOM (Byte Order Mark) are commonly > used in Windows, though BOM was designed for UTF16 text originally. > However, psql cannot read such format even if we set client encoding > to UTF8. Is it worth supportin

Re: [HACKERS] UTF8 with BOM support in psql

2009-10-19 Thread Itagaki Takahiro
Bruce Momjian wrote: > Itagaki Takahiro wrote: > > When psql opens a file with -f or \i, it checks first 3 bytes of the > > file. If they are BOM, discard the 3 bytes and change client encoding > > to UTF8 automatically. > > Seems there is community support for accepting BOM: > http://arc

Re: [HACKERS] UTF8 with BOM support in psql

2009-10-19 Thread Bruce Momjian
Itagaki Takahiro wrote: > UTF8 encoding text files with BOM (Byte Order Mark) are commonly > used in Windows, though BOM was designed for UTF16 text originally. > However, psql cannot read such format even if we set client encoding > to UTF8. Is it worth supporting those format in psql? > > When p

[HACKERS] UTF8 with BOM support in psql

2009-10-19 Thread Itagaki Takahiro
UTF8 encoding text files with BOM (Byte Order Mark) are commonly used in Windows, though BOM was designed for UTF16 text originally. However, psql cannot read such format even if we set client encoding to UTF8. Is it worth supporting those format in psql? When psql opens a file with -f or \i, it c