Tom Lane wrote:
Andrew Dunstan [EMAIL PROTECTED] writes:
Tom Lane wrote:
What I think we'd need to have a complete solution is
convert(text, name) returns bytea
-- convert from DB encoding to arbitrary encoding
convert(bytea, name, name) returns bytea
-- convert between any two
Ühel kenal päeval, T, 2007-09-18 kell 08:08, kirjutas Andrew Dunstan:
Tom Lane wrote:
Andrew Dunstan [EMAIL PROTECTED] writes:
Tom Lane wrote:
What I think we'd need to have a complete solution is
convert(text, name) returns bytea
-- convert from DB encoding to arbitrary
Hannu Krosing wrote:
Ühel kenal päeval, T, 2007-09-18 kell 08:08, kirjutas Andrew Dunstan:
Tom Lane wrote:
Andrew Dunstan [EMAIL PROTECTED] writes:
Tom Lane wrote:
What I think we'd need to have a complete solution is
convert(text, name) returns bytea
--
Andrew Dunstan [EMAIL PROTECTED] writes:
What's bothering me here though is that in the two argument forms, if
the first argument is text the second argument is the destination
encoding, but if the first argument is a bytea the second argument is
the source encoding. That strikes me as
Tom Lane wrote:
Anyway, on the strength of that, these functions are definitely
best named to stay away from the spec syntax, so +1 for your
proposal above.
OK, I have committed this and the other the functional changes that
should change the encoding holes.
Catalog version
Andrew Dunstan [EMAIL PROTECTED] writes:
Tom Lane wrote:
No. We have a function overloading system, we should use it.
In general I agree with you.
What's bothering me here though is that in the two argument forms, if the
first
argument is text the second argument is the destination
Tom Lane wrote:
What I think we'd need to have a complete solution is
convert(text, name) returns bytea
-- convert from DB encoding to arbitrary encoding
convert(bytea, name, name) returns bytea
-- convert between any two encodings
convert(bytea, name) returns text
Andrew Dunstan [EMAIL PROTECTED] writes:
Tom Lane wrote:
What I think we'd need to have a complete solution is
convert(text, name) returns bytea
-- convert from DB encoding to arbitrary encoding
convert(bytea, name, name) returns bytea
-- convert between any two encodings
Tom Lane wrote:
I think really the technically cleanest solution would be to make
convert() return bytea instead of text; then we'd not have to put
restrictions on what encoding or locale it's working inside of.
However, it's not clear to me whether there are valid usages that
that would
Andrew Dunstan [EMAIL PROTECTED] writes:
Are you wanting this done for 8.3? If so, by whom? :-)
[ shrug... ] I'm not the one who's worried about closing all the holes
leading to encoding problems.
regards, tom lane
---(end of
Tom Lane wrote:
Andrew Dunstan [EMAIL PROTECTED] writes:
Are you wanting this done for 8.3? If so, by whom? :-)
[ shrug... ] I'm not the one who's worried about closing all the holes
leading to encoding problems.
I can certainly have a go at it. Are
Andrew Dunstan [EMAIL PROTECTED] writes:
I can certainly have a go at it. Are we still talking about Oct 1 for a
possible beta?
Yeah, there's still a little time left --- HOT will take at least a few
more days.
regards, tom lane
---(end of
On Tue, Sep 11, 2007 at 11:27:50AM +0900, Tatsuo Ishii wrote:
SELECT * FROM japanese_table ORDER BY convert(japanese_text using
utf8_to_euc_jp);
Without using convert(), he will get random order of data. This is
because Kanji characters are in random order in UTF-8, while Kanji
characters
On Tue, 2007-09-11 at 14:50 +0900, Tatsuo Ishii wrote:
On Tue, 2007-09-11 at 12:29 +0900, Tatsuo Ishii wrote:
Please show me concrete examples how I could introduce a
vulnerability
using this kind of convert() usage.
Try the sequence below. Then, try to dump and then reload the
Try the sequence below. Then, try to dump and then reload the database.
When you try to reload it, you will get an error:
ERROR: invalid byte sequence for encoding UTF8: 0xbd
I know this could be a problem (like chr() with invalid byte pattern).
And that's enough of a problem already. We
On Tue, 2007-09-11 at 14:50 +0900, Tatsuo Ishii wrote:
On Tue, 2007-09-11 at 12:29 +0900, Tatsuo Ishii wrote:
Please show me concrete examples how I could introduce a
vulnerability
using this kind of convert() usage.
Try the sequence below. Then, try to dump and then reload
Andrew Dunstan wrote:
Instead of the code point, I'd prefer the actual encoding of
the character as argument to chr() and return value of ascii().
And frankly, I don't know how to do it sanely anyway. A character
encoding has a fixed byte pattern, but a given byte pattern
doesn't have
a
On Mon, 2007-09-10 at 23:20 -0400, Tom Lane wrote:
The reason we have a problem here is that we've been choosing
convenience over safety in encoding-related issues. I wonder if we must
stoop to having a strict_encoding_checks GUC variable to satisfy
everyone.
That would be satisfactory to
Jeff Davis [EMAIL PROTECTED] writes:
On Mon, 2007-09-10 at 23:20 -0400, Tom Lane wrote:
It might work the way you are expecting if the database uses SQL_ASCII
encoding and C locale --- and I'd be fine with allowing convert() only
when the database encoding is SQL_ASCII.
I prefer this option.
On Tue, 2007-09-11 at 14:48 -0400, Tom Lane wrote:
Jeff Davis [EMAIL PROTECTED] writes:
On Mon, 2007-09-10 at 23:20 -0400, Tom Lane wrote:
It might work the way you are expecting if the database uses SQL_ASCII
encoding and C locale --- and I'd be fine with allowing convert() only
when the
Tom Lane wrote:
Jeff Davis [EMAIL PROTECTED] writes:
On Mon, 2007-09-10 at 23:20 -0400, Tom Lane wrote:
It might work the way you are expecting if the database uses SQL_ASCII
encoding and C locale --- and I'd be fine with allowing convert() only
when the database encoding is SQL_ASCII.
Alvaro Herrera [EMAIL PROTECTED] writes:
Tom Lane wrote:
I think really the technically cleanest solution would be to make
convert() return bytea instead of text; then we'd not have to put
restrictions on what encoding or locale it's working inside of.
However, it's not clear to me whether
Tatsuo Ishii [EMAIL PROTECTED] writes:
If we make convert() operate on bytea and return bytea, as Tom
suggested, would that solve your use case?
The problem is, the above use case is just one of what I can think of.
Another use case is, something like this:
SELECT
However ISTM we would also need something like
length(bytea, name) returns int
-- counts the number of characters assuming that the bytea is in
-- the given encoding
Hmm, I wonder if counting chars is consistent regardless of the
encoding the string is in. To me it sounds
Tom Lane wrote:
. for chr() under UTF8, it seems to be generally agreed
that the argument should represent the codepoint and the
function should return the correspondingly encoded character.
If so, possible the argument should be a bigint to
accommodate the full range of possible code points.
Tom Lane [EMAIL PROTECTED] writes:
Jeff Davis [EMAIL PROTECTED] writes:
I think the concern is when they use only one slash, like:
E'\377\000\377'::bytea
which, as I mentioned before, is not correct anyway.
Wait, why would this be wrong? How would you enter the three byte bytea of
I think the concern is when they use only one slash, like:
E'\377\000\377'::bytea
which, as I mentioned before, is not correct anyway.
Wait, why would this be wrong? How would you enter the three byte bytea of
consisting of those three bytes described above?
Either as
E'\\377\\000\\377'
Tom Lane wrote:
Andrew Dunstan [EMAIL PROTECTED] writes:
Tom Lane wrote:
In the short run it might be best to do it in scan.l after all.
I have not come up with a way of doing that and handling the bytea case.
AFAICS we have no realistic choice other than to
Albe Laurenz wrote:
I'd like to repeat my suggestion for chr() and ascii().
Instead of the code point, I'd prefer the actual encoding of
the character as argument to chr() and return value of ascii().
[snip]
Of course, if it is generally perceived that the code point
is more useful
Andrew Dunstan [EMAIL PROTECTED] writes:
Perhaps we're talking at cross purposes.
The problem with doing encoding validation in scan.l is that it lacks
context. Null bytes are only the tip of the bytea iceberg, since any
arbitrary sequence of bytes can be valid for a bytea.
If you think
Andrew Dunstan [EMAIL PROTECTED] writes:
The reason we are prepared to make an exception for Unicode is precisely
because the code point maps to an encoding pattern independently of
architecture, ISTM.
Right --- there is a well-defined standard for the numerical value of
each character in
Andrew Dunstan [EMAIL PROTECTED] writes:
Tom Lane wrote:
Those should be checked already --- if not, the right fix is still to
fix it there, not in per-datatype code. I think we are OK though,
eg see need_transcoding logic in copy.c.
Well, a little experimentation shows that we currently
Tom Lane wrote:
Andrew Dunstan [EMAIL PROTECTED] writes:
Perhaps we're talking at cross purposes.
The problem with doing encoding validation in scan.l is that it lacks
context. Null bytes are only the tip of the bytea iceberg, since any
arbitrary sequence of bytes can be valid
Andrew Dunstan [EMAIL PROTECTED] writes:
The reason we are prepared to make an exception for Unicode is precisely
because the code point maps to an encoding pattern independently of
architecture, ISTM.
Right --- there is a well-defined standard for the numerical value of
each
BTW, I'm sure this was discussed but I forgot the conclusion: should
chr(0) throw an error? If we're trying to get rid of embedded-null
problems, seems it must.
regards, tom lane
---(end of broadcast)---
TIP 2: Don't 'kill
On Tue, Sep 11, 2007 at 12:30:51AM +0900, Tatsuo Ishii wrote:
Why do you think that employing the Unicode code point as the chr()
argument could avoid endianness issues? Are you going to represent
Unicode code point as UCS-4? Then you have to specify the endianness
anyway. (see the UCS-4
Tatsuo Ishii wrote:
I don't understand whole discussion.
Why do you think that employing the Unicode code point as the chr()
argument could avoid endianness issues? Are you going to represent
Unicode code point as UCS-4? Then you have to specify the endianness
anyway. (see the UCS-4
Tom Lane wrote:
BTW, I'm sure this was discussed but I forgot the conclusion: should
chr(0) throw an error? If we're trying to get rid of embedded-null
problems, seems it must.
I think it should, yes.
cheers
andrew
---(end of
Andrew Dunstan [EMAIL PROTECTED] writes:
Tom Lane wrote:
BTW, I'm sure this was discussed but I forgot the conclusion: should
chr(0) throw an error?
I think it should, yes.
OK. Looking back, there was also some mention of changing chr's
argument to bigint, but I'd counsel against doing
Tom Lane wrote:
OK. Looking back, there was also some mention of changing chr's
argument to bigint, but I'd counsel against doing that. We should not
need it since we only support 4-byte UTF8, hence code points only up to
21 bits (and indeed even 6-byte UTF8 can only have 31-bit code points,
On Mon, Sep 10, 2007 at 11:48:29AM -0400, Tom Lane wrote:
BTW, I'm sure this was discussed but I forgot the conclusion: should
chr(0) throw an error? If we're trying to get rid of embedded-null
problems, seems it must.
It is pointed out on wikipedia that Java sometimes uses to byte pair C0
80
Tatsuo Ishii wrote:
I don't understand whole discussion.
Why do you think that employing the Unicode code point as the chr()
argument could avoid endianness issues? Are you going to represent
Unicode code point as UCS-4? Then you have to specify the endianness
anyway. (see the
Tatsuo Ishii wrote:
I don't understand whole discussion.
Why do you think that employing the Unicode code point as the chr()
argument could avoid endianness issues? Are you going to represent
Unicode code point as UCS-4? Then you have to specify the endianness
anyway. (see the
Tatsuo Ishii [EMAIL PROTECTED] writes:
If you regard the unicode code point as simply a number, why not
regard the multibyte characters as a number too?
Because there's a standard specifying the Unicode code points *as
numbers*. The mapping from those numbers to UTF8 strings (and other
Tatsuo Ishii wrote:
If you regard the unicode code point as simply a number, why not
regard the multibyte characters as a number too? I mean, since 0xC2A9
= 49833, select chr(49833) should work fine no?
No. The number corresponding to a given byte pattern depends on the
endianness of
Tatsuo Ishii [EMAIL PROTECTED] writes:
If you regard the unicode code point as simply a number, why not
regard the multibyte characters as a number too?
Because there's a standard specifying the Unicode code points *as
numbers*. The mapping from those numbers to UTF8 strings (and other
On Tue, 2007-09-11 at 11:27 +0900, Tatsuo Ishii wrote:
BTW, it strikes me that there is another hole that we need to plug in
this area, and that's the convert() function. Being able to create
a value of type text that is not in the database encoding is simply
broken. Perhaps we could
On Tue, 2007-09-11 at 11:27 +0900, Tatsuo Ishii wrote:
BTW, it strikes me that there is another hole that we need to plug in
this area, and that's the convert() function. Being able to create
a value of type text that is not in the database encoding is simply
broken. Perhaps we
Tatsuo Ishii wrote:
BTW, it strikes me that there is another hole that we need to plug in
this area, and that's the convert() function. Being able to create
a value of type text that is not in the database encoding is simply
broken. Perhaps we could make it work on bytea instead (providing
a
Jeff Davis wrote:
On Tue, 2007-09-11 at 11:27 +0900, Tatsuo Ishii wrote:
BTW, it strikes me that there is another hole that we need to plug in
this area, and that's the convert() function. Being able to create
a value of type text that is not in the database encoding is simply
broken.
Andrew Dunstan [EMAIL PROTECTED] writes:
I'm not sure we are going to be able to catch every path by which
invalid data can get into the database in one release. I suspect we
might need two or three goes at this. (I'm just wondering if the
routines that return cstrings are a possible
Tatsuo Ishii [EMAIL PROTECTED] writes:
BTW, it strikes me that there is another hole that we need to plug in
this area, and that's the convert() function. Being able to create
a value of type text that is not in the database encoding is simply
broken. Perhaps we could make it work on bytea
Tatsuo Ishii [EMAIL PROTECTED] writes:
BTW, it strikes me that there is another hole that we need to plug in
this area, and that's the convert() function. Being able to create
a value of type text that is not in the database encoding is simply
broken. Perhaps we could make it work on
On Tue, 2007-09-11 at 11:53 +0900, Tatsuo Ishii wrote:
Isn't the collation a locale issue, not an encoding issue? Is there a
ja_JP.UTF-8 that defines the proper order?
I don't think it helps. The point is, he needs different language's
collation, while PostgreSQL allows only one
On Tue, 2007-09-11 at 12:29 +0900, Tatsuo Ishii wrote:
Please show me concrete examples how I could introduce a vulnerability
using this kind of convert() usage.
Try the sequence below. Then, try to dump and then reload the database.
When you try to reload it, you will get an error:
ERROR:
On Tue, 2007-09-11 at 12:29 +0900, Tatsuo Ishii wrote:
Please show me concrete examples how I could introduce a vulnerability
using this kind of convert() usage.
Try the sequence below. Then, try to dump and then reload the database.
When you try to reload it, you will get an error:
On Sun, Sep 09, 2007 at 12:02:28AM -0400, Andrew Dunstan wrote:
. what do we need to do to make the verification code more efficient? I
think we need to address the correctness issue first, but doing so
should certainly make us want to improve the verification code. For
example, I'm
Martijn van Oosterhout wrote:
On Sun, Sep 09, 2007 at 12:02:28AM -0400, Andrew Dunstan wrote:
. what do we need to do to make the verification code more efficient? I
think we need to address the correctness issue first, but doing so
should certainly make us want to improve the
Andrew Dunstan [EMAIL PROTECTED] writes:
I have been looking at fixing the issue of accepting strings that are
not valid in the database encoding. It appears from previous discussion
that we need to add a call to pg_verifymbstr() to the relevant input
routines and ensure that the chr()
Tom Lane wrote:
A possible answer is to add a verifymbstr to the string literal
converter anytime it has processed a numeric backslash-escape in the
string. Open questions for that are (1) does it have negative effects
for bytea, and if so is there any hope of working around it? (2) how
can
Tom Lane wrote:
Andrew Dunstan [EMAIL PROTECTED] writes:
Is that going to cover data coming in via COPY? and parameters for
prepared statements?
Those should be checked already --- if not, the right fix is still to
fix it there, not in per-datatype code. I think we are OK though,
On Sun, 2007-09-09 at 10:51 -0400, Tom Lane wrote:
A possible answer is to add a verifymbstr to the string literal
converter anytime it has processed a numeric backslash-escape in the
string. Open questions for that are (1) does it have negative effects
for bytea, and if so is there any hope
Andrew Dunstan [EMAIL PROTECTED] writes:
Well, a little experimentation shows that we currently are not OK:
This experiment is inadequately described.
What is the type of the column involved?
regards, tom lane
---(end of
Jeff Davis [EMAIL PROTECTED] writes:
Currently, you can pass a bytea literal as either: E'\377\377\377' or
E'\\377\\377\\377'.
The first strategy (single backslash) is not correct, because if you do
E'\377\000\377', the embedded null character counts as the end of the
cstring, even though
On Sun, 2007-09-09 at 17:09 -0400, Tom Lane wrote:
Jeff Davis [EMAIL PROTECTED] writes:
Currently, you can pass a bytea literal as either: E'\377\377\377' or
E'\\377\\377\\377'.
The first strategy (single backslash) is not correct, because if you do
E'\377\000\377', the embedded null
Tom Lane wrote:
Andrew Dunstan [EMAIL PROTECTED] writes:
Well, a little experimentation shows that we currently are not OK:
This experiment is inadequately described.
What is the type of the column involved?
Sorry. It's text.
cheers
andrew
Jeff Davis [EMAIL PROTECTED] writes:
Would stringTypeDatum() in parse_type.c be a good place to put the
pg_verifymbstr()?
Probably not, in its current form, since it hasn't got any idea where
the char *string came from; moreover it is not in any better position
than the typinput function to
Tom Lane wrote:
In the short run it might be best to do it in scan.l after all.
I have not come up with a way of doing that and handling the bytea case.
If you have I'm all ears. And then I am still worried about COPY.
cheers
andrew
---(end of
On Sun, 2007-09-09 at 23:22 -0400, Tom Lane wrote:
In the short run it might be best to do it in scan.l after all. A few
minutes' thought about what it'd take to delay the decisions till later
yields a depressingly large number of changes; and we do not have time
to be developing
Andrew Dunstan [EMAIL PROTECTED] writes:
Tom Lane wrote:
In the short run it might be best to do it in scan.l after all.
I have not come up with a way of doing that and handling the bytea case.
AFAICS we have no realistic choice other than to reject \0 in SQL
literals; to do otherwise
On Sun, 2007-09-09 at 23:33 -0400, Andrew Dunstan wrote:
Tom Lane wrote:
In the short run it might be best to do it in scan.l after all.
I have not come up with a way of doing that and handling the bytea case.
If you have I'm all ears. And then I am still worried about COPY.
If
Jeff Davis [EMAIL PROTECTED] writes:
If it's done in the scanner it should still accept things like:
E'\\377\\000\\377'::bytea
right?
Right, that will work, because the transformed literal is '\377\000\377'
(no strange characters there, just what it says) and that has not got
any encoding
I have been looking at fixing the issue of accepting strings that are
not valid in the database encoding. It appears from previous discussion
that we need to add a call to pg_verifymbstr() to the relevant input
routines and ensure that the chr() function returns a valid string. That
leaves
73 matches
Mail list logo