Re: [HACKERS] [BUGS] BUG #9210: PostgreSQL string store bug? not enforce check with correct characterSET/encoding

2014-02-19 Thread Tom Lane
I wrote:
 The minimum-refactoring solution to this would be to tweak
 pg_do_encoding_conversion() so that if the src_encoding is SQL_ASCII but
 the dest_encoding isn't, it does pg_verify_mbstr() rather than nothing.

 I'm not sure if this would break anything we need to have work,
 though.  Thoughts?  Do we want to back-patch such a change?

 I looked through all the callers of pg_do_encoding_conversion(), and
 AFAICS this change is a good idea.  There are a whole bunch of places
 that use pg_do_encoding_conversion() to convert from the database encoding
 to encoding X (most usually UTF8), and right now if you do that in a
 SQL_ASCII database you have no assurance whatever that what is produced
 is actually valid in encoding X.  I think we need to close that loophole.

The more I looked into mbutils.c, the less happy I got.  The attached
proposed patch takes care of the missing-verification hole in
pg_do_encoding_conversion() and pg_server_to_any(), and also gets rid
of what I believe to be obsolete provisions in pg_do_encoding_conversion
to work if called outside a transaction --- if you consider it working
to completely fail to honor its API contract.  That should no longer be
necessary now that we perform client-server encoding conversions via
perform_default_encoding_conversion rather than here.

For testing I inserted an Assert(IsTransactionState()); at the top of
pg_do_encoding_conversion(), and couldn't trigger it, so I'm fairly sure
this change is OK.  Still, back-patching it might not be a good thing.
On the other hand, if there is still any code path that can get here
outside a transaction, we probably ought to find out rather than allow
completely bogus data to be returned.

I also made some more-cosmetic improvements, notably removing a bunch
of Asserts that are certainly dead code because the relevant variables
are never NULL.

I've not done anything yet about simplifying unnecessary calls of
pg_do_encoding_conversion into pg_server_to_any/pg_any_to_server.

How much of this is back-patch material, do you think?

regards, tom lane

diff --git a/src/backend/utils/mb/mbutils.c b/src/backend/utils/mb/mbutils.c
index 08440e9..7f43cae 100644
*** a/src/backend/utils/mb/mbutils.c
--- b/src/backend/utils/mb/mbutils.c
***
*** 1,10 
! /*
!  * This file contains public functions for conversion between
!  * client encoding and server (database) encoding.
   *
!  * Tatsuo Ishii
   *
!  * src/backend/utils/mb/mbutils.c
   */
  #include postgres.h
  
--- 1,36 
! /*-
   *
!  * mbutils.c
!  *	  This file contains functions for encoding conversion.
   *
!  * The string-conversion functions in this file share some API quirks.
!  * Note the following:
!  *
!  * The functions return a palloc'd, null-terminated string if conversion
!  * is required.  However, if no conversion is performed, the given source
!  * string pointer is returned as-is.
!  *
!  * Although the presence of a length argument means that callers can pass
!  * non-null-terminated strings, care is required because the same string
!  * will be passed back if no conversion occurs.  Such callers *must* check
!  * whether result == src and handle that case differently.
!  *
!  * If the source and destination encodings are the same, the source string
!  * is returned without any verification; it's assumed to be valid data.
!  * If that might not be the case, the caller is responsible for validating
!  * the string using a separate call to pg_verify_mbstr().  Whenever the
!  * source and destination encodings are different, the functions ensure that
!  * the result is validly encoded according to the destination encoding.
!  *
!  *
!  * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
!  * Portions Copyright (c) 1994, Regents of the University of California
!  *
!  *
!  * IDENTIFICATION
!  *	  src/backend/utils/mb/mbutils.c
!  *
!  *-
   */
  #include postgres.h
  
*** InitializeClientEncoding(void)
*** 290,296 
  int
  pg_get_client_encoding(void)
  {
- 	Assert(ClientEncoding);
  	return ClientEncoding-encoding;
  }
  
--- 316,321 
*** pg_get_client_encoding(void)
*** 300,328 
  const char *
  pg_get_client_encoding_name(void)
  {
- 	Assert(ClientEncoding);
  	return ClientEncoding-name;
  }
  
  /*
!  * Apply encoding conversion on src and return it. The encoding
!  * conversion function is chosen from the pg_conversion system catalog
!  * marked as default. If it is not found in the schema search path,
!  * it's taken from pg_catalog schema. If it even is not in the schema,
!  * warn and return src.
!  *
!  * If conversion occurs, a palloc'd null-terminated string is returned.
!  * In the case of no conversion, src is returned.
!  *
!  * CAUTION: although the presence of a length argument means that callers
! 

Re: [HACKERS] [BUGS] BUG #9210: PostgreSQL string store bug? not enforce check with correct characterSET/encoding

2014-02-18 Thread Tom Lane
I wrote:
 dig...@126.com writes:
 select t, t::bytea from convert_from('\xeec1', 'sql_ascii') as g(t);
 [ fails to check that string is valid in database encoding ]

 Hm, yeah.  Normal input to the database goes through pg_any_to_server(),
 which will apply a validation step if the source encoding is SQL_ASCII
 and the destination encoding is something else.  However, pg_convert and
 some other places call pg_do_encoding_conversion() directly, and that
 function will just quietly do nothing if either encoding is SQL_ASCII.

 The minimum-refactoring solution to this would be to tweak
 pg_do_encoding_conversion() so that if the src_encoding is SQL_ASCII but
 the dest_encoding isn't, it does pg_verify_mbstr() rather than nothing.

 I'm not sure if this would break anything we need to have work,
 though.  Thoughts?  Do we want to back-patch such a change?

I looked through all the callers of pg_do_encoding_conversion(), and
AFAICS this change is a good idea.  There are a whole bunch of places
that use pg_do_encoding_conversion() to convert from the database encoding
to encoding X (most usually UTF8), and right now if you do that in a
SQL_ASCII database you have no assurance whatever that what is produced
is actually valid in encoding X.  I think we need to close that loophole.

I found one place --- utf_u2e() in plperl_helpers.h --- that is aware of
the lack of checking and forces a pg_verify_mbstr call for itself; but
it apparently is concerned about whether the source data is actually utf8
in the first place, which I think is not really
pg_do_encoding_conversion's bailiwick.  I'm okay with
pg_do_encoding_conversion being a no-op if src_encoding == dest_encoding.

Barring objections, I will fix and back-patch this.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [BUGS] BUG #9210: PostgreSQL string store bug? not enforce check with correct characterSET/encoding

2014-02-18 Thread Tom Lane
I wrote:
 I looked through all the callers of pg_do_encoding_conversion(), and
 AFAICS this change is a good idea.  There are a whole bunch of places
 that use pg_do_encoding_conversion() to convert from the database encoding
 to encoding X (most usually UTF8), and right now if you do that in a
 SQL_ASCII database you have no assurance whatever that what is produced
 is actually valid in encoding X.  I think we need to close that loophole.

BTW, a second look confirms that all but one or two of the callers of
pg_do_encoding_conversion() are in fact converting either to or from
the database encoding.  That probably means they ought to be using
pg_any_to_server or pg_server_to_any instead, so that they can take
advantage of the pre-cached default conversion in the not-unlikely
situation that the other encoding is the current client_encoding.

This would imply that we ought to put a validate-if-source-is-SQL_ASCII
step into pg_server_to_any() as well, since that function currently
short-circuits calling pg_do_encoding_conversion() in that case.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [BUGS] BUG #9210: PostgreSQL string store bug? not enforce check with correct characterSET/encoding

2014-02-13 Thread Tom Lane
dig...@126.com writes:
 select t, t::bytea from convert_from('\xeec1', 'sql_ascii') as g(t);
 [ fails to check that string is valid in database encoding ]

Hm, yeah.  Normal input to the database goes through pg_any_to_server(),
which will apply a validation step if the source encoding is SQL_ASCII
and the destination encoding is something else.  However, pg_convert and
some other places call pg_do_encoding_conversion() directly, and that
function will just quietly do nothing if either encoding is SQL_ASCII.

The minimum-refactoring solution to this would be to tweak
pg_do_encoding_conversion() so that if the src_encoding is SQL_ASCII but
the dest_encoding isn't, it does pg_verify_mbstr() rather than nothing.

I'm not sure if this would break anything we need to have work,
though.  Thoughts?  Do we want to back-patch such a change?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers