I wrote:
>> The minimum-refactoring solution to this would be to tweak
>> pg_do_encoding_conversion() so that if the src_encoding is SQL_ASCII but
>> the dest_encoding isn't, it does pg_verify_mbstr() rather than nothing.
>> I'm not sure if this would break anything we need to have work,
>> though. Thoughts? Do we want to back-patch such a change?
> I looked through all the callers of pg_do_encoding_conversion(), and
> AFAICS this change is a good idea. There are a whole bunch of places
> that use pg_do_encoding_conversion() to convert from the database encoding
> to encoding X (most usually UTF8), and right now if you do that in a
> SQL_ASCII database you have no assurance whatever that what is produced
> is actually valid in encoding X. I think we need to close that loophole.
The more I looked into mbutils.c, the less happy I got. The attached
proposed patch takes care of the missing-verification hole in
pg_do_encoding_conversion() and pg_server_to_any(), and also gets rid
of what I believe to be obsolete provisions in pg_do_encoding_conversion
to "work" if called outside a transaction --- if you consider it working
to completely fail to honor its API contract. That should no longer be
necessary now that we perform client<->server encoding conversions via
perform_default_encoding_conversion rather than here.
For testing I inserted an "Assert(IsTransactionState());" at the top of
pg_do_encoding_conversion(), and couldn't trigger it, so I'm fairly sure
this change is OK. Still, back-patching it might not be a good thing.
On the other hand, if there is still any code path that can get here
outside a transaction, we probably ought to find out rather than allow
completely bogus data to be returned.
I also made some more-cosmetic improvements, notably removing a bunch
of Asserts that are certainly dead code because the relevant variables
are never NULL.
I've not done anything yet about simplifying unnecessary calls of
pg_do_encoding_conversion into pg_server_to_any/pg_any_to_server.
How much of this is back-patch material, do you think?
regards, tom lane
diff --git a/src/backend/utils/mb/mbutils.c b/src/backend/utils/mb/mbutils.c
index 08440e9..7f43cae 100644
*** a/src/backend/utils/mb/mbutils.c
--- b/src/backend/utils/mb/mbutils.c
***************
*** 1,10 ****
! /*
! * This file contains public functions for conversion between
! * client encoding and server (database) encoding.
*
! * Tatsuo Ishii
*
! * src/backend/utils/mb/mbutils.c
*/
#include "postgres.h"
--- 1,36 ----
! /*-------------------------------------------------------------------------
*
! * mbutils.c
! * This file contains functions for encoding conversion.
*
! * The string-conversion functions in this file share some API quirks.
! * Note the following:
! *
! * The functions return a palloc'd, null-terminated string if conversion
! * is required. However, if no conversion is performed, the given source
! * string pointer is returned as-is.
! *
! * Although the presence of a length argument means that callers can pass
! * non-null-terminated strings, care is required because the same string
! * will be passed back if no conversion occurs. Such callers *must* check
! * whether result == src and handle that case differently.
! *
! * If the source and destination encodings are the same, the source string
! * is returned without any verification; it's assumed to be valid data.
! * If that might not be the case, the caller is responsible for validating
! * the string using a separate call to pg_verify_mbstr(). Whenever the
! * source and destination encodings are different, the functions ensure that
! * the result is validly encoded according to the destination encoding.
! *
! *
! * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
! * Portions Copyright (c) 1994, Regents of the University of California
! *
! *
! * IDENTIFICATION
! * src/backend/utils/mb/mbutils.c
! *
! *-------------------------------------------------------------------------
*/
#include "postgres.h"
*************** InitializeClientEncoding(void)
*** 290,296 ****
int
pg_get_client_encoding(void)
{
- Assert(ClientEncoding);
return ClientEncoding->encoding;
}
--- 316,321 ----
*************** pg_get_client_encoding(void)
*** 300,328 ****
const char *
pg_get_client_encoding_name(void)
{
- Assert(ClientEncoding);
return ClientEncoding->name;
}
/*
! * Apply encoding conversion on src and return it. The encoding
! * conversion function is chosen from the pg_conversion system catalog
! * marked as "default". If it is not found in the schema search path,
! * it's taken from pg_catalog schema. If it even is not in the schema,
! * warn and return src.
! *
! * If conversion occurs, a palloc'd null-terminated string is returned.
! * In the case of no conversion, src is returned.
! *
! * CAUTION: although the presence of a length argument means that callers
! * can pass non-null-terminated strings, care is required because the same
! * string will be passed back if no conversion occurs. Such callers *must*
! * check whether result == src and handle that case differently.
*
! * Note: we try to avoid raising error, since that could get us into
! * infinite recursion when this function is invoked during error message
! * sending. It should be OK to raise error for overlength strings though,
! * since the recursion will come with a shorter message.
*/
unsigned char *
pg_do_encoding_conversion(unsigned char *src, int len,
--- 325,337 ----
const char *
pg_get_client_encoding_name(void)
{
return ClientEncoding->name;
}
/*
! * Convert src string to another encoding (general case).
*
! * See the notes about string conversion functions at the top of this file.
*/
unsigned char *
pg_do_encoding_conversion(unsigned char *src, int len,
*************** pg_do_encoding_conversion(unsigned char
*** 331,369 ****
unsigned char *result;
Oid proc;
! if (!IsTransactionState())
! return src;
if (src_encoding == dest_encoding)
! return src;
! if (src_encoding == PG_SQL_ASCII || dest_encoding == PG_SQL_ASCII)
! return src;
! if (len <= 0)
return src;
proc = FindDefaultConversionProc(src_encoding, dest_encoding);
if (!OidIsValid(proc))
! {
! ereport(LOG,
(errcode(ERRCODE_UNDEFINED_FUNCTION),
errmsg("default conversion function for encoding \"%s\" to \"%s\" does not exist",
pg_encoding_to_char(src_encoding),
pg_encoding_to_char(dest_encoding))));
- return src;
- }
-
- /*
- * XXX we should avoid throwing errors in OidFunctionCall. Otherwise we
- * are going into infinite loop! So we have to make sure that the
- * function exists before calling OidFunctionCall.
- */
- if (!SearchSysCacheExists1(PROCOID, ObjectIdGetDatum(proc)))
- {
- elog(LOG, "cache lookup failed for function %u", proc);
- return src;
- }
/*
* Allocate space for conversion result, being wary of integer overflow
--- 340,371 ----
unsigned char *result;
Oid proc;
! if (len <= 0)
! return src; /* empty string is always valid */
if (src_encoding == dest_encoding)
! return src; /* no conversion required, assume valid */
! if (dest_encoding == PG_SQL_ASCII)
! return src; /* any string is valid in SQL_ASCII */
! if (src_encoding == PG_SQL_ASCII)
! {
! /* No conversion is possible, but we must validate the result */
! (void) pg_verify_mbstr(dest_encoding, (const char *) src, len, false);
return src;
+ }
+
+ if (!IsTransactionState()) /* shouldn't happen */
+ elog(ERROR, "cannot perform encoding conversion outside a transaction");
proc = FindDefaultConversionProc(src_encoding, dest_encoding);
if (!OidIsValid(proc))
! ereport(ERROR,
(errcode(ERRCODE_UNDEFINED_FUNCTION),
errmsg("default conversion function for encoding \"%s\" to \"%s\" does not exist",
pg_encoding_to_char(src_encoding),
pg_encoding_to_char(dest_encoding))));
/*
* Allocate space for conversion result, being wary of integer overflow
*************** pg_do_encoding_conversion(unsigned char
*** 387,393 ****
}
/*
! * Convert string using encoding_name. The source
* encoding is the DB encoding.
*
* BYTEA convert_to(TEXT string, NAME encoding_name) */
--- 389,395 ----
}
/*
! * Convert string to encoding encoding_name. The source
* encoding is the DB encoding.
*
* BYTEA convert_to(TEXT string, NAME encoding_name) */
*************** pg_convert_to(PG_FUNCTION_ARGS)
*** 412,418 ****
}
/*
! * Convert string using encoding_name. The destination
* encoding is the DB encoding.
*
* TEXT convert_from(BYTEA string, NAME encoding_name) */
--- 414,420 ----
}
/*
! * Convert string from encoding encoding_name. The destination
* encoding is the DB encoding.
*
* TEXT convert_from(BYTEA string, NAME encoding_name) */
*************** pg_convert_from(PG_FUNCTION_ARGS)
*** 439,445 ****
}
/*
! * Convert string using encoding_names.
*
* BYTEA convert(BYTEA string, NAME src_encoding_name, NAME dest_encoding_name)
*/
--- 441,447 ----
}
/*
! * Convert string between two arbitrary encodings.
*
* BYTEA convert(BYTEA string, NAME src_encoding_name, NAME dest_encoding_name)
*/
*************** pg_convert(PG_FUNCTION_ARGS)
*** 472,479 ****
src_str = VARDATA_ANY(string);
pg_verify_mbstr_len(src_encoding, src_str, len, false);
! dest_str = (char *) pg_do_encoding_conversion(
! (unsigned char *) src_str, len, src_encoding, dest_encoding);
if (dest_str != src_str)
len = strlen(dest_str);
--- 474,486 ----
src_str = VARDATA_ANY(string);
pg_verify_mbstr_len(src_encoding, src_str, len, false);
! /* perform conversion */
! dest_str = (char *) pg_do_encoding_conversion((unsigned char *) src_str,
! len,
! src_encoding,
! dest_encoding);
!
! /* update len if conversion actually happened */
if (dest_str != src_str)
len = strlen(dest_str);
*************** pg_convert(PG_FUNCTION_ARGS)
*** 503,512 ****
Datum
length_in_encoding(PG_FUNCTION_ARGS)
{
! bytea *string = PG_GETARG_BYTEA_P(0);
char *src_encoding_name = NameStr(*PG_GETARG_NAME(1));
int src_encoding = pg_char_to_encoding(src_encoding_name);
! int len = VARSIZE(string) - VARHDRSZ;
int retval;
if (src_encoding < 0)
--- 510,520 ----
Datum
length_in_encoding(PG_FUNCTION_ARGS)
{
! bytea *string = PG_GETARG_BYTEA_PP(0);
char *src_encoding_name = NameStr(*PG_GETARG_NAME(1));
int src_encoding = pg_char_to_encoding(src_encoding_name);
! const char *src_str;
! int len;
int retval;
if (src_encoding < 0)
*************** length_in_encoding(PG_FUNCTION_ARGS)
*** 515,525 ****
errmsg("invalid encoding name \"%s\"",
src_encoding_name)));
! retval = pg_verify_mbstr_len(src_encoding, VARDATA(string), len, false);
! PG_RETURN_INT32(retval);
}
Datum
pg_encoding_max_length_sql(PG_FUNCTION_ARGS)
{
--- 523,541 ----
errmsg("invalid encoding name \"%s\"",
src_encoding_name)));
! len = VARSIZE_ANY_EXHDR(string);
! src_str = VARDATA_ANY(string);
+ retval = pg_verify_mbstr_len(src_encoding, src_str, len, false);
+
+ PG_RETURN_INT32(retval);
}
+ /*
+ * Get maximum multibyte character length in the specified encoding.
+ *
+ * Note encoding is specified numerically, not by name as above.
+ */
Datum
pg_encoding_max_length_sql(PG_FUNCTION_ARGS)
{
*************** pg_encoding_max_length_sql(PG_FUNCTION_A
*** 532,558 ****
}
/*
! * convert client encoding to server encoding.
*/
char *
pg_client_to_server(const char *s, int len)
{
- Assert(ClientEncoding);
-
return pg_any_to_server(s, len, ClientEncoding->encoding);
}
/*
! * convert any encoding to server encoding.
*/
char *
pg_any_to_server(const char *s, int len, int encoding)
{
- Assert(DatabaseEncoding);
- Assert(ClientEncoding);
-
if (len <= 0)
! return (char *) s;
if (encoding == DatabaseEncoding->encoding ||
encoding == PG_SQL_ASCII)
--- 548,578 ----
}
/*
! * Convert client encoding to server encoding.
! *
! * See the notes about string conversion functions at the top of this file.
*/
char *
pg_client_to_server(const char *s, int len)
{
return pg_any_to_server(s, len, ClientEncoding->encoding);
}
/*
! * Convert any encoding to server encoding.
! *
! * See the notes about string conversion functions at the top of this file.
! *
! * Unlike the other string conversion functions, this will apply validation
! * even if encoding == DatabaseEncoding->encoding. This is because this is
! * used to process data coming in from outside the database, and we never
! * want to just assume validity.
*/
char *
pg_any_to_server(const char *s, int len, int encoding)
{
if (len <= 0)
! return (char *) s; /* empty string is always valid */
if (encoding == DatabaseEncoding->encoding ||
encoding == PG_SQL_ASCII)
*************** pg_any_to_server(const char *s, int len,
*** 594,639 ****
return (char *) s;
}
! if (ClientEncoding->encoding == encoding)
return perform_default_encoding_conversion(s, len, true);
! else
! return (char *) pg_do_encoding_conversion(
! (unsigned char *) s, len, encoding, DatabaseEncoding->encoding);
}
/*
! * convert server encoding to client encoding.
*/
char *
pg_server_to_client(const char *s, int len)
{
- Assert(ClientEncoding);
-
return pg_server_to_any(s, len, ClientEncoding->encoding);
}
/*
! * convert server encoding to any encoding.
*/
char *
pg_server_to_any(const char *s, int len, int encoding)
{
- Assert(DatabaseEncoding);
- Assert(ClientEncoding);
-
if (len <= 0)
! return (char *) s;
if (encoding == DatabaseEncoding->encoding ||
! encoding == PG_SQL_ASCII ||
! DatabaseEncoding->encoding == PG_SQL_ASCII)
return (char *) s; /* assume data is valid */
! if (ClientEncoding->encoding == encoding)
return perform_default_encoding_conversion(s, len, false);
! else
! return (char *) pg_do_encoding_conversion(
! (unsigned char *) s, len, DatabaseEncoding->encoding, encoding);
}
/*
--- 614,672 ----
return (char *) s;
}
! /* Fast path if we can use cached conversion function */
! if (encoding == ClientEncoding->encoding)
return perform_default_encoding_conversion(s, len, true);
!
! /* General case ... will not work outside transactions */
! return (char *) pg_do_encoding_conversion((unsigned char *) s,
! len,
! encoding,
! DatabaseEncoding->encoding);
}
/*
! * Convert server encoding to client encoding.
! *
! * See the notes about string conversion functions at the top of this file.
*/
char *
pg_server_to_client(const char *s, int len)
{
return pg_server_to_any(s, len, ClientEncoding->encoding);
}
/*
! * Convert server encoding to any encoding.
! *
! * See the notes about string conversion functions at the top of this file.
*/
char *
pg_server_to_any(const char *s, int len, int encoding)
{
if (len <= 0)
! return (char *) s; /* empty string is always valid */
if (encoding == DatabaseEncoding->encoding ||
! encoding == PG_SQL_ASCII)
return (char *) s; /* assume data is valid */
! if (DatabaseEncoding->encoding == PG_SQL_ASCII)
! {
! /* No conversion is possible, but we must validate the result */
! (void) pg_verify_mbstr(encoding, s, len, false);
! return (char *) s;
! }
!
! /* Fast path if we can use cached conversion function */
! if (encoding == ClientEncoding->encoding)
return perform_default_encoding_conversion(s, len, false);
!
! /* General case ... will not work outside transactions */
! return (char *) pg_do_encoding_conversion((unsigned char *) s,
! len,
! DatabaseEncoding->encoding,
! encoding);
}
/*
*************** pg_server_to_any(const char *s, int len,
*** 643,649 ****
* SetClientEncoding(), no conversion is performed.
*/
static char *
! perform_default_encoding_conversion(const char *src, int len, bool is_client_to_server)
{
char *result;
int src_encoding,
--- 676,683 ----
* SetClientEncoding(), no conversion is performed.
*/
static char *
! perform_default_encoding_conversion(const char *src, int len,
! bool is_client_to_server)
{
char *result;
int src_encoding,
*************** raw_pg_bind_textdomain_codeset(const cha
*** 931,941 ****
* On most platforms, gettext defaults to the codeset implied by LC_CTYPE.
* When that matches the database encoding, we don't need to do anything. In
* CREATE DATABASE, we enforce or trust that the locale's codeset matches the
! * database encoding, except for the C locale. (On Windows, we also permit a
* discrepancy under the UTF8 encoding.) For the C locale, explicitly bind
* gettext to the right codeset.
*
! * On Windows, gettext defaults to the Windows ANSI code page. This is a
* convenient departure for software that passes the strings to Windows ANSI
* APIs, but we don't do that. Compel gettext to use database encoding or,
* failing that, the LC_CTYPE encoding as it would on other platforms.
--- 965,975 ----
* On most platforms, gettext defaults to the codeset implied by LC_CTYPE.
* When that matches the database encoding, we don't need to do anything. In
* CREATE DATABASE, we enforce or trust that the locale's codeset matches the
! * database encoding, except for the C locale. (On Windows, we also permit a
* discrepancy under the UTF8 encoding.) For the C locale, explicitly bind
* gettext to the right codeset.
*
! * On Windows, gettext defaults to the Windows ANSI code page. This is a
* convenient departure for software that passes the strings to Windows ANSI
* APIs, but we don't do that. Compel gettext to use database encoding or,
* failing that, the LC_CTYPE encoding as it would on other platforms.
*************** pg_bind_textdomain_codeset(const char *d
*** 980,1007 ****
int
GetDatabaseEncoding(void)
{
- Assert(DatabaseEncoding);
return DatabaseEncoding->encoding;
}
const char *
GetDatabaseEncodingName(void)
{
- Assert(DatabaseEncoding);
return DatabaseEncoding->name;
}
Datum
getdatabaseencoding(PG_FUNCTION_ARGS)
{
- Assert(DatabaseEncoding);
return DirectFunctionCall1(namein, CStringGetDatum(DatabaseEncoding->name));
}
Datum
pg_client_encoding(PG_FUNCTION_ARGS)
{
- Assert(ClientEncoding);
return DirectFunctionCall1(namein, CStringGetDatum(ClientEncoding->name));
}
--- 1014,1037 ----
*************** pg_client_encoding(PG_FUNCTION_ARGS)
*** 1014,1020 ****
int
GetMessageEncoding(void)
{
- Assert(MessageEncoding);
return MessageEncoding->encoding;
}
--- 1044,1049 ----
--
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers