Re: [HACKERS] bytea vs. pg_dump

Peter Eisentraut Sat, 11 Jul 2009 03:41:56 -0700

On Wednesday 08 July 2009 01:07:08 Tom Lane wrote:
> Peter Eisentraut <pete...@gmx.net> writes:
> > Here is a first cut at a new hex bytea input and output format.  Example:
> > ...
> > SET bytea_output_hex = true;
> >
> > Should the configuration parameter be a boolean or an enum, opening
> > possibilities for other formats?
>
> Enum.  If we do this then it seems entirely fair that someone might
> want other settings someday.  Also, it seems silly to pick a format
> partly on the grounds that it's expansible, and then not make the
> control GUC expansible.  Perhaps
>
>       SET bytea_output = [ hex | traditional ]


OK, here is an updated patch.  It has the setting as enum, completed 
documentation, and libpq support.  I'll add it to the commit fest in the hope 
that someone else can look it over in detail.

I'm attaching two versions of the patch.  One it made with the -w option, 
which leads to less differences.

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 2572d78..fece041 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3703,6 +3703,23 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
      <title>Statement Behavior</title>
      <variablelist>
 
+     <varlistentry id="guc-bytea-output" xreflabel="bytea_output">
+      <term><varname>bytea_output</varname> (<type>enum</type>)</term>
+      <indexterm>
+       <primary><varname>bytea_output</> configuration parameter</primary>
+      </indexterm>
+      <listitem>
+       <para>
+	Sets the output format for values of type <type>bytea</type>.
+	Valid values are <literal>hex</literal> (the default)
+	and <literal>escape</literal> (the traditional PostgreSQL
+	format).  The <xref linkend="datatype-binary"> for more
+	information.  Note that the <type>bytea</type> type always
+	accepts both formats on input.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-search-path" xreflabel="search_path">
       <term><varname>search_path</varname> (<type>string</type>)</term>
       <indexterm>
diff --git a/doc/src/sgml/datatype.sgml b/doc/src/sgml/datatype.sgml
index c944d8f..bdead3e 100644
--- a/doc/src/sgml/datatype.sgml
+++ b/doc/src/sgml/datatype.sgml
@@ -1,4 +1,4 @@
-<!-- $PostgreSQL$ -->
+<!-- $PostgreSQL: pgsql/doc/src/sgml/datatype.sgml,v 1.237 2009/04/27 16:27:35 momjian Exp $ -->
 
  <chapter id="datatype">
   <title id="datatype-title">Data Types</title>
@@ -1189,6 +1189,66 @@ SELECT b, char_length(b) FROM test2;
    </para>
 
    <para>
+    The <type>bytea</type> type supports two external formats for
+    input and output: the <quote>escape</quote> format that is
+    particular to PostgreSQL, and the <quote>hex</quote> format.  Both
+    of these are always accepted on input.  The output format depends
+    on the configuration parameter <xref linkend="guc-bytea-output">;
+    the default is hex.  (Note that the hex format was introduced in
+    PostgreSQL 8.5; so earlier version and some tools don't understand
+    it.)
+   </para>
+
+   <para>
+        The <acronym>SQL</acronym> standard defines a different binary
+        string type, called <type>BLOB</type> or <type>BINARY LARGE
+        OBJECT</type>.  The input format is different from
+        <type>bytea</type>, but the provided functions and operators are
+        mostly the same.
+   </para>
+
+  <sect2>
+   <title>The Hex Format</title>
+
+   <para>
+    The hex format encodes the binary data as 2 hexadecimal digits per
+    byte, highest significant nibble first.  The entire string ist
+    preceded by the sequence <literal>\x</literal> (to distinguish it
+    from the bytea format).  In SQL literals, the backslash may need
+    to be escaped, but it is one logical backslash as far as the
+    <type>bytea</type> type is concerned.  The hex format is compatible with a wide
+    range of external applications and protocols, and it tends to be
+    faster than the traditional bytea format, so its use is
+    somewhat preferrable.
+   </para>
+
+   <para>
+    Example:
+<programlisting>
+SELECT E'\\xDEADBEEF';
+</programlisting>
+   </para>
+  </sect2>
+
+  <sect2>
+   <title>The Escape Format</title>
+
+   <para>
+    The <quote>escape</quote> format is the traditional
+    PostgreSQL-specific format for the <type>bytea</type> type.  It
+    takes the approach of representing a binary string as a sequence
+    of ASCII characters and escaping those bytes that cannot be
+    represented as an ASCII character by a special escape sequence.
+    If, from the point of view of the application, representing bytes
+    as characters makes sense, then this representation can be
+    convenient, but in practice it is usually confusing becauses it
+    fuzzes up the distinction between binary strings and characters
+    strings, and the particular escape mechanism that was chosen is
+    also somewhat unwieldy.  So this format should probably not be
+    used for most new applications.
+   </para>
+
+   <para>
     When entering <type>bytea</type> values, octets of certain
     values <emphasis>must</emphasis> be escaped (but all octet
     values <emphasis>can</emphasis> be escaped) when used as part
@@ -1341,14 +1401,7 @@ SELECT b, char_length(b) FROM test2;
     have to escape line feeds and carriage returns if your interface
     automatically translates these.
    </para>
-
-   <para>
-        The <acronym>SQL</acronym> standard defines a different binary
-        string type, called <type>BLOB</type> or <type>BINARY LARGE
-        OBJECT</type>.  The input format is different from
-        <type>bytea</type>, but the provided functions and operators are
-        mostly the same.
-   </para>
+  </sect2>
  </sect1>
 
 
diff --git a/src/backend/utils/adt/encode.c b/src/backend/utils/adt/encode.c
index eed799a..b8a3cef 100644
--- a/src/backend/utils/adt/encode.c
+++ b/src/backend/utils/adt/encode.c
@@ -122,8 +122,8 @@ static const int8 hexlookup[128] = {
 	-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
 };
 
-static unsigned
-hex_encode(const char *src, unsigned len, char *dst)
+size_t
+hex_encode(const char *src, size_t len, char *dst)
 {
 	const char *end = src + len;
 
@@ -152,8 +152,8 @@ get_hex(char c)
 	return (char) res;
 }
 
-static unsigned
-hex_decode(const char *src, unsigned len, char *dst)
+size_t
+hex_decode(const char *src, size_t len, char *dst)
 {
 	const char *s,
 			   *srcend;
diff --git a/src/backend/utils/adt/varlena.c b/src/backend/utils/adt/varlena.c
index 4cf3966..7b80bc9 100644
--- a/src/backend/utils/adt/varlena.c
+++ b/src/backend/utils/adt/varlena.c
@@ -62,6 +62,8 @@ static text *text_substring(Datum str,
 			   bool length_not_specified);
 static void appendStringInfoText(StringInfo str, const text *t);
 
+int bytea_output;
+
 
 /*****************************************************************************
  *	 CONVERSION ROUTINES EXPORTED FOR USE BY C CODE							 *
@@ -189,6 +191,18 @@ byteain(PG_FUNCTION_ARGS)
 	int			byte;
 	bytea	   *result;
 
+	if (inputText[0] == '\\' && inputText[1] == 'x')
+	{
+		size_t len = strlen(inputText);
+
+		byte = (len - 2)/2 + VARHDRSZ;
+		result = palloc(byte);
+		SET_VARSIZE(result, byte);
+		hex_decode(inputText + 2, len - 2, VARDATA(result));
+
+		PG_RETURN_BYTEA_P(result);
+	}
+
 	for (byte = 0, tp = inputText; *tp != '\0'; byte++)
 	{
 		if (tp[0] != '\\')
@@ -266,47 +280,61 @@ Datum
 byteaout(PG_FUNCTION_ARGS)
 {
 	bytea	   *vlena = PG_GETARG_BYTEA_PP(0);
-	char	   *result;
+	char	   *result = NULL;
 	char	   *vp;
-	char	   *rp;
-	int			val;			/* holds unprintable chars */
+	char	   *rp = NULL;
 	int			i;
-	int			len;
 
-	len = 1;					/* empty string has 1 char */
-	vp = VARDATA_ANY(vlena);
-	for (i = VARSIZE_ANY_EXHDR(vlena); i != 0; i--, vp++)
+	if (bytea_output == BYTEA_OUTPUT_HEX)
 	{
-		if (*vp == '\\')
-			len += 2;
-		else if ((unsigned char) *vp < 0x20 || (unsigned char) *vp > 0x7e)
-			len += 4;
-		else
-			len++;
+		rp = result = palloc(VARSIZE_ANY_EXHDR(vlena) * 2 + 2 + 1);
+		*rp++ = '\\';
+		*rp++ = 'x';
+		rp += hex_encode(VARDATA_ANY(vlena), VARSIZE_ANY_EXHDR(vlena), rp);
 	}
-	rp = result = (char *) palloc(len);
-	vp = VARDATA_ANY(vlena);
-	for (i = VARSIZE_ANY_EXHDR(vlena); i != 0; i--, vp++)
+	else if (bytea_output == BYTEA_OUTPUT_ESCAPE)
 	{
-		if (*vp == '\\')
+		int			len;
+
+		len = 1;					/* empty string has 1 char */
+		vp = VARDATA_ANY(vlena);
+		for (i = VARSIZE_ANY_EXHDR(vlena); i != 0; i--, vp++)
 		{
-			*rp++ = '\\';
-			*rp++ = '\\';
+			if (*vp == '\\')
+				len += 2;
+			else if ((unsigned char) *vp < 0x20 || (unsigned char) *vp > 0x7e)
+				len += 4;
+			else
+				len++;
 		}
-		else if ((unsigned char) *vp < 0x20 || (unsigned char) *vp > 0x7e)
+		rp = result = (char *) palloc(len);
+		vp = VARDATA_ANY(vlena);
+		for (i = VARSIZE_ANY_EXHDR(vlena); i != 0; i--, vp++)
 		{
-			val = *vp;
-			rp[0] = '\\';
-			rp[3] = DIG(val & 07);
-			val >>= 3;
-			rp[2] = DIG(val & 07);
-			val >>= 3;
-			rp[1] = DIG(val & 03);
-			rp += 4;
+			if (*vp == '\\')
+			{
+				*rp++ = '\\';
+				*rp++ = '\\';
+			}
+			else if ((unsigned char) *vp < 0x20 || (unsigned char) *vp > 0x7e)
+			{
+				int			val;			/* holds unprintable chars */
+
+				val = *vp;
+				rp[0] = '\\';
+				rp[3] = DIG(val & 07);
+				val >>= 3;
+				rp[2] = DIG(val & 07);
+				val >>= 3;
+				rp[1] = DIG(val & 03);
+				rp += 4;
+			}
+			else
+				*rp++ = *vp;
 		}
-		else
-			*rp++ = *vp;
 	}
+	else
+		elog(ERROR, "unrecognized bytea_output setting: %d", bytea_output);
 	*rp = '\0';
 	PG_RETURN_CSTRING(result);
 }
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 4f06725..19b1f46 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -180,6 +180,12 @@ static char *config_enum_get_options(struct config_enum *record,
  * NOTE! Option values may not contain double quotes!
  */
 
+static const struct config_enum_entry bytea_output_options[] = {
+	{"escape", BYTEA_OUTPUT_ESCAPE, false},
+	{"hex", BYTEA_OUTPUT_HEX, false},
+	{NULL, 0, false}
+};
+
 /*
  * We have different sets for client and server message level options because
  * they sort slightly different (see "log" level)
@@ -2533,6 +2539,15 @@ static struct config_enum ConfigureNamesEnum[] =
 	},
 
 	{
+		{"bytea_output", PGC_USERSET, CLIENT_CONN_OTHER,
+			gettext_noop("Sets the bytea output format to the hex format."),
+			NULL
+		},
+		&bytea_output,
+		BYTEA_OUTPUT_HEX, bytea_output_options, NULL, NULL
+	},
+
+	{
 		{"client_min_messages", PGC_USERSET, LOGGING_WHEN,
 			gettext_noop("Sets the message levels that are sent to the client."),
 			gettext_noop("Each level includes all the levels that follow it. The later"
diff --git a/src/include/utils/builtins.h b/src/include/utils/builtins.h
index c1b9393..0c93a7e 100644
--- a/src/include/utils/builtins.h
+++ b/src/include/utils/builtins.h
@@ -704,6 +704,13 @@ extern Datum unknownout(PG_FUNCTION_ARGS);
 extern Datum unknownrecv(PG_FUNCTION_ARGS);
 extern Datum unknownsend(PG_FUNCTION_ARGS);
 
+typedef enum {
+	BYTEA_OUTPUT_ESCAPE,
+	BYTEA_OUTPUT_HEX
+} ByteaOutputType;
+
+extern int bytea_output;		/* ByteaOutputType, but int for guc enum */
+
 extern Datum byteain(PG_FUNCTION_ARGS);
 extern Datum byteaout(PG_FUNCTION_ARGS);
 extern Datum bytearecv(PG_FUNCTION_ARGS);
@@ -728,6 +735,9 @@ extern Datum bytea_substr(PG_FUNCTION_ARGS);
 extern Datum bytea_substr_no_len(PG_FUNCTION_ARGS);
 extern Datum pg_column_size(PG_FUNCTION_ARGS);
 
+extern size_t hex_encode(const char *src, size_t len, char *dst);
+extern size_t hex_decode(const char *src, size_t len, char *dst);
+
 /* version.c */
 extern Datum pgsql_version(PG_FUNCTION_ARGS);
 
diff --git a/src/interfaces/libpq/fe-exec.c b/src/interfaces/libpq/fe-exec.c
index c1934ad..66d1114 100644
--- a/src/interfaces/libpq/fe-exec.c
+++ b/src/interfaces/libpq/fe-exec.c
@@ -3166,6 +3166,29 @@ PQescapeBytea(const unsigned char *from, size_t from_length, size_t *to_length)
 }
 
 
+static const int8 hexlookup[128] = {
+	-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
+	-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
+	-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
+	0, 1, 2, 3, 4, 5, 6, 7, 8, 9, -1, -1, -1, -1, -1, -1,
+	-1, 10, 11, 12, 13, 14, 15, -1, -1, -1, -1, -1, -1, -1, -1, -1,
+	-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
+	-1, 10, 11, 12, 13, 14, 15, -1, -1, -1, -1, -1, -1, -1, -1, -1,
+	-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
+};
+
+static char
+get_hex(char c)
+{
+	int			res = -1;
+
+	if (c > 0 && c < 127)
+		res = hexlookup[(unsigned char) c];
+
+	return (char) res;
+}
+
+
 #define ISFIRSTOCTDIGIT(CH) ((CH) >= '0' && (CH) <= '3')
 #define ISOCTDIGIT(CH) ((CH) >= '0' && (CH) <= '7')
 #define OCTVAL(CH) ((CH) - '0')
@@ -3197,6 +3220,37 @@ PQunescapeBytea(const unsigned char *strtext, size_t *retbuflen)
 
 	strtextlen = strlen((const char *) strtext);
 
+	if (strtext[0] == '\\' && strtext[1] == 'x')
+	{
+		const unsigned char *s;
+		unsigned char	*p;
+
+		*retbuflen = (strtextlen - 2)/2;
+		/* Avoid unportable malloc(0) */
+		buffer = malloc(*retbuflen > 0 ? *retbuflen : 1);
+		if (buffer == NULL)
+			return NULL;
+
+		s = strtext + 2;
+		p = buffer;
+		while (*s)
+		{
+			char	v1,	v2;
+
+			v1 = get_hex(*s++) << 4;
+			if (!*s)
+			{
+				*retbuflen = -1;
+				return NULL;
+			}
+
+			v2 = get_hex(*s++);
+			*p++ = v1 | v2;
+		}
+
+		return buffer;
+	}
+	else {
 	/*
 	 * Length of input is max length of output, but add one to avoid
 	 * unportable malloc(0) if input is zero-length.
@@ -3257,4 +3311,5 @@ PQunescapeBytea(const unsigned char *strtext, size_t *retbuflen)
 
 	*retbuflen = buflen;
 	return tmpbuf;
+	}
 }
diff --git a/src/test/regress/expected/conversion.out b/src/test/regress/expected/conversion.out
index e80e1a4..8f08706 100644
--- a/src/test/regress/expected/conversion.out
+++ b/src/test/regress/expected/conversion.out
@@ -1,3 +1,4 @@
+SET bytea_output TO escape;
 --
 -- create user defined conversion
 --
diff --git a/src/test/regress/expected/strings.out b/src/test/regress/expected/strings.out
index be8eb91..0315341 100644
--- a/src/test/regress/expected/strings.out
+++ b/src/test/regress/expected/strings.out
@@ -97,6 +97,87 @@ LINE 1: SELECT U&'wrong: +0061' UESCAPE '+';
                ^
 DETAIL:  String constants with Unicode escapes cannot be used when standard_conforming_strings is off.
 RESET standard_conforming_strings;
+-- bytea
+SET bytea_output TO hex;
+SELECT E'\\xDeAdBeEf'::bytea;
+   bytea    
+------------
+ \xdeadbeef
+(1 row)
+
+SELECT E'\\xDeAdBeE'::bytea;
+ERROR:  invalid hexadecimal data: odd number of digits
+LINE 1: SELECT E'\\xDeAdBeE'::bytea;
+               ^
+SELECT E'\\xDeAdBeEx'::bytea;
+ERROR:  invalid hexadecimal digit: "x"
+LINE 1: SELECT E'\\xDeAdBeEx'::bytea;
+               ^
+SELECT E'\\xDe00BeEf'::bytea;
+   bytea    
+------------
+ \xde00beef
+(1 row)
+
+SELECT E'DeAdBeEf'::bytea;
+       bytea        
+--------------------
+ \x4465416442654566
+(1 row)
+
+SELECT E'De\\000dBeEf'::bytea;
+       bytea        
+--------------------
+ \x4465006442654566
+(1 row)
+
+SELECT E'De\123dBeEf'::bytea;
+       bytea        
+--------------------
+ \x4465536442654566
+(1 row)
+
+SELECT E'De\\123dBeEf'::bytea;
+       bytea        
+--------------------
+ \x4465536442654566
+(1 row)
+
+SELECT E'De\\678dBeEf'::bytea;
+ERROR:  invalid input syntax for type bytea
+LINE 1: SELECT E'De\\678dBeEf'::bytea;
+               ^
+SET bytea_output TO escape;
+SELECT E'\\xDeAdBeEf'::bytea;
+      bytea       
+------------------
+ \336\255\276\357
+(1 row)
+
+SELECT E'\\xDe00BeEf'::bytea;
+      bytea       
+------------------
+ \336\000\276\357
+(1 row)
+
+SELECT E'DeAdBeEf'::bytea;
+  bytea   
+----------
+ DeAdBeEf
+(1 row)
+
+SELECT E'De\\000dBeEf'::bytea;
+    bytea    
+-------------
+ De\000dBeEf
+(1 row)
+
+SELECT E'De\\123dBeEf'::bytea;
+  bytea   
+----------
+ DeSdBeEf
+(1 row)
+
 --
 -- test conversions between various string types
 -- E021-10 implicit casting among the character data types
diff --git a/src/test/regress/input/largeobject.source b/src/test/regress/input/largeobject.source
index 46ba926..5bfba18 100644
--- a/src/test/regress/input/largeobject.source
+++ b/src/test/regress/input/largeobject.source
@@ -2,6 +2,8 @@
 -- Test large object support
 --
 
+SET bytea_output TO escape;
+
 -- Load a file
 CREATE TABLE lotest_stash_values (loid oid, fd integer);
 -- lo_creat(mode integer) returns oid
diff --git a/src/test/regress/output/largeobject.source b/src/test/regress/output/largeobject.source
index 9d69f6c..5ff2e3b 100644
--- a/src/test/regress/output/largeobject.source
+++ b/src/test/regress/output/largeobject.source
@@ -1,6 +1,7 @@
 --
 -- Test large object support
 --
+SET bytea_output TO escape;
 -- Load a file
 CREATE TABLE lotest_stash_values (loid oid, fd integer);
 -- lo_creat(mode integer) returns oid
diff --git a/src/test/regress/output/largeobject_1.source b/src/test/regress/output/largeobject_1.source
index 1fbc29c..0ece752 100644
--- a/src/test/regress/output/largeobject_1.source
+++ b/src/test/regress/output/largeobject_1.source
@@ -1,6 +1,7 @@
 --
 -- Test large object support
 --
+SET bytea_output_hex TO false;
 -- Load a file
 CREATE TABLE lotest_stash_values (loid oid, fd integer);
 -- lo_creat(mode integer) returns oid
diff --git a/src/test/regress/sql/conversion.sql b/src/test/regress/sql/conversion.sql
index 99a9178..c4bb583 100644
--- a/src/test/regress/sql/conversion.sql
+++ b/src/test/regress/sql/conversion.sql
@@ -1,3 +1,5 @@
+SET bytea_output TO escape;
+
 --
 -- create user defined conversion
 --
diff --git a/src/test/regress/sql/strings.sql b/src/test/regress/sql/strings.sql
index a28c75a..3e9cfa3 100644
--- a/src/test/regress/sql/strings.sql
+++ b/src/test/regress/sql/strings.sql
@@ -43,6 +43,25 @@ SELECT U&'wrong: +0061' UESCAPE '+';
 
 RESET standard_conforming_strings;
 
+-- bytea
+SET bytea_output TO hex;
+SELECT E'\\xDeAdBeEf'::bytea;
+SELECT E'\\xDeAdBeE'::bytea;
+SELECT E'\\xDeAdBeEx'::bytea;
+SELECT E'\\xDe00BeEf'::bytea;
+SELECT E'DeAdBeEf'::bytea;
+SELECT E'De\\000dBeEf'::bytea;
+SELECT E'De\123dBeEf'::bytea;
+SELECT E'De\\123dBeEf'::bytea;
+SELECT E'De\\678dBeEf'::bytea;
+
+SET bytea_output TO escape;
+SELECT E'\\xDeAdBeEf'::bytea;
+SELECT E'\\xDe00BeEf'::bytea;
+SELECT E'DeAdBeEf'::bytea;
+SELECT E'De\\000dBeEf'::bytea;
+SELECT E'De\\123dBeEf'::bytea;
+
 --
 -- test conversions between various string types
 -- E021-10 implicit casting among the character data types

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 2572d78..fece041 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3703,6 +3703,23 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
      <title>Statement Behavior</title>
      <variablelist>
 
+     <varlistentry id="guc-bytea-output" xreflabel="bytea_output">
+      <term><varname>bytea_output</varname> (<type>enum</type>)</term>
+      <indexterm>
+       <primary><varname>bytea_output</> configuration parameter</primary>
+      </indexterm>
+      <listitem>
+       <para>
+	Sets the output format for values of type <type>bytea</type>.
+	Valid values are <literal>hex</literal> (the default)
+	and <literal>escape</literal> (the traditional PostgreSQL
+	format).  The <xref linkend="datatype-binary"> for more
+	information.  Note that the <type>bytea</type> type always
+	accepts both formats on input.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-search-path" xreflabel="search_path">
       <term><varname>search_path</varname> (<type>string</type>)</term>
       <indexterm>
diff --git a/doc/src/sgml/datatype.sgml b/doc/src/sgml/datatype.sgml
index c944d8f..bdead3e 100644
--- a/doc/src/sgml/datatype.sgml
+++ b/doc/src/sgml/datatype.sgml
@@ -1,4 +1,4 @@
-<!-- $PostgreSQL$ -->
+<!-- $PostgreSQL: pgsql/doc/src/sgml/datatype.sgml,v 1.237 2009/04/27 16:27:35 momjian Exp $ -->
 
  <chapter id="datatype">
   <title id="datatype-title">Data Types</title>
@@ -1189,6 +1189,66 @@ SELECT b, char_length(b) FROM test2;
    </para>
 
    <para>
+    The <type>bytea</type> type supports two external formats for
+    input and output: the <quote>escape</quote> format that is
+    particular to PostgreSQL, and the <quote>hex</quote> format.  Both
+    of these are always accepted on input.  The output format depends
+    on the configuration parameter <xref linkend="guc-bytea-output">;
+    the default is hex.  (Note that the hex format was introduced in
+    PostgreSQL 8.5; so earlier version and some tools don't understand
+    it.)
+   </para>
+
+   <para>
+        The <acronym>SQL</acronym> standard defines a different binary
+        string type, called <type>BLOB</type> or <type>BINARY LARGE
+        OBJECT</type>.  The input format is different from
+        <type>bytea</type>, but the provided functions and operators are
+        mostly the same.
+   </para>
+
+  <sect2>
+   <title>The Hex Format</title>
+
+   <para>
+    The hex format encodes the binary data as 2 hexadecimal digits per
+    byte, highest significant nibble first.  The entire string ist
+    preceded by the sequence <literal>\x</literal> (to distinguish it
+    from the bytea format).  In SQL literals, the backslash may need
+    to be escaped, but it is one logical backslash as far as the
+    <type>bytea</type> type is concerned.  The hex format is compatible with a wide
+    range of external applications and protocols, and it tends to be
+    faster than the traditional bytea format, so its use is
+    somewhat preferrable.
+   </para>
+
+   <para>
+    Example:
+<programlisting>
+SELECT E'\\xDEADBEEF';
+</programlisting>
+   </para>
+  </sect2>
+
+  <sect2>
+   <title>The Escape Format</title>
+
+   <para>
+    The <quote>escape</quote> format is the traditional
+    PostgreSQL-specific format for the <type>bytea</type> type.  It
+    takes the approach of representing a binary string as a sequence
+    of ASCII characters and escaping those bytes that cannot be
+    represented as an ASCII character by a special escape sequence.
+    If, from the point of view of the application, representing bytes
+    as characters makes sense, then this representation can be
+    convenient, but in practice it is usually confusing becauses it
+    fuzzes up the distinction between binary strings and characters
+    strings, and the particular escape mechanism that was chosen is
+    also somewhat unwieldy.  So this format should probably not be
+    used for most new applications.
+   </para>
+
+   <para>
     When entering <type>bytea</type> values, octets of certain
     values <emphasis>must</emphasis> be escaped (but all octet
     values <emphasis>can</emphasis> be escaped) when used as part
@@ -1341,14 +1401,7 @@ SELECT b, char_length(b) FROM test2;
     have to escape line feeds and carriage returns if your interface
     automatically translates these.
    </para>
-
-   <para>
-        The <acronym>SQL</acronym> standard defines a different binary
-        string type, called <type>BLOB</type> or <type>BINARY LARGE
-        OBJECT</type>.  The input format is different from
-        <type>bytea</type>, but the provided functions and operators are
-        mostly the same.
-   </para>
+  </sect2>
  </sect1>
 
 
diff --git a/src/backend/utils/adt/encode.c b/src/backend/utils/adt/encode.c
index eed799a..b8a3cef 100644
--- a/src/backend/utils/adt/encode.c
+++ b/src/backend/utils/adt/encode.c
@@ -122,8 +122,8 @@ static const int8 hexlookup[128] = {
 	-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
 };
 
-static unsigned
-hex_encode(const char *src, unsigned len, char *dst)
+size_t
+hex_encode(const char *src, size_t len, char *dst)
 {
 	const char *end = src + len;
 
@@ -152,8 +152,8 @@ get_hex(char c)
 	return (char) res;
 }
 
-static unsigned
-hex_decode(const char *src, unsigned len, char *dst)
+size_t
+hex_decode(const char *src, size_t len, char *dst)
 {
 	const char *s,
 			   *srcend;
diff --git a/src/backend/utils/adt/varlena.c b/src/backend/utils/adt/varlena.c
index 4cf3966..7b80bc9 100644
--- a/src/backend/utils/adt/varlena.c
+++ b/src/backend/utils/adt/varlena.c
@@ -62,6 +62,8 @@ static text *text_substring(Datum str,
 			   bool length_not_specified);
 static void appendStringInfoText(StringInfo str, const text *t);
 
+int bytea_output;
+
 
 /*****************************************************************************
  *	 CONVERSION ROUTINES EXPORTED FOR USE BY C CODE							 *
@@ -189,6 +191,18 @@ byteain(PG_FUNCTION_ARGS)
 	int			byte;
 	bytea	   *result;
 
+	if (inputText[0] == '\\' && inputText[1] == 'x')
+	{
+		size_t len = strlen(inputText);
+
+		byte = (len - 2)/2 + VARHDRSZ;
+		result = palloc(byte);
+		SET_VARSIZE(result, byte);
+		hex_decode(inputText + 2, len - 2, VARDATA(result));
+
+		PG_RETURN_BYTEA_P(result);
+	}
+
 	for (byte = 0, tp = inputText; *tp != '\0'; byte++)
 	{
 		if (tp[0] != '\\')
@@ -266,11 +280,20 @@ Datum
 byteaout(PG_FUNCTION_ARGS)
 {
 	bytea	   *vlena = PG_GETARG_BYTEA_PP(0);
-	char	   *result;
+	char	   *result = NULL;
 	char	   *vp;
-	char	   *rp;
-	int			val;			/* holds unprintable chars */
+	char	   *rp = NULL;
 	int			i;
+
+	if (bytea_output == BYTEA_OUTPUT_HEX)
+	{
+		rp = result = palloc(VARSIZE_ANY_EXHDR(vlena) * 2 + 2 + 1);
+		*rp++ = '\\';
+		*rp++ = 'x';
+		rp += hex_encode(VARDATA_ANY(vlena), VARSIZE_ANY_EXHDR(vlena), rp);
+	}
+	else if (bytea_output == BYTEA_OUTPUT_ESCAPE)
+	{
 	int			len;
 
 	len = 1;					/* empty string has 1 char */
@@ -295,6 +318,8 @@ byteaout(PG_FUNCTION_ARGS)
 		}
 		else if ((unsigned char) *vp < 0x20 || (unsigned char) *vp > 0x7e)
 		{
+				int			val;			/* holds unprintable chars */
+
 			val = *vp;
 			rp[0] = '\\';
 			rp[3] = DIG(val & 07);
@@ -307,6 +332,9 @@ byteaout(PG_FUNCTION_ARGS)
 		else
 			*rp++ = *vp;
 	}
+	}
+	else
+		elog(ERROR, "unrecognized bytea_output setting: %d", bytea_output);
 	*rp = '\0';
 	PG_RETURN_CSTRING(result);
 }
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 4f06725..19b1f46 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -180,6 +180,12 @@ static char *config_enum_get_options(struct config_enum *record,
  * NOTE! Option values may not contain double quotes!
  */
 
+static const struct config_enum_entry bytea_output_options[] = {
+	{"escape", BYTEA_OUTPUT_ESCAPE, false},
+	{"hex", BYTEA_OUTPUT_HEX, false},
+	{NULL, 0, false}
+};
+
 /*
  * We have different sets for client and server message level options because
  * they sort slightly different (see "log" level)
@@ -2533,6 +2539,15 @@ static struct config_enum ConfigureNamesEnum[] =
 	},
 
 	{
+		{"bytea_output", PGC_USERSET, CLIENT_CONN_OTHER,
+			gettext_noop("Sets the bytea output format to the hex format."),
+			NULL
+		},
+		&bytea_output,
+		BYTEA_OUTPUT_HEX, bytea_output_options, NULL, NULL
+	},
+
+	{
 		{"client_min_messages", PGC_USERSET, LOGGING_WHEN,
 			gettext_noop("Sets the message levels that are sent to the client."),
 			gettext_noop("Each level includes all the levels that follow it. The later"
diff --git a/src/include/utils/builtins.h b/src/include/utils/builtins.h
index c1b9393..0c93a7e 100644
--- a/src/include/utils/builtins.h
+++ b/src/include/utils/builtins.h
@@ -704,6 +704,13 @@ extern Datum unknownout(PG_FUNCTION_ARGS);
 extern Datum unknownrecv(PG_FUNCTION_ARGS);
 extern Datum unknownsend(PG_FUNCTION_ARGS);
 
+typedef enum {
+	BYTEA_OUTPUT_ESCAPE,
+	BYTEA_OUTPUT_HEX
+} ByteaOutputType;
+
+extern int bytea_output;		/* ByteaOutputType, but int for guc enum */
+
 extern Datum byteain(PG_FUNCTION_ARGS);
 extern Datum byteaout(PG_FUNCTION_ARGS);
 extern Datum bytearecv(PG_FUNCTION_ARGS);
@@ -728,6 +735,9 @@ extern Datum bytea_substr(PG_FUNCTION_ARGS);
 extern Datum bytea_substr_no_len(PG_FUNCTION_ARGS);
 extern Datum pg_column_size(PG_FUNCTION_ARGS);
 
+extern size_t hex_encode(const char *src, size_t len, char *dst);
+extern size_t hex_decode(const char *src, size_t len, char *dst);
+
 /* version.c */
 extern Datum pgsql_version(PG_FUNCTION_ARGS);
 
diff --git a/src/interfaces/libpq/fe-exec.c b/src/interfaces/libpq/fe-exec.c
index c1934ad..66d1114 100644
--- a/src/interfaces/libpq/fe-exec.c
+++ b/src/interfaces/libpq/fe-exec.c
@@ -3166,6 +3166,29 @@ PQescapeBytea(const unsigned char *from, size_t from_length, size_t *to_length)
 }
 
 
+static const int8 hexlookup[128] = {
+	-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
+	-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
+	-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
+	0, 1, 2, 3, 4, 5, 6, 7, 8, 9, -1, -1, -1, -1, -1, -1,
+	-1, 10, 11, 12, 13, 14, 15, -1, -1, -1, -1, -1, -1, -1, -1, -1,
+	-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
+	-1, 10, 11, 12, 13, 14, 15, -1, -1, -1, -1, -1, -1, -1, -1, -1,
+	-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
+};
+
+static char
+get_hex(char c)
+{
+	int			res = -1;
+
+	if (c > 0 && c < 127)
+		res = hexlookup[(unsigned char) c];
+
+	return (char) res;
+}
+
+
 #define ISFIRSTOCTDIGIT(CH) ((CH) >= '0' && (CH) <= '3')
 #define ISOCTDIGIT(CH) ((CH) >= '0' && (CH) <= '7')
 #define OCTVAL(CH) ((CH) - '0')
@@ -3197,6 +3220,37 @@ PQunescapeBytea(const unsigned char *strtext, size_t *retbuflen)
 
 	strtextlen = strlen((const char *) strtext);
 
+	if (strtext[0] == '\\' && strtext[1] == 'x')
+	{
+		const unsigned char *s;
+		unsigned char	*p;
+
+		*retbuflen = (strtextlen - 2)/2;
+		/* Avoid unportable malloc(0) */
+		buffer = malloc(*retbuflen > 0 ? *retbuflen : 1);
+		if (buffer == NULL)
+			return NULL;
+
+		s = strtext + 2;
+		p = buffer;
+		while (*s)
+		{
+			char	v1,	v2;
+
+			v1 = get_hex(*s++) << 4;
+			if (!*s)
+			{
+				*retbuflen = -1;
+				return NULL;
+			}
+
+			v2 = get_hex(*s++);
+			*p++ = v1 | v2;
+		}
+
+		return buffer;
+	}
+	else {
 	/*
 	 * Length of input is max length of output, but add one to avoid
 	 * unportable malloc(0) if input is zero-length.
@@ -3258,3 +3312,4 @@ PQunescapeBytea(const unsigned char *strtext, size_t *retbuflen)
 	*retbuflen = buflen;
 	return tmpbuf;
 }
+}
diff --git a/src/test/regress/expected/conversion.out b/src/test/regress/expected/conversion.out
index e80e1a4..8f08706 100644
--- a/src/test/regress/expected/conversion.out
+++ b/src/test/regress/expected/conversion.out
@@ -1,3 +1,4 @@
+SET bytea_output TO escape;
 --
 -- create user defined conversion
 --
diff --git a/src/test/regress/expected/strings.out b/src/test/regress/expected/strings.out
index be8eb91..0315341 100644
--- a/src/test/regress/expected/strings.out
+++ b/src/test/regress/expected/strings.out
@@ -97,6 +97,87 @@ LINE 1: SELECT U&'wrong: +0061' UESCAPE '+';
                ^
 DETAIL:  String constants with Unicode escapes cannot be used when standard_conforming_strings is off.
 RESET standard_conforming_strings;
+-- bytea
+SET bytea_output TO hex;
+SELECT E'\\xDeAdBeEf'::bytea;
+   bytea    
+------------
+ \xdeadbeef
+(1 row)
+
+SELECT E'\\xDeAdBeE'::bytea;
+ERROR:  invalid hexadecimal data: odd number of digits
+LINE 1: SELECT E'\\xDeAdBeE'::bytea;
+               ^
+SELECT E'\\xDeAdBeEx'::bytea;
+ERROR:  invalid hexadecimal digit: "x"
+LINE 1: SELECT E'\\xDeAdBeEx'::bytea;
+               ^
+SELECT E'\\xDe00BeEf'::bytea;
+   bytea    
+------------
+ \xde00beef
+(1 row)
+
+SELECT E'DeAdBeEf'::bytea;
+       bytea        
+--------------------
+ \x4465416442654566
+(1 row)
+
+SELECT E'De\\000dBeEf'::bytea;
+       bytea        
+--------------------
+ \x4465006442654566
+(1 row)
+
+SELECT E'De\123dBeEf'::bytea;
+       bytea        
+--------------------
+ \x4465536442654566
+(1 row)
+
+SELECT E'De\\123dBeEf'::bytea;
+       bytea        
+--------------------
+ \x4465536442654566
+(1 row)
+
+SELECT E'De\\678dBeEf'::bytea;
+ERROR:  invalid input syntax for type bytea
+LINE 1: SELECT E'De\\678dBeEf'::bytea;
+               ^
+SET bytea_output TO escape;
+SELECT E'\\xDeAdBeEf'::bytea;
+      bytea       
+------------------
+ \336\255\276\357
+(1 row)
+
+SELECT E'\\xDe00BeEf'::bytea;
+      bytea       
+------------------
+ \336\000\276\357
+(1 row)
+
+SELECT E'DeAdBeEf'::bytea;
+  bytea   
+----------
+ DeAdBeEf
+(1 row)
+
+SELECT E'De\\000dBeEf'::bytea;
+    bytea    
+-------------
+ De\000dBeEf
+(1 row)
+
+SELECT E'De\\123dBeEf'::bytea;
+  bytea   
+----------
+ DeSdBeEf
+(1 row)
+
 --
 -- test conversions between various string types
 -- E021-10 implicit casting among the character data types
diff --git a/src/test/regress/input/largeobject.source b/src/test/regress/input/largeobject.source
index 46ba926..5bfba18 100644
--- a/src/test/regress/input/largeobject.source
+++ b/src/test/regress/input/largeobject.source
@@ -2,6 +2,8 @@
 -- Test large object support
 --
 
+SET bytea_output TO escape;
+
 -- Load a file
 CREATE TABLE lotest_stash_values (loid oid, fd integer);
 -- lo_creat(mode integer) returns oid
diff --git a/src/test/regress/output/largeobject.source b/src/test/regress/output/largeobject.source
index 9d69f6c..5ff2e3b 100644
--- a/src/test/regress/output/largeobject.source
+++ b/src/test/regress/output/largeobject.source
@@ -1,6 +1,7 @@
 --
 -- Test large object support
 --
+SET bytea_output TO escape;
 -- Load a file
 CREATE TABLE lotest_stash_values (loid oid, fd integer);
 -- lo_creat(mode integer) returns oid
diff --git a/src/test/regress/output/largeobject_1.source b/src/test/regress/output/largeobject_1.source
index 1fbc29c..0ece752 100644
--- a/src/test/regress/output/largeobject_1.source
+++ b/src/test/regress/output/largeobject_1.source
@@ -1,6 +1,7 @@
 --
 -- Test large object support
 --
+SET bytea_output_hex TO false;
 -- Load a file
 CREATE TABLE lotest_stash_values (loid oid, fd integer);
 -- lo_creat(mode integer) returns oid
diff --git a/src/test/regress/sql/conversion.sql b/src/test/regress/sql/conversion.sql
index 99a9178..c4bb583 100644
--- a/src/test/regress/sql/conversion.sql
+++ b/src/test/regress/sql/conversion.sql
@@ -1,3 +1,5 @@
+SET bytea_output TO escape;
+
 --
 -- create user defined conversion
 --
diff --git a/src/test/regress/sql/strings.sql b/src/test/regress/sql/strings.sql
index a28c75a..3e9cfa3 100644
--- a/src/test/regress/sql/strings.sql
+++ b/src/test/regress/sql/strings.sql
@@ -43,6 +43,25 @@ SELECT U&'wrong: +0061' UESCAPE '+';
 
 RESET standard_conforming_strings;
 
+-- bytea
+SET bytea_output TO hex;
+SELECT E'\\xDeAdBeEf'::bytea;
+SELECT E'\\xDeAdBeE'::bytea;
+SELECT E'\\xDeAdBeEx'::bytea;
+SELECT E'\\xDe00BeEf'::bytea;
+SELECT E'DeAdBeEf'::bytea;
+SELECT E'De\\000dBeEf'::bytea;
+SELECT E'De\123dBeEf'::bytea;
+SELECT E'De\\123dBeEf'::bytea;
+SELECT E'De\\678dBeEf'::bytea;
+
+SET bytea_output TO escape;
+SELECT E'\\xDeAdBeEf'::bytea;
+SELECT E'\\xDe00BeEf'::bytea;
+SELECT E'DeAdBeEf'::bytea;
+SELECT E'De\\000dBeEf'::bytea;
+SELECT E'De\\123dBeEf'::bytea;
+
 --
 -- test conversions between various string types
 -- E021-10 implicit casting among the character data types

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] bytea vs. pg_dump

Reply via email to