Thank you Tom for your review.

Le mer. 29 sept. 2021 à 21:36, Tom Lane <t...@sss.pgh.pa.us> a écrit :

> Jean-Christophe Arnu <jca...@gmail.com> writes:
> > [ empty_string_in_tsvector_v4.patch ]
>
> I looked through this patch a bit.  I don't agree with adding
> these new error conditions to tsvector_setweight_by_filter and
> tsvector_delete_arr.  Those don't prevent bad lexemes from being
> added to tsvectors, so AFAICS they can have no effect other than
> breaking existing applications.  In fact, tsvector_delete_arr is
> one thing you could use to fix existing bad tsvectors, so making
> it throw an error seems actually counterproductive.
>

Agreed.
The new patch included here does not change tsvector_setweight_by_filter()
anymore. Tests and docs are also upgraded.
This patch is not ready yet.


> (By the same token, I think there's a good argument for
> tsvector_delete_arr to just ignore nulls, not throw an error.
> That's a somewhat orthogonal issue, though.)
>

Nulls are now ignored in tsvector_delete_arr().


> What I'm wondering about more than that is whether array_to_tsvector
> is the only place that can inject an empty lexeme ... don't we have
> anything else that can add lexemes without going through the parser?
>

I crawled the docs [1] in order to check each tsvector output from
functions and
operators :

 * The only fonctions left that may lead to empty strings seem
    both json_to_tsvector() and jsonb_to_tsvector(). Both functions use
parsetext
    (in ts_parse.c) that seems to behave correctly and don't create "empty
string".
 * concat operator "||' allows to compute a ts_vector containing "empty
string" if
   one of its operands contains itself an empty string tsvector. This seems
perfectly
   correct from the operator point of view... Should we change this
behaviour to
   filter out empty strings ?

I also wonder if we should not also consider changing COPY FROM  behaviour
on empty string lexemes.
Current version is just crashing on empty string lexemes. Should
we allow them  anyway as COPY FROM input (it seems weird not to be able
to re-import dumped data) or "skipping them" just like array_to_tsvector()
does in the patched version (that may lead to database content changes) or
finally should we not change COPY behaviour ?

I admit this is a tricky bunch of questions I'm too rookie to answer.

[1] https://www.postgresql.org/docs/14/functions-textsearch.html

-- 
Jean-Christophe Arnu
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 78812b2dbe..74e25db03c 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -12896,7 +12896,9 @@ CREATE TYPE rainbow AS ENUM ('red', 'orange', 'yellow', 'green', 'blue', 'purple
        </para>
        <para>
         Converts an array of lexemes to a <type>tsvector</type>.
-        The given strings are used as-is without further processing.
+        The given strings are used as-is. Some checks are performed
+        on array elements. Empty strings and <literal>NULL</literal> values
+        will cause an error to be raised.
        </para>
        <para>
         <literal>array_to_tsvector('{fat,cat,rat}'::text[])</literal>
@@ -13079,6 +13081,8 @@ CREATE TYPE rainbow AS ENUM ('red', 'orange', 'yellow', 'green', 'blue', 'purple
         Assigns the specified <parameter>weight</parameter> to elements
         of the <parameter>vector</parameter> that are listed
         in <parameter>lexemes</parameter>.
+        Some checks are performed on <parameter>lexemes</parameter>.
+        <literal>NULL</literal> values will cause an error to be raised.
        </para>
        <para>
         <literal>setweight('fat:2,4 cat:3 rat:5,6B'::tsvector, 'A', '{cat,rat}')</literal>
@@ -13256,6 +13260,8 @@ CREATE TYPE rainbow AS ENUM ('red', 'orange', 'yellow', 'green', 'blue', 'purple
         Removes any occurrences of the lexemes
         in <parameter>lexemes</parameter>
         from the <parameter>vector</parameter>.
+        <literal>NULL</literal> values in <parameter>lexemes</parameter>
+        will be ignored.
        </para>
        <para>
         <literal>ts_delete('fat:2,4 cat:3 rat:5A'::tsvector, ARRAY['fat','rat'])</literal>
diff --git a/src/backend/utils/adt/tsvector_op.c b/src/backend/utils/adt/tsvector_op.c
index 9236ebcc8f..e50d8a84e2 100644
--- a/src/backend/utils/adt/tsvector_op.c
+++ b/src/backend/utils/adt/tsvector_op.c
@@ -602,10 +602,9 @@ tsvector_delete_arr(PG_FUNCTION_ARGS)
 		int			lex_len,
 					lex_pos;
 
+		// Ignore null values
 		if (nulls[i])
-			ereport(ERROR,
-					(errcode(ERRCODE_NULL_VALUE_NOT_ALLOWED),
-					 errmsg("lexeme array may not contain nulls")));
+			continue;
 
 		lex = VARDATA(dlexemes[i]);
 		lex_len = VARSIZE(dlexemes[i]) - VARHDRSZ;
@@ -761,13 +760,18 @@ array_to_tsvector(PG_FUNCTION_ARGS)
 
 	deconstruct_array(v, TEXTOID, -1, false, TYPALIGN_INT, &dlexemes, &nulls, &nitems);
 
-	/* Reject nulls (maybe we should just ignore them, instead?) */
+	/* Reject nulls and zero length strings (maybe we should just ignore them, instead?) */
 	for (i = 0; i < nitems; i++)
 	{
 		if (nulls[i])
 			ereport(ERROR,
 					(errcode(ERRCODE_NULL_VALUE_NOT_ALLOWED),
 					 errmsg("lexeme array may not contain nulls")));
+
+		if (VARSIZE(dlexemes[i]) - VARHDRSZ == 0)
+			ereport(ERROR,
+					(errcode(ERRCODE_ZERO_LENGTH_CHARACTER_STRING),
+					 errmsg("lexeme array may not contain empty strings")));
 	}
 
 	/* Sort and de-dup, because this is required for a valid tsvector. */
diff --git a/src/test/regress/expected/tstypes.out b/src/test/regress/expected/tstypes.out
index 2601e312df..016e901995 100644
--- a/src/test/regress/expected/tstypes.out
+++ b/src/test/regress/expected/tstypes.out
@@ -85,6 +85,10 @@ SELECT 'a:3A b:2a'::tsvector || 'ba:1234 a:1B';
  'a':3A,4B 'b':2A 'ba':1237
 (1 row)
 
+SELECT $$'' '1' '2'$$::tsvector;
+ERROR:  syntax error in tsvector: "'' '1' '2'"
+LINE 1: SELECT $$'' '1' '2'$$::tsvector;
+               ^
 --Base tsquery test
 SELECT '1'::tsquery;
  tsquery 
@@ -1259,7 +1263,17 @@ SELECT ts_delete('base hidden rebel spaceship strike'::tsvector, ARRAY['spaceshi
 (1 row)
 
 SELECT ts_delete('base hidden rebel spaceship strike'::tsvector, ARRAY['spaceship','leya','rebel', NULL]);
-ERROR:  lexeme array may not contain nulls
+        ts_delete         
+--------------------------
+ 'base' 'hidden' 'strike'
+(1 row)
+
+SELECT ts_delete('base hidden rebel spaceship strike'::tsvector, ARRAY['spaceship','leya','rebel', '']);
+        ts_delete         
+--------------------------
+ 'base' 'hidden' 'strike'
+(1 row)
+
 SELECT unnest('base:7 hidden:6 rebel:1 spaceship:2,33A,34B,35C,36D strike:3'::tsvector);
                    unnest                    
 ---------------------------------------------
@@ -1330,6 +1344,8 @@ SELECT array_to_tsvector(ARRAY['base','hidden','rebel','spaceship','strike']);
 
 SELECT array_to_tsvector(ARRAY['base','hidden','rebel','spaceship', NULL]);
 ERROR:  lexeme array may not contain nulls
+SELECT array_to_tsvector(ARRAY['base','hidden','rebel','spaceship', '']);
+ERROR:  lexeme array may not contain empty strings
 -- array_to_tsvector must sort and de-dup
 SELECT array_to_tsvector(ARRAY['foo','bar','baz','bar']);
  array_to_tsvector 
@@ -1375,6 +1391,12 @@ SELECT setweight('a asd w:5,6,12B,13A zxc'::tsvector, 'c', '{a,zxc}');
 
 SELECT setweight('a asd w:5,6,12B,13A zxc'::tsvector, 'c', ARRAY['a', 'zxc', NULL]);
 ERROR:  lexeme array may not contain nulls
+SELECT setweight('a asd w:5,6,12B,13A zxc'::tsvector, 'c', ARRAY['a', 'zxc', '']);
+            setweight            
+---------------------------------
+ 'a' 'asd' 'w':5,6,12B,13A 'zxc'
+(1 row)
+
 SELECT ts_filter('base:7A empir:17 evil:15 first:11 galact:16 hidden:6A rebel:1A spaceship:2A strike:3A victori:12 won:9'::tsvector, '{a}');
                           ts_filter                          
 -------------------------------------------------------------
diff --git a/src/test/regress/sql/tstypes.sql b/src/test/regress/sql/tstypes.sql
index 30c8c702f0..79a7777407 100644
--- a/src/test/regress/sql/tstypes.sql
+++ b/src/test/regress/sql/tstypes.sql
@@ -17,6 +17,7 @@ SELECT $$'\\as' ab\c ab\\c AB\\\c ab\\\\c$$::tsvector;
 SELECT tsvectorin(tsvectorout($$'\\as' ab\c ab\\c AB\\\c ab\\\\c$$::tsvector));
 SELECT '''w'':4A,3B,2C,1D,5 a:8';
 SELECT 'a:3A b:2a'::tsvector || 'ba:1234 a:1B';
+SELECT $$'' '1' '2'$$::tsvector;
 
 --Base tsquery test
 SELECT '1'::tsquery;
@@ -240,6 +241,7 @@ SELECT ts_delete('base:7 hidden:6 rebel:1 spaceship:2,33A,34B,35C,36D strike:3':
 SELECT ts_delete('base hidden rebel spaceship strike'::tsvector, ARRAY['spaceship','leya','rebel']);
 SELECT ts_delete('base hidden rebel spaceship strike'::tsvector, ARRAY['spaceship','leya','rebel','rebel']);
 SELECT ts_delete('base hidden rebel spaceship strike'::tsvector, ARRAY['spaceship','leya','rebel', NULL]);
+SELECT ts_delete('base hidden rebel spaceship strike'::tsvector, ARRAY['spaceship','leya','rebel', '']);
 
 SELECT unnest('base:7 hidden:6 rebel:1 spaceship:2,33A,34B,35C,36D strike:3'::tsvector);
 SELECT unnest('base hidden rebel spaceship strike'::tsvector);
@@ -252,6 +254,7 @@ SELECT tsvector_to_array('base hidden rebel spaceship strike'::tsvector);
 
 SELECT array_to_tsvector(ARRAY['base','hidden','rebel','spaceship','strike']);
 SELECT array_to_tsvector(ARRAY['base','hidden','rebel','spaceship', NULL]);
+SELECT array_to_tsvector(ARRAY['base','hidden','rebel','spaceship', '']);
 -- array_to_tsvector must sort and de-dup
 SELECT array_to_tsvector(ARRAY['foo','bar','baz','bar']);
 
@@ -262,6 +265,7 @@ SELECT setweight('a:1,3A asd:1C w:5,6,12B,13A zxc:81,222A,567'::tsvector, 'c', '
 SELECT setweight('a:1,3A asd:1C w:5,6,12B,13A zxc:81,222A,567'::tsvector, 'c', '{a,zxc}');
 SELECT setweight('a asd w:5,6,12B,13A zxc'::tsvector, 'c', '{a,zxc}');
 SELECT setweight('a asd w:5,6,12B,13A zxc'::tsvector, 'c', ARRAY['a', 'zxc', NULL]);
+SELECT setweight('a asd w:5,6,12B,13A zxc'::tsvector, 'c', ARRAY['a', 'zxc', '']);
 
 SELECT ts_filter('base:7A empir:17 evil:15 first:11 galact:16 hidden:6A rebel:1A spaceship:2A strike:3A victori:12 won:9'::tsvector, '{a}');
 SELECT ts_filter('base hidden rebel spaceship strike'::tsvector, '{a}');

Reply via email to