Re: FDW pushdown of non-collated functions

2023-10-10 Thread Jean-Christophe Arnu
Hi Ashutosh,

Le ven. 6 oct. 2023 à 14:16, Ashutosh Bapat 
a écrit :

> Hi Jean-Christophe,
>
> On Fri, Sep 8, 2023 at 11:30 PM Jean-Christophe Arnu 
> wrote:
> >
> > Maybe we could add another condition to the first if statement in order
> to allow a “no-collation” function to be pushed down even if they have
> “collatable” parameters. I’m not sure about the possible regressions of
> behaviour of this change, but it
> seems to work fine with date_trunc() and date_part() (which suffers
> the same problem).
>
> That may not work since the output of the function may be dependent
> upon the collation on the inputs.
>
> There were similar discussions earlier. E.g.
>
> https://www.postgresql.org/message-id/flat/CACowWR1ARWyRepRxGfijMcsw%2BH84Dj8x2o9N3kvz%3Dz1p%2B6b45Q%40mail.gmail.com
> .
>
> Reading Tom's first reply there you may work around this by declaring
> the collation explicitly.
>

Thanks for your reply. I did not catch  these messages in the archive.
Thanks for spotting them.


> Briefly reading Tom's reply, the problem seems to be trusting whether
> the default collation locally and on the foreign server respectively
> is same or not. May be a simple fix is to declare a foreign server
> level option declaring that the default collation on the foreign
> server is same as the local server may be a way to move forward. But
> given that the problem remains unsolved for 7 years at least, may be
> such a simple fix is not enough.
>

I studied postgres_fdw source code a bit and the problem is not as easy to
solve : one could set an option telling the default remote collation is
aligned with local per "server" but nothing guaranties that the parameter
collation is known on the «remote» side.


>
> Another solution would be to attach another attribute to a function
> indicating whether the output of that function depends upon the input
> collations or not. Doing that just for FDW may not be acceptable
> though.
>

Yes, definitely. I thought

Anyway, you're right, after 7 years, this is a really difficult problem to
solve and there's no straightforward solution (to my eyes).
Thanks again for your kind explanations
Regards

-- 
Jean-Christophe Arnu


Re: FDW pushdown of non-collated functions

2023-10-05 Thread Jean-Christophe Arnu
Dear Hackers,

I figured out this email was sent at release time. The worst time to ask
for thoughts on a subject IMHO. Anyway, I hope this email will pop the
topic over the stack!
Thank you!

Le ven. 8 sept. 2023 à 16:41, Jean-Christophe Arnu  a
écrit :

> Dear hackers,
>
> I recently found a weird behaviour involving FDW (postgres_fdw) and
> planning.
>
> Here’s a simplified use-case:
>
> Given a remote table (say on server2) with the following definition:
>
> CREATE TABLE t1(
>   ts timestamp without time zone,
>   x  bigint,
>   x2 text
> );
> --Then populate t1 table:INSERT INTO t1
>SELECT
> current_timestamp - 1000*random()*'1 day'::interval
> ,x
> ,''||x
>FROM
> generate_series(1,10) as x;
>
>
> This table is imported in a specific schema on server1 (we do not use
> use_remote_estimate) also with t1 name in a specific schema:
>
> On server1:
>
> CREATE SERVER server2
>FOREIGN DATA WRAPPER  postgres_fdw
>OPTIONS (
>   host '127.0.0.1',
>   port '9002',
>   dbname 'postgres',
>   use_remote_estimate 'false'
>);
> CREATE USER MAPPING FOR jc
>SERVER server2
>OPTIONS (user 'jc');
> CREATE SCHEMA remote;
>
> IMPORT FOREIGN SCHEMA public
>FROM SERVER server2
>INTO remote ;
>
> On a classic PostgreSQL 15 version the following query using date_trunc()
> is executed and results in the following plan:
>
> jc=# explain (verbose,analyze) select date_trunc('day',ts), count(1) from 
> remote.t1 group by date_trunc('day',ts) order by 1;
> QUERY PLAN
>  
> ---
>  Sort  (cost=216.14..216.64 rows=200 width=16) (actual time=116.699..116.727 
> rows=1001 loops=1)
>Output: (date_trunc('day'::text, ts)), (count(1))
>Sort Key: (date_trunc('day'::text, t1.ts))
>Sort Method: quicksort  Memory: 79kB
>->  HashAggregate  (cost=206.00..208.50 rows=200 width=16) (actual 
> time=116.452..116.532 rows=1001 loops=1)
>  Output: (date_trunc('day'::text, ts)), count(1)
>  Group Key: date_trunc('day'::text, t1.ts)
>  Batches: 1  Memory Usage: 209kB
>  ->  Foreign Scan on remote.t1  (cost=100.00..193.20 rows=2560 
> width=8) (actual time=0.384..106.225 rows=10 loops=1)
>Output: date_trunc('day'::text, ts)
>Remote SQL: SELECT ts FROM public.t1
>  Planning Time: 0.077 ms
>  Execution Time: 117.028 ms
>
>
> Whereas the same query with date_bin()
>
> jc=# explain (verbose,analyze) select date_bin('1day',ts,'2023-01-01'), 
> count(1) from remote.t1 group by 1 order by 1;
>   
>   QUERY PLAN  
>   
> 
> --
>  Foreign Scan  (cost=113.44..164.17 rows=200 width=16) (actual 
> time=11.297..16.312 rows=1001 loops=1)
>Output: (date_bin('1 day'::interval, ts, '2023-01-01 00:00:00'::timestamp 
> without time zone)), (count(1))
>Relations: Aggregate on (remote.t1)
>Remote SQL: SELECT date_bin('1 day'::interval, ts, '2023-01-01 
> 00:00:00'::timestamp without time zone), count(1) FROM public.t1 GROUP BY 1 
> ORDER BY date_bin('1 day'::interval, ts, '2023-01-01 00:00:00'::timestamp 
> without time zone) ASC NULLS LAST
>  Planning Time: 0.114 ms
>  Execution Time: 16.599 ms
>
>
>
> With date_bin() the whole expression is pushed down to the remote server,
> whereas with date_trunc() it’s not.
>
> I dived into the code and live debugged. It turns out that decisions to
> pushdown or not a whole query depends on many factors like volatility and
> collation. In the date_trunc() case, the problem is all about collation (
> date_trunc() on timestamp without time zone). And decision is made in the
> foreign_expr_walker() in deparse.c (
> https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=contrib/postgres_fdw/deparse.c;h=efaf387890e3f85c419748ec3af5d1e9696c9c4c;hb=86648dcdaec67b83cec20a9d25b45ec089a7c624#l468
> )
>
> First the function is tested as shippabl

FDW pushdown of non-collated functions

2023-09-08 Thread Jean-Christophe Arnu
fe->inputcollid is evaluated to 0 (InvalidOid) thus skips the else
   statement and continues the control flow in the function.

For date_bin(), all arguments are “non-collatable” arguments (timestamp
without time zone and interval).

So the situation is that date_trunc() is a “non-collatable” function
failing to be pushed down whereas it may be a good idea to do so.

Maybe we could add another condition to the first if statement in order to
allow a “no-collation” function to be pushed down even if they have
“collatable” parameters. I’m not sure about the possible regressions of
behaviour of this change, but it seems to work fine with date_trunc() and
date_part() (which suffers the same problem).

Here’s the following change

/*
* If function's input collation is not derived from a foreign
* Var, it can't be sent to remote.
*/if (fe->inputcollid == InvalidOid ||
fe->funccollid == InvalidOid)
  /* OK, inputs are all noncollatable */ ;else if (inner_cxt.state !=
FDW_COLLATE_SAFE ||
 fe->inputcollid != inner_cxt.collation)
 return false;

I don’t presume this patch is free from side effects or fits all use-cases.

A patch (tiny) is attached to this email. This patch works against
master/head at the time of writing.
Thank you for any thoughts.

-- 
Jean-Christophe Arnu
diff --git a/contrib/postgres_fdw/deparse.c b/contrib/postgres_fdw/deparse.c
index 09d6dd60dd..16b938ebb2 100644
--- a/contrib/postgres_fdw/deparse.c
+++ b/contrib/postgres_fdw/deparse.c
@@ -568,7 +568,8 @@ foreign_expr_walker(Node *node,
  * If function's input collation is not derived from a foreign
  * Var, it can't be sent to remote.
  */
-if (fe->inputcollid == InvalidOid)
+if (fe->inputcollid == InvalidOid ||
+	 fe->funccollid == InvalidOid)
 	 /* OK, inputs are all noncollatable */ ;
 else if (inner_cxt.state != FDW_COLLATE_SAFE ||
 		 fe->inputcollid != inner_cxt.collation)


Re: Empty string in lexeme for tsvector

2021-09-30 Thread Jean-Christophe Arnu
Thank you Tom for your review.

Le mer. 29 sept. 2021 à 21:36, Tom Lane  a écrit :

> Jean-Christophe Arnu  writes:
> > [ empty_string_in_tsvector_v4.patch ]
>
> I looked through this patch a bit.  I don't agree with adding
> these new error conditions to tsvector_setweight_by_filter and
> tsvector_delete_arr.  Those don't prevent bad lexemes from being
> added to tsvectors, so AFAICS they can have no effect other than
> breaking existing applications.  In fact, tsvector_delete_arr is
> one thing you could use to fix existing bad tsvectors, so making
> it throw an error seems actually counterproductive.
>

Agreed.
The new patch included here does not change tsvector_setweight_by_filter()
anymore. Tests and docs are also upgraded.
This patch is not ready yet.


> (By the same token, I think there's a good argument for
> tsvector_delete_arr to just ignore nulls, not throw an error.
> That's a somewhat orthogonal issue, though.)
>

Nulls are now ignored in tsvector_delete_arr().


> What I'm wondering about more than that is whether array_to_tsvector
> is the only place that can inject an empty lexeme ... don't we have
> anything else that can add lexemes without going through the parser?
>

I crawled the docs [1] in order to check each tsvector output from
functions and
operators :

 * The only fonctions left that may lead to empty strings seem
both json_to_tsvector() and jsonb_to_tsvector(). Both functions use
parsetext
(in ts_parse.c) that seems to behave correctly and don't create "empty
string".
 * concat operator "||' allows to compute a ts_vector containing "empty
string" if
   one of its operands contains itself an empty string tsvector. This seems
perfectly
   correct from the operator point of view... Should we change this
behaviour to
   filter out empty strings ?

I also wonder if we should not also consider changing COPY FROM  behaviour
on empty string lexemes.
Current version is just crashing on empty string lexemes. Should
we allow them  anyway as COPY FROM input (it seems weird not to be able
to re-import dumped data) or "skipping them" just like array_to_tsvector()
does in the patched version (that may lead to database content changes) or
finally should we not change COPY behaviour ?

I admit this is a tricky bunch of questions I'm too rookie to answer.

[1] https://www.postgresql.org/docs/14/functions-textsearch.html

-- 
Jean-Christophe Arnu
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 78812b2dbe..74e25db03c 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -12896,7 +12896,9 @@ CREATE TYPE rainbow AS ENUM ('red', 'orange', 'yellow', 'green', 'blue', 'purple


 Converts an array of lexemes to a tsvector.
-The given strings are used as-is without further processing.
+The given strings are used as-is. Some checks are performed
+on array elements. Empty strings and NULL values
+will cause an error to be raised.


 array_to_tsvector('{fat,cat,rat}'::text[])
@@ -13079,6 +13081,8 @@ CREATE TYPE rainbow AS ENUM ('red', 'orange', 'yellow', 'green', 'blue', 'purple
 Assigns the specified weight to elements
 of the vector that are listed
 in lexemes.
+Some checks are performed on lexemes.
+NULL values will cause an error to be raised.


 setweight('fat:2,4 cat:3 rat:5,6B'::tsvector, 'A', '{cat,rat}')
@@ -13256,6 +13260,8 @@ CREATE TYPE rainbow AS ENUM ('red', 'orange', 'yellow', 'green', 'blue', 'purple
 Removes any occurrences of the lexemes
 in lexemes
 from the vector.
+NULL values in lexemes
+will be ignored.


 ts_delete('fat:2,4 cat:3 rat:5A'::tsvector, ARRAY['fat','rat'])
diff --git a/src/backend/utils/adt/tsvector_op.c b/src/backend/utils/adt/tsvector_op.c
index 9236ebcc8f..e50d8a84e2 100644
--- a/src/backend/utils/adt/tsvector_op.c
+++ b/src/backend/utils/adt/tsvector_op.c
@@ -602,10 +602,9 @@ tsvector_delete_arr(PG_FUNCTION_ARGS)
 		int			lex_len,
 	lex_pos;
 
+		// Ignore null values
 		if (nulls[i])
-			ereport(ERROR,
-	(errcode(ERRCODE_NULL_VALUE_NOT_ALLOWED),
-	 errmsg("lexeme array may not contain nulls")));
+			continue;
 
 		lex = VARDATA(dlexemes[i]);
 		lex_len = VARSIZE(dlexemes[i]) - VARHDRSZ;
@@ -761,13 +760,18 @@ array_to_tsvector(PG_FUNCTION_ARGS)
 
 	deconstruct_array(v, TEXTOID, -1, false, TYPALIGN_INT, , , );
 
-	/* Reject nulls (maybe we should just ignore them, instead?) */
+	/* Reject nulls and zero length strings (maybe we should just ignore them, instead?) */
 	for (i = 0; i < nitems; i++)
 	{
 		if (nulls[i])
 			ereport(ERROR,
 	(errcode(ERRCODE_NULL_VALUE_NOT_ALLOWED),
 	 errmsg("lexeme array may not contain nulls")));
+
+		if (VARSIZE(dlexemes[i]) - VARHDRSZ == 0)
+			ereport(

Re: Empty string in lexeme for tsvector

2021-09-27 Thread Jean-Christophe Arnu
Le dim. 26 sept. 2021 à 22:41, Jean-Christophe Arnu  a
écrit :

>
>
> Le dim. 26 sept. 2021 à 15:55, Artur Zakirov  a écrit :
>
>> Nice catch! The patch looks good to me.
>> Can you also add a more general test case:
>>
>> =# SELECT $$'' '1' '2'$$::tsvector;
>> ERROR:  syntax error in tsvector: "'' '1' '2'"
>> LINE 1: SELECT $$'' '1' '2'$$::tsvector;
>>
>>
> Thank you, Artur for spotting this test.
> It is now included into this patch.
>
>
>
Two more things :

  * I updated the documentation for array_to_tsvector(), ts_delete() and
setweight() functions (so here's a new patch);
  * I should mention François Ferry from Logilab who first reported the
backup/restore problem that led to this patch.

I think this should be ok, now the doc is up to date.

Kind regards.
-- 
Jean-Christophe Arnu
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 78812b2dbe..01f0b870ca 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -12896,7 +12896,9 @@ CREATE TYPE rainbow AS ENUM ('red', 'orange', 'yellow', 'green', 'blue', 'purple


 Converts an array of lexemes to a tsvector.
-The given strings are used as-is without further processing.
+The given strings are used as-is. Some checks are performed
+on array elements. Empty strings and NULL values
+will cause an error to be raised.


 array_to_tsvector('{fat,cat,rat}'::text[])
@@ -13079,6 +13081,9 @@ CREATE TYPE rainbow AS ENUM ('red', 'orange', 'yellow', 'green', 'blue', 'purple
 Assigns the specified weight to elements
 of the vector that are listed
 in lexemes.
+Some checks are performed on lexemes.
+Empty strings and NULL values
+will cause an error to be raised.


 setweight('fat:2,4 cat:3 rat:5,6B'::tsvector, 'A', '{cat,rat}')
@@ -13256,6 +13261,9 @@ CREATE TYPE rainbow AS ENUM ('red', 'orange', 'yellow', 'green', 'blue', 'purple
 Removes any occurrences of the lexemes
 in lexemes
 from the vector.
+Some checks are performed on lexemes.
+Empty strings and NULL values
+will cause an error to be raised.


 ts_delete('fat:2,4 cat:3 rat:5A'::tsvector, ARRAY['fat','rat'])
diff --git a/src/backend/utils/adt/tsvector_op.c b/src/backend/utils/adt/tsvector_op.c
index 9236ebcc8f..00f80ffcbc 100644
--- a/src/backend/utils/adt/tsvector_op.c
+++ b/src/backend/utils/adt/tsvector_op.c
@@ -329,6 +329,12 @@ tsvector_setweight_by_filter(PG_FUNCTION_ARGS)
 
 		lex = VARDATA(dlexemes[i]);
 		lex_len = VARSIZE(dlexemes[i]) - VARHDRSZ;
+
+		if (lex_len == 0)
+			ereport(ERROR,
+	(errcode(ERRCODE_ZERO_LENGTH_CHARACTER_STRING),
+	 errmsg("lexeme array may not contain empty strings")));
+
 		lex_pos = tsvector_bsearch(tsout, lex, lex_len);
 
 		if (lex_pos >= 0 && (j = POSDATALEN(tsout, entry + lex_pos)) != 0)
@@ -609,6 +615,12 @@ tsvector_delete_arr(PG_FUNCTION_ARGS)
 
 		lex = VARDATA(dlexemes[i]);
 		lex_len = VARSIZE(dlexemes[i]) - VARHDRSZ;
+
+		if (lex_len == 0)
+			ereport(ERROR,
+	(errcode(ERRCODE_ZERO_LENGTH_CHARACTER_STRING),
+	 errmsg("lexeme array may not contain empty strings")));
+
 		lex_pos = tsvector_bsearch(tsin, lex, lex_len);
 
 		if (lex_pos >= 0)
@@ -761,13 +773,18 @@ array_to_tsvector(PG_FUNCTION_ARGS)
 
 	deconstruct_array(v, TEXTOID, -1, false, TYPALIGN_INT, , , );
 
-	/* Reject nulls (maybe we should just ignore them, instead?) */
+	/* Reject nulls and zero length strings (maybe we should just ignore them, instead?) */
 	for (i = 0; i < nitems; i++)
 	{
 		if (nulls[i])
 			ereport(ERROR,
 	(errcode(ERRCODE_NULL_VALUE_NOT_ALLOWED),
 	 errmsg("lexeme array may not contain nulls")));
+
+		if (VARSIZE(dlexemes[i]) - VARHDRSZ == 0)
+			ereport(ERROR,
+	(errcode(ERRCODE_ZERO_LENGTH_CHARACTER_STRING),
+	 errmsg("lexeme array may not contain empty strings")));
 	}
 
 	/* Sort and de-dup, because this is required for a valid tsvector. */
diff --git a/src/test/regress/expected/tstypes.out b/src/test/regress/expected/tstypes.out
index 2601e312df..f8bf9c6051 100644
--- a/src/test/regress/expected/tstypes.out
+++ b/src/test/regress/expected/tstypes.out
@@ -85,6 +85,10 @@ SELECT 'a:3A b:2a'::tsvector || 'ba:1234 a:1B';
  'a':3A,4B 'b':2A 'ba':1237
 (1 row)
 
+SELECT $$'' '1' '2'$$::tsvector;
+ERROR:  syntax error in tsvector: "'' '1' '2'"
+LINE 1: SELECT $$'' '1' '2'$$::tsvector;
+   ^
 --Base tsquery test
 SELECT '1'::tsquery;
  tsquery 
@@ -1260,6 +1264,8 @@ SELECT ts_delete('base hidden rebel spaceship strike'::tsvector, ARRAY['spaceshi
 
 SELECT ts_delete('base hidden rebel spaceship strike'::tsvector, ARRAY['spaceship','leya','rebel', NULL]);
 ERROR:  lexeme array may not contain nulls
+SELECT ts_delete('base hidden rebel spaceshi

Re: Empty string in lexeme for tsvector

2021-09-26 Thread Jean-Christophe Arnu
Le dim. 26 sept. 2021 à 15:55, Artur Zakirov  a écrit :

> Nice catch! The patch looks good to me.
> Can you also add a more general test case:
>
> =# SELECT $$'' '1' '2'$$::tsvector;
> ERROR:  syntax error in tsvector: "'' '1' '2'"
> LINE 1: SELECT $$'' '1' '2'$$::tsvector;
>
>
Thank you, Artur for spotting this test.
It is now included into this patch.

-- 
Jean-Christophe Arnu
diff --git a/src/backend/utils/adt/tsvector_op.c b/src/backend/utils/adt/tsvector_op.c
index 9236ebcc8f..00f80ffcbc 100644
--- a/src/backend/utils/adt/tsvector_op.c
+++ b/src/backend/utils/adt/tsvector_op.c
@@ -329,6 +329,12 @@ tsvector_setweight_by_filter(PG_FUNCTION_ARGS)
 
 		lex = VARDATA(dlexemes[i]);
 		lex_len = VARSIZE(dlexemes[i]) - VARHDRSZ;
+
+		if (lex_len == 0)
+			ereport(ERROR,
+	(errcode(ERRCODE_ZERO_LENGTH_CHARACTER_STRING),
+	 errmsg("lexeme array may not contain empty strings")));
+
 		lex_pos = tsvector_bsearch(tsout, lex, lex_len);
 
 		if (lex_pos >= 0 && (j = POSDATALEN(tsout, entry + lex_pos)) != 0)
@@ -609,6 +615,12 @@ tsvector_delete_arr(PG_FUNCTION_ARGS)
 
 		lex = VARDATA(dlexemes[i]);
 		lex_len = VARSIZE(dlexemes[i]) - VARHDRSZ;
+
+		if (lex_len == 0)
+			ereport(ERROR,
+	(errcode(ERRCODE_ZERO_LENGTH_CHARACTER_STRING),
+	 errmsg("lexeme array may not contain empty strings")));
+
 		lex_pos = tsvector_bsearch(tsin, lex, lex_len);
 
 		if (lex_pos >= 0)
@@ -761,13 +773,18 @@ array_to_tsvector(PG_FUNCTION_ARGS)
 
 	deconstruct_array(v, TEXTOID, -1, false, TYPALIGN_INT, , , );
 
-	/* Reject nulls (maybe we should just ignore them, instead?) */
+	/* Reject nulls and zero length strings (maybe we should just ignore them, instead?) */
 	for (i = 0; i < nitems; i++)
 	{
 		if (nulls[i])
 			ereport(ERROR,
 	(errcode(ERRCODE_NULL_VALUE_NOT_ALLOWED),
 	 errmsg("lexeme array may not contain nulls")));
+
+		if (VARSIZE(dlexemes[i]) - VARHDRSZ == 0)
+			ereport(ERROR,
+	(errcode(ERRCODE_ZERO_LENGTH_CHARACTER_STRING),
+	 errmsg("lexeme array may not contain empty strings")));
 	}
 
 	/* Sort and de-dup, because this is required for a valid tsvector. */
diff --git a/src/test/regress/expected/tstypes.out b/src/test/regress/expected/tstypes.out
index 2601e312df..f8bf9c6051 100644
--- a/src/test/regress/expected/tstypes.out
+++ b/src/test/regress/expected/tstypes.out
@@ -85,6 +85,10 @@ SELECT 'a:3A b:2a'::tsvector || 'ba:1234 a:1B';
  'a':3A,4B 'b':2A 'ba':1237
 (1 row)
 
+SELECT $$'' '1' '2'$$::tsvector;
+ERROR:  syntax error in tsvector: "'' '1' '2'"
+LINE 1: SELECT $$'' '1' '2'$$::tsvector;
+   ^
 --Base tsquery test
 SELECT '1'::tsquery;
  tsquery 
@@ -1260,6 +1264,8 @@ SELECT ts_delete('base hidden rebel spaceship strike'::tsvector, ARRAY['spaceshi
 
 SELECT ts_delete('base hidden rebel spaceship strike'::tsvector, ARRAY['spaceship','leya','rebel', NULL]);
 ERROR:  lexeme array may not contain nulls
+SELECT ts_delete('base hidden rebel spaceship strike'::tsvector, ARRAY['spaceship','leya','rebel', '']);
+ERROR:  lexeme array may not contain empty strings
 SELECT unnest('base:7 hidden:6 rebel:1 spaceship:2,33A,34B,35C,36D strike:3'::tsvector);
unnest
 -
@@ -1330,6 +1336,8 @@ SELECT array_to_tsvector(ARRAY['base','hidden','rebel','spaceship','strike']);
 
 SELECT array_to_tsvector(ARRAY['base','hidden','rebel','spaceship', NULL]);
 ERROR:  lexeme array may not contain nulls
+SELECT array_to_tsvector(ARRAY['base','hidden','rebel','spaceship', '']);
+ERROR:  lexeme array may not contain empty strings
 -- array_to_tsvector must sort and de-dup
 SELECT array_to_tsvector(ARRAY['foo','bar','baz','bar']);
  array_to_tsvector 
@@ -1375,6 +1383,8 @@ SELECT setweight('a asd w:5,6,12B,13A zxc'::tsvector, 'c', '{a,zxc}');
 
 SELECT setweight('a asd w:5,6,12B,13A zxc'::tsvector, 'c', ARRAY['a', 'zxc', NULL]);
 ERROR:  lexeme array may not contain nulls
+SELECT setweight('a asd w:5,6,12B,13A zxc'::tsvector, 'c', ARRAY['a', 'zxc', '']);
+ERROR:  lexeme array may not contain empty strings
 SELECT ts_filter('base:7A empir:17 evil:15 first:11 galact:16 hidden:6A rebel:1A spaceship:2A strike:3A victori:12 won:9'::tsvector, '{a}');
   ts_filter  
 -
diff --git a/src/test/regress/sql/tstypes.sql b/src/test/regress/sql/tstypes.sql
index 30c8c702f0..79a407 100644
--- a/src/test/regress/sql/tstypes.sql
+++ b/src/test/regress/sql/tstypes.sql
@@ -17,6 +17,7 @@ SELECT $$'\\as' ab\c ab\\c AB\\\c abc$$::tsvector;
 SELECT tsvectorin(tsvectorout($$'\\as' ab\c ab\\c AB\\\c abc$$::tsvector));
 SELECT '''w'':4A,3B,2C,1D,5 a:8';
 SELECT 'a:3A b:2a'::tsvector || 'ba:1234 a:1B';
+SELECT $$'' '1' '2'$$::tsvector;
 
 --Base tsquery test
 SELECT '1'::tsquery;
@@ -240,6 +241,7 @@ SELECT ts_delete('base:7 hidden:6

Re: Empty string in lexeme for tsvector

2021-09-24 Thread Jean-Christophe Arnu
Le ven. 24 sept. 2021 à 13:03, Ranier Vilela  a écrit :

>
> Comments are more than welcome!
>>
> 1. Would be better to add this test-and-error before tsvector_bsearch call.
>
> + if (lex_len == 0)
> + ereport(ERROR,
> + (errcode(ERRCODE_ZERO_LENGTH_CHARACTER_STRING),
> + errmsg("lexeme array may not contain empty strings")));
> +
>
> If lex_len is equal to zero, better get out soon.
>
> 2. The second test-and-error can use lex_len, just like the first test,
> I don't see the point in recalculating the size of lex_len if that's
> already done.
>
> + if (VARSIZE(dlexemes[i]) - VARHDRSZ == 0)
> + ereport(ERROR,
> + (errcode(ERRCODE_ZERO_LENGTH_CHARACTER_STRING),
> + errmsg("lexeme array may not contain empty strings")));
> +
>

Hello Ranier,
Thank you for your comments.
Here's a new patch file taking your comments into account.

I was just wondering if empty string eviction is done in the right place.
As you rightfully commented, lex_len is calculated later (once again for a
right purpose) and my code checks for empty strings as soon as possible.
To me, it seems to be the right thing to do (prevent further processing on
lexemes
as soon as possible) but I might omit something.

Regards


Jean-Christophe Arnu
diff --git a/src/backend/utils/adt/tsvector_op.c b/src/backend/utils/adt/tsvector_op.c
index 9236ebcc8f..00f80ffcbc 100644
--- a/src/backend/utils/adt/tsvector_op.c
+++ b/src/backend/utils/adt/tsvector_op.c
@@ -329,6 +329,12 @@ tsvector_setweight_by_filter(PG_FUNCTION_ARGS)
 
 		lex = VARDATA(dlexemes[i]);
 		lex_len = VARSIZE(dlexemes[i]) - VARHDRSZ;
+
+		if (lex_len == 0)
+			ereport(ERROR,
+	(errcode(ERRCODE_ZERO_LENGTH_CHARACTER_STRING),
+	 errmsg("lexeme array may not contain empty strings")));
+
 		lex_pos = tsvector_bsearch(tsout, lex, lex_len);
 
 		if (lex_pos >= 0 && (j = POSDATALEN(tsout, entry + lex_pos)) != 0)
@@ -609,6 +615,12 @@ tsvector_delete_arr(PG_FUNCTION_ARGS)
 
 		lex = VARDATA(dlexemes[i]);
 		lex_len = VARSIZE(dlexemes[i]) - VARHDRSZ;
+
+		if (lex_len == 0)
+			ereport(ERROR,
+	(errcode(ERRCODE_ZERO_LENGTH_CHARACTER_STRING),
+	 errmsg("lexeme array may not contain empty strings")));
+
 		lex_pos = tsvector_bsearch(tsin, lex, lex_len);
 
 		if (lex_pos >= 0)
@@ -761,13 +773,18 @@ array_to_tsvector(PG_FUNCTION_ARGS)
 
 	deconstruct_array(v, TEXTOID, -1, false, TYPALIGN_INT, , , );
 
-	/* Reject nulls (maybe we should just ignore them, instead?) */
+	/* Reject nulls and zero length strings (maybe we should just ignore them, instead?) */
 	for (i = 0; i < nitems; i++)
 	{
 		if (nulls[i])
 			ereport(ERROR,
 	(errcode(ERRCODE_NULL_VALUE_NOT_ALLOWED),
 	 errmsg("lexeme array may not contain nulls")));
+
+		if (VARSIZE(dlexemes[i]) - VARHDRSZ == 0)
+			ereport(ERROR,
+	(errcode(ERRCODE_ZERO_LENGTH_CHARACTER_STRING),
+	 errmsg("lexeme array may not contain empty strings")));
 	}
 
 	/* Sort and de-dup, because this is required for a valid tsvector. */
diff --git a/src/test/regress/expected/tstypes.out b/src/test/regress/expected/tstypes.out
index 2601e312df..a258179ef0 100644
--- a/src/test/regress/expected/tstypes.out
+++ b/src/test/regress/expected/tstypes.out
@@ -1260,6 +1260,8 @@ SELECT ts_delete('base hidden rebel spaceship strike'::tsvector, ARRAY['spaceshi
 
 SELECT ts_delete('base hidden rebel spaceship strike'::tsvector, ARRAY['spaceship','leya','rebel', NULL]);
 ERROR:  lexeme array may not contain nulls
+SELECT ts_delete('base hidden rebel spaceship strike'::tsvector, ARRAY['spaceship','leya','rebel', '']);
+ERROR:  lexeme array may not contain empty strings
 SELECT unnest('base:7 hidden:6 rebel:1 spaceship:2,33A,34B,35C,36D strike:3'::tsvector);
unnest
 -
@@ -1330,6 +1332,8 @@ SELECT array_to_tsvector(ARRAY['base','hidden','rebel','spaceship','strike']);
 
 SELECT array_to_tsvector(ARRAY['base','hidden','rebel','spaceship', NULL]);
 ERROR:  lexeme array may not contain nulls
+SELECT array_to_tsvector(ARRAY['base','hidden','rebel','spaceship', '']);
+ERROR:  lexeme array may not contain empty strings
 -- array_to_tsvector must sort and de-dup
 SELECT array_to_tsvector(ARRAY['foo','bar','baz','bar']);
  array_to_tsvector 
@@ -1375,6 +1379,8 @@ SELECT setweight('a asd w:5,6,12B,13A zxc'::tsvector, 'c', '{a,zxc}');
 
 SELECT setweight('a asd w:5,6,12B,13A zxc'::tsvector, 'c', ARRAY['a', 'zxc', NULL]);
 ERROR:  lexeme array may not contain nulls
+SELECT setweight('a asd w:5,6,12B,13A zxc'::tsvector, 'c', ARRAY['a', 'zxc', '']);
+ERROR:  lexeme array may not contain empty strings
 SELECT ts_filter('base:7A empir:17 evil:15 first:11 galact:16 hidden:6A rebel:1A spaceship:2A strike:3A victori:12 won:9'::tsvector, '{a}');
   ts_filter  
 ---

Empty string in lexeme for tsvector

2021-09-24 Thread Jean-Christophe Arnu
Hello Hackers,

This is my second proposal for a patch, so I hope not to make "rookie"
mistakes.

This proposal patch is based on a simple use case :

If one creates a table this way
CREATE TABLE tst_table AS (SELECT array_to_tsvector(ARRAY['','abc','def']));

the table content is :
 array_to_tsvector
---
 '' 'abc' 'def'
(1 row)

First it can be strange to have an empty string for tsvector lexeme but
anyway, keep going to the real point.

Once dumped, this table dump contains that empty string that can't be
restored.
tsvector_parse (./utils/adt/tsvector_parser.c) raises an error.

Thus it is not possible for data to be restored this way.

There are two ways to consider this : is it alright to have empty strings
in lexemes ?
   * If so, empty strings should be correctly parsed by tsvector_parser.
   * If not, one should prevent empty strings from being stored into
tsvectors.

Since "empty strings" seems not to be a valid lexeme, I undertook to change
some functions dealing with tsvector to check whether string arguments are
empty. This might be the wrong path as I'm not familiar with tsvector
usage... (OTOH, I can provide a fix patch for tsvector_parser() if I'm
wrong).

This involved changing the way functions like array_to_tsvector(),
ts_delete() and setweight() behave. As for NULL values, empty string values
are checked and an error is raised for such a value. It appears to me that
ERRCODE_ZERO_LENGTH_CHARACTER_STRING (2200F) matched this behaviour but I
may be wrong.

Since this patch changes the way functions behave, consider it as a simple
try to overcome a strange situation we've noticed on a specific use case.

This included patch manages that checks for empty strings on the pointed
out functions. It comes with modified regression tests. Patch applies along
head/master and 14_RC1.

Comments are more than welcome!
Thank you,

-- 
Jean-Christophe Arnu
diff --git a/src/backend/utils/adt/tsvector_op.c b/src/backend/utils/adt/tsvector_op.c
index 9236ebcc8f..5a33e1bb10 100644
--- a/src/backend/utils/adt/tsvector_op.c
+++ b/src/backend/utils/adt/tsvector_op.c
@@ -331,6 +331,11 @@ tsvector_setweight_by_filter(PG_FUNCTION_ARGS)
 		lex_len = VARSIZE(dlexemes[i]) - VARHDRSZ;
 		lex_pos = tsvector_bsearch(tsout, lex, lex_len);
 
+		if (lex_len == 0)
+			ereport(ERROR,
+	(errcode(ERRCODE_ZERO_LENGTH_CHARACTER_STRING),
+	 errmsg("lexeme array may not contain empty strings")));
+
 		if (lex_pos >= 0 && (j = POSDATALEN(tsout, entry + lex_pos)) != 0)
 		{
 			WordEntryPos *p = POSDATAPTR(tsout, entry + lex_pos);
@@ -607,6 +612,11 @@ tsvector_delete_arr(PG_FUNCTION_ARGS)
 	(errcode(ERRCODE_NULL_VALUE_NOT_ALLOWED),
 	 errmsg("lexeme array may not contain nulls")));
 
+		if (VARSIZE(dlexemes[i]) - VARHDRSZ == 0)
+			ereport(ERROR,
+	(errcode(ERRCODE_ZERO_LENGTH_CHARACTER_STRING),
+	 errmsg("lexeme array may not contain empty strings")));
+
 		lex = VARDATA(dlexemes[i]);
 		lex_len = VARSIZE(dlexemes[i]) - VARHDRSZ;
 		lex_pos = tsvector_bsearch(tsin, lex, lex_len);
@@ -761,13 +771,18 @@ array_to_tsvector(PG_FUNCTION_ARGS)
 
 	deconstruct_array(v, TEXTOID, -1, false, TYPALIGN_INT, , , );
 
-	/* Reject nulls (maybe we should just ignore them, instead?) */
+	/* Reject nulls and zero length strings (maybe we should just ignore them, instead?) */
 	for (i = 0; i < nitems; i++)
 	{
 		if (nulls[i])
 			ereport(ERROR,
 	(errcode(ERRCODE_NULL_VALUE_NOT_ALLOWED),
 	 errmsg("lexeme array may not contain nulls")));
+
+		if (VARSIZE(dlexemes[i]) - VARHDRSZ == 0)
+			ereport(ERROR,
+	(errcode(ERRCODE_ZERO_LENGTH_CHARACTER_STRING),
+	 errmsg("lexeme array may not contain empty strings")));
 	}
 
 	/* Sort and de-dup, because this is required for a valid tsvector. */
diff --git a/src/test/regress/expected/tstypes.out b/src/test/regress/expected/tstypes.out
index 2601e312df..a258179ef0 100644
--- a/src/test/regress/expected/tstypes.out
+++ b/src/test/regress/expected/tstypes.out
@@ -1260,6 +1260,8 @@ SELECT ts_delete('base hidden rebel spaceship strike'::tsvector, ARRAY['spaceshi
 
 SELECT ts_delete('base hidden rebel spaceship strike'::tsvector, ARRAY['spaceship','leya','rebel', NULL]);
 ERROR:  lexeme array may not contain nulls
+SELECT ts_delete('base hidden rebel spaceship strike'::tsvector, ARRAY['spaceship','leya','rebel', '']);
+ERROR:  lexeme array may not contain empty strings
 SELECT unnest('base:7 hidden:6 rebel:1 spaceship:2,33A,34B,35C,36D strike:3'::tsvector);
unnest
 -
@@ -1330,6 +1332,8 @@ SELECT array_to_tsvector(ARRAY['base','hidden','rebel','spaceship','strike']);
 
 SELECT array_to_tsvector(ARRAY['base','hidden','rebel','spaceship', NULL]);
 ERROR:  lexeme array may not contain nulls
+SELECT array_to_tsvector(ARRAY['base','hidden','rebel','spaceship', '']);
+ERROR:  lexeme array may not contai

Re: PITR on DROP DATABASE, deleting of the database directory despite the recovery_target_time set before.

2019-12-13 Thread Jean-Christophe Arnu
Hello,

Le mar. 19 nov. 2019 à 16:15, Nicolas Lutic  a écrit :

>
> We are aware that this part is tricky and will have little effects on
> normal operations, as best practices are to use xid_target or lsn_target.
>
> I'm working with Nicolas and we made some further testing. If we use xid
target with inclusive to  false at the next xid after the insert, we end up
with the same DELETE/DROP directory behaviour which is quite confusing. One
have to choose the xid-1 value with inclusive behaviour to lake it work.

I assume this is the right first thing to document the behaviour. And give
some examples on this.

Maybe we could add some documentation in the xlog explanation and a warning
in the recovery_target_time and xid in guc doc ?

If there are better places in the docs let us know.

Thanks


-- 
Jean-Christophe Arnu


Re: wal_dump output on CREATE DATABASE

2018-11-20 Thread Jean-Christophe Arnu
Le mar. 20 nov. 2018 à 13:34, Peter Eisentraut <
peter.eisentr...@2ndquadrant.com> a écrit :

> On 16/11/2018 17:28, Jean-Christophe Arnu wrote:
> > On the other hand, there's a consensus not to go further than the
> > initial patch.
>
> I have committed your original patch.
>
> Great, thank you Peter. And thank you all for that interesting discussion.
Regards

-- 
Jean-Christophe Arnu


Re: wal_dump output on CREATE DATABASE

2018-11-16 Thread Jean-Christophe Arnu
Le ven. 16 nov. 2018 à 14:32, Tomas Vondra  a
écrit :

> >
> > appendStringInfo(buf, "xid %u (db/rel) %u/%u ",
> >   xlrec->locks[i].xid, xlrec->locks[i].dbOid,
> >   xlrec->locks[i].relOid);
> >
> >
> > As I said, I don't know whether it's relevant to perform these changes
> > or not.
>
> Maybe, I'm not against doing that. But if we do that, I don't think we
> need to add the "(db/rel)" bit - we don't do that elsewhere, so why
> here?


Yes, you spotted a bad copy/paste from me ; "(db/rel)" should bit should
not be there. Sorry.
Why that error so ? I admit I tried some changes on a local git branch of
mine with another format helping the user to understand what ids really
were.  (each line containing ids had "(ts/db/rel)" or "(db/rel)" inside.
On  COMMIT messages, there also was only one "(db/rels)" bit.
This was just an answer to Peter proposal on a different format. I wanted
to keep A/B/C-like format and help DBA to understand what it was using a
prefix string like (ts/db/rel).
But as discussions were going on, I figured out it might be a bad idea so I
did not submit.


> And if we adopt the same format, should that also include the
> tablespace? Although, maybe for locks that doesn't make much sense.
>

I agree that, in a way, it may be a good idea to use that A/B/C format
every where each time we use ids. It would make sense for the dba to know
what kind of ids are involved if we have partial ones (A/B or B/C).
I don't know the code enough to say whether it would be difficult to
retrieve each time the tablespace or not.
There's also lines with base/db/rel ... So there's some different format.
Anyway, I handle all of them  (I hope so) in my processing script. So
keeping the things just as they are (out of the initial COPY patch that
should be applied *to me*) is no problem for me.

On the other hand, there's a consensus not to go further than the initial
patch.


-- 
Jean-Christophe Arnu


Re: wal_dump output on CREATE DATABASE

2018-11-16 Thread Jean-Christophe Arnu
Le jeu. 15 nov. 2018 à 19:44, Robert Haas  a écrit :

> On Tue, Nov 13, 2018 at 3:40 PM Tomas Vondra
>  wrote:
> > People reading pg_waldump output quickly learn to read the A/B/C format
> > and what those fields mean. Breaking that into ts=A db=B relfilenode=C
> > does not make that particularly clearer or easier to read. I'd say it'd
> > also makes it harder to parse, and it increases the size of the output
> > (both in terms of line length and data size).
>
> I agree.
>

First, thank you all for your reviews.

I also agree that the A/B/C format is right (and it may be a good thing to
document it, maybe by adding some changes in the
doc/src/sgml/ref/pg_waldump.sgml file to this patch).

To reply to Andres, I agree we should not change things for a target format
that would not fit clearly defined syntax. In that way, I agree with Tomas
on the fact that people reading
pg_waldump output are quickly familiar with the A/B/C notation.

My first use case was to decode the ids with a processing script to
identify each id in A/B/C or pg_waldump output with a "human readable"
item. For this, my processing script connects the cluster and tries resolve
the ids with simple queries (and building a local cache for this). Then it
replaces each looked up id item with its corresponding text.
In some cases, this could be useful for DBA to find more easily when a
specific relation was modified (searching for DELETE BTW). But that's only
my use case and my little script.

Going back to the code :

As I can figure by crawling the source tree (and discovering it) there are
messages with :
  * A/B/C notation which seems to be the one we should adopt ( meaning
ts/db/refilenode )
some are only
  * A/B for the COPY message we discussed later

On the other hand, and I don't know if it's relevant, I've pointed some
examples such as XLOG_RELMAP_UPDATE in relmapdesc.c which could benefit of
that "notation" :

appendStringInfo(buf, "database %u tablespace %u size %u",
 xlrec->dbid, xlrec->tsid, xlrec->nbytes);

could be written like this :

appendStringInfo(buf, "%u/%u size %u",
 xlrec->tsid, xlrec->dbid, xlrec->nbytes);

In that case ts and db should also be switched. In that case the message
would only by B/C which is confusing, but we have other place where "base/"
is put in prefix.

The same transform may be also applied to standbydesc.c in standby_desc()
function.

appendStringInfo(buf, "xid %u db %u rel %u ",
 xlrec->locks[i].xid, xlrec->locks[i].dbOid,
 xlrec->locks[i].relOid);

may be  changed to

appendStringInfo(buf, "xid %u (db/rel) %u/%u ",
 xlrec->locks[i].xid, xlrec->locks[i].dbOid,
 xlrec->locks[i].relOid);


As I said, I don't know whether it's relevant to perform these changes or
not.
If the A/B/C notation is to be generalized, it would be worth document it
in the SGML file.
If not, the first patch provided should be enough.

Regards

-- 
Jean-Christophe Arnu


Re: wal_dump output on CREATE DATABASE

2018-11-13 Thread Jean-Christophe Arnu
Le lun. 5 nov. 2018 à 15:37, Jean-Christophe Arnu  a
écrit :

>
>
> Le dim. 4 nov. 2018 à 18:01, Jean-Christophe Arnu  a
> écrit :
>
>> Le ven. 2 nov. 2018 à 08:37, Peter Eisentraut <
>> peter.eisentr...@2ndquadrant.com> a écrit :
>>
>>> On 26/10/2018 15:53, Jean-Christophe Arnu wrote:
>>> > Exemple on CREATE DATABASE (without defining a template database) :
>>> > rmgr: Databaselen (rec/tot): 42/42, tx:568, lsn:
>>> > 0/01865790, prev 0/01865720, desc: CREATE copy dir 1/1663 to 16384/1663
>>> >
>>> > It comes out (to me) it may be more consistent to use the same template
>>> > than the other operations in pg_waldump.
>>> > I propose to swap the parameters positions for the copy dir operation
>>> > output.
>>> >
>>> > You'll find a patch file included which does the switching job :
>>> > rmgr: Databaselen (rec/tot): 42/42, tx:568, lsn:
>>> > 0/01865790, prev 0/01865720, desc: CREATE copy dir 1663/1 to 1663/16384
>>>
>>> I see your point.  I suspect this was mainly implemented that way
>>> because that's how the fields occur in the underlying
>>> xl_dbase_create_rec structure.  (But that structure also has the target
>>> location first, so it's not entirely consistent that way either.)  We
>>> could switch the fields around in the output, but I wonder whether we
>>> couldn't make the whole thing a bit more readable like this:
>>>
>>> desc: CREATE copy dir ts=1663 db=1 to ts=1663 db=16384
>>>
>>> or maybe like this
>>>
>>> desc: CREATE copy dir (ts/db) 1663/1 to 1663/16384
>>>
>>
>>
>> Thank you Peter for your review and proposal .
>> I like the last one which gives the fields semantics in a concise way.
>> Pushing the idea a bit farther we could think of applying that
>> description to any other operation that involves the ts/db/oid fields. What
>> do you think ?
>>
>> Example in nbtdesc.c we could add the "(ts/db/oid)" information to help
>> the DBA to identify the objects:
>> case XLOG_BTREE_REUSE_PAGE:
>> {
>> xl_btree_reuse_page *xlrec = (xl_btree_reuse_page *) rec;
>>
>> appendStringInfo(buf, "rel (ts/db/rel) %u/%u/%u;
>> latestRemovedXid %u",
>>  xlrec->node.spcNode, xlrec->node.dbNode,
>>  xlrec->node.relNode,
>> xlrec->latestRemovedXid);
>> break;
>> }
>>
>> This may help DBAs to better identify the objects related to the
>> messages.
>>
>> Any thought/comments/suggestions ?
>>
>> ~~~ Side story
>> Well to be clear, my first proposal is born while i was writting a side
>> script to textually identify which objects were involved into pg_waldump
>> operations (if that objects already exists at script run time). I'm
>> wondering whether it could be interesting to incllde such a "textual
>> decoding" into pg_waldump or not (as a command line switch). Anyway, my
>> script would be available as proof of concept first. We have time to
>> discuss that last point in another thread.
>> ~~~
>>
>> Thank you
>>
>>>
> I've dug a little more in that way and spotted different locations in the
> code where such semantics might be useful too.
> Sometimes just like in XLOG_HEAP_TRUNCATE (heapdesc.c), there's a bunch of
> rels and descendants that are held by a single db. Adding the database id
> may be useful in that case. Decoding tablespace for each object may be
> interesting  (but not mandatory IMHO) for each rel. That information does
> not seem to be included in the structure, but as newcomer I assume there's
> a convenient way to retrieve that information easily.
>
> Another operation that might be eligible to end user message improvements
> is the one handled by xact_desc_commit() function (xactdesc.c) . Each time
> a COMMIT (XLOG_XACT_COMMIT or XLOG_XACT_COMMIT_PREPARED)  occurs the
> folllowing snippets is output :
>
>  desc: COMMIT 2018-11-05 15:11:03.087546 CET; rels: base/16384/16385
> base/16384/16388 base/16384/16390;
>
> in that case, file path is output using relpathperm macro which ends up
> callin gthe GetRelationPath() function.  In that last function, we can have
> the dbNode, spcNode (tablespace) and relNode Oid (include/common/relpath.h )
>
> The question is now to know if it would be interesting to have a
> consistent way to represent all objects and their hi

Re: wal_dump output on CREATE DATABASE

2018-11-05 Thread Jean-Christophe Arnu
Le dim. 4 nov. 2018 à 18:01, Jean-Christophe Arnu  a
écrit :

> Le ven. 2 nov. 2018 à 08:37, Peter Eisentraut <
> peter.eisentr...@2ndquadrant.com> a écrit :
>
>> On 26/10/2018 15:53, Jean-Christophe Arnu wrote:
>> > Exemple on CREATE DATABASE (without defining a template database) :
>> > rmgr: Databaselen (rec/tot): 42/42, tx:568, lsn:
>> > 0/01865790, prev 0/01865720, desc: CREATE copy dir 1/1663 to 16384/1663
>> >
>> > It comes out (to me) it may be more consistent to use the same template
>> > than the other operations in pg_waldump.
>> > I propose to swap the parameters positions for the copy dir operation
>> > output.
>> >
>> > You'll find a patch file included which does the switching job :
>> > rmgr: Databaselen (rec/tot): 42/42, tx:568, lsn:
>> > 0/01865790, prev 0/01865720, desc: CREATE copy dir 1663/1 to 1663/16384
>>
>> I see your point.  I suspect this was mainly implemented that way
>> because that's how the fields occur in the underlying
>> xl_dbase_create_rec structure.  (But that structure also has the target
>> location first, so it's not entirely consistent that way either.)  We
>> could switch the fields around in the output, but I wonder whether we
>> couldn't make the whole thing a bit more readable like this:
>>
>> desc: CREATE copy dir ts=1663 db=1 to ts=1663 db=16384
>>
>> or maybe like this
>>
>> desc: CREATE copy dir (ts/db) 1663/1 to 1663/16384
>>
>
>
> Thank you Peter for your review and proposal .
> I like the last one which gives the fields semantics in a concise way.
> Pushing the idea a bit farther we could think of applying that description
> to any other operation that involves the ts/db/oid fields. What do you
> think ?
>
> Example in nbtdesc.c we could add the "(ts/db/oid)" information to help
> the DBA to identify the objects:
> case XLOG_BTREE_REUSE_PAGE:
> {
> xl_btree_reuse_page *xlrec = (xl_btree_reuse_page *) rec;
>
> appendStringInfo(buf, "rel (ts/db/rel) %u/%u/%u;
> latestRemovedXid %u",
>  xlrec->node.spcNode, xlrec->node.dbNode,
>  xlrec->node.relNode,
> xlrec->latestRemovedXid);
> break;
> }
>
> This may help DBAs to better identify the objects related to the messages.
>
> Any thought/comments/suggestions ?
>
> ~~~ Side story
> Well to be clear, my first proposal is born while i was writting a side
> script to textually identify which objects were involved into pg_waldump
> operations (if that objects already exists at script run time). I'm
> wondering whether it could be interesting to incllde such a "textual
> decoding" into pg_waldump or not (as a command line switch). Anyway, my
> script would be available as proof of concept first. We have time to
> discuss that last point in another thread.
> ~~~
>
> Thank you
>
>>
I've dug a little more in that way and spotted different locations in the
code where such semantics might be useful too.
Sometimes just like in XLOG_HEAP_TRUNCATE (heapdesc.c), there's a bunch of
rels and descendants that are held by a single db. Adding the database id
may be useful in that case. Decoding tablespace for each object may be
interesting  (but not mandatory IMHO) for each rel. That information does
not seem to be included in the structure, but as newcomer I assume there's
a convenient way to retrieve that information easily.

Another operation that might be eligible to end user message improvements
is the one handled by xact_desc_commit() function (xactdesc.c) . Each time
a COMMIT (XLOG_XACT_COMMIT or XLOG_XACT_COMMIT_PREPARED)  occurs the
folllowing snippets is output :

 desc: COMMIT 2018-11-05 15:11:03.087546 CET; rels: base/16384/16385
base/16384/16388 base/16384/16390;

in that case, file path is output using relpathperm macro which ends up
callin gthe GetRelationPath() function.  In that last function, we can have
the dbNode, spcNode (tablespace) and relNode Oid (include/common/relpath.h )

The question is now to know if it would be interesting to have a consistent
way to represent all objects and their hierarchy :
 BTW, we could have db/ts/rel or ts/db/rel ?
What about the "base/ts/rel" from XLOG_XACT_COMMIT/COMMIT_PREPARED ? Why is
it different from the other representation (it must serve a purpose I
assume) ?
How do we represent multiples rels such as XLOG_HEAP_TRUNCATE ? My proposal
(db/rels) dboid/ reloid1 reloid2 reloid3 ... reloidN  (as TRUNCATE only
deals with one DB, but no tablespace is defined, this may be another point
?)

Well, if you could give me some directions, I would appreciate !

As ever, any thought, comments are more than welcomed.

-- 
Jean-Christophe Arnu


Re: wal_dump output on CREATE DATABASE

2018-11-04 Thread Jean-Christophe Arnu
Le ven. 2 nov. 2018 à 08:37, Peter Eisentraut <
peter.eisentr...@2ndquadrant.com> a écrit :

> On 26/10/2018 15:53, Jean-Christophe Arnu wrote:
> > Exemple on CREATE DATABASE (without defining a template database) :
> > rmgr: Databaselen (rec/tot): 42/42, tx:568, lsn:
> > 0/01865790, prev 0/01865720, desc: CREATE copy dir 1/1663 to 16384/1663
> >
> > It comes out (to me) it may be more consistent to use the same template
> > than the other operations in pg_waldump.
> > I propose to swap the parameters positions for the copy dir operation
> > output.
> >
> > You'll find a patch file included which does the switching job :
> > rmgr: Databaselen (rec/tot): 42/42, tx:568, lsn:
> > 0/01865790, prev 0/01865720, desc: CREATE copy dir 1663/1 to 1663/16384
>
> I see your point.  I suspect this was mainly implemented that way
> because that's how the fields occur in the underlying
> xl_dbase_create_rec structure.  (But that structure also has the target
> location first, so it's not entirely consistent that way either.)  We
> could switch the fields around in the output, but I wonder whether we
> couldn't make the whole thing a bit more readable like this:
>
> desc: CREATE copy dir ts=1663 db=1 to ts=1663 db=16384
>
> or maybe like this
>
> desc: CREATE copy dir (ts/db) 1663/1 to 1663/16384
>


Thank you Peter for your review and proposal .
I like the last one which gives the fields semantics in a concise way.
Pushing the idea a bit farther we could think of applying that description
to any other operation that involves the ts/db/oid fields. What do you
think ?

Example in nbtdesc.c we could add the "(ts/db/oid)" information to help the
DBA to identify the objects:
case XLOG_BTREE_REUSE_PAGE:
{
xl_btree_reuse_page *xlrec = (xl_btree_reuse_page *) rec;

appendStringInfo(buf, "rel (ts/db/rel) %u/%u/%u;
latestRemovedXid %u",
 xlrec->node.spcNode, xlrec->node.dbNode,
 xlrec->node.relNode,
xlrec->latestRemovedXid);
break;
}

This may help DBAs to better identify the objects related to the messages.

Any thought/comments/suggestions ?

~~~ Side story
Well to be clear, my first proposal is born while i was writting a side
script to textually identify which objects were involved into pg_waldump
operations (if that objects already exists at script run time). I'm
wondering whether it could be interesting to incllde such a "textual
decoding" into pg_waldump or not (as a command line switch). Anyway, my
script would be available as proof of concept first. We have time to
discuss that last point in another thread.
~~~

Thank you

>


wal_dump output on CREATE DATABASE

2018-10-26 Thread Jean-Christophe Arnu
Dear hackers,

This is my first try to post on that list to propose a workaround on an
issue I noticed while using pg_waldump. Each  time an object is referenced
by "oids" following  output template :
tablespace oid/database oid/relfilenodeid

That template seems to be used for each operation but the *copy dir*
operation. The latter is performed (at least) when CREATE DATABASE
statement is called with the template: oid/tablespace oid

Exemple on CREATE DATABASE (without defining a template database) :
rmgr: Databaselen (rec/tot): 42/42, tx:568, lsn:
0/01865790, prev 0/01865720, desc: CREATE copy dir 1/1663 to 16384/1663


It comes out (to me) it may be more consistent to use the same template
than the other operations in pg_waldump.
I propose to swap the parameters positions for the copy dir operation
output.

You'll find a patch file included which does the switching job :
rmgr: Databaselen (rec/tot): 42/42, tx:568, lsn:
0/01865790, prev 0/01865720, desc: CREATE copy dir 1663/1 to 1663/16384



   - Project name ; PostgreSQL
   - File : copy_dir_message_switch_parameters_v0.patch
   - Description : Swaps copy dir output parameters
   - This patch is not WIP (but may be discussed)
   - Patch applied against master branch
   - Compiles and tests successfully
   - No platform specific code
   - No need of regression test
   - Not a new feature
   - No performance impact

Any comment or advice are more than welcome.

If I didn't do the things the right way, I would appreciate some help from
a kind mentor.

I'll put the patch for the next commitfest.

Thank you,
-- 
Jean-Christophe Arnu
diff --git a/src/backend/access/rmgrdesc/dbasedesc.c b/src/backend/access/rmgrdesc/dbasedesc.c
index 39e26d7ed4..7ee120f1d9 100644
--- a/src/backend/access/rmgrdesc/dbasedesc.c
+++ b/src/backend/access/rmgrdesc/dbasedesc.c
@@ -29,15 +29,15 @@ dbase_desc(StringInfo buf, XLogReaderState *record)
 		xl_dbase_create_rec *xlrec = (xl_dbase_create_rec *) rec;
 
 		appendStringInfo(buf, "copy dir %u/%u to %u/%u",
-		 xlrec->src_db_id, xlrec->src_tablespace_id,
-		 xlrec->db_id, xlrec->tablespace_id);
+		 xlrec->src_tablespace_id, xlrec->src_db_id,
+		 xlrec->tablespace_id, xlrec->db_id);
 	}
 	else if (info == XLOG_DBASE_DROP)
 	{
 		xl_dbase_drop_rec *xlrec = (xl_dbase_drop_rec *) rec;
 
 		appendStringInfo(buf, "dir %u/%u",
-		 xlrec->db_id, xlrec->tablespace_id);
+		 xlrec->tablespace_id, xlrec->db_id);
 	}
 }