Re: [HACKERS] Block level parallel vacuum

2019-12-01 Thread Sergei Kornilov
Hi

> I think the advantage of the current approach is that once the parallel 
> workers are launched, the leader can process indexes that don't support 
> parallelism.  So, both type of indexes can be processed at the same time.

In lazy_parallel_vacuum_or_cleanup_indexes I see:

/*
 * Join as a parallel worker. The leader process alone does that in
 * case where no workers launched.
 */
if (lps->leaderparticipates || lps->pcxt->nworkers_launched == 0)
vacuum_or_cleanup_indexes_worker(Irel, nindexes, stats, 
lps->lvshared,

 vacrelstats->dead_tuples);

/*
 * Here, the indexes that had been skipped during parallel index 
vacuuming
 * are remaining. If there are such indexes the leader process does 
vacuum
 * or cleanup them one by one.
 */
vacuum_or_cleanup_skipped_indexes(vacrelstats, Irel, nindexes, stats,
  lps);

So parallel leader will process parallel indexes first along with parallel 
workers and skip non-parallel ones. Only after end of the index list parallel 
leader will process non-parallel indexes one by one. In case of equal index 
processing time parallel leader will process (count of parallel 
indexes)/(nworkers+1) + all non-parallel, while parallel workers will process 
(count of parallel indexes)/(nworkers+1).  I am wrong here?

regards, Sergei




Re: pg_upgrade fails with non-standard ACL

2019-12-01 Thread Grigory Smolkin

Hello!

On 11/29/19 11:07 AM, Artur Zakirov wrote:

If Anastasia doesn't mind I'd like to send new version of the patch.

On 2019/11/28 12:29, Artur Zakirov wrote:

On 2019/11/27 13:22, Michael Paquier wrote:

Yeah, the actual take is if we want to make the frontend code more
complicated with a large set of SQL queries to check that each object
ACL is modified, which adds an additional maintenance cost in
pg_upgrade.  Or if we want to keep the frontend simple and have more
backend facility to ease cross-catalog lookups for ACLs. Perhaps we
could also live with just checking after the ACLs of functions in the
short term and perhaps it covers most of the cases users would care
about..  That's tricky to conclude about and I am not sure which path
is better in the long-term, but at least it's worth discussing all
possible candidates IMO so as we make sure to not miss anything.


I checked what objects changed their signatures between master and 
9.6. I just ran pg_describe_object() for grantable object types, 
saved the output into a file and diffed the outputs. It seems that 
only functions and table columns changed their signatures. A list of 
functions is big and here the list of columns:


table pg_attrdef column adsrc
table pg_class column relhasoids
table pg_class column relhaspkey
table pg_constraint column consrc
table pg_proc column proisagg
table pg_proc column proiswindow
table pg_proc column protransform

As a result I think in pg_upgrade we could just check functions and 
columns signatures. It might simplify the patch. And if something 
changes in a future we could fix pg_upgrade too.

New version of the patch differs from the previous:
- it doesn't generate script to revoke conflicting permissions (but 
the patch can be fixed easily)
- generates file incompatible_objects_for_acl.txt to report which 
objects changed their signatures
- uses pg_shdepend and pg_depend catalogs to get objects with custom 
privileges

- uses pg_describe_object() to find objects with different signatures

Currently relations, attributes, languages, functions and procedures 
are scanned. According to the documentation foreign databases, 
foreign-data wrappers, foreign servers, schemas and tablespaces also 
support ACLs. But some of them doesn't have entries initialized during 
initdb, others like schemas and tablespaces didn't change their names. 
So the patch doesn't scan such objects.


Grigory it would be great if you'll try the patch. I tested it but I 
could miss something.


I`ve tested the patch on upgrade from 9.5 to master and it works now, 
thank you.
But I think that 'incompatible_objects_for_acl.txt' is not a very 
informative way of reporting to user the source of the problem.
There is no mentions of rolenames that holds permissions, that prevents 
the upgrade, and even if they were, it is still up to user to conjure an 
script to revoke all those grants, which is not a very user-friendly.


I think it should generate 'catalog_procedures.sql' script as in 
previous version with all REVOKE statements, that are required to 
execute for pg_upgrade to work.



--
Grigory Smolkin
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company





Re: Do we have a CF manager for November?

2019-12-01 Thread Tom Lane
Michael Paquier  writes:
> I have worked more on the CF.  And here are the final results:
> Committed: 36.
> Moved to next CF: 146.
> Withdrawn: 4.
> Rejected: 6.
> Returned with Feedback: 29.
> Total: 221. 

As always, many thanks for doing this tedious work!

regards, tom lane




Re: Issue about memory order on ARM

2019-12-01 Thread Tom Lane
"=?utf-8?B?55uP5LiA?="  writes:
> The code in GetSnapshotData() that read the `xid` field of  PGXACT 
> struct has a dependency on code in GetNewTransactionId() that write 
> `MyPgXact->xid`. It means that the store of xid should happen before the 
> load of it. In C11, we can use [Release-Acquire 
> ordering](https://en.cppreference.com/w/c/atomic/memory_order#Release-Acquire_ordering)
>  to achieve it. But now, there is no special operation to do it(, and the 
> [volatile](https://en.cppreference.com/w/c/language/volatile) keyword should 
> not play any role in this situation).
> So it means that when a backend A returns from GetNewTransactionId(), the 
> newval of `MyPgXact->xid` maybe just in CPU store buffer, or CPU cache 
> line, so the newval is not yet visible to backend B that calling 
> GetSnapshotData(). So the assumption that 'all top-level XIDs <= 
> latestCompletedXid are either present in the ProcArray, or not running 
> anymore' may not be safe. 

You'e ignoring the memory barriers that are implicit in LWLock
acquisition and release; as well as the fact that it's transaction
end, not start, that needs to be interlocked.  Please read the section
"Interlocking Transaction Begin, Transaction End, and Snapshots"
in src/backend/access/transam/README.

regards, tom lane




Re: [HACKERS] Block level parallel vacuum

2019-12-01 Thread Masahiko Sawada
On Sun, 1 Dec 2019 at 11:06, Sergei Kornilov  wrote:
>
> Hi
>
> > I think the advantage of the current approach is that once the parallel 
> > workers are launched, the leader can process indexes that don't support 
> > parallelism.  So, both type of indexes can be processed at the same time.
>
> In lazy_parallel_vacuum_or_cleanup_indexes I see:
>
> /*
>  * Join as a parallel worker. The leader process alone does that in
>  * case where no workers launched.
>  */
> if (lps->leaderparticipates || lps->pcxt->nworkers_launched == 0)
> vacuum_or_cleanup_indexes_worker(Irel, nindexes, stats, 
> lps->lvshared,
>   
>vacrelstats->dead_tuples);
>
> /*
>  * Here, the indexes that had been skipped during parallel index 
> vacuuming
>  * are remaining. If there are such indexes the leader process does 
> vacuum
>  * or cleanup them one by one.
>  */
> vacuum_or_cleanup_skipped_indexes(vacrelstats, Irel, nindexes, stats,
>   
> lps);
>
> So parallel leader will process parallel indexes first along with parallel 
> workers and skip non-parallel ones. Only after end of the index list parallel 
> leader will process non-parallel indexes one by one. In case of equal index 
> processing time parallel leader will process (count of parallel 
> indexes)/(nworkers+1) + all non-parallel, while parallel workers will process 
> (count of parallel indexes)/(nworkers+1).  I am wrong here?
>

I think I got your point. Your proposal is that it's more efficient if
we make the leader process vacuum the index that can be processed only
the leader process (i.e. indexes not supporting parallel index vacuum)
while workers are processing indexes supporting parallel index vacuum,
right? That way, we can process indexes in parallel as much as
possible. So maybe we can call vacuum_or_cleanup_skipped_indexes first
and then call vacuum_or_cleanup_indexes_worker. But I'm not sure that
there are parallel-safe remaining indexes after the leader finished
vacuum_or_cleanup_indexes_worker, as described on your proposal.

Regards,

-- 
Masahiko Sawadahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: [HACKERS] Block level parallel vacuum

2019-12-01 Thread Masahiko Sawada
On Sat, 30 Nov 2019 at 04:06, Amit Kapila  wrote:
>
> On Fri, Nov 29, 2019 at 7:11 PM Masahiko Sawada
>  wrote:
> >
> > On Thu, 28 Nov 2019 at 11:57, Amit Kapila  wrote:
> > >
> >
> > > I think it is probably because this part of the code doesn't consider
> > > PARALLEL_VACUUM_DISABLE_LEADER_PARTICIPATION.  I think if we want we
> > > can change it but I am slightly nervous about the code complexity this
> > > will bring but maybe that is fine.
> >
> > Right. I'll try to change so that.
> >
>
> I am thinking that as PARALLEL_VACUUM_DISABLE_LEADER_PARTICIPATION is
> a debugging/testing facility, we should ideally separate this out from
> the main patch.  BTW, I am hacking/reviewing the patch further, so
> request you to wait for a few day's time before we do anything in this
> regard.

Sure, thank you so much. I'll wait for your comments and reviewing.

Regards,

-- 
Masahiko Sawadahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: bitmaps and correlation

2019-12-01 Thread Justin Pryzby
On Sun, Dec 01, 2019 at 12:34:37PM +0900, Michael Paquier wrote:
> On Sat, Nov 02, 2019 at 03:26:17PM -0500, Justin Pryzby wrote:
> > Attached is a fixed and rebasified patch for cfbot.
> > Included inline for conceptual review.
> 
> Your patch still causes two regression tests to fail per Mr Robot's
> report: join and select.  Could you look at those problems?  I have
> moved the patch to next CF, waiting on author.

The regression failures seem to be due to deliberate, expected plan changes.

I think the patch still needs some discussion, but find attached rebasified
patch including regression diffs.

Justin
>From cb2b02d6288bc0bc5432a7e7f5214ffcddec010e Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Tue, 1 Jan 2019 16:17:28 -0600
Subject: [PATCH v3] Use correlation statistic in costing bitmap scans..

Same as for an index scan, an uncorrelated bitmap (like modulus) which accesses
a certain number of pages across the entire length of a table should have its
cost estimate heavily weighted by random access, compared to an bitmap scan
which accesses same number of pages across a small portion of the table.

Note, Tom points out that there are cases where a column could be
tightly-"clumped" without being hightly-ordered.  Since we have correlation
already, we use that until such time as someone implements a new statistic for
clumpiness.
---
 src/backend/optimizer/path/costsize.c  | 84 --
 src/backend/optimizer/path/indxpath.c  |  8 ++--
 src/include/nodes/pathnodes.h  |  3 ++
 src/include/optimizer/cost.h   |  2 +-
 src/test/regress/expected/identity.out |  2 +-
 src/test/regress/expected/join.out | 42 ++---
 src/test/regress/expected/select.out   | 16 +++
 src/test/regress/sql/identity.sql  |  2 +-
 8 files changed, 110 insertions(+), 49 deletions(-)

diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index c5f6593..aaac29a 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -549,11 +549,12 @@ cost_index(IndexPath *path, PlannerInfo *root, double loop_count,
 
 	/*
 	 * Save amcostestimate's results for possible use in bitmap scan planning.
-	 * We don't bother to save indexStartupCost or indexCorrelation, because a
-	 * bitmap scan doesn't care about either.
+	 * We don't bother to save indexStartupCost, because a
+	 * bitmap scan doesn't care.
 	 */
 	path->indextotalcost = indexTotalCost;
 	path->indexselectivity = indexSelectivity;
+	path->indexCorrelation = indexCorrelation;
 
 	/* all costs for touching index itself included here */
 	startup_cost += indexStartupCost;
@@ -986,12 +987,33 @@ cost_bitmap_heap_scan(Path *path, PlannerInfo *root, RelOptInfo *baserel,
 	 * appropriate to charge spc_seq_page_cost apiece.  The effect is
 	 * nonlinear, too. For lack of a better idea, interpolate like this to
 	 * determine the cost per page.
+	 * Note this works at PAGE granularity, so even if we read 1% of a
+	 * table's tuples, if we have to read nearly every page, it should be
+	 * considered sequential.
 	 */
-	if (pages_fetched >= 2.0)
+	if (pages_fetched >= 2.0) {
+		double correlation, cost_per_page2;
 		cost_per_page = spc_random_page_cost -
 			(spc_random_page_cost - spc_seq_page_cost)
 			* sqrt(pages_fetched / T);
-	else
+
+		// XXX: interpolate based on correlation from seq_page_cost to rand_page_cost?
+		// a highly correlated bitmap scan 1) likely reads fewer pages; and,
+		// 2) at higher "density" (more sequential).
+		// double correlation = get_indexpath_correlation(root, bitmapqual);
+		correlation = ((IndexPath *)bitmapqual)->indexCorrelation;
+		cost_per_page2 = spc_seq_page_cost +
+			(1-correlation*correlation)*(spc_random_page_cost - spc_seq_page_cost);  // XXX: *sqrt(pages_fetched/T) ?
+		// There are two variables: fraction of pages(T) and correlation.
+		// If T is high, giving sequential reads, we want low cost_per_page regardless of correlation.
+		// If correlation is high, we (probably) want low cost per page.
+		// ...the exception is if someone does an =ANY() query on a list of non-consecutive values.
+		// Something like start_time=ANY('2017-01-01', '2017-02-01',...)
+		// which reads small number of rows from pages all across the length of the table.
+		// But index scan doesn't seem to do address that at all, so leave it alone for now.
+		cost_per_page=Min(cost_per_page, cost_per_page2);
+		// cost_per_page=sqrt(cost_per_page*cost_per_page + cost_per_page2*cost_per_page2);
+	} else
 		cost_per_page = spc_random_page_cost;
 
 	run_cost += pages_fetched * cost_per_page;
@@ -1035,15 +1057,16 @@ cost_bitmap_heap_scan(Path *path, PlannerInfo *root, RelOptInfo *baserel,
 
 /*
  * cost_bitmap_tree_node
- *		Extract cost and selectivity from a bitmap tree node (index/and/or)
+ *		Extract cost, selectivity, and correlation from a bitmap tree node (index/and/or)
  */
 void
-cost_bitmap_tree_node(Path *path, Cost *cost, Sel

Re: [HACKERS] Block level parallel vacuum

2019-12-01 Thread Masahiko Sawada
On Sat, 30 Nov 2019 at 22:11, Mahendra Singh  wrote:
>
> On Sat, 30 Nov 2019 at 19:18, Sergei Kornilov  wrote:
>>
>> Hello
>>
>> Its possible to change order of index processing by parallel leader? In v35 
>> patchset I see following order:
>> - start parallel processes
>> - leader and parallel workers processed index lixt and possible skip some 
>> entries
>> - after that parallel leader recheck index list and process the skipped 
>> indexes
>> - WaitForParallelWorkersToFinish
>>
>> I think it would be better to:
>> - start parallel processes
>> - parallel leader goes through index list and process only indexes which are 
>> skip_parallel_index_vacuum = true
>> - parallel workers processes indexes with skip_parallel_index_vacuum = false
>> - parallel leader start participate with remainings parallel-safe index 
>> processing
>> - WaitForParallelWorkersToFinish
>>
>> This would be less running time and better load balance across leader and 
>> workers in case of few non-parallel and few parallel indexes.
>> (if this is expected and required by some reason, we need a comment in code)
>>
>> Also few notes to vacuumdb:
>> Seems we need version check at least in vacuum_one_database and 
>> prepare_vacuum_command. Similar to SKIP_LOCKED or DISABLE_PAGE_SKIPPING 
>> features.
>> discussion question: difference between --parallel and --jobs parameters 
>> will be confusing? We need more description for this options
>
>
> While doing testing with different server configuration settings, I am 
> getting error (ERROR:  no unpinned buffers available) in parallel vacuum but 
> normal vacuum is working fine.
>
> Test Setup:
> max_worker_processes = 40
> autovacuum = off
> shared_buffers = 128kB
> max_parallel_workers = 40
> max_parallel_maintenance_workers = 40
> vacuum_cost_limit = 2000
> vacuum_cost_delay = 10
>
> Table description: table have 16 indexes(14 btree, 1 hash, 1 BRIN ) and total 
> 10,00,000 tuples and I am deleting all the tuples, then firing vacuum command.
> Run attached .sql file (test_16_indexes.sql)
> $ ./psql postgres
> postgres=# \i test_16_indexes.sql
>
> Re-start the server and do vacuum.
> Case 1) normal vacuum:
> postgres=# vacuum test ;
> VACUUM
> Time: 115174.470 ms (01:55.174)
>
> Case 2) parallel vacuum using 10 parallel workers:
> postgres=# vacuum (parallel 10)test ;
> ERROR:  no unpinned buffers available
> CONTEXT:  parallel worker
> postgres=#
>
> This error is coming due to 128kB shared buffer. I think, I launched 10 
> parallel workers and all are working paralleling so due to less shared 
> buffer, I am getting this error.
>

Thank you for testing!

> Is this expected behavior with small shared buffer size or we should try to 
> come with a solution for this.  Please let me know your thoughts.

I think it's normal behavior when the shared buffer is not enough.
Since the total 10 processes were processing different pages at the
same time and you set a small value to shared_buffers the shared
buffer gets full easily. And you got the proper error. So I think in
this case we should consider either to increase the shared buffer size
or to decrease the parallel degree. I guess you can get this error
even when you vacuum 10 different tables concurrently instead.

Regards,

-- 
Masahiko Sawadahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




surprisingly expensive join planning query

2019-12-01 Thread Tomas Vondra

Hi,

while evaluating one of the CF patches (the incremental sort one, but
that's mostly irrelevant), I ran into a strange issue with join planning
for a fairly simple query. I needed to asses how the patch affects query
planning for different GUCs, so I ran a group of queries and stashed the
results into a table with this structure

  CREATE TABLE plans (
query text,
index text,
option text,
plan text,
type text,
force boolean,
parallel boolean);

Essentially all the columns are 'dimensions' with the exception of the
'plan' column storing the explain plan generated.

The results (~60k rows / 30MB) is available here:

  https://drive.google.com/open?id=1Q4oR1KtaAil87lbMo-xUvvw_0wf_zDx-

  copy plans from '/tmp/results-100M.data';

To evaluate results, I needed to see which GUCs result in a different
plan compared to the master, so I did a query like this:

  with
master AS (select * from plans where option = 
''),
create_ordered_paths_parallel  AS (select * from plans where option = 
'devel_create_ordered_paths_parallel'),
create_partial_grouping_paths_2AS (select * from plans where option = 
'devel_create_partial_grouping_paths_2'),
create_partial_grouping_paths  AS (select * from plans where option = 
'devel_create_partial_grouping_paths'),
standard_join_search   AS (select * from plans where option = 
'devel_standard_join_search'),
add_paths_to_grouping_rel  AS (select * from plans where option = 
'devel_add_paths_to_grouping_rel'),
gather_grouping_paths  AS (select * from plans where option = 
'devel_gather_grouping_paths'),
create_ordered_paths   AS (select * from plans where option = 
'devel_create_ordered_paths'),
add_paths_to_grouping_rel_parallel AS (select * from plans where option = 
'devel_add_paths_to_grouping_rel_parallel'),
set_rel_pathlist   AS (select * from plans where option = 
'devel_set_rel_pathlist'),
apply_scanjoin_target_to_paths AS (select * from plans where option = 
'devel_apply_scanjoin_target_to_paths')
  select
master.query,
master.index,
master.type,
master.force,
master.parallel,
md5(master.plan),
(CASE WHEN (master.plan = r1.plan) THEN NULL ELSE 'DIFF' END) guc1,
(CASE WHEN (master.plan = r2.plan) THEN NULL ELSE 'DIFF' END) guc2,
(CASE WHEN (master.plan = r3.plan) THEN NULL ELSE 'DIFF' END) guc3,
(CASE WHEN (master.plan = r4.plan) THEN NULL ELSE 'DIFF' END) guc4,
(CASE WHEN (master.plan = r5.plan) THEN NULL ELSE 'DIFF' END) guc5,
(CASE WHEN (master.plan = r6.plan) THEN NULL ELSE 'DIFF' END) guc6,
(CASE WHEN (master.plan = r7.plan) THEN NULL ELSE 'DIFF' END) guc7,
(CASE WHEN (master.plan = r8.plan) THEN NULL ELSE 'DIFF' END) guc8,
(CASE WHEN (master.plan = r9.plan) THEN NULL ELSE 'DIFF' END) guc9,
(CASE WHEN (master.plan = r10.plan) THEN NULL ELSE 'DIFF' END) guc10
  from
master
  join create_ordered_paths_parallel r1 using (query, index, type, force, 
parallel)
  join create_partial_grouping_paths r2 using (query, index, type, force, 
parallel)
  join create_partial_grouping_paths_2 r3 using (query, index, type, force, 
parallel)
  join standard_join_search r4 using (query, index, type, force, parallel)
  join add_paths_to_grouping_rel r5 using (query, index, type, force, 
parallel)
  join gather_grouping_paths r6 using (query, index, type, force, parallel)
  join create_ordered_paths r7 using (query, index, type, force, parallel)
  join add_paths_to_grouping_rel_parallel r8 using (query, index, type, 
force, parallel)
  join set_rel_pathlist r9 using (query, index, type, force, parallel)
  join apply_scanjoin_target_to_paths r10 using (query, index, type, force, 
parallel);

This however causes pretty serious issues during planning. Firstly, it
consumes insane amounts of memory, to the extent that on my machine it
crashes due to OOM.

If I lover the join_collapse_limit to 1, it works just fine, but once I
increase it too much, the memory consumption and planning time go
through the roof and eventually crashes.

I did a bit of investigation, and after instrumenting aset.c a bit I got
a statistic like this:

   size  | alloc count | alloc sum | free count | free sum |  diff
  ---+-+---++--+---
  64 | 5606157 | 358794048 |118 | 7552 | 358786496

i.e. there's a lot of 64B chunks allocated, but almost none of them are
freed, resulting in ~350MB leak. There are various other sizes with a
lot of allocated chunks, but nowhere close to this.

It seems most of this comesfrom find_mergeclauses_for_outer_pathkeys()
which builds matched_restrictinfos and then just leaves it allocated.
After pfreeing this (see attached patch), the memory usage gets way down
and the query completes. I'm sure there are other things we could pfree
to r

Re: [HACKERS] Block level parallel vacuum

2019-12-01 Thread Sergei Kornilov
Hi

> I think I got your point. Your proposal is that it's more efficient if
> we make the leader process vacuum the index that can be processed only
> the leader process (i.e. indexes not supporting parallel index vacuum)
> while workers are processing indexes supporting parallel index vacuum,
> right? That way, we can process indexes in parallel as much as
> possible.

Right

> So maybe we can call vacuum_or_cleanup_skipped_indexes first
> and then call vacuum_or_cleanup_indexes_worker. But I'm not sure that
> there are parallel-safe remaining indexes after the leader finished
> vacuum_or_cleanup_indexes_worker, as described on your proposal.

I meant that after processing missing indexes (not supporting parallel index 
vacuum), the leader can start processing indexes that support the parallel 
index vacuum, along with parallel workers.
Exactly call vacuum_or_cleanup_skipped_indexes after start parallel workers but 
before vacuum_or_cleanup_indexes_worker or something with similar effect.
If we have 0 missed indexes - parallel vacuum will run as in current 
implementation, with leader participation.

Sorry for my unclear english...

regards, Sergei




Re: surprisingly expensive join planning query

2019-12-01 Thread Tom Lane
Tomas Vondra  writes:
> It seems most of this comesfrom find_mergeclauses_for_outer_pathkeys()
> which builds matched_restrictinfos and then just leaves it allocated.
> After pfreeing this (see attached patch), the memory usage gets way down
> and the query completes.

Interesting.  The memory leak was probably much less bad before
commit 1cff1b95a, since in the old List implementation this code
would have leaked only a list header.  It makes sense to me to
add the list_free.

Alternatively, it'd be possible to get rid of the separate List
altogether, and just add the rinfo's to "mergeclauses" immediately.
The functionality of the separate list could be replaced by a
bool variable remembering whether we found any matches in this
pass through the loop.  I think the code would be a little less
clear that way, but this report makes it clear that it's a
performance bottleneck, so maybe we should just change it.

regards, tom lane




Re: surprisingly expensive join planning query

2019-12-01 Thread Tomas Vondra

On Sun, Dec 01, 2019 at 01:27:04PM -0500, Tom Lane wrote:

Tomas Vondra  writes:

It seems most of this comesfrom find_mergeclauses_for_outer_pathkeys()
which builds matched_restrictinfos and then just leaves it allocated.
After pfreeing this (see attached patch), the memory usage gets way down
and the query completes.


Interesting.  The memory leak was probably much less bad before
commit 1cff1b95a, since in the old List implementation this code
would have leaked only a list header.  It makes sense to me to
add the list_free.



I forgot to mention I tried on older releases, up to 9.5 (I suspected it
might be related to parallel queries), and I get OOM crashes there too.
I can't say if the memory is leaking slower/faster, though.

I tried fixing 9.5 - a simple pfree(matched_restrictinfos) triggers some
sort of list_concat error for me, seemed a bit weird TBH.


Alternatively, it'd be possible to get rid of the separate List
altogether, and just add the rinfo's to "mergeclauses" immediately.
The functionality of the separate list could be replaced by a
bool variable remembering whether we found any matches in this
pass through the loop.  I think the code would be a little less
clear that way, but this report makes it clear that it's a
performance bottleneck, so maybe we should just change it.



Yes, that might be an option. And it works even on 9.5 for me (per the
attached patch). I don't think it's much less clear compared to just
doing an explicit free at the end.

It does fix cases with up to join_collapse_limit = 10, but with 11 it
still OOM-crashes. That definitely depends on available memory, of
course.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 
diff --git a/src/backend/optimizer/path/pathkeys.c 
b/src/backend/optimizer/path/pathkeys.c
index 928bbae8bd..a1d3f1dd3c 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -980,8 +980,8 @@ find_mergeclauses_for_outer_pathkeys(PlannerInfo *root,
{
PathKey*pathkey = (PathKey *) lfirst(i);
EquivalenceClass *pathkey_ec = pathkey->pk_eclass;
-   List   *matched_restrictinfos = NIL;
ListCell   *j;
+   boolfound = false;
 
/*--
 * A mergejoin clause matches a pathkey if it has the same EC.
@@ -1027,8 +1027,12 @@ find_mergeclauses_for_outer_pathkeys(PlannerInfo *root,
 
clause_ec = rinfo->outer_is_left ?
rinfo->left_ec : rinfo->right_ec;
-   if (clause_ec == pathkey_ec)
-   matched_restrictinfos = 
lappend(matched_restrictinfos, rinfo);
+
+   if (clause_ec != pathkey_ec)
+   continue;
+
+   mergeclauses = lappend(mergeclauses, rinfo);
+   found = true;
}
 
/*
@@ -1036,14 +1040,8 @@ find_mergeclauses_for_outer_pathkeys(PlannerInfo *root,
 * sort-key positions in the pathkeys are useless.  (But we can 
still
 * mergejoin if we found at least one mergeclause.)
 */
-   if (matched_restrictinfos == NIL)
+   if (!found)
break;
-
-   /*
-* If we did find usable mergeclause(s) for this sort-key 
position,
-* add them to result list.
-*/
-   mergeclauses = list_concat(mergeclauses, matched_restrictinfos);
}
 
return mergeclauses;


Re: Using multiple extended statistics for estimates

2019-12-01 Thread Tomas Vondra

On Sat, Nov 30, 2019 at 03:01:31PM -0800, Mark Dilger wrote:


Are you planning to submit a revised patch for this?



Yes, I'll submit a rebased version of this patch shortly. I got broken
because of the recent fix in choose_best_statistics, shouldn't take long
to update the patch. I do have a couple more related patches in the
queue, so I want to submit them all at once.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 





Re: surprisingly expensive join planning query

2019-12-01 Thread Tom Lane
Tomas Vondra  writes:
> On Sun, Dec 01, 2019 at 01:27:04PM -0500, Tom Lane wrote:
>> Alternatively, it'd be possible to get rid of the separate List
>> altogether, and just add the rinfo's to "mergeclauses" immediately.
>> The functionality of the separate list could be replaced by a
>> bool variable remembering whether we found any matches in this
>> pass through the loop.  I think the code would be a little less
>> clear that way, but this report makes it clear that it's a
>> performance bottleneck, so maybe we should just change it.

> Yes, that might be an option. And it works even on 9.5 for me (per the
> attached patch). I don't think it's much less clear compared to just
> doing an explicit free at the end.

LGTM.

regards, tom lane




Re: Bogus EXPLAIN results with column aliases for mismatched partitions

2019-12-01 Thread Tom Lane
I wrote:
> * variant-a's diffs in expected/postgres_fdw.out indicate that
> postgres_fdw is doing something weird with the table aliases it selects to
> print in the "Relations:" output.  I think this is an independent bug ---
> and it surely *is* a bug, because the aliases printed by HEAD don't agree
> with the table aliases used for Vars of those relations.  But I haven't
> run it to ground yet.  (Notice that variant-b fixes those discrepancies in
> the opposite direction...)

I checked that, and indeed postgres_fdw is doing something randomly
different from what ruleutils does.  In set_rtable_names(), the
first priority is rte->alias->aliasname, and if that's not set
then (for a RELATION RTE) you get the result of get_rel_name().
postgres_fdw is taking rte->eref->aliasname as being the alias,
which is usually the same string, but "usually" doesn't cut it.
So we should make it look at rte->alias instead.

Now, there is another thing that set_rtable_names() is doing that
postgres_fdw doesn't do, which is to unique-ify the chosen alias
names by adding "_NN" if the querytree contains multiple uses of
the same table alias.  I don't see any practical way for postgres_fdw
to match that behavior, since it lacks global information about the
query.  If we wanted to fix it, what we'd likely need to do is
postpone creation of the relation_name strings until EXPLAIN,
providing some way for postgres_fdw to ask ruleutils.c what alias
it'd assigned to a particular RTE.  This seems like it wouldn't be
terribly painful for base relations but it'd be a mess for joins
and aggregates, so I'm not eager to do something like that.
In practice the presence or absence of "_NN" might not be too
confusing --- it's certainly not as bad as the inconsistency being
shown now.

In short then I propose the attached fix for this issue.

regards, tom lane

diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index 14180ee..5f7aa31 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -8507,7 +8507,7 @@ SELECT t1.a,t2.b,t2.c FROM fprt1 t1 LEFT JOIN (SELECT * FROM fprt2 WHERE a < 10)
 
  Foreign Scan
Output: t1.a, ftprt2_p1.b, ftprt2_p1.c
-   Relations: (public.ftprt1_p1 t1) LEFT JOIN (public.ftprt2_p1 fprt2)
+   Relations: (public.ftprt1_p1 t1) LEFT JOIN (public.ftprt2_p1)
Remote SQL: SELECT r5.a, r6.b, r6.c FROM (public.fprt1_p1 r5 LEFT JOIN public.fprt2_p1 r6 ON (((r5.a = r6.b)) AND ((r5.b = r6.a)) AND ((r6.a < 10 WHERE ((r5.a < 10)) ORDER BY r5.a ASC NULLS LAST, r6.b ASC NULLS LAST, r6.c ASC NULLS LAST
 (4 rows)
 
@@ -8689,17 +8689,17 @@ SELECT a, sum(b), min(b), count(*) FROM pagg_tab GROUP BY a HAVING avg(b) < 22 O
 SET enable_partitionwise_aggregate TO true;
 EXPLAIN (COSTS OFF)
 SELECT a, sum(b), min(b), count(*) FROM pagg_tab GROUP BY a HAVING avg(b) < 22 ORDER BY 1;
-  QUERY PLAN  
---
+ QUERY PLAN  
+-
  Sort
Sort Key: fpagg_tab_p1.a
->  Append
  ->  Foreign Scan
-   Relations: Aggregate on (public.fpagg_tab_p1 pagg_tab)
+   Relations: Aggregate on (public.fpagg_tab_p1)
  ->  Foreign Scan
-   Relations: Aggregate on (public.fpagg_tab_p2 pagg_tab)
+   Relations: Aggregate on (public.fpagg_tab_p2)
  ->  Foreign Scan
-   Relations: Aggregate on (public.fpagg_tab_p3 pagg_tab)
+   Relations: Aggregate on (public.fpagg_tab_p3)
 (9 rows)
 
 SELECT a, sum(b), min(b), count(*) FROM pagg_tab GROUP BY a HAVING avg(b) < 22 ORDER BY 1;
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index fa14296..1bed907 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -576,7 +576,6 @@ postgresGetForeignRelSize(PlannerInfo *root,
 	RangeTblEntry *rte = planner_rt_fetch(baserel->relid, root);
 	const char *namespace;
 	const char *relname;
-	const char *refname;
 
 	/*
 	 * We use PgFdwRelationInfo to pass various information to subsequent
@@ -727,13 +726,12 @@ postgresGetForeignRelSize(PlannerInfo *root,
 	fpinfo->relation_name = makeStringInfo();
 	namespace = get_namespace_name(get_rel_namespace(foreigntableid));
 	relname = get_rel_name(foreigntableid);
-	refname = rte->eref->aliasname;
 	appendStringInfo(fpinfo->relation_name, "%s.%s",
 	 quote_identifier(namespace),
 	 quote_identifier(relname));
-	if (*refname && strcmp(refname, relname) != 0

Re: Bogus EXPLAIN results with column aliases for mismatched partitions

2019-12-01 Thread Tom Lane
I wrote:
> Now, there is another thing that set_rtable_names() is doing that
> postgres_fdw doesn't do, which is to unique-ify the chosen alias
> names by adding "_NN" if the querytree contains multiple uses of
> the same table alias.  I don't see any practical way for postgres_fdw
> to match that behavior, since it lacks global information about the
> query.  If we wanted to fix it, what we'd likely need to do is
> postpone creation of the relation_name strings until EXPLAIN,
> providing some way for postgres_fdw to ask ruleutils.c what alias
> it'd assigned to a particular RTE.

Hmmm ... so actually, that isn't *quite* as bad as I thought:
ExplainState does already expose that information, so we just need
to rearrange how postgres_fdw does things.  Here's a revised proposed
patch, which exposes (and fixes) several additional test cases where
the Relations: string was previously visibly out of sync with what
ruleutils was using for Var names.

BTW, the existing code always schema-qualifies the relation names,
on the rather lame grounds that it's producing the string without
knowing whether EXPLAIN VERBOSE will be specified.  In this code,
the verbose flag is available so it would be trivial to make the
output conform to EXPLAIN's normal policy.  I didn't change that
here because there'd be a bunch more test output diffs of no
intellectual interest.  Should we change it, or leave well enough
alone?

regards, tom lane

diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index 14180ee..d276545 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -1348,7 +1348,7 @@ SELECT t1.c1, ss.a, ss.b FROM (SELECT c1 FROM ft4 WHERE c1 between 50 and 60) t1
 
  Foreign Scan
Output: ft4.c1, ft4_1.c1, ft5.c1
-   Relations: (public.ft4) FULL JOIN ((public.ft4) FULL JOIN (public.ft5))
+   Relations: (public.ft4) FULL JOIN ((public.ft4 ft4_1) FULL JOIN (public.ft5))
Remote SQL: SELECT s4.c1, s10.c1, s10.c2 FROM ((SELECT c1 FROM "S 1"."T 3" WHERE ((c1 >= 50)) AND ((c1 <= 60))) s4(c1) FULL JOIN (SELECT s8.c1, s9.c1 FROM ((SELECT c1 FROM "S 1"."T 3" WHERE ((c1 >= 50)) AND ((c1 <= 60))) s8(c1) FULL JOIN (SELECT c1 FROM "S 1"."T 4" WHERE ((c1 >= 50)) AND ((c1 <= 60))) s9(c1) ON (((s8.c1 = s9.c1 WHERE (((s8.c1 IS NULL) OR (s8.c1 IS NOT NULL s10(c1, c2) ON (((s4.c1 = s10.c1 ORDER BY s4.c1 ASC NULLS LAST, s10.c1 ASC NULLS LAST, s10.c2 ASC NULLS LAST
 (4 rows)
 
@@ -2084,7 +2084,7 @@ SELECT t1c1, avg(t1c1 + t2c1) FROM (SELECT t1.c1, t2.c1 FROM ft1 t1 JOIN ft2 t2
  Remote SQL: SELECT r1."C 1", r2."C 1" FROM ("S 1"."T 1" r1 INNER JOIN "S 1"."T 1" r2 ON (((r1."C 1" = r2."C 1"
->  Foreign Scan
  Output: t1_1.c1, t2_1.c1
- Relations: (public.ft1 t1) INNER JOIN (public.ft2 t2)
+ Relations: (public.ft1 t1_1) INNER JOIN (public.ft2 t2_1)
  Remote SQL: SELECT r1."C 1", r2."C 1" FROM ("S 1"."T 1" r1 INNER JOIN "S 1"."T 1" r2 ON (((r1."C 1" = r2."C 1"
 (20 rows)
 
@@ -3230,7 +3230,7 @@ select count(*), x.b from ft1, (select c2 a, sum(c1) b from ft1 group by c2) x w
Output: x.b, x.a
->  Foreign Scan
  Output: ft1_1.c2, (sum(ft1_1.c1))
- Relations: Aggregate on (public.ft1)
+ Relations: Aggregate on (public.ft1 ft1_1)
  Remote SQL: SELECT c2, sum("C 1") FROM "S 1"."T 1" GROUP BY 1
 (21 rows)
 
@@ -8480,15 +8480,15 @@ ANALYZE fprt2_p2;
 -- inner join three tables
 EXPLAIN (COSTS OFF)
 SELECT t1.a,t2.b,t3.c FROM fprt1 t1 INNER JOIN fprt2 t2 ON (t1.a = t2.b) INNER JOIN fprt1 t3 ON (t2.b = t3.a) WHERE t1.a % 25 =0 ORDER BY 1,2,3;
- QUERY PLAN 
-
+QUERY PLAN
+--
  Sort
Sort Key: t1.

Re: Allow 'sslkey' and 'sslcert' in postgres_fdw user mappings

2019-12-01 Thread Andrew Dunstan


On 11/30/19 8:48 PM, Michael Paquier wrote:
> On Thu, Oct 31, 2019 at 07:54:41PM -0400, Andrew Dunstan wrote:
>> This patch achieves $SUBJECT and also provides some testing of the
>> sslpassword setting.
> The patch does not apply anymore, so a rebase is needed.  As it has
> not been reviewed, I am moving it to next CF, waiting on author.



That's OK. This patch is dependent, as it always has been, on the patch
to allow passwordless user mappings for postgres_fdw. I hope to commit
that soon, but I'd prefer some signoff from prominent hackers, as I
don't want to go too far down this road and then encounter a bunch of
objections.


cheers


andrew

-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services





Re: Implementing Incremental View Maintenance

2019-12-01 Thread Tatsuo Ishii
Michael,

> A rebase looks to be necessary, Mr Robot complains that the patch does
> not apply cleanly.  As the thread is active recently, I have moved the
> patch to next CF, waiting on author.

Thank you for taking care of this patch. Hoshiai-san, can you please
rebase the patch?

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp




Re: Yet another vectorized engine

2019-12-01 Thread Hubert Zhang
Hi Konstantin,
Thanks for your reply.

On Fri, Nov 29, 2019 at 12:09 AM Konstantin Knizhnik <
k.knizh...@postgrespro.ru> wrote:

> On 28.11.2019 12:23, Hubert Zhang wrote:
>
> We just want to introduce another POC for vectorized execution engine
> https://github.com/zhangh43/vectorize_engine and want to get some
> feedback on the idea.
>
> But I do not completely understand why you are proposing to implement it
> as extension.
> Yes, custom nodes makes it possible to provide vector execution without
> affecting Postgres core.
> But for efficient integration of zedstore and vectorized executor we need
> to extend table-AM (VectorTupleTableSlot and correspondent scan functions).
> Certainly it is easier to contribute vectorized executor as extension, but
> sooner or later I think it should be added to Postgres core.
>
> As far as I understand you already have some prototype implementation
> (otherwise how you got the performance results)?
> If so, are you planning to publish it or you think that executor should be
> developed from scratch?
>

The prototype extension is at https://github.com/zhangh43/vectorize_engine
 I agree vectorized executor should be added to Postgres core some days.
But it is such a huge feature and need to change from not only the extended
table-AM you mentioned and also every executor node , such as Agg,Join,Sort
node etc. What's more, the expression evaluation function and aggregate's
transition function, combine function etc. We all need to supply a
vectorized version for them. Hence, implementing it as an extension first
and if it is popular among community and stable, we could merge it into
Postgres core whenever we want.

We do want to get some feedback from the community about CustomScan.
CustomScan is just an abstract layer. It's typically used to support user
defined scan node, but some other PG extensions(pgstorm) have already used
it as a general CustomNode e.g. Agg, Join etc. Since vectorized engine need
to support vectorized processing in all executor node, follow the above
idea, our choice is to use CustomScan.


>  Some my concerns based on VOPS experience:
>

> 1. Vertical (columnar) model is preferable for some kind of queries, but
> there are some classes of queries for which it is less efficient.
> Moreover, data is used to be imported in the database in row format.
> Inserting it in columnar store record-by-record is very inefficient.
> So you need some kind of bulk loader which will be able to buffer input
> data before loading it in columnar store.
> Actually this problem it is more related with data model rather than
> vectorized executor. But what I want to express here is that it may be
> better to have both representation (horizontal and vertical)
> and let optimizer choose most efficient one for particular query.
>
>
Yes, in general, for OLTP queries, row format is better and for OLAP
queries column format is better.
As for storage type(or data model), I think DBA should choose row or column
store to use for a specific table.
As for executor, it's a good idea to let optimizer to choose based on cost.
It is a long term goal and our extension now will fallback to original row
executor for Insert,Update,IndexScan cases in a rough way.
We want our extension could be enhanced in a gradual way.


> 2. Columnar store and vectorized executor are most efficient for query
> like "select sum(x) from T where ...".
> Unfortunately such simple queries are rarely used in real life. Usually
> analytic queries contain group-by and joins.
> And here vertical model is not always optimal (you have to reconstruct
> rows from columns to perform join or grouping).
> To provide efficient execution of queries you may need to create multiple
> different projections of the same data (sorted by different subset of
> attributes).
> This is why Vertica (one of the most popular columnar store DBMS) is
> supporting projections.
> The same can be done in VOPS: using create_projection function you can
> specify which attributes should be scalar (grouping attributes) and which
> vectorized.
> In this case you can perform grouping and joins using standard Postgres
> executor, while perform vectorized operations for filtering and
> accumulating aggregates.
>

> This is why Q1 is 20 times faster in VOPS and not 2 times as in your
> prototype.
> So I think that columnar store should make it possible to maintain several
> projections of table and optimizer should be able to automatically choose
> one of them for particular query.
> Definitely synchronization of projections is challenged problem.
> Fortunately OLAP usually not require most recent data.
>

Projection in Vertica is useful. I tested, VOPS is really faster. It could
be nice if you could contribute it to PG core. Our extension is aimed to
not change any Postgres code as well as user's sql and existing table.
We will continue to optimize our vectorize implementation. Vectorized
hashagg need vectorized hashtable implementation, e.g. calcul

Re: Implementing Incremental View Maintenance

2019-12-01 Thread Tatsuo Ishii
>> One thing pending in this development line is how to catalogue aggregate
>> functions that can be used in incrementally-maintainable views.
>> I saw a brief mention somewhere that the devels knew it needed to be
>> done, but I don't see in the thread that they got around to doing it.
>> Did you guys have any thoughts on how it can be represented in catalogs?
>> It seems sine-qua-non ...
> 
> Yes, this is a pending issue. Currently, supported aggregate functions are
> identified their name, that is, we support aggregate functions named "count",
> "sum", "avg", "min", or "max". As mentioned before, this is not robust
> because there might be user-defined aggregates with these names although all
> built-in aggregates can be used in IVM.
> 
> In our implementation, the new aggregate values are calculated using "+" and
> "-" operations for sum and count, "/" for agv, and ">=" / "<=" for min/max. 
> Therefore, if there is a user-defined aggregate on a user-defined type which
> doesn't support these operators, errors will raise. Obviously, this is a
> problem.  Even if these operators are defined, the semantics of user-defined
> aggregate functions might not match with the way of maintaining views, and
> resultant might be incorrect.
> 
> I think there are at least three options to prevent these problems.
> 
> In the first option, we support only built-in aggregates which we know able
> to handle correctly. Supported aggregates can be identified using their OIDs.
> User-defined aggregates are not supported. I think this is the simplest and
> easiest way.

I think this is enough for the first cut of IVM. So +1.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp




Re: Yet another vectorized engine

2019-12-01 Thread Hubert Zhang
On Sun, Dec 1, 2019 at 10:05 AM Michael Paquier  wrote:

> On Thu, Nov 28, 2019 at 05:23:59PM +0800, Hubert Zhang wrote:
> > Note that the vectorized executor engine is based on PG9.6 now, but it
> > could be ported to master / zedstore with some effort.  We would
> appreciate
> > some feedback before moving further in that direction.
>
> There has been no feedback yet, unfortunately.  The patch does not
> apply anymore, so a rebase is necessary.  For now I am moving the
> patch to next CF, waiting on author.
> --
> Michael
>

Thanks we'll rebase and resubmit the patch.
-- 
Thanks

Hubert Zhang


Re: Implementing Incremental View Maintenance

2019-12-01 Thread Tatsuo Ishii
> On Fri, Nov 29, 2019 at 06:19:54PM +0900, Yugo Nagata wrote:
>> Sorry, an unfinished line was left... Please ignore this.
> 
> A rebase looks to be necessary, Mr Robot complains that the patch does
> not apply cleanly.

Is this because the patch has ".gz" suffix?

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp





RE: extension patch of CREATE OR REPLACE TRIGGER

2019-12-01 Thread osumi.takami...@fujitsu.com
Dear Michael san

> The latest patch includes calls to heap_open(), causing its compilation to 
> fail.
> Could you please send a rebased version of the patch?  I have moved the entry 
> to
> next CF, waiting on author.
Thanks. I've fixed where you pointed out. 
Also, I'm waiting for other kind of feedbacks from anyone.

Regards,
Osumi Takamichi


CREATE_OR_REPLACE_TRIGGER_v05.patch
Description: CREATE_OR_REPLACE_TRIGGER_v05.patch


Re: Implementing Incremental View Maintenance

2019-12-01 Thread Takuma Hoshiai
On Mon, 02 Dec 2019 10:01:18 +0900 (JST)
Tatsuo Ishii  wrote:

> Michael,
> 
> > A rebase looks to be necessary, Mr Robot complains that the patch does
> > not apply cleanly.  As the thread is active recently, I have moved the
> > patch to next CF, waiting on author.
> 
> Thank you for taking care of this patch. Hoshiai-san, can you please
> rebase the patch?


Sure,
I re-created a patch. This contains 'IVM_8.patch' and 
'create_materialized_view.patch', 
and I checked to apply a patch on latest master.

> Best regards,
> --
> Tatsuo Ishii
> SRA OSS, Inc. Japan
> English: http://www.sraoss.co.jp/index_en.php
> Japanese:http://www.sraoss.co.jp
> 

Best Regards,

-- 
Takuma Hoshiai 


IVM_v9.patch.gz
Description: Binary data


Re: fe-utils - share query cancellation code

2019-12-01 Thread Michael Paquier
On Fri, Nov 29, 2019 at 08:44:25AM +0100, Fabien COELHO wrote:
> I do not have a Windows host, so I can only do blind tests. I just moved the
> declaration out of !WIN32 scope in attached v7, which might solve the
> resolution error. All other issues pointed out above seem fixed in the v6
> you sent.

Committed the patch after splitting things into two commits and after
testing things from Linux and from a Windows console: the actual
refactoring and the pgbench changes.  While polishing the code, I have
found the upthread argument of Ibrar quite appealing because there are
use cases where a callback can be interesting on Windows, like simply
being able to log the cancel event to a different source.  So I have
removed the callback restriction and the assertion, making the
callback of psql a no-op on Windows.  A second thing is that two large
comments originally in psql had better be moved to cancel.c because
the logic with libpq cancel routines applies only there.
--
Michael


signature.asc
Description: PGP signature


Re: Bogus EXPLAIN results with column aliases for mismatched partitions

2019-12-01 Thread Amit Langote
On Sun, Dec 1, 2019 at 11:40 AM Tom Lane  wrote:
> then the EXPLAIN output produced by HEAD looks like:
>
>   QUERY PLAN
> ---
>  Sort
>Output: p.x, p.b
>Sort Key: p.x
>->  Append
>  ->  Seq Scan on public.part_p1 p
>Output: p.x, p.b
>  ->  Seq Scan on public.part_rev p_1
>Output: p_1.a, p_1.x
>  ->  Seq Scan on public.part_p2_p1 p_2
>Output: p_2.x, p_2.b
> (10 rows)
>
> wherein the "x" alias for column "a" has been applied to part_rev.b.
> That's wrong and confusing.

Agreed.

> The reason this happens is that expand_single_inheritance_child()
> just clones the parent RTE's alias node without any thought for
> the possibility that the columns don't match one-to-one.  It's
> an ancient bug that affects traditional inheritance as well as
> partitioning.
>
> I experimented with fixing this by making expand_single_inheritance_child
> generate a correctly-adjusted child alias node, which seems reasonable
> since it takes pains to adjust the rest of the child RTE for the different
> column layout.  It turns out to be slightly tedious to do that without
> causing a lot of regression test diffs: if we add an alias node where
> there was none before, that affects ruleutils.c's selection of table
> aliases not just column aliases.  The "variant-a" patch below mostly
> succeeds in avoiding test diffs, but it adds a fair amount of complication
> to inherit.c.  The "variant-b" patch below uses a simpler way of setting
> up the child aliases, which results in a whole lot of test diffs: all
> children of a parent named "x" will now show in EXPLAIN with aliases like
> "x_1", "x_2", etc.  (That happens already if you wrote an explicit table
> alias "x", but not if you didn't.)  While my initial reaction was that
> that was an unacceptable amount of churn, the idea gets more appealing the
> more I think about it.  It means that tables you did not name in the query
> will be shown with aliases that clearly identify their connection to
> something you did name.  So despite the added churn, I'm kind of attracted
> to the variant-b approach.  (Note that the discussion in [1] is almost
> certainly going to result in some changes to ruleutils.c's alias-selection
> behavior anyway, so I don't think staying exactly compatible with v12 is
> worth much here.)
>
> On the other hand, if we went with variant-a it might be plausible to
> back-patch this fix.  But given the fact that this is a mostly cosmetic
> problem, and we've not had field complaints, I don't feel a big need
> to fix it in the back branches.

I tend to agree that "variant b" is more appealing for the reason that
it makes the connection between child RTEs and that of the table named
in the query from which they originate more explicit.

> * To make computing the modified column alias list cheap, I made
> make_inh_translation_list() compute a reverse-mapping array showing the
> parent column associated with each child column.  I'm more than a little
> bit tempted to add that array to the AppendRelInfo struct, instead of
> passing it back separately and then discarding it.  We could use the info
> to simplify and speed up the reverse-mapping logic added by commit
> 553d2ec27, and I bet there will be other use-cases later.  But I've not
> done that in these patches.

+1

Thanks,
Amit




Failure in TAP tests of pg_ctl on Windows with parallel instance set

2019-12-01 Thread Michael Paquier
Hi all,

I have run the TAP tests with an instance of Postgres locally set at
port 5432 on Windows, to notice that 001_start_stop.pl fails various
tests because the test tries to use the default port for the node
initialized with pg_ctl.  The problem can get fixed easily by
assigning a random port number to that instance.

It could potentially become a problem if parallel TAP tests run in
parallel on Windows while initializing the node because of a port
conflict, but that's actually already a problem now for all the tests
as all nodes listen to 127.0.0.1 in this case.  This cannot happen on
*nix simply because we use a unique unix domain path, so even if ports
conflict things are able to work. 

Attached is a patch to fix this issue, that I would like to
back-patch down to 9.4 where the issue can show up.

Any objections?
--
Michael
diff --git a/src/bin/pg_ctl/t/001_start_stop.pl b/src/bin/pg_ctl/t/001_start_stop.pl
index e5d46a6f25..6a1619e171 100644
--- a/src/bin/pg_ctl/t/001_start_stop.pl
+++ b/src/bin/pg_ctl/t/001_start_stop.pl
@@ -22,8 +22,10 @@ command_ok([ 'pg_ctl', 'initdb', '-D', "$tempdir/data", '-o', '-N' ],
 	'pg_ctl initdb');
 command_ok([ $ENV{PG_REGRESS}, '--config-auth', "$tempdir/data" ],
 	'configure authentication');
+my $node_port = get_free_port();
 open my $conf, '>>', "$tempdir/data/postgresql.conf";
 print $conf "fsync = off\n";
+print $conf "port = $node_port\n";
 print $conf TestLib::slurp_file($ENV{TEMP_CONFIG})
   if defined $ENV{TEMP_CONFIG};
 


signature.asc
Description: PGP signature


Re:Issue about memory order on ARM

2019-12-01 Thread 盏一
Sorry to bother you, now I know that there is no problem here.

The model for reading and writing of PGXACT::xid and 
ShmemVariableCache->latestCompletedXid can be simplified as follows:

 backend A backend B   
backend C
  wlock(XidGenLock);   wlock(XidGenLock);   
rlock(ProcArrayLock);
  write APgXact->xid; write BPgXact->xid;   read 
latestCompletedXid;
  unlock(XidGenLock); unlock(XidGenLock);  read 
APgXact->xid;
  ...   
read BPgXact->xid;
  wlock(ProcArrayLock);
unlock(ProcArrayLock);
  write latestCompletedXid;
  unlock(ProcArrayLock);

My previous problem was that C might not be able to see the value of 
APgXact->xid written by A because there was no obvious acquire-release 
operation during this. But now I find that there are already some 
acquire-release operations here. Because of the `unlock(XidGenLock)` in A and 
`wlock(XidGenLock)` in B and the rules introduced in [Inter-thread 
happens-before](https://en.cppreference.com/w/cpp/atomic/memory_order), we can 
know that the `write APgXact->xid` in A inter-thread happens before `write 
BPgXact->xid` in B. And `write BPgXact->xid` is sequenced before `write 
latestCompletedXid` in B according to rules introduced in [Sequenced-before 
rules](https://en.cppreference.com/w/cpp/language/eval_order). And similarly 
`write latestCompletedXid` in B inter-thread happens before `read 
latestCompletedXid` in C. So the `write APgXact->xid` in A inter-thread happens 
before `read APgXact->xid` in C. So C can see the value of APgXact->xid written 
by A.

Missing data_sync_elevel() for some calls of pg_fsync()?

2019-12-01 Thread Michael Paquier
Hi all,

I was just looking at some callers of pg_fsync(), to notice that some
code paths don't use data_sync_elevel().  For some code paths, that's
actually better to never PANIC (say backup_label file, logical
decoding snapshot, lock file where FATAL/LOG are used now, etc.).
However I have spotted three code paths where this is not done and I
think that's not fine:
- 2PC file generated at checkpoint time.
- WAL segment initialization.
- Temporary state file for a replication slot save, which may cause
ERRORs at checkpoint time.

Any thoughts?
--
Michael
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 529976885f..0cae748348 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -1661,7 +1661,7 @@ RecreateTwoPhaseFile(TransactionId xid, void *content, int len)
 	 */
 	pgstat_report_wait_start(WAIT_EVENT_TWOPHASE_FILE_SYNC);
 	if (pg_fsync(fd) != 0)
-		ereport(ERROR,
+		ereport(data_sync_elevel(ERROR),
 (errcode_for_file_access(),
  errmsg("could not fsync file \"%s\": %m", path)));
 	pgstat_report_wait_end();
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 5f0ee50092..986d95243c 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -3315,7 +3315,7 @@ XLogFileInit(XLogSegNo logsegno, bool *use_existent, bool use_lock)
 
 		close(fd);
 		errno = save_errno;
-		ereport(ERROR,
+		ereport(data_sync_elevel(ERROR),
 (errcode_for_file_access(),
  errmsg("could not fsync file \"%s\": %m", tmppath)));
 	}
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 21ae8531b3..68f7ba1247 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -1307,7 +1307,7 @@ SaveSlotToPath(ReplicationSlot *slot, const char *dir, int elevel)
 		pgstat_report_wait_end();
 		CloseTransientFile(fd);
 		errno = save_errno;
-		ereport(elevel,
+		ereport(data_sync_elevel(elevel),
 (errcode_for_file_access(),
  errmsg("could not fsync file \"%s\": %m",
 		tmppath)));


signature.asc
Description: PGP signature


Re: pgbench -i progress output on terminal

2019-12-01 Thread Amit Langote
Hi Fabien,

On Sun, Dec 1, 2019 at 4:33 AM Fabien COELHO  wrote:
> Patch applies, compiles, works for me. No further comments.
>
> I switched the patch as ready.

Thanks a lot.

Regards,
Amit




Re: Missing data_sync_elevel() for some calls of pg_fsync()?

2019-12-01 Thread Thomas Munro
On Mon, Dec 2, 2019 at 5:58 PM Michael Paquier  wrote:
> I was just looking at some callers of pg_fsync(), to notice that some
> code paths don't use data_sync_elevel().  For some code paths, that's
> actually better to never PANIC (say backup_label file, logical
> decoding snapshot, lock file where FATAL/LOG are used now, etc.).
> However I have spotted three code paths where this is not done and I
> think that's not fine:
> - 2PC file generated at checkpoint time.
> - WAL segment initialization.
> - Temporary state file for a replication slot save, which may cause
> ERRORs at checkpoint time.

One of the distinctions I had in mind when reviewing/working on the
PANIC stuff was this:

1.  Some code opens a file, writes some stuff to it, closes, and then
fsyncs it, and if that fails and and it ever retries it'll redo all of
those steps again.  We know that some kernels might have thrown away
the data, but we don't care about the copy in the kernel's cache
because we'll write it out again next time around.

2.  Some code, primarily buffer pool write-back code, writes data out
to the file, then throws away the only copy we have of it other than
the WAL by using the buffer for some other block, and then later
(usually in the checkpointer) fsyncs it.  One way to look at it is
that if the fsync fails, the only place left to get that data (which
may represent committed transactions) is the WAL, and the only way to
get it is crash recovery.  Another way to look at it is that if we
didn't PANIC, the checkpointer would try again, but it's easily
demonstrable that if it tries again, certain kernels will do nothing
and then tell you that it succeeded, so we need to prevent that or our
checkpoint would be a data-eating lie.

I didn't look closely at your patch, but I think those are category 1,
no?  Admittedly there is an argument that we should panic in those
cases too, because otherwise we're second guessing how the kernel
works, and I did make a similar argument for why sync_file_range()
failure is panic-worthy.




Re: Missing data_sync_elevel() for some calls of pg_fsync()?

2019-12-01 Thread Craig Ringer
On Mon, 2 Dec 2019 at 13:43, Thomas Munro  wrote:

>
> 1.  Some code opens a file, writes some stuff to it, closes, and then
> fsyncs it, and if that fails and and it ever retries it'll redo all of
> those steps again.  We know that some kernels might have thrown away
> the data, but we don't care about the copy in the kernel's cache
> because we'll write it out again next time around.
>

Can we trust the kernel to be reporting the EIO or ENOSPC only from
writeback buffers for the actual file we're fsync()ing though? Not from
buffers it flushed while performing our fsync() request, failed to flush,
and complained about?

I'm not confident I want to assume that.
-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 2ndQuadrant - PostgreSQL Solutions for the Enterprise


Re: Add a GUC variable that control logical replication

2019-12-01 Thread Craig Ringer
On Thu, 28 Nov 2019 at 11:53, Michael Paquier  wrote:

> On Wed, Nov 06, 2019 at 10:01:43PM +0800, Quan Zongliang wrote:
> > What the user needs is the same replication link that selectively skips
> some
> > transactions. And this choice only affects transactions that are doing
> bulk
> > delete sessions. The operations of other sessions are not affected and
> can
> > continue to output replication messages.
> > For example, session 1 wants to bulk delete 1 million old data from the
> T1
> > table, which can be done without replication. At the same time, session 2
> > deletes 10 records from T1, which is expected to be passed on through
> > replication.
> > Therefore, the two decoders can not meet this requirement. It is also
> > inappropriate to temporarily disable subscriptions because it skips all
> > transactions for a certain period of time.
>
> Hmm.  The patch discussed on this thread does not have much support
> from Peter and Craig, so I am marking it as RwF.
>
>
Yeah. I'm not against it as such. But I'd like to either see it work by
exposing the ability to use DoNotReplicateId to SQL or if that's not
satisfactory, potentially replace that mechanism with the newly added one
and emulate DoNotReplicateId for BC.

I don't want two orthogonal ways to say "don't consider this for logical
replication".
-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 2ndQuadrant - PostgreSQL Solutions for the Enterprise


Re: pgbench -i progress output on terminal

2019-12-01 Thread Michael Paquier
On Mon, Dec 02, 2019 at 02:30:47PM +0900, Amit Langote wrote:
> On Sun, Dec 1, 2019 at 4:33 AM Fabien COELHO  wrote:
>> Patch applies, compiles, works for me. No further comments.
>>
>> I switched the patch as ready.
> 
> Thanks a lot.

An issue with the patch as proposed is that its style is different
than what pg_rewind and pg_basebackup do in the same cases, but who
cares :)

By the way, the first patch sent on this thread had a bug when
redirecting the output of stderr to a log file because it was printing
a newline for each loop done on naccounts, but you just want to print
a log every 100 rows or 100k rows depending on if the quiet mode is
used or not, so the log file grew in size with mostly empty lines.  v3
does that correctly of course as you add the last character of one log
line each time the log entry is printed.

Another question I have is why doing only that for the data
initialization phase?  Wouldn't it make sense to be consistent with
the other tools having --progress and do the same dance in pgbench's
printProgressReport()?

NB: Note as well that pgindent complains for one thing, a newline
before the call to isatty.
--
Michael


signature.asc
Description: PGP signature


Re: Control your disk usage in PG: Introduction to Disk Quota Extension

2019-12-01 Thread Haozhou Wang
Hi Michael,

Thank you very much for your email. I rebased the code with the newest
master and attached it in the attachment.

Thank you very much!

Regards,
Haozhou

On Sun, Dec 1, 2019 at 11:20 AM Michael Paquier  wrote:

> On Fri, Sep 27, 2019 at 11:30:08AM +0800, Haozhou Wang wrote:
> > I rebased this patch with the newest master branch. Attached the new
> > patch disk_quota_hooks_v5.patch in the attachment.
>
> This again needs a rebase, so I have switched it as waiting on
> author.
> --
> Michael
>


-- 
Regards,
Haozhou


disk_quota_hooks_v6.patch
Description: Binary data


Re: Implementing Incremental View Maintenance

2019-12-01 Thread Yugo Nagata
On Mon, 02 Dec 2019 10:36:36 +0900 (JST)
Tatsuo Ishii  wrote:

> >> One thing pending in this development line is how to catalogue aggregate
> >> functions that can be used in incrementally-maintainable views.
> >> I saw a brief mention somewhere that the devels knew it needed to be
> >> done, but I don't see in the thread that they got around to doing it.
> >> Did you guys have any thoughts on how it can be represented in catalogs?
> >> It seems sine-qua-non ...
> > 
> > Yes, this is a pending issue. Currently, supported aggregate functions are
> > identified their name, that is, we support aggregate functions named 
> > "count",
> > "sum", "avg", "min", or "max". As mentioned before, this is not robust
> > because there might be user-defined aggregates with these names although all
> > built-in aggregates can be used in IVM.
> > 
> > In our implementation, the new aggregate values are calculated using "+" and
> > "-" operations for sum and count, "/" for agv, and ">=" / "<=" for min/max. 
> > Therefore, if there is a user-defined aggregate on a user-defined type which
> > doesn't support these operators, errors will raise. Obviously, this is a
> > problem.  Even if these operators are defined, the semantics of user-defined
> > aggregate functions might not match with the way of maintaining views, and
> > resultant might be incorrect.
> > 
> > I think there are at least three options to prevent these problems.
> > 
> > In the first option, we support only built-in aggregates which we know able
> > to handle correctly. Supported aggregates can be identified using their 
> > OIDs.
> > User-defined aggregates are not supported. I think this is the simplest and
> > easiest way.
> 
> I think this is enough for the first cut of IVM. So +1.

If there is no objection, I will add the check of aggregate functions
by this way. Thanks.

Regards,
Yugo Nagata

-- 
Yugo Nagata 




Re: Using XLogFileNameP in critical section

2019-12-01 Thread Michael Paquier
On Fri, Nov 29, 2019 at 06:44:58PM +0100, Masahiko Sawada wrote:
> I encountered that the assertion error is reported instead of a proper
> PANIC message when failed to fsync WAL. The cause is that there are
> multiple places where we call XLogFileNameP function that calls palloc
> during critical section, for example XLogWrite function.
> 
> TRAP: FailedAssertion("CritSectionCount == 0 ||
> (context)->allowInCritSection", File: "mcxt.c", Line: 956)
> 
> As far as I can see there are five places we need to fix.  I've
> attached a patch.

+msg = "could not fdatasync file \"%s\": %m";
Missing some translations, no?

You are missing a couple of code paths in walreceiver.c,
XLogWalRcvWrite(), where XLogFileNameP is used on a PANIC.  This
brings me the following points:
1) If you look closely, all the callers of XLogFileNameP() are used
for the generation of error strings.
2) I am ready to bet that we'll have the same discussion in the future
because somebody will make the same mistake for a new code path.

I think that we had better just get rid of XLogFileNameP() (on HEAD)
and just change those code paths so as they use a buffer of size
MAXFNAMELEN, with XLogFileName() generating the file name.  This leads
actually to some simplifications, see for example XLogWalRcvWrite..
--
Michael


signature.asc
Description: PGP signature