Re: let's make the list of reportable GUCs configurable (was Re: Add %r substitution for psql prompts to show recovery status)

2019-10-10 Thread Craig Ringer
On Thu, 10 Oct 2019 at 23:45, Andres Freund  wrote:

>
> Unless schema qualified you can't rely on that even without search_path
> changing. Consider an object in schema b existing, with a search_path of
> a,b. Even without the search path changing, somebody could create a type
> in a, "masking" the one in b.


True. I remember dealing with (or trying to deal with) that mess in BDR's
DDL replication.

I guess the question becomes "what is required to to permit apps, clients
or drivers to safely and efficiently cache mappings of object names to
server side metadata".

E.g. if I have a non-schema-qualified relation name what server side help
is needed to make it possible to safely cache its column names, column
order, and column types? Or for a given function name, cache whether it's a
function or procedure, its IN, INOUT and OUT parameter names, positions and
types, etc?

You sensibly point out that being notified of search_path changes is not
sufficient. It might still be useful.

We have comprehensive server-side cache invalidation support. I wonder if
it's reasonable to harness that. Let clients request that we accumulate
invalidations and deliver them on the wire to the client at the end of each
statement, before ReadyForQuery, along with a backend-local invalidation
epoch that increments each time we send invalidations to the client.

The problem is that the client would have to send the invalidation epoch
along with each query so the server could tell it to flush its cache and
retry if there were invalidations since the end of the last query.

None of that is likely to be pretty given that we've been unable to agree
on any way to extend the protocol at client request.

Anyone have any ideas for "near enough is good enough" approaches or
methods that'd work with some client assistance / extension support / etc?

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 2ndQuadrant - PostgreSQL Solutions for the Enterprise


Re: BRIN index which is much faster never chosen by planner

2019-10-10 Thread Michael Lewis
On Thu, Oct 10, 2019 at 6:22 PM David Rowley 
wrote:

> The planner might be able to get a better estimate on the number of
> matching rows if the now() - interval '10 days' expression was
> replaced with 'now'::timestamptz - interval '10 days'. However, care
> would need to be taken to ensure the plan is never prepared since
> 'now' is evaluated during parse. The same care must be taken when
> creating views, functions, stored procedures and the like.
>
> The planner will just estimate the selectivity of now() - interval '10
> days'  by using DEFAULT_INEQ_SEL, which is 0., so it
> thinks it'll get 1/3rd of the table.  Using 'now' will allow the
> planner to lookup actual statistics on that column which will likely
> give a much better estimate, which by the looks of it, likely will
> result in one of those BRIN index being used.
>

This surprised me a bit, and would have significant implications. I tested
a few different tables in our system and get the same row count estimate
with either WHERE condition. Perhaps I am missing a critical piece of what
you said.

explain
select * from charges where posted_on > now() - interval '10 days';

explain
select * from charges where posted_on > 'now'::timestamptz  - interval '10
days';


Re: maintenance_work_mem used by Vacuum

2019-10-10 Thread Masahiko Sawada
On Thu, Oct 10, 2019 at 6:38 PM Amit Kapila  wrote:
>
> On Thu, Oct 10, 2019 at 2:10 PM Masahiko Sawada  wrote:
> >
> > On Thu, Oct 10, 2019 at 3:36 PM Amit Kapila  wrote:
> > >
> > > On Thu, Oct 10, 2019 at 9:58 AM Masahiko Sawada  
> > > wrote:
> > > >
> > > > On Wed, Oct 9, 2019 at 7:12 PM Dilip Kumar  
> > > > wrote:
> > > > >
> > > > > I think the current situation is not good but if we try to cap it to
> > > > > maintenance_work_mem + gin_*_work_mem then also I don't think it will
> > > > > make the situation much better.  However, I think the idea you
> > > > > proposed up-thread[1] is better.  At least the  maintenance_work_mem
> > > > > will be the top limit what the auto vacuum worker can use.
> > > > >
> > > >
> > > > I'm concerned that there are other index AMs that could consume more
> > > > memory like GIN. In principle we can vacuum third party index AMs and
> > > > will be able to even parallel vacuum them. I  expect that
> > > > maintenance_work_mem is the top limit of memory usage of maintenance
> > > > command but actually it's hard to set the limit to memory usage of
> > > > bulkdelete and cleanup by the core. So I thought that since GIN is the
> > > > one of the index AM it can have a new parameter to make its job
> > > > faster. If we have that parameter it might not make the current
> > > > situation much better but user will be able to set a lower value to
> > > > that parameter to not use the memory much while keeping the number of
> > > > index vacuums.
> > > >
> > >
> > > I can understand your concern why dividing maintenance_work_mem for
> > > vacuuming heap and cleaning up the index might be tricky especially
> > > because of third party indexes, but introducing new Guc isn't free
> > > either.  I think that should be the last resort and we need buy-in
> > > from more people for that.  Did you consider using work_mem for this?
> >
> > Yeah that seems work too. But I wonder if it could be the similar
> > story to gin_pending_list_limit. I mean that previously we used to use
> >  work_mem as the maximum size of GIN pending list. But we concluded
> > that it was not appropriate to control both by one GUC so we
> > introduced gin_penidng_list_limit and the storage parameter at commit
> > 263865a4
> >
>
> It seems you want to say about commit id
> a1b395b6a26ae80cde17fdfd2def8d351872f399.

Yeah thanks.

>  I wonder why they have not
> changed it to gin_penidng_list_limit (at that time
> pending_list_cleanup_size) in that commit itself?  I think if we want
> to use gin_pending_list_limit in this context then we can replace both
> work_mem and maintenance_work_mem with gin_penidng_list_limit.

Hmm as far as I can see the discussion, no one mentioned about
maintenance_work_mem. Perhaps we just oversighted? I also didn't know
that.

I also think we can replace at least the work_mem for cleanup of
pending list with gin_pending_list_limit. In the following comment in
ginfast.c,

/*
 * Force pending list cleanup when it becomes too long. And,
 * ginInsertCleanup could take significant amount of time, so we prefer to
 * call it when it can do all the work in a single collection cycle. In
 * non-vacuum mode, it shouldn't require maintenance_work_mem, so fire it
 * while pending list is still small enough to fit into
 * gin_pending_list_limit.
 *
 * ginInsertCleanup() should not be called inside our CRIT_SECTION.
 */
cleanupSize = GinGetPendingListCleanupSize(index);
if (metadata->nPendingPages * GIN_PAGE_FREESIZE > cleanupSize * 1024L)
needCleanup = true;

ISTM the gin_pending_list_limit in the above comment corresponds to
the following code in ginfast.c,

/*
 * We are called from regular insert and if we see concurrent cleanup
 * just exit in hope that concurrent process will clean up pending
 * list.
 */
if (!ConditionalLockPage(index, GIN_METAPAGE_BLKNO, ExclusiveLock))
return;
workMemory = work_mem;

If work_mem is smaller than gin_pending_list_limit the pending list
cleanup would behave against the intention of the above comment that
prefers to do all the work in a single collection cycle while pending
list is still small enough to fit into gin_pending_list_limit.

Regards,

--
Masahiko Sawada




Re: BRIN index which is much faster never chosen by planner

2019-10-10 Thread David Rowley
On Fri, 11 Oct 2019 at 12:13, Tomas Vondra  wrote:
> The index scan is estimated to return 157328135 rows, i.e. about 50% of
> the table (apparently it's ~10x more than the actual number).

Don't pay too much attention to the actual row counts from bitmap
index scans of brin indexes. The value is entirely made up, per:

/*
* XXX We have an approximation of the number of *pages* that our scan
* returns, but we don't have a precise idea of the number of heap tuples
* involved.
*/
return totalpages * 10;

in bringetbitmap().

(Ideally EXPLAIN would be written in such a way that it didn't even
show the actual rows for node types that don't return rows. However,
I'm sure that would break many explain parsing tools)

The planner might be able to get a better estimate on the number of
matching rows if the now() - interval '10 days' expression was
replaced with 'now'::timestamptz - interval '10 days'. However, care
would need to be taken to ensure the plan is never prepared since
'now' is evaluated during parse. The same care must be taken when
creating views, functions, stored procedures and the like.

The planner will just estimate the selectivity of now() - interval '10
days'  by using DEFAULT_INEQ_SEL, which is 0., so it
thinks it'll get 1/3rd of the table.  Using 'now' will allow the
planner to lookup actual statistics on that column which will likely
give a much better estimate, which by the looks of it, likely will
result in one of those BRIN index being used.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services




Re: BTP_DELETED leaf still in tree

2019-10-10 Thread Daniel Wood
> On October 10, 2019 at 1:18 PM Peter Geoghegan  wrote:
> 
> 
> On Thu, Oct 10, 2019 at 12:48 PM Daniel Wood  wrote:
> > Update query stuck in a loop.  Looping in _bt_moveright().
> 
> You didn't say which PostgreSQL versions were involved, and if the
> database was ever upgraded using pg_upgrade. Those details could
> matter.

PG_VERSION says 10.  I suspect we are running 10.9.  I have no idea if 
pg_upgrade was ever done.

> > ExecInsertIndexTuples->btinsert->_bt_doinsert->_bt_search->_bt_moveright
> >
> > Mid Tree Node downlink path taken by _bt_search points to a BTP_DELETED 
> > Leaf.
> 
> This should hardly ever happen -- it is barely possible for an index
> scan to land on a BTP_DELETED leaf page (or a half-dead page) when
> following a downlink in its parent. Recall that nbtree uses Lehman &
> Yao's design, so _bt_search() does not "couple" buffer locks on the
> way down. It would probably be impossible to observe this happening
> without carefully setting breakpoints in multiple sessions.
> 
> If this happens reliably for you, which it sounds like, then you can
> already assume that the index is corrupt.
> 
> > btpo_next is also DELETED but not in the tree.
> >
> > btpo_next->btpo_next is NOT deleted but in the mid tree as a lesser key 
> > value.
> >
> > Thus creating an endless loop in moveright.
> 
> Offhand, these other details sound normal. The side links are still
> needed in fully deleted (BTP_DELETED) pages. And, moving right and
> finding lesser key values (not greater key values) is normal with
> deleted pages, since page deletion makes the keyspace move right, not
> left (moving the keyspace left is how the source Lanin & Shasha paper
> does it, though).
> 
> Actually, I take it back -- the looping part is not normal. The
> btpo_next->btpo_next page has no business linking back to the
> original/first deleted page you mentioned. That's just odd.

btpo_next->btpo_next does NOT link directly back to the 1st deleted page.  It 
simply links to some in-use page which is 50 or so leaf pages back in the tree. 
 Eventually we do reach the two deleted pages again.  Only the first one is in 
the 'tree'.

> Can you provide me with a dump of the page images? The easiest way of
> getting a page dump is described here:

Customer data.  Looks like meaningless customer data (5 digit key values).  But 
too much paperwork. :-)

The hard part for me to understand isn't just why the DELETED leaf node is 
still referenced in the mid tree node.
It is that the step which sets BTP_DELETED should have also linked its leaf and 
right siblings together.  But this hasn't been done.

Could the page have already have been dirty, but because of "target != 
leafblkno", we didn't stamp a new LSN on it.  Could this allow us to write the 
DELETED dirty page without the XLOG_BTREE_MARK_PAGE_HALFDEAD and 
XLOG_BTREE_UNLINK_PAGE being flushed?  Of course, I don't understand the 
"target != leafblkno".

In any case, thanks.

> https://wiki.postgresql.org/wiki/Getting_a_stack_trace_of_a_running_PostgreSQL_backend_on_Linux/BSD#contrib.2Fpageinspect_page_dump
> 
> If I had to guess, I'd guess that this was due to a generic storage problem.
> 
> -- 
> Peter Geoghegan




Re: BRIN index which is much faster never chosen by planner

2019-10-10 Thread Tomas Vondra

On Thu, Oct 10, 2019 at 04:58:11PM -0500, Jeremy Finzel wrote:


...

Notice it chooses the smallest BRIN index with 1000 pages per range, and
this is far faster than the seq scan.

I do believe the estimate is actually way off.  Just a plain EXPLAIN of the
latter estimates 10x more rows than actual:
WindowAgg  (cost=24354689.19..24354705.07 rows=706 width=120)
  ->  Sort  (cost=24354689.19..24354690.95 rows=706 width=104)
Sort Key: unique_cases.source, unique_cases.rec_insert_time
->  Subquery Scan on unique_cases  (cost=24354641.66..24354655.78
rows=706 width=104)
  ->  Unique  (cost=24354641.66..24354648.72 rows=706
width=124)
->  Sort  (cost=24354641.66..24354643.42 rows=706
width=124)
  Sort Key: l.brand_id, l.last_change, l.log_id,
l.rec_insert_time DESC
  ->  Nested Loop  (cost=2385.42..24354608.25
rows=706 width=124)
->  Bitmap Heap Scan on log_table l
(cost=2383.82..24352408.26 rows=706 width=99)
  Recheck Cond: (rec_insert_time >=
(now() - '10 days'::interval))
  Filter: ((field1 IS NOT NULL) AND
(category = 'music'::name))
  ->  Bitmap Index Scan on
rec_insert_time_brin_1000  (cost=0.00..2383.64 rows=156577455 width=0)
Index Cond: (rec_insert_time

= (now() - '10 days'::interval))

->  Bitmap Heap Scan on small_join_table
filter  (cost=1.60..3.12 rows=1 width=8)
  Recheck Cond: (category =
(l.category)::text)
  ->  Bitmap Index Scan on
small_join_table_pkey  (cost=0.00..1.60 rows=1 width=0)
Index Cond: (category =
(l.category)::text)
(17 rows)


Here is EXPLAIN only of the default chosen plan:
WindowAgg  (cost=24437857.18..24437873.07 rows=706 width=120)
  ->  Sort  (cost=24437857.18..24437858.95 rows=706 width=104)
Sort Key: unique_cases.source, unique_cases.rec_insert_time
->  Subquery Scan on unique_cases  (cost=24437809.66..24437823.78
rows=706 width=104)
  ->  Unique  (cost=24437809.66..24437816.72 rows=706
width=124)
->  Sort  (cost=24437809.66..24437811.42 rows=706
width=124)
  Sort Key: l.brand_id, l.last_change, l.log_id,
l.rec_insert_time DESC
  ->  Nested Loop  (cost=0.00..24437776.25
rows=706 width=124)
Join Filter: ((l.category)::text =
filter.category)
->  Seq Scan on log_table l
(cost=0.00..24420483.80 rows=706 width=99)
  Filter: ((field1 IS NOT NULL) AND
(category = 'music'::name) AND (rec_insert_time >= (now() - '10
days'::interval)))
->  Materialize  (cost=0.00..33.98
rows=1399 width=8)
  ->  Seq Scan on small_join_table
filter  (cost=0.00..26.99 rows=1399 width=8)
(13 rows)



Any insight into this is much appreciated.  This is just one example of
many similar issues I have been finding with BRIN indexes scaling
predictably with insert-only workloads.



It's quite interesting planning issue. The cost estimates are:

->  Seq Scan on foo.log_table l
(cost=0.00..24420483.80 rows=707 width=99) (actual

while for the bitmap heap scan it looks like this:

->  Bitmap Heap Scan on foo.log_table l
(cost=2391.71..24360848.29 rows=735 width=99) (actual

So the planner actualy thinks the bitmap heap scan is a tad *cheaper*
but picks the seq scan anyway. This is likely because we don't really
compare the exact costs, but we do fuzzy comparison - the plan has to be
at least 1% cheaper to dominate the existing plan. This allows us to
save some work when replacing the paths.

In this case the difference is only about 0.2%, so we keep the seqscan
path. The real question is why the planner came to this cost, when it
got pretty good row estimates etc.

Looking at the cost_bitmap_heap_scan() I think the costing issue comes
mostly from this bit:

 /*
  * For small numbers of pages we should charge spc_random_page_cost
  * apiece, while if nearly all the table's pages are being read, it's more
  * appropriate to charge spc_seq_page_cost apiece.  The effect is
  * nonlinear, too. For lack of a better idea, interpolate like this to
  * determine the cost per page.
  */
 if (pages_fetched >= 2.0)
 cost_per_page = spc_random_page_cost -
 (spc_random_page_cost - spc_seq_page_cost)
 * sqrt(pages_fetched / T);
 else
 cost_per_page = spc_random_page_cost;

The index scan is estimated to return 157328135 rows, i.e. about 50% of
the table (apparently it's ~10x more than the actual number). This is
produced by compute_bitmap_pages() which also computes pages_fetched,
and I guess that's going to be 

Re: BRIN index which is much faster never chosen by planner

2019-10-10 Thread Michael Lewis
Since the optimizer is choosing a seq scan over index scan when it seems
like it has good row estimates in both cases, to me that may mean costs of
scanning index are expected to be high. Is this workload on SSD? Has the
random_page_cost config been decreased from default 4 (compared with cost
of 1 unit for sequential scan)?

Your buffer hits aren't great. What is shared_buffers set to? How much ram
on this cluster?

With this table being insert only, one assumes correlation is very high on
the data in this column as shown in pg_stats, but have your confirmed?

To me, distinct ON is often a bad code smell and probably can be re-written
to be much more efficient with GROUP BY, lateral & order by, or some other
tool. Same with the window function. It is a powerful tool, but sometimes
not the right one.

Is "source" a function that is called on field1? What is it doing/how is it
defined?


Re: stress test for parallel workers

2019-10-10 Thread Andrew Dunstan


On 10/10/19 5:34 PM, Tom Lane wrote:
> I wrote:
 Yeah, I've been wondering whether pg_ctl could fork off a subprocess
 that would fork the postmaster, wait for the postmaster to exit, and then
 report the exit status.
>> [ pushed at 6a5084eed ]
>> Given wobbegong's recent failure rate, I don't think we'll have to wait
>> long.
> Indeed, we didn't:
>
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=wobbegong=2019-10-10%2020%3A54%3A46
>
> The tail end of the system log looks like
>
> 2019-10-10 21:00:33.717 UTC [15127:306] pg_regress/date FATAL:  postmaster 
> exited during a parallel transaction
> 2019-10-10 21:00:33.717 UTC [15127:307] pg_regress/date LOG:  disconnection: 
> session time: 0:00:02.896 user=fedora database=regression host=[local]
> /bin/sh: line 1: 14168 Segmentation fault  (core dumped) 
> "/home/fedora/build-farm-10-clang/buildroot/HEAD/pgsql.build/tmp_install/home/fedora/build-farm-clang/buildroot/HEAD/inst/bin/postgres"
>  -F -c listen_addresses="" -k "/tmp/pg_upgrade_check-ZrhQ4h"
> postmaster exit status is 139
>
> So that's definitive proof that the postmaster is suffering a SIGSEGV.
> Unfortunately, we weren't blessed with a stack trace, even though
> wobbegong is running a buildfarm client version that is new enough
> to try to collect one.  However, seeing that wobbegong is running
> a pretty-recent Fedora release, the odds are that systemd-coredump
> has commandeered the core dump and squirreled it someplace where
> we can't find it.



At least on F29 I have set /proc/sys/kernel/core_pattern and it works.



>
> Much as one could wish otherwise, systemd doesn't seem likely to
> either go away or scale back its invasiveness; so I'm afraid we
> are probably going to need to teach the buildfarm client how to
> negotiate with systemd-coredump at some point.  I don't much want
> to do that right this minute, though.
>
> A nearer-term solution would be to reproduce this manually and
> dig into the core.  Mark, are you in a position to give somebody
> ssh access to wobbegong's host, or another similarly-configured VM?



I have given Mark my SSH key. That doesn't mean others interested shouldn't.


>
> (While at it, it'd be nice to investigate the infinite_recurse
> failures we've been seeing on all those ppc64 critters ...)
>
>   



Yeah.


cheers


andrew

-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services





BRIN index which is much faster never chosen by planner

2019-10-10 Thread Jeremy Finzel
Good Afternoon,

I posted about this on another thread here
,
but the topic was not precisely planner issues, so I wanted to post it here.

I am running Postgres 11.5.  I have a table that is insert-only and has 312
million rows.  It is also pruned continuously to only past year.  The size
of the table is 223 GB with indexes, 140GB without.  One of the fields is
rec_insert_time timestamptz.  Here are all potentially relevant table stats:

schemaname  | foo
relname | log_table
n_tup_ins   | 86850506
n_tup_upd   | 0
n_tup_del   | 68916115
n_tup_hot_upd   | 0
n_live_tup  | 312810691
n_dead_tup  | 9405132
n_mod_since_analyze | 11452
last_vacuum | 2019-09-20 09:41:43.78383-05
last_autovacuum | 2019-10-04 13:56:16.810355-05
last_analyze| 2019-10-10 09:34:26.807695-05
last_autoanalyze|
vacuum_count| 2
autovacuum_count| 1
analyze_count   | 13
autoanalyze_count   | 0
total_relation_size | 223 GB
relation_size   | 139 GB
table_size  | 140 GB

I have a simple query looking at past 10 days based on rec_insert_time, and
it will not choose the BRIN index even with several configurations.  Here
are my all relevant indexes (I intentionally do not have a btree on
rec_insert_time because I believe BRIN *should* fit better here):

"log_table_brand_id_product_rec_insert_time_idx" btree (brand_id, product,
rec_insert_time)
"log_table_rec_insert_time_idx" brin (rec_insert_time)
"log_table_rec_insert_time_idx1" brin (rec_insert_time) WITH
(pages_per_range='64')
"rec_insert_time_brin_1000" brin (rec_insert_time) WITH
(pages_per_range='1000')

And here is the SQL:
SELECT
 category, source, MIN(rec_insert_time) OVER (partition by source order by
rec_insert_time) AS first_source_time, MAX(rec_insert_time) OVER (partition
by source order by rec_insert_time) AS last_source_time
FROM (SELECT DISTINCT ON (brand_id, last_change, log_id)
category, source(field1) AS source, rec_insert_time
FROM log_table l
INNER JOIN public.small_join_table filter ON filter.category = l.category
WHERE field1 IS NOT NULL AND l.category = 'music'
AND l.rec_insert_time >= now() - interval '10 days'
ORDER BY brand_id, last_change, log_id, rec_insert_time DESC) unique_cases
;

This query will choose a seq scan on log_table every time in spite of these
BRIN indexes on rec_insert_time.
@Michael Lewis  had suggested I check
default_statistics_target for this column.  I raised it to 5000 for this
column and still it's choosing a seq scan.

Here is default chosen plan (takes 2 minutes 12 seconds):
 WindowAgg  (cost=24437881.80..24437897.70 rows=707 width=120) (actual
time=132173.173..132173.222 rows=53 loops=1)
   Output: unique_cases.category, unique_cases.source,
min(unique_cases.rec_insert_time) OVER (?),
max(unique_cases.rec_insert_time) OVER (?), unique_cases.rec_insert_time
   Buffers: shared hit=391676 read=17772642 dirtied=4679 written=7
   ->  Sort  (cost=24437881.80..24437883.56 rows=707 width=104) (actual
time=132173.146..132173.149 rows=53 loops=1)
 Output: unique_cases.source, unique_cases.rec_insert_time,
unique_cases.category
 Sort Key: unique_cases.source, unique_cases.rec_insert_time
 Sort Method: quicksort  Memory: 32kB
 Buffers: shared hit=391676 read=17772642 dirtied=4679 written=7
 ->  Subquery Scan on unique_cases  (cost=24437834.20..24437848.34
rows=707 width=104) (actual time=132172.950..132173.062 rows=53 loops=1)
   Output: unique_cases.source, unique_cases.rec_insert_time,
unique_cases.category
   Buffers: shared hit=391676 read=17772642 dirtied=4679
written=7
   ->  Unique  (cost=24437834.20..24437841.27 rows=707
width=124) (actual time=132172.946..132173.048 rows=53 loops=1)
 Output: l.category, (source(l.field1)),
l.rec_insert_time, l.brand_id, l.last_change, l.log_id
 Buffers: shared hit=391676 read=17772642 dirtied=4679
written=7
 ->  Sort  (cost=24437834.20..24437835.96 rows=707
width=124) (actual time=132172.939..132172.962 rows=466 loops=1)
   Output: l.category, (source(l.field1)),
l.rec_insert_time, l.brand_id, l.last_change, l.log_id
   Sort Key: l.brand_id, l.last_change, l.log_id,
l.rec_insert_time DESC
   Sort Method: quicksort  Memory: 90kB
   Buffers: shared hit=391676 read=17772642
dirtied=4679 written=7
   ->  Nested Loop  (cost=0.00..24437800.73
rows=707 width=124) (actual time=4096.253..132171.425 rows=466 loops=1)
 Output: l.category, source(l.field1),
l.rec_insert_time, l.brand_id, l.last_change, l.log_id
 Inner Unique: true
 Join Filter: 

Re: stress test for parallel workers

2019-10-10 Thread Mark Wong
On Thu, Oct 10, 2019 at 05:34:51PM -0400, Tom Lane wrote:
> A nearer-term solution would be to reproduce this manually and
> dig into the core.  Mark, are you in a position to give somebody
> ssh access to wobbegong's host, or another similarly-configured VM?
> 
> (While at it, it'd be nice to investigate the infinite_recurse
> failures we've been seeing on all those ppc64 critters ...)

Yeah, whoever would like access, just send me your ssh key and login
you'd like to use, and I'll get you set up.

Regards,
Mark

-- 
Mark Wong
2ndQuadrant - PostgreSQL Solutions for the Enterprise
https://www.2ndQuadrant.com/




Re: stress test for parallel workers

2019-10-10 Thread Tom Lane
I wrote:
>>> Yeah, I've been wondering whether pg_ctl could fork off a subprocess
>>> that would fork the postmaster, wait for the postmaster to exit, and then
>>> report the exit status.

> [ pushed at 6a5084eed ]
> Given wobbegong's recent failure rate, I don't think we'll have to wait
> long.

Indeed, we didn't:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=wobbegong=2019-10-10%2020%3A54%3A46

The tail end of the system log looks like

2019-10-10 21:00:33.717 UTC [15127:306] pg_regress/date FATAL:  postmaster 
exited during a parallel transaction
2019-10-10 21:00:33.717 UTC [15127:307] pg_regress/date LOG:  disconnection: 
session time: 0:00:02.896 user=fedora database=regression host=[local]
/bin/sh: line 1: 14168 Segmentation fault  (core dumped) 
"/home/fedora/build-farm-10-clang/buildroot/HEAD/pgsql.build/tmp_install/home/fedora/build-farm-clang/buildroot/HEAD/inst/bin/postgres"
 -F -c listen_addresses="" -k "/tmp/pg_upgrade_check-ZrhQ4h"
postmaster exit status is 139

So that's definitive proof that the postmaster is suffering a SIGSEGV.
Unfortunately, we weren't blessed with a stack trace, even though
wobbegong is running a buildfarm client version that is new enough
to try to collect one.  However, seeing that wobbegong is running
a pretty-recent Fedora release, the odds are that systemd-coredump
has commandeered the core dump and squirreled it someplace where
we can't find it.

Much as one could wish otherwise, systemd doesn't seem likely to
either go away or scale back its invasiveness; so I'm afraid we
are probably going to need to teach the buildfarm client how to
negotiate with systemd-coredump at some point.  I don't much want
to do that right this minute, though.

A nearer-term solution would be to reproduce this manually and
dig into the core.  Mark, are you in a position to give somebody
ssh access to wobbegong's host, or another similarly-configured VM?

(While at it, it'd be nice to investigate the infinite_recurse
failures we've been seeing on all those ppc64 critters ...)

regards, tom lane




Re: BTP_DELETED leaf still in tree

2019-10-10 Thread Peter Geoghegan
On Thu, Oct 10, 2019 at 1:18 PM Peter Geoghegan  wrote:
> You didn't say which PostgreSQL versions were involved, and if the
> database was ever upgraded using pg_upgrade. Those details could
> matter.

In case you weren't aware, contrib/amcheck should make detected and
diagnosing these kinds of problems a lot easier. You should prefer to
use the bt_index_parent_check() variant (which will block writes and
VACUUM, but not reads). It will verify that sibling pointers are in
agreement with each other, and that leaf pages contain keys that are
covered by the relevant separator keys from the parent level.

If you happen to be using v11, then you might also want to use the
heapallindexed option -- that will verify that the heap and index are
in agreement. If the issue is on v12, the new "rootdescend" option can
detect very subtle cross-level transitive consistency options. (This
is only available in v12 because that was the version that made all
entries in the index unique by using heap TID as a unique-ifier.)

-- 
Peter Geoghegan




Re: BTP_DELETED leaf still in tree

2019-10-10 Thread Peter Geoghegan
On Thu, Oct 10, 2019 at 12:48 PM Daniel Wood  wrote:
> Update query stuck in a loop.  Looping in _bt_moveright().

You didn't say which PostgreSQL versions were involved, and if the
database was ever upgraded using pg_upgrade. Those details could
matter.

> ExecInsertIndexTuples->btinsert->_bt_doinsert->_bt_search->_bt_moveright
>
> Mid Tree Node downlink path taken by _bt_search points to a BTP_DELETED Leaf.

This should hardly ever happen -- it is barely possible for an index
scan to land on a BTP_DELETED leaf page (or a half-dead page) when
following a downlink in its parent. Recall that nbtree uses Lehman &
Yao's design, so _bt_search() does not "couple" buffer locks on the
way down. It would probably be impossible to observe this happening
without carefully setting breakpoints in multiple sessions.

If this happens reliably for you, which it sounds like, then you can
already assume that the index is corrupt.

> btpo_next is also DELETED but not in the tree.
>
> btpo_next->btpo_next is NOT deleted but in the mid tree as a lesser key value.
>
> Thus creating an endless loop in moveright.

Offhand, these other details sound normal. The side links are still
needed in fully deleted (BTP_DELETED) pages. And, moving right and
finding lesser key values (not greater key values) is normal with
deleted pages, since page deletion makes the keyspace move right, not
left (moving the keyspace left is how the source Lanin & Shasha paper
does it, though).

Actually, I take it back -- the looping part is not normal. The
btpo_next->btpo_next page has no business linking back to the
original/first deleted page you mentioned. That's just odd.

Can you provide me with a dump of the page images? The easiest way of
getting a page dump is described here:

https://wiki.postgresql.org/wiki/Getting_a_stack_trace_of_a_running_PostgreSQL_backend_on_Linux/BSD#contrib.2Fpageinspect_page_dump

If I had to guess, I'd guess that this was due to a generic storage problem.

-- 
Peter Geoghegan




Re: generating catcache control data

2019-10-10 Thread Tom Lane
... BTW, one other issue with changing this, at least if we want to
precompute tupdescs for all system catalogs used in catcaches, is that
that would put a very big crimp in doing runtime changes to catalogs.
While we'll probably never support changes in the physical layouts
of catalog rows, there is interest in being able to change some
auxiliary pg_attribute fields, e.g. attstattarget [1].  So we'd need
to be sure that the compiled-in tupdescs are only used to disassemble
catalog tuples, and not for other purposes.

Of course this issue arises already for the bootstrap catalogs, so
maybe it's been dealt with sufficiently.  But it's something to keep
an eye on.

regards, tom lane

[1] 
https://www.postgresql.org/message-id/flat/8b00ea5e-28a7-88ba-e848-21528b632354%402ndquadrant.com




Re: generating catcache control data

2019-10-10 Thread Tom Lane
John Naylor  writes:
> While digging through the archives, I found a thread from a couple
> years back about syscache performance. There was an idea [1] to
> generate the cache control data at compile time. That would to remove
> the need to perform database access to complete cache initialization,
> as well as the need to check in various places whether initialization
> has happened.

Right.

> 1. Generate the current syscache cacheinfo[] array and cacheid enum by
> adding a couple arguments to the declarations for system indexes, as
> in:
> #define DECLARE_UNIQUE_INDEX(name,oid,oid_macro,cacheid,num_buckets,decl)
> extern int no_such_variable

I do not like attaching this data to the DECLARE_UNIQUE_INDEX macros.
It's really no business of the indexes' whether they are associated
with a syscache.  It's *certainly* no business of theirs how many
buckets such a cache should start off with.

I'd be inclined to make a separate file that's specifically concerned
with declaring syscaches, and put all the required data there.

> Relname, and relisshared are straightforward. For eq/hash functions,
> we could add metadata attributes to pg_type.dat for the relevant
> types. Tuple descriptors would get their attrs from schemapg.h.

I don't see a need to hard-wire more information than we do today, and
I'd prefer not to because it adds to the burden of creating new syscaches.
Assuming that the plan is for genbki.pl or some similar script to generate
the constants, it could look up all the appropriate data from the initial
contents for pg_opclass and friends.  That is, basically what we want here
is for a constant-creation script to perform the same lookups that're now
done during backend startup.

> Is this something worth doing?

Hard to tell.  It'd take a few cycles out of backend startup, which
seems like a worthy goal; but I don't know if it'd save enough to be
worth the trouble.  Probably can't tell for sure without doing most
of the work :-(.

Perhaps you could break it up by building a hand-made copy of the
constants and then removing the runtime initialization code.  This'd
be enough to get data on the performance change.  Only if that looked
promising would you need to write the Perl script to compute the
constants.

regards, tom lane




Re: pgsql: Remove pqsignal() from libpq's official exports list.

2019-10-10 Thread Tom Lane
Stephen Frost  writes:
> Yes, this is absolutely the right answer, we shouldn't be removing
> symbols without an SONAME bump.  If we don't want to bump the SONAME,
> then don't remove the symbol.

OK, done.

regards, tom lane




Re: let's make the list of reportable GUCs configurable (was Re: Add %r substitution for psql prompts to show recovery status)

2019-10-10 Thread Andres Freund
Hi,

On 2019-10-09 16:29:07 -0400, Dave Cramer wrote:
> I've added functionality into libpq to be able to set this STARTUP
> parameter as well as changed it to _pq_.report.
> Still need to document this and figure out how to test it.


> From 85de9f48f80a3bfd9e8bdd4f1ba6b177b1ff9749 Mon Sep 17 00:00:00 2001
> From: Dave Cramer 
> Date: Thu, 11 Jul 2019 08:20:14 -0400
> Subject: [PATCH] Add a STARTUP packet option to set GUC_REPORT on GUC's that
>  currently do not have that option set. There is a facility to add protocol
>  options using _pq_. The new option name is report and takes a
>  comma delmited string of GUC names which will have GUC_REPORT set. Add
>  functionality into libpq to accept this new option key

I don't think it's good to only be able to change this at connection
time. Especially for poolers this ought to be configurable at any
time. I do think startup message support makes sense (especially to
avoid race conditions around to-be-reported gucs changing soon after
connecting), don't get me wrong, I just don't think it's sufficient.

> @@ -2094,6 +2094,7 @@ retry1:
>* zeroing extra byte above.
>*/
>   port->guc_options = NIL;
> + port->guc_report = NIL;
>  
>   while (offset < len)
>   {
> @@ -2138,13 +2139,34 @@ retry1:
>   }
>   else if (strncmp(nameptr, "_pq_.", 5) == 0)
>   {
> - /*
> -  * Any option beginning with _pq_. is reserved 
> for use as a
> -  * protocol-level option, but at present no 
> such options are
> -  * defined.
> -  */
> - unrecognized_protocol_options =
> - lappend(unrecognized_protocol_options, 
> pstrdup(nameptr));
> + if (strncasecmp(nameptr + 5, "report", 6) == 0)
> + {
> + char sep[3] = " ,";
> +
> + /* avoid scribbling on valptr */
> + char *temp_val = pstrdup(valptr);
> +
> + /* strtok is going to scribble on 
> temp_val */
> + char *freeptr = temp_val;
> + char *guc_report = strtok(temp_val, 
> sep);
> + while (guc_report)
> + {
> + port->guc_report = 
> lappend(port->guc_report,
> + 
>pstrdup(guc_report));
> + guc_report = strtok(NULL, sep);
> + }
> + pfree(freeptr);
> + }

I don't think it's good to open-code this inside
ProcessStartupPacket(). Should be moved into its own function. I'd
probably even move all of the option handling out of
ProcessStartupPacket() before expanding it further.

I don't like the use of strtok, nor the type of parsing done
here. Perhaps we could just use SplitGUCList()?


> + else
> + {
> + /*
> +  * Any option beginning with _pq_. is 
> reserved for use as a
> +  * protocol-level option, but at 
> present no such options are
> +  * defined.
> +  */
> + unrecognized_protocol_options =
> + 
> lappend(unrecognized_protocol_options, pstrdup(nameptr));
> + }
>   }

You can't just move a comment explaining what _pq_ is into the else,
especially not without adjusting the contents.





> +/*
> + * Set the option to be GUC_REPORT
> + */
> +
> +bool
> +SetConfigReport(const char *name, bool missing_ok)
> +{
> + struct config_generic *record;
>  
> + record = find_option(name, false, WARNING);
> + if (record == NULL)
> + {
> + if (missing_ok)
> + return 0;
> + ereport(ERROR,
> + (errcode(ERRCODE_UNDEFINED_OBJECT),
> +  errmsg("unrecognized configuration parameter 
> \"%s\"",
> + name)));
> + }
> + record->flags |= GUC_REPORT;
> +
> + return 0;
> +}

This way we loose track which gucs originally were marked as REPORT,
that strikes me as bad. We'd imo want to be able to debug this by
querying pg_settings.


Greetings,

Andres Freund




Re: configure fails for perl check on CentOS8

2019-10-10 Thread Tom Lane
Andrew Dunstan  writes:
> On 10/10/19 1:46 AM, Kyotaro Horiguchi wrote:
>> Hello, While I'm moving to CentOS8 environment, I got stuck at
>> ./configure with the following error.
>> configure: error: libperl library is requred for Perl
>> It complains that it needs -fPIC.
>> Configure uses only $Config{ccflags}, but it seems that
>> $Config{cccdlflags} is also required. The attached patch make
>> ./configure success. (configure itself is excluded in the patch.)

> ./configure --with-perl
> is working for me on Centos8 (double checked after a `dnf update`)

Yeah, I'm quite suspicious of this too.  Although we don't seem to have
any buildfarm members covering exactly RHEL8/CentOS8, we have enough
coverage of different Fedora releases to make it hard to believe that
we missed any changes in Red Hat's packaging of Perl.

Is this error perhaps occurring with a non-vendor Perl installation?
What's the exact error message?  config.log might contain some useful
clues, too.

regards, tom lane




Re: let's make the list of reportable GUCs configurable (was Re: Add %r substitution for psql prompts to show recovery status)

2019-10-10 Thread Andres Freund
Hi,

On 2019-10-08 23:57:24 +0800, Craig Ringer wrote:
> In other places I've (ab)used GUC_REPORT to push information back to the
> client as a workaround for the lack of protocol extensibility / custom
> messages. It's a little less ugly than abusing NOTICE messages. I'd prefer
> a real way to add optional/extension messages, but in the absence of that
> extension-defined GUCs may have good reasons to want to report multiple
> changes within a single statement/function/etc.

FWIW, custom messages strike me as a terrible idea leading to a lot of
poorly thought ought extensions that various drivers have to implement,
without there being an authoritative source of what needs to be
implemented from a practical point of view.


> BTW, a good argument for the client wanting to be notified of search_path
> changes might be for clients that want to cache name=>oid mappings for
> types, relations, etc. The JDBC driver would be able to benefit from that
> if it could trust that the same name would map to the same type (for a
> top-level statement) in future, but currently it can't.

Unless schema qualified you can't rely on that even without search_path
changing. Consider an object in schema b existing, with a search_path of
a,b. Even without the search path changing, somebody could create a type
in a, "masking" the one in b.  I'm somewhat convinced that search_path
has no role in this type of caching, unless you want to do it
wrong. There's a separate issue of needing cache invalidation for such
caches to cover some edge cases, but that's unrelated to search_path
imo.

Greetings,

Andres Freund




Re: pg_dump compatibility level / use create view instead of create table/rule

2019-10-10 Thread Tom Lane
Alex Williams  writes:
> [ gripes about pg_dump printing REPLICA IDENTITY NOTHING for a view ]

I spent a little bit of time trying to reproduce this, and indeed I can,
in versions before v10.

regression=# create table mytab (f1 int primary key, f2 text);
CREATE TABLE
regression=# create view myview as select * from mytab group by f1;
CREATE VIEW

This situation is problematic for pg_dump because validity of the
view depends on the existence of mytab's primary key constraint,
and we don't create primary keys till late in the restore process.
So it has to break myview into two parts, one to emit during normal
table/view creation and one to emit after index creation.

With 9.5's pg_dump, what comes out is:

--
-- Name: myview; Type: TABLE; Schema: public; Owner: postgres
--

CREATE TABLE public.myview (
f1 integer,
f2 text
);

ALTER TABLE ONLY public.myview REPLICA IDENTITY NOTHING;


ALTER TABLE public.myview OWNER TO postgres;

and then later:

--
-- Name: myview _RETURN; Type: RULE; Schema: public; Owner: postgres
--

CREATE RULE "_RETURN" AS
ON SELECT TO public.myview DO INSTEAD  SELECT mytab.f1,
mytab.f2
   FROM public.mytab
  GROUP BY mytab.f1;

The reason we get "REPLICA IDENTITY NOTHING" is that a view's relreplident
is set to 'n' not 'd', which might not have been a great choice.  But why
does pg_dump print anything --- it knows perfectly well that it should not
emit REPLICA IDENTITY for relkinds that don't have storage?  The answer
emerges from looking at the code that breaks the dependency loop:

/* pretend view is a plain table and dump it that way */
viewinfo->relkind = 'r';/* RELKIND_RELATION */

After that, pg_dump *doesn't* know it's a view, which also explains
why the comment says TABLE not VIEW.

This is fixed in v10 and up thanks to d8c05aff5.  I was hesitant to
back-patch that at the time, but now that it's survived in the field
for a couple years, I think a good case could be made for doing so.
After a bit of looking around, the main argument I can find against
it is that emitting 'CREATE OR REPLACE VIEW' in a dropStmt will
break pg_restore versions preceding this commit:

Author: Tom Lane 
Branch: master Release: REL_10_BR [ac888986f] 2016-11-17 14:59:13 -0500
Branch: REL9_6_STABLE Release: REL9_6_2 [0eaa5118a] 2016-11-17 14:59:19 -0500
Branch: REL9_5_STABLE Release: REL9_5_6 [a7864037d] 2016-11-17 14:59:23 -0500
Branch: REL9_4_STABLE Release: REL9_4_11 [e69b532be] 2016-11-17 14:59:26 -0500

Improve pg_dump/pg_restore --create --if-exists logic.

Teach it not to complain if the dropStmt attached to an archive entry
is actually spelled CREATE OR REPLACE VIEW, since that will happen due to
an upcoming bug fix.  Also, if it doesn't recognize a dropStmt, have it
print a WARNING and then emit the dropStmt unmodified.  That seems like a
much saner behavior than Assert'ing or dumping core due to a null-pointer
dereference, which is what would happen before :-(.

Back-patch to 9.4 where this option was introduced.

AFAIR, we have not had complaints about back-rev pg_restore failing
on archives made by v10 and up; but perhaps it's more likely that
someone would try to use, say, 9.5.5 pg_restore with a dump made by
9.5.20 pg_dump.

An alternative that just responds to Alex's issue without fixing the
other problems d8c05aff5 fixed is to hack the dependency-loop code
like this:

/* pretend view is a plain table and dump it that way */
viewinfo->relkind = 'r';/* RELKIND_RELATION */
viewinfo->relkind = 'r';/* RELKIND_RELATION */
+   viewinfo->relreplident = 'd';   /* REPLICA_IDENTITY_DEFAULT */

That's mighty ugly but it doesn't seem to carry any particular
risk.

Thoughts?

regards, tom lane




Re: Transparent Data Encryption (TDE) and encrypted files

2019-10-10 Thread Stephen Frost
Greetings,

* Craig Ringer (cr...@2ndquadrant.com) wrote:
> On Wed, 9 Oct 2019 at 22:30, Stephen Frost  wrote:
> > - All decryption happens in a given backend when it's sending data to
> >   the client
> 
> That is not what I think of as TDE. But upon review, it looks like I'm
> wrong, and the usual usage of TDE is for server-side-only encryption
> at-rest.

Yes, that's typically what TDE is, at least in the relational DBMS
world.

> But when I'm asked about TDE, people are generally actually asking for data
> that's encrypted at rest and in transit, where the client driver is
> responsible for data encryption/decryption transparently to the
> application. The server is expected to be able to mark columns as
> encrypted, so it can report the column's true datatype while storing a
> bytea-like encrypted value for it instead. In this case the server does not
> know the column encryption/decryption key at all, and it cannot perform any
> operations on the data except for input and output.

This is definitely also a thing though I'm not sure what it's called,
exactly.  Having everything happen on the client side is also,
certainly, a better solution as it removes the risk of root on the
database server being able to gain access to the data.  This is also
what I recommend in a lot of situations- have the client side
application handle the encryption/decryption, working with a vaulting
solution ideally, but it'd definitely be neat to add this as a
capability to PG.

> Some people ask for indexable encrypted columns, but I tend to explain to
> them how impractical and inefficient that is. You can support hash indexes
> if you don't salt the encrypted data, but that greatly weakens the
> encryption by allowing attackers to use dictionary attacks and other brute
> force techniques efficiently. And you can't support b-tree > and < without
> very complex encryption schemes (
> https://en.wikipedia.org/wiki/Homomorphic_encryption).

I'm not sure why you wouldn't salt the hash..?  That's pretty important,
imv, and, of course, you have to store the salt but that shouldn't be
that big of a deal, I wouldn't think.  Agreed that you can't support
b-tree (even with complex encryption schemes..., I've read some papers
about how just  is enough to be able to glean a good bit of info
from, not super relevant to the overall discussion here so I won't go
hunt them down right now, but if there's interest, I can try to do so).

> I see quite a lot of demand for this column level driver-assisted
> encryption. I think it'd actually be quite simple for the PostgreSQL server
> to provide support for it too, since most of the work is done by the
> driver. But I won't go into the design here since this thread appears to be
> about encryption at rest only, fully server-side.

Yes, that's what this thread is about, but I very much like the idea of
driver-assisted encryption on the client side and would love it if
someone had time to work on it.

Thanks,

Stephen


signature.asc
Description: PGP signature


Re: Remove size limitations of vacuums dead_tuples array

2019-10-10 Thread Tomas Vondra

On Wed, Oct 09, 2019 at 03:58:11PM +0300, Ants Aasma wrote:

When dealing with a case where a 2TB table had 3 billion dead tuples I
discovered that vacuum currently can't make use of more than 1GB of
maintenance_work_mem - 179M tuples. This caused excessive amounts of index
scanning even though there was plenty of memory available.

I didn't see any good reason for having this limit, so here is a patch that
makes use of MemoryContextAllocHuge, and converts the array indexing to use
size_t to lift a second limit at 12GB.

One potential problem with allowing larger arrays is that bsearch might no
longer be the best way of determining if a ctid was marked dead. It might
pay off to convert the dead tuples array to a hash table to avoid O(n log
n) runtime when scanning indexes. I haven't done any profiling yet to see
how big of a problem this is.

Second issue I noticed is that the dead_tuples array is always allocated
max allowed size, unless the table can't possibly have that many tuples. It
may make sense to allocate it based on estimated number of dead tuples and
resize if needed.



There already was a attempt to make this improvement, see [1]. There was
a fairly long discussion about how to best do that (using other data
structure, not just a simple array). It kinda died about a year ago, but
I suppose there's a lot of relevant info in that thread.

[1] 
https://www.postgresql.org/message-id/CAGTBQpbDCaR6vv9%3DscXzuT8fSbckf%3Da3NgZdWFWZbdVugVht6Q%40mail.gmail.com


--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Compressed pluggable storage experiments

2019-10-10 Thread Ildar Musin
Hi hackers,

I've been experimenting with pluggable storage API recently and just
feel like I can share my first experience. First of all it's great to
have this API and that now community has the opportunity to implement
alternative storage engines. There are a few applications that come to
mind and a compressed storage is one of them.

Recently I've been working on a simple append-only compressed storage
[1]. My first idea was to just store data into compressed 1mb blocks
in a continuous file and keep separate file for block offsets (similar
to Knizhnik's CFS proposal). But then i realized that then i won't be
able to use most of postgres' infrastructure like WAL-logging and also
won't be able to implement some of the functions of TableAmRoutine
(like bitmap scan or analyze). So I had to adjust extension the way to
utilize standard postgres 8kb blocks: compressed 1mb blocks are split
into chunks and distributed among 8kb blocks. Current page layout
looks like this:

┌───┐
│ metapage  │
└───┘
┌───┐  ┐
│  block 1  │  │
├...┤  │ compressed 1mb block
│  block k  │  │
└───┘  ┘
┌───┐  ┐
│ block k+1 │  │
├...┤  │ another compressed 1mb block
│  block m  │  │
└───┘  ┘

Inside compressed blocks there are regular postgres heap tuples.

The following is the list of things i stumbled upon while implementing
storage. Since API is just came out there are not many examples of
pluggable storages and even less as external extensions (I managed to
find only blackhole_am by Michael Paquier which doesn't do much). So
many things i had to figure out by myself. Hopefully some of those
issues have a solution that i just can't see.

1. Unlike FDW API, in pluggable storage API there are no routines like
"begin modify table" and "end modify table" and there is no shared
state between insert/update/delete calls. In context of compressed
storage that means that there is no exact moment when we can finalize
writes (compress, split into chunks etc). We can set a callback at the
end of transaction, but in this case we'll have to keep latest
modifications for every table in memory until the end of transaction.
As for shared state we also can maintain some kind of key-value data
structure with per-relation shared state. But that again requires memory.
Because of this currently I only implemented COPY semantics.

2. It looks like I cannot implement custom storage options. E.g. for
compressed storage it makes sense to implement different compression
methods (lz4, zstd etc.) and corresponding options (like compression
level). But as i can see storage options (like fillfactor etc) are
hardcoded and are not extensible. Possible solution is to use GUCs
which would work but is not extremely convinient.

3. A bit surprising limitation that in order to use bitmap scan the
maximum number of tuples per page must not exceed 291 due to
MAX_TUPLES_PER_PAGE macro in tidbitmap.c which is calculated based on
8kb page size. In case of 1mb page this restriction feels really
limiting.

4. In order to use WAL-logging each page must start with a standard 24
byte PageHeaderData even if it is needless for storage itself. Not a
big deal though. Another (acutally documented) WAL-related limitation
is that only generic WAL can be used within extension. So unless
inserts are made in bulks it's going to require a lot of disk space to
accomodate logs and wide bandwith for replication.

pg_cryogen extension is still in developement so if other issues arise
i'll post them here. At this point the extension already supports
inserts via COPY, index and bitmap scans, vacuum (only freezing),
analyze. It uses lz4 compression and currently i'm working on adding
different compression methods. I'm also willing to work on
forementioned issues in API if community verifies them as valid.


[1] https://github.com/adjust/pg_cryogen

Thanks,
Ildar


Re: configure fails for perl check on CentOS8

2019-10-10 Thread Andrew Dunstan


On 10/10/19 1:46 AM, Kyotaro Horiguchi wrote:
> Hello, While I'm moving to CentOS8 environment, I got stuck at
> ./configure with the following error.
>
> configure: error: libperl library is requred for Perl
>
> It complains that it needs -fPIC.
>
> Configure uses only $Config{ccflags}, but it seems that
> $Config{cccdlflags} is also required. The attached patch make
> ./configure success. (configure itself is excluded in the patch.)
>


./configure --with-perl


is working for me on Centos8 (double checked after a `dnf update`)


cheers


andrew


-- 

Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services





Re: adding partitioned tables to publications

2019-10-10 Thread Rafia Sabih
On Thu, 10 Oct 2019 at 08:29, Amit Langote  wrote:

> On Mon, Oct 7, 2019 at 9:55 AM Amit Langote 
> wrote:
> > One cannot currently add partitioned tables to a publication.
> >
> > create table p (a int, b int) partition by hash (a);
> > create table p1 partition of p for values with (modulus 3, remainder 0);
> > create table p2 partition of p for values with (modulus 3, remainder 1);
> > create table p3 partition of p for values with (modulus 3, remainder 2);
> >
> > create publication publish_p for table p;
> > ERROR:  "p" is a partitioned table
> > DETAIL:  Adding partitioned tables to publications is not supported.
> > HINT:  You can add the table partitions individually.
> >
> > One can do this instead:
> >
> > create publication publish_p1 for table p1;
> > create publication publish_p2 for table p2;
> > create publication publish_p3 for table p3;
> >
> > but maybe that's too much code to maintain for users.
> >
> > I propose that we make this command:
> >
> > create publication publish_p for table p;
> >
> > automatically add all the partitions to the publication.  Also, any
> > future partitions should also be automatically added to the
> > publication.  So, publishing a partitioned table automatically
> > publishes all of its existing and future partitions.  Attached patch
> > implements that.
> >
> > What doesn't change with this patch is that the partitions on the
> > subscription side still have to match one-to-one with the partitions
> > on the publication side, because the changes are still replicated as
> > being made to the individual partitions, not as the changes to the
> > root partitioned table.  It might be useful to implement that
> > functionality on the publication side, because it allows users to
> > define the replication target any way they need to, but this patch
> > doesn't implement that.
>
> Added this to the next CF: https://commitfest.postgresql.org/25/2301/
>
> Hi Amit,

Lately I was exploring logical replication feature of postgresql and I
found this addition in the scope of feature for partitioned tables a useful
one.

In order to understand the working of your patch a bit more, I performed an
experiment wherein I created a partitioned table with several  children and
a default partition at the publisher side and normal tables of the same
name as parent, children, and default partition of the publisher side at
the subscriber side. Next I established the logical replication connection
and to my surprise the data was successfully replicated from partitioned
tables to normal tables and then this error filled the logs,
LOG:  logical replication table synchronization worker for subscription
"my_subscription", table "parent" has started
ERROR:  table "public.parent" not found on publisher

here parent is the name of the partitioned table at the publisher side and
it is present as normal table at subscriber side as well. Which is
understandable, it is trying to find a normal table of the same name but
couldn't find one, maybe it should not worry about that now also if not at
replication time.

Please let me know if this is something expected because in my opinion this
is not desirable, there should be some check to check the table type for
replication. This wasn't important till now maybe because only normal
tables were to be replicated, but with the extension of the scope of
logical replication to more objects such checks would be helpful.

On a separate note was thinking for partitioned tables, wouldn't it be
cleaner to have something like you create only partition table at the
subscriber and then when logical replication starts it creates the child
tables accordingly. Or would that be too much in future...?

-- 
Regards,
Rafia Sabih


Re: abort-time portal cleanup

2019-10-10 Thread Amit Kapila
On Wed, Oct 9, 2019 at 6:56 PM Robert Haas  wrote:
>
> On Tue, Oct 8, 2019 at 2:10 PM Andres Freund  wrote:
> > On 2019-10-07 12:14:52 -0400, Robert Haas wrote:
> > > > - if (portal->status == PORTAL_READY)
> > > > - MarkPortalFailed(portal);
> > > >
> > > > Why it is safe to remove this check?  It has been explained in commit
> > > > 7981c342 why we need that check.  I don't see any explanation in email
> > > > or patch which justifies this code removal.  Is it because you removed
> > > > PortalCleanup?  If so, that is still called from PortalDrop?
> > >
> > > All MarkPortalFailed() does is change the status to PORTAL_FAILED and
> > > call the cleanup hook. PortalDrop() calls the cleanup hook, and we
> > > don't need to change the status if we're removing it completely.
> >
> > Note that currently PortalCleanup() behaves differently depending on
> > whether the portal is set to failed or not...
> >

Yeah, this is the reason, I mentioned it.

> Urk, yeah, I forgot about that.  I think that's a wretched hack that
> somebody ought to untangle at some point, but maybe for purposes of
> this patch it makes more sense to just put the MarkPortalFailed call
> back.
>

+1.

> It's unclear to me why there's a special case here specifically for
> PORTAL_READY.  Like, why is PORTAL_NEW or PORTAL_DEFINED or
> PORTAL_DONE any different?
>

If read the commit message of commit 7981c34279 [1] which introduced
this, then we might get some clue.  It is quite possible that we need
same handling for PORTAL_NEW, PORTAL_DEFINED, etc. but it seems we
just hit the problem mentioned in commit 7981c34279 for PORTAL_READY
state.  I think as per commit, if we don't mark it failed, then with
auto_explain things can go wrong.

[1] -
commit 7981c34279fbddc254cfccb9a2eec4b35e692a12
Author: Tom Lane 
Date:   Thu Feb 18 03:06:46 2010 +

Force READY portals into FAILED state when a transaction or
subtransaction is aborted, if they were created within the failed
xact.  This  prevents ExecutorEnd from being run on them, which is a
good idea because they may contain references to tables or other
objects that no longer exist.  In particular this is hazardous when
auto_explain is active, but it's really rather surprising that nobody
has seen an issue with this before.  I'm back-patching this to 8.4,
since that's the first version that contains auto_explain or an
ExecutorEnd hook, but I wonder whether we shouldn't back-patch
further.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

2019-10-10 Thread Dilip Kumar
On Thu, Oct 3, 2019 at 1:18 PM Dilip Kumar  wrote:
>
> I have attempted to test the performance of (Stream + Spill) vs
> (Stream + BGW pool) and I can see the similar gain what Alexey had
> shown[1].
>
> In addition to this, I have rebased the latest patchset [2] without
> the two-phase logical decoding patch set.
>
> Test results:
> I have repeated the same test as Alexy[1] for 1kk and 1kk data and
> here is my result
> Stream + Spill
> N   time on master(sec)   Total xact time (sec)
> 1kk   6   21
> 3kk 18   55
>
> Stream + BGW pool
> N  time on master(sec)  Total xact time (sec)
> 1kk  6  13
> 3kk19  35
>
> Patch details:
> All the patches are the same as posted on [2] except
> 1. 0006-Gracefully-handle-concurrent-aborts-of-uncommitted -> I have
> removed the handling of error which is specific for 2PC

Here[1], I mentioned that I have removed the 2PC changes from
this[0006] patch but mistakenly I attached the original patch itself
instead of the modified version. So attaching the modified version of
only this patch other patches are the same.

> 2. 0007-Implement-streaming-mode-in-ReorderBuffer -> Rebased without 2PC
> 3. 0009-Extend-the-concurrent-abort-handling-for-in-progress -> New
> patch to handle concurrent abort error for the in-progress transaction
> and also add handling for the sub transaction's abort.
> 4. v3-0014-BGWorkers-pool-for-streamed-transactions-apply -> Rebased
> Alexey's patch

[1] 
https://www.postgresql.org/message-id/CAFiTN-vHoksqvV4BZ0479NhugGe4QHq_ezngNdDd-YRQ_2cwug%40mail.gmail.com

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


0006-Gracefully-handle-concurrent-aborts-of-uncommitted-t.patch
Description: Binary data


Logical replication dead but synching

2019-10-10 Thread Jehan-Guillaume de Rorthais
Hello,

While giving assistance to some customer with their broker procedure, I found a
scenario where the subscription is failing but the table are sync'ed anyway.
Here is bash script to reproduce it with versions 10, 11 and 12 (make sure
to set PATH correctly):

  # env
  PUB=/tmp/pub
  SUB=/tmp/sub
  unset PGPORT PGHOST PGDATABASE PGDATA
  export PGUSER=postgres

  # cleanup
  kill %1
  pg_ctl -w -s -D "$PUB" -m immediate stop; echo $?
  pg_ctl -w -s -D "$SUB" -m immediate stop; echo $?
  rm -r "$PUB" "$SUB"

  # cluster
  initdb -U postgres -N "$PUB" &>/dev/null; echo $?
  initdb -U postgres -N "$SUB" &>/dev/null; echo $?
  echo "wal_level=logical" >> "$PUB"/postgresql.conf
  echo "port=5433" >> "$SUB"/postgresql.conf
  pg_ctl -w -s -D $PUB -l "$PUB"-"$(date +%FT%T)".log start; echo $?
  pg_ctl -w -s -D $SUB -l "$SUB"-"$(date +%FT%T)".log start; echo $?
  pgbench -p 5432 -qi 
  pg_dump -p 5432 -s | psql -qXp 5433

  # fake activity
  pgbench -p 5432 -T 60 -c 2 &

  # replication setup
  cat

Re: [HACKERS] proposal: schema variables

2019-10-10 Thread Pavel Stehule
Hi

minor change - replace heap_tuple_fetch_attr by detoast_external_attr.

Regards

Pavel

pá 4. 10. 2019 v 6:12 odesílatel Pavel Stehule 
napsal:

> Hi
>
> so 10. 8. 2019 v 9:10 odesílatel Pavel Stehule 
> napsal:
>
>> Hi
>>
>> just rebase
>>
>
> fresh rebase
>
> Regards
>
> Pavel
>
>
>> Regards
>>
>> Pavel
>>
>


schema_variables-20191010.patch.gz
Description: application/gzip


Re: maintenance_work_mem used by Vacuum

2019-10-10 Thread Amit Kapila
On Thu, Oct 10, 2019 at 2:10 PM Masahiko Sawada  wrote:
>
> On Thu, Oct 10, 2019 at 3:36 PM Amit Kapila  wrote:
> >
> > On Thu, Oct 10, 2019 at 9:58 AM Masahiko Sawada  
> > wrote:
> > >
> > > On Wed, Oct 9, 2019 at 7:12 PM Dilip Kumar  wrote:
> > > >
> > > > I think the current situation is not good but if we try to cap it to
> > > > maintenance_work_mem + gin_*_work_mem then also I don't think it will
> > > > make the situation much better.  However, I think the idea you
> > > > proposed up-thread[1] is better.  At least the  maintenance_work_mem
> > > > will be the top limit what the auto vacuum worker can use.
> > > >
> > >
> > > I'm concerned that there are other index AMs that could consume more
> > > memory like GIN. In principle we can vacuum third party index AMs and
> > > will be able to even parallel vacuum them. I  expect that
> > > maintenance_work_mem is the top limit of memory usage of maintenance
> > > command but actually it's hard to set the limit to memory usage of
> > > bulkdelete and cleanup by the core. So I thought that since GIN is the
> > > one of the index AM it can have a new parameter to make its job
> > > faster. If we have that parameter it might not make the current
> > > situation much better but user will be able to set a lower value to
> > > that parameter to not use the memory much while keeping the number of
> > > index vacuums.
> > >
> >
> > I can understand your concern why dividing maintenance_work_mem for
> > vacuuming heap and cleaning up the index might be tricky especially
> > because of third party indexes, but introducing new Guc isn't free
> > either.  I think that should be the last resort and we need buy-in
> > from more people for that.  Did you consider using work_mem for this?
>
> Yeah that seems work too. But I wonder if it could be the similar
> story to gin_pending_list_limit. I mean that previously we used to use
>  work_mem as the maximum size of GIN pending list. But we concluded
> that it was not appropriate to control both by one GUC so we
> introduced gin_penidng_list_limit and the storage parameter at commit
> 263865a4
>

It seems you want to say about commit id
a1b395b6a26ae80cde17fdfd2def8d351872f399.   I wonder why they have not
changed it to gin_penidng_list_limit (at that time
pending_list_cleanup_size) in that commit itself?  I think if we want
to use gin_pending_list_limit in this context then we can replace both
work_mem and maintenance_work_mem with gin_penidng_list_limit.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: pgbench - extend initialization phase control

2019-10-10 Thread Fabien COELHO



Attached v2 is a rebase after ce8f9467.


Here is rebase v3.

--
Fabien.diff --git a/doc/src/sgml/ref/pgbench.sgml b/doc/src/sgml/ref/pgbench.sgml
index e3a0abb4c7..e9f43f3b26 100644
--- a/doc/src/sgml/ref/pgbench.sgml
+++ b/doc/src/sgml/ref/pgbench.sgml
@@ -167,7 +167,7 @@ pgbench  options  d
 init_steps specifies the
 initialization steps to be performed, using one character per step.
 Each step is invoked in the specified order.
-The default is dtgvp.
+The default is dt(g)vp.
 The available steps are:
 
 
@@ -193,12 +193,31 @@ pgbench  options  d
   
  
  
-  g (Generate data)
+  ( (begin transaction)
+  
+   
+Emit a BEGIN.
+   
+  
+ 
+ 
+  g or G (Generate data, client or server side)
   

 Generate data and load it into the standard tables,
 replacing any data already present.

+   
+When data is generated server side, there is no progress output.
+   
+  
+ 
+ 
+  ) (commit transaction)
+  
+   
+Emit a COMMIT.
+   
   
  
  
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index e72ad0036e..50b8139f50 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -130,7 +130,13 @@ static int	pthread_join(pthread_t th, void **thread_return);
 /
  * some configurable parameters */
 
-#define DEFAULT_INIT_STEPS "dtgvp"	/* default -I setting */
+/*
+ * we do all data generation in one transaction to enable the backend's
+ * data-loading optimizations
+ */
+#define DEFAULT_INIT_STEPS "dt(g)vp"	/* default -I setting */
+
+#define ALL_INIT_STEPS "dtgGvpf()"		/* all possible steps */
 
 #define LOG_STEP_SECONDS	5	/* seconds between log messages */
 #define DEFAULT_NXACTS	10		/* default nxacts */
@@ -626,7 +632,7 @@ usage(void)
 		   "  %s [OPTION]... [DBNAME]\n"
 		   "\nInitialization options:\n"
 		   "  -i, --initialize invokes initialization mode\n"
-		   "  -I, --init-steps=[dtgvpf]+ (default \"dtgvp\")\n"
+		   "  -I, --init-steps=[" ALL_INIT_STEPS "]+ (default \"" DEFAULT_INIT_STEPS "\")\n"
 		   "   run selected initialization steps\n"
 		   "  -F, --fillfactor=NUM set fill factor\n"
 		   "  -n, --no-vacuum  do not run VACUUM during initialization\n"
@@ -3802,10 +3808,23 @@ append_fillfactor(char *opts, int len)
 }
 
 /*
- * Fill the standard tables with some data
+ * truncate away any old data, in one command in case there are foreign keys
  */
 static void
-initGenerateData(PGconn *con)
+initTruncateTables(PGconn *con)
+{
+	executeStatement(con, "truncate table "
+	 "pgbench_accounts, "
+	 "pgbench_branches, "
+	 "pgbench_history, "
+	 "pgbench_tellers");
+}
+
+/*
+ * Fill the standard tables with some data, from the client side
+ */
+static void
+initGenerateDataClientSide(PGconn *con)
 {
 	char		sql[256];
 	PGresult   *res;
@@ -3819,23 +3838,9 @@ initGenerateData(PGconn *con)
 remaining_sec;
 	int			log_interval = 1;
 
-	fprintf(stderr, "generating data...\n");
+	fprintf(stderr, "generating data client side...\n");
 
-	/*
-	 * we do all of this in one transaction to enable the backend's
-	 * data-loading optimizations
-	 */
-	executeStatement(con, "begin");
-
-	/*
-	 * truncate away any old data, in one command in case there are foreign
-	 * keys
-	 */
-	executeStatement(con, "truncate table "
-	 "pgbench_accounts, "
-	 "pgbench_branches, "
-	 "pgbench_history, "
-	 "pgbench_tellers");
+	initTruncateTables(con);
 
 	/*
 	 * fill branches, tellers, accounts in that order in case foreign keys
@@ -3935,8 +3940,37 @@ initGenerateData(PGconn *con)
 		fprintf(stderr, "PQendcopy failed\n");
 		exit(1);
 	}
+}
 
-	executeStatement(con, "commit");
+/*
+ * Fill the standard tables with some data, from the server side
+ */
+static void
+initGenerateDataServerSide(PGconn *con)
+{
+	char		sql[256];
+
+	fprintf(stderr, "generating data server side...\n");
+
+	initTruncateTables(con);
+
+	snprintf(sql, sizeof(sql),
+			 "insert into pgbench_branches(bid,bbalance,filler) "
+			 "select bid, 0, '' "
+			 "from generate_series(1, %d) as bid", scale);
+	executeStatement(con, sql);
+
+	snprintf(sql, sizeof(sql),
+			 "insert into pgbench_tellers(tid,bid,tbalance,filler) "
+			 "select tid, (tid - 1) / %d + 1, 0, '' "
+			 "from generate_series(1, %d) as tid", ntellers, scale * ntellers);
+	executeStatement(con, sql);
+
+	snprintf(sql, sizeof(sql),
+			 "insert into pgbench_accounts(aid,bid,abalance,filler) "
+			 "select aid, (aid - 1) / %d + 1, 0, '' "
+			 "from generate_series(1, %d) as aid", naccounts, scale * naccounts);
+	executeStatement(con, sql);
 }
 
 /*
@@ -4020,6 +4054,7 @@ static void
 

Re: maintenance_work_mem used by Vacuum

2019-10-10 Thread Masahiko Sawada
On Thu, Oct 10, 2019 at 3:36 PM Amit Kapila  wrote:
>
> On Thu, Oct 10, 2019 at 9:58 AM Masahiko Sawada  wrote:
> >
> > On Wed, Oct 9, 2019 at 7:12 PM Dilip Kumar  wrote:
> > >
> > > I think the current situation is not good but if we try to cap it to
> > > maintenance_work_mem + gin_*_work_mem then also I don't think it will
> > > make the situation much better.  However, I think the idea you
> > > proposed up-thread[1] is better.  At least the  maintenance_work_mem
> > > will be the top limit what the auto vacuum worker can use.
> > >
> >
> > I'm concerned that there are other index AMs that could consume more
> > memory like GIN. In principle we can vacuum third party index AMs and
> > will be able to even parallel vacuum them. I  expect that
> > maintenance_work_mem is the top limit of memory usage of maintenance
> > command but actually it's hard to set the limit to memory usage of
> > bulkdelete and cleanup by the core. So I thought that since GIN is the
> > one of the index AM it can have a new parameter to make its job
> > faster. If we have that parameter it might not make the current
> > situation much better but user will be able to set a lower value to
> > that parameter to not use the memory much while keeping the number of
> > index vacuums.
> >
>
> I can understand your concern why dividing maintenance_work_mem for
> vacuuming heap and cleaning up the index might be tricky especially
> because of third party indexes, but introducing new Guc isn't free
> either.  I think that should be the last resort and we need buy-in
> from more people for that.  Did you consider using work_mem for this?

Yeah that seems work too. But I wonder if it could be the similar
story to gin_pending_list_limit. I mean that previously we used to use
 work_mem as the maximum size of GIN pending list. But we concluded
that it was not appropriate to control both by one GUC so we
introduced gin_penidng_list_limit and the storage parameter at commit
263865a4 (originally it's pending_list_cleanup_size but rename to
gin_pending_list_limit at commit c291503b1). I feel that that story is
quite similar to this issue.

Regards,

--
Masahiko Sawada




generating catcache control data

2019-10-10 Thread John Naylor
Hi,

While digging through the archives, I found a thread from a couple
years back about syscache performance. There was an idea [1] to
generate the cache control data at compile time. That would to remove
the need to perform database access to complete cache initialization,
as well as the need to check in various places whether initialization
has happened.

If this were done, catcache.c:InitCatCachePhase2() and
catcache.c:CatalogCacheInitializeCache() would disappear, and
syscache.c:InitCatalogCachePhase2() could be replaced by code that
simply opens the relations when writing new init files. Another
possibility this opens up is making the SysCacheRelationOid and
SysCacheSupportingRelOid arrays constant data as well.


Here's a basic design sketch:

1. Generate the current syscache cacheinfo[] array and cacheid enum by
adding a couple arguments to the declarations for system indexes, as
in:

#define DECLARE_UNIQUE_INDEX(name,oid,oid_macro,cacheid,num_buckets,decl)
extern int no_such_variable

DECLARE_UNIQUE_INDEX(pg_amop_opr_fam_index, 2654,
AccessMethodOperatorIndexId, AMOPOPID, 64, on pg_amop using
btree(amopopr oid_ops, amoppurpose char_ops, amopfamily oid_ops));

DECLARE_UNIQUE_INDEX(pg_amop_oid_index, 2756,
AccessMethodOperatorOidIndexId, -, 0, on pg_amop using btree(oid
oid_ops));

...and add in data we already know how to parse from the catalog
headers. Note that the last example has '-' and '0' to mean "no
cache". (The index oid macro is superfluous there, but kept for
consistency.)

2. Expand the cacheinfo[] element structs with the rest of the constant data:

Relname, and relisshared are straightforward. For eq/hash functions,
we could add metadata attributes to pg_type.dat for the relevant
types. Tuple descriptors would get their attrs from schemapg.h.

3. Simplify cat/syscache.c


Is this something worth doing?

[1] https://www.postgresql.org/message-id/1295.1507918074%40sss.pgh.pa.us


-- 
John Naylorhttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: dropping column prevented due to inherited index

2019-10-10 Thread Amit Langote
On Thu, Oct 10, 2019 at 4:53 PM Michael Paquier  wrote:
> On Thu, Oct 10, 2019 at 02:56:32PM +0900, Amit Langote wrote:
> > /* Initialize addrs on the first invocation. */
>
> I would add "recursive" here, to give:
> /* Initialize addrs on the first recursive invocation. */

Actually, the code initializes it on the first call (recursing is
false) and asserts that it must have been already initialized in a
recursive (recursing is true) call.

> > +/*
> > + * The recursion is ending, hence perform the actual column deletions.
> > + */
> >
> > Maybe:
> >
> > /* All columns to be dropped must now be in addrs, so drop. */
>
> I think that it would be better to clarify as well that this stands
> after all the child relations have been checked, so what about that?
> "The resursive lookup for inherited child relations is done.  All the
> child relations have been scanned and the object addresses of all the
> columns to-be-dropped are registered in addrs, so drop."

Okay, sure.  Maybe it's better to write the comment inside the if
block, because if recursing is true, we don't drop yet.

Thoughts on suggestion to expand the test case?

Thanks,
Amit




Re: [PATCH] Do not use StdRdOptions in Access Methods

2019-10-10 Thread Amit Langote
Hello Nikolay,

I read comments that Tomas left at:
https://www.postgresql.org/message-id/20190727173841.7ypzo4xuzizvijge%40development

I'd like to join Michael in reiterating one point from Tomas' review.
I think the patch can go further in trying to make the code in this
area more maintainable.

For example, even without this patch, the following stanza is repeated
in many places:

options = parseRelOptions(reloptions, validate, foo_relopt_kind,
);
rdopts = allocateReloptStruct(sizeof(FooOptions), options, numoptions);
fillRelOptions((void *) rdopts, sizeof(FooOptions), options, numoptions,
   validate, foo_relopt_tab, lengthof(foo_relopt_tab));
return (bytea *) rdopts;

and this patch adds few more instances as it's adding more Options structs.

I think it wouldn't be hard to encapsulate the above stanza in a new
public function in reloptions.c and call it from the various places
that now have the above code.

Thanks,
Amit




Re: use of the term "verifier" with SCRAM

2019-10-10 Thread Michael Paquier
On Thu, Oct 10, 2019 at 09:08:37AM +0200, Peter Eisentraut wrote:
> Here is my proposed patch to adjust this.

Looks fine to me reading through.  I think that you are right to not
change the descriptions in build_server_final_message(), as that's
described similarly in RFC 5802.  By renaming scram_build_verifier()
to scram_build_secret() you are going to break one of my in-house
extensions.  I am using it to register for a user SCRAM veri^D^D^D^D
secrets with custom iteration and salt length :)
--
Michael


signature.asc
Description: PGP signature


Re: dropping column prevented due to inherited index

2019-10-10 Thread Michael Paquier
On Thu, Oct 10, 2019 at 02:56:32PM +0900, Amit Langote wrote:
> /*
>  * Drops column 'colName' from relation 'rel' and returns the address of the
>  * dropped column.  The column is also dropped (or marked as no longer
>  * inherited from relation) from the relation's inheritance children, if any.
>  *
>  * In the recursive invocations for inheritance child relations, instead of
>  * dropping the column directly (if to be dropped at all), its object address
>  * is added to 'addrs', which must be non-NULL in such invocations.
>  * All columns are dropped at the same time after all the children have been
>  * checked recursively.
>  */

Sounds fine to me.

> + * Initialize the location of addresses which will be deleted on a
> + * recursive lookup on children relations.  The structure to store all 
> the
> + * object addresses to delete is initialized once when the recursive
> + * lookup begins.
> + */
> +Assert(!recursing || addrs != NULL);
> +if (!recursing)
> +addrs = new_object_addresses();
> 
> Maybe this could just say:
> 
> /* Initialize addrs on the first invocation. */

I would add "recursive" here, to give:
/* Initialize addrs on the first recursive invocation. */

> +/*
> + * The recursion is ending, hence perform the actual column deletions.
> + */
> 
> Maybe:
> 
> /* All columns to be dropped must now be in addrs, so drop. */

I think that it would be better to clarify as well that this stands
after all the child relations have been checked, so what about that?
"The resursive lookup for inherited child relations is done.  All the
child relations have been scanned and the object addresses of all the
columns to-be-dropped are registered in addrs, so drop."
--
Michael


signature.asc
Description: PGP signature


Re: [HACKERS] Block level parallel vacuum

2019-10-10 Thread Masahiko Sawada
On Thu, Oct 10, 2019 at 2:19 PM Amit Kapila  wrote:
>
> On Fri, Oct 4, 2019 at 4:18 PM Amit Kapila  wrote:
> >
> > On Wed, Oct 2, 2019 at 7:29 PM Masahiko Sawada  
> > wrote:
> >>
>
> Few more comments:

Thank you for reviewing the patch!

> -
> 1.  Caurrently parallel vacuum is allowed for temporary relations
> which is wrong.  It leads to below error:
>
> postgres=# create temporary table tmp_t1(c1 int, c2 char(10));
> CREATE TABLE
> postgres=# create index idx_tmp_t1 on tmp_t1(c1);
> CREATE INDEX
> postgres=# create index idx1_tmp_t1 on tmp_t1(c2);
> CREATE INDEX
> postgres=# insert into tmp_t1 values(generate_series(1,1),'');
> INSERT 0 1
> postgres=# delete from tmp_t1 where c1 > 5000;
> DELETE 5000
> postgres=# vacuum (parallel 2) tmp_t1;
> ERROR:  cannot access temporary tables during a parallel operation
> CONTEXT:  parallel worker
>
> The parallel vacuum shouldn't be allowed for temporary relations.

Fixed.

>
> 2.
> --- a/doc/src/sgml/ref/vacuum.sgml
> +++ b/doc/src/sgml/ref/vacuum.sgml
> @@ -34,6 +34,7 @@ VACUUM [ FULL ] [ FREEZE ] [ VERBOSE ] [ ANALYZE ] [
> boolean ]
>  INDEX_CLEANUP [  class="parameter">boolean ]
>  TRUNCATE [ boolean ]
> +PARALLEL [  class="parameter">integer ]
>
> Now, if the user gives a command like Vacuum (analyze, parallel)
> ; it is not very obvious that a parallel option will be
> only used for vacuum purposes but not for analyze.  I think we can add
> a note in the docs to mention this explicitly.  This can avoid any
> confusion.

Agreed.

Attached the latest version patch although the memory usage problem is
under discussion. I'll update the patches according to the result of
that discussion.

Regards,

--
Masahiko Sawada
From 941fd0848e95e6bc48b51e86f1456e6da800c77e Mon Sep 17 00:00:00 2001
From: Masahiko Sawada 
Date: Wed, 23 Jan 2019 16:07:53 +0900
Subject: [PATCH v28 2/2] Add --paralell, -P option to vacuumdb command

---
 doc/src/sgml/ref/vacuumdb.sgml| 16 +++
 src/bin/scripts/t/100_vacuumdb.pl | 10 ++-
 src/bin/scripts/vacuumdb.c| 48 ++-
 3 files changed, 72 insertions(+), 2 deletions(-)

diff --git a/doc/src/sgml/ref/vacuumdb.sgml b/doc/src/sgml/ref/vacuumdb.sgml
index 47d93456f8..f6ac0c6e5a 100644
--- a/doc/src/sgml/ref/vacuumdb.sgml
+++ b/doc/src/sgml/ref/vacuumdb.sgml
@@ -226,6 +226,22 @@ PostgreSQL documentation
   
  
 
+ 
+  -P workers
+  --parallel=workers
+  
+   
+Execute parallel vacuum with PostgreSQL's
+workers background workers.
+   
+   
+This option will require background workers, so make sure your
+ setting is more
+than one.
+   
+  
+ 
+
  
   -q
   --quiet
diff --git a/src/bin/scripts/t/100_vacuumdb.pl b/src/bin/scripts/t/100_vacuumdb.pl
index b685b35282..8fe80719e8 100644
--- a/src/bin/scripts/t/100_vacuumdb.pl
+++ b/src/bin/scripts/t/100_vacuumdb.pl
@@ -3,7 +3,7 @@ use warnings;
 
 use PostgresNode;
 use TestLib;
-use Test::More tests => 44;
+use Test::More tests => 48;
 
 program_help_ok('vacuumdb');
 program_version_ok('vacuumdb');
@@ -48,6 +48,14 @@ $node->issues_sql_like(
 $node->command_fails(
 	[ 'vacuumdb', '--analyze-only', '--disable-page-skipping', 'postgres' ],
 	'--analyze-only and --disable-page-skipping specified together');
+$node->issues_sql_like(
+	[ 'vacuumdb', '-P2', 'postgres' ],
+	qr/statement: VACUUM \(PARALLEL 2\).*;/,
+	'vacuumdb -P2');
+$node->issues_sql_like(
+	[ 'vacuumdb', '-P', 'postgres' ],
+	qr/statement: VACUUM \(PARALLEL\).*;/,
+	'vacuumdb -P');
 $node->command_ok([qw(vacuumdb -Z --table=pg_am dbname=template1)],
 	'vacuumdb with connection string');
 
diff --git a/src/bin/scripts/vacuumdb.c b/src/bin/scripts/vacuumdb.c
index 2c7219239f..63bf66a70b 100644
--- a/src/bin/scripts/vacuumdb.c
+++ b/src/bin/scripts/vacuumdb.c
@@ -34,6 +34,8 @@ typedef struct vacuumingOptions
 	bool		skip_locked;
 	int			min_xid_age;
 	int			min_mxid_age;
+	int			parallel_workers;	/* -1 disables, 0 for choosing based on the
+	 * number of indexes */
 } vacuumingOptions;
 
 
@@ -86,6 +88,7 @@ main(int argc, char *argv[])
 		{"full", no_argument, NULL, 'f'},
 		{"verbose", no_argument, NULL, 'v'},
 		{"jobs", required_argument, NULL, 'j'},
+		{"parallel", optional_argument, NULL, 'P'},
 		{"maintenance-db", required_argument, NULL, 2},
 		{"analyze-in-stages", no_argument, NULL, 3},
 		{"disable-page-skipping", no_argument, NULL, 4},
@@ -115,6 +118,7 @@ main(int argc, char *argv[])
 
 	/* initialize options to all false */
 	memset(, 0, sizeof(vacopts));
+	vacopts.parallel_workers = -1;
 
 	pg_logging_init(argv[0]);
 	progname = get_progname(argv[0]);
@@ -122,7 +126,7 @@ main(int argc, char *argv[])
 
 	handle_help_version_opts(argc, argv, "vacuumdb", help);
 
-	while ((c = getopt_long(argc, argv, "h:p:U:wWeqd:zZFat:fvj:", long_options, )) != -1)
+	while ((c = getopt_long(argc, argv, "h:p:P::U:wWeqd:zZFat:fvj:", 

Re: use of the term "verifier" with SCRAM

2019-10-10 Thread Peter Eisentraut
On 2019-08-14 10:41, Heikki Linnakangas wrote:
>> RFC 5803 is titled "Lightweight Directory Access Protocol (LDAP) Schema
>> for Storing Salted Challenge Response Authentication Mechanism (SCRAM)
>> Secrets".  Following that, I think calling the contents of rolpassword a
>> "secret" or a "stored secret" would be better.

> But I agree that "secret", as used in RFC5803 is better.

Here is my proposed patch to adjust this.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From 008f0302bc9821efba84d54dc0f46e62f6047075 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Wed, 9 Oct 2019 10:59:58 +0200
Subject: [PATCH] Fix use of term "verifier"

Within the context of SCRAM, "verifier" has a specific meaning in the
protocol, per RFCs.  The existing code used "verifier" differently, to
mean whatever is or would be stored in pg_auth.rolpassword.

Fix this by using the term "secret" for this, following RFC 5803.
---
 src/backend/libpq/auth-scram.c| 104 +++---
 src/backend/libpq/auth.c  |   2 +-
 src/backend/libpq/crypt.c |   8 +-
 src/common/scram-common.c |   4 +-
 src/include/common/scram-common.h |   6 +-
 src/include/libpq/crypt.h |   2 +-
 src/include/libpq/scram.h |   8 +-
 src/interfaces/libpq/fe-auth-scram.c  |   6 +-
 src/interfaces/libpq/fe-auth.c|   2 +-
 src/interfaces/libpq/fe-auth.h|   2 +-
 src/test/authentication/t/001_password.pl |   2 +-
 src/test/regress/expected/password.out|  12 +--
 src/test/regress/sql/password.sql |  12 +--
 13 files changed, 85 insertions(+), 85 deletions(-)

diff --git a/src/backend/libpq/auth-scram.c b/src/backend/libpq/auth-scram.c
index 68792cb45e..8ecb17bae6 100644
--- a/src/backend/libpq/auth-scram.c
+++ b/src/backend/libpq/auth-scram.c
@@ -64,10 +64,10 @@
  * Don't reveal user information to an unauthenticated client.  We don't
  * want an attacker to be able to probe whether a particular username is
  * valid.  In SCRAM, the server has to read the salt and iteration count
- * from the user's password verifier, and send it to the client.  To avoid
+ * from the user's stored secret, and send it to the client.  To avoid
  * revealing whether a user exists, when the client tries to authenticate
  * with a username that doesn't exist, or doesn't have a valid SCRAM
- * verifier in pg_authid, we create a fake salt and iteration count
+ * secret in pg_authid, we create a fake salt and iteration count
  * on-the-fly, and proceed with the authentication with that.  In the end,
  * we'll reject the attempt, as if an incorrect password was given.  When
  * we are performing a "mock" authentication, the 'doomed' flag in
@@ -161,7 +161,7 @@ static char *build_server_first_message(scram_state *state);
 static char *build_server_final_message(scram_state *state);
 static bool verify_client_proof(scram_state *state);
 static bool verify_final_nonce(scram_state *state);
-static void mock_scram_verifier(const char *username, int *iterations,
+static void mock_scram_secret(const char *username, int *iterations,
char **salt, 
uint8 *stored_key, uint8 *server_key);
 static bool is_scram_printable(char *p);
 static char *sanitize_char(char c);
@@ -202,13 +202,13 @@ pg_be_scram_get_mechanisms(Port *port, StringInfo buf)
  *
  * Initialize a new SCRAM authentication exchange status tracker.  This
  * needs to be called before doing any exchange.  It will be filled later
- * after the beginning of the exchange with verifier data.
+ * after the beginning of the exchange with authentication information.
  *
  * 'selected_mech' identifies the SASL mechanism that the client selected.
  * It should be one of the mechanisms that we support, as returned by
  * pg_be_scram_get_mechanisms().
  *
- * 'shadow_pass' is the role's password verifier, from pg_authid.rolpassword.
+ * 'shadow_pass' is the role's stored secret, from pg_authid.rolpassword.
  * The username was provided by the client in the startup message, and is
  * available in port->user_name.  If 'shadow_pass' is NULL, we still perform
  * an authentication exchange, but it will fail, as if an incorrect password
@@ -220,7 +220,7 @@ pg_be_scram_init(Port *port,
 const char *shadow_pass)
 {
scram_state *state;
-   boolgot_verifier;
+   boolgot_secret;
 
state = (scram_state *) palloc0(sizeof(scram_state));
state->port = port;
@@ -248,7 +248,7 @@ pg_be_scram_init(Port *port,
 errmsg("client selected an invalid SASL 
authentication mechanism")));
 
/*
-* Parse the stored password verifier.
+* Parse the stored secret.
 */
if (shadow_pass)
{
@@ -256,30 +256,30 @@ 

Re: [PATCH] use separate PartitionedRelOptions structure to store partitioned table options

2019-10-10 Thread Amit Langote
Hello,

On Sun, Oct 6, 2019 at 9:48 PM Nikolay Shaplov  wrote:
> This message is follow up to the "Get rid of the StdRdOptions" patch thread:
> https://www.postgresql.org/message-id/2620882.s52SJui4ql@x200m
>
> I've split patch into even smaller parts and commitfest want each patch in
> separate thread. So it is new thread.
>
> The idea of this patch is following: If you read the code, partitioned tables
> do not have any options (you will not find RELOPT_KIND_PARTITIONED in
> boolRelOpts, intRelOpts, realRelOpts, stringRelOpts and enumRelOpts in
> reloption.c), but it uses StdRdOptions to store them (these no options).
>
> If partitioned table is to have it's own option set (even if it is empty now)
> it would be better to save it into separate structure, like it is done for
> Views, not adding them to StdRdOptions, like current code hints to do.
>
> So in this patch I am creating a structure that would store partitioned table
> options (it is empty for now as there are no options for this relation kind),
> and all other code that would use this structure as soon as the first option
> comes.
>
> I think it is bad idea to suggest option adder to ad it to StdRdOption, we
> already have a big mess there. Better if he add it to an new empty structure.

I tend to agree that this improves readability of the reloptions code a bit.

Some comments on the patch:

How about naming the function partitioned_table_reloptions() instead
of partitioned_reloptions()?

+ * Option parser for partitioned relations

Since partitioned_reloptions only caters to partitioned "tables",
maybe use "tables" here instead of "relations".

+  /*
+   * Since there is no options for patitioned table for now, we just do
+   * validation to report incorrect option error and leave.
+   */

Fix typo and minor rewording:

"Since there are no options for partitioned tables..."

+switch ((int)relkind)

Needs a space between (int) and relkind, but I don't think the (int)
cast is really necessary.  I don't see it in other places.

+ * Binary representation of relation options for Partitioned relations.

"for partitioned tables".

Speaking of partitioned relations vs tables, I see that you didn't
touch partitioned indexes (RELKIND_PARTITIONED_INDEX relations).  Is
that because we leave option handling to the index AMs?

Thanks,
Amit




Re: maintenance_work_mem used by Vacuum

2019-10-10 Thread Amit Kapila
On Thu, Oct 10, 2019 at 9:58 AM Masahiko Sawada  wrote:
>
> On Wed, Oct 9, 2019 at 7:12 PM Dilip Kumar  wrote:
> >
> > I think the current situation is not good but if we try to cap it to
> > maintenance_work_mem + gin_*_work_mem then also I don't think it will
> > make the situation much better.  However, I think the idea you
> > proposed up-thread[1] is better.  At least the  maintenance_work_mem
> > will be the top limit what the auto vacuum worker can use.
> >
>
> I'm concerned that there are other index AMs that could consume more
> memory like GIN. In principle we can vacuum third party index AMs and
> will be able to even parallel vacuum them. I  expect that
> maintenance_work_mem is the top limit of memory usage of maintenance
> command but actually it's hard to set the limit to memory usage of
> bulkdelete and cleanup by the core. So I thought that since GIN is the
> one of the index AM it can have a new parameter to make its job
> faster. If we have that parameter it might not make the current
> situation much better but user will be able to set a lower value to
> that parameter to not use the memory much while keeping the number of
> index vacuums.
>

I can understand your concern why dividing maintenance_work_mem for
vacuuming heap and cleaning up the index might be tricky especially
because of third party indexes, but introducing new Guc isn't free
either.  I think that should be the last resort and we need buy-in
from more people for that.  Did you consider using work_mem for this?
And even if we want to go with a new guc, maybe it is better to have
some generic name like maintenance_index_work_mem or something along
those lines so that it can be used for other index cleanups as well if
required.

Tom, Teodor, do you have any opinion on this matter?  This has been
introduced by commit:

commit ff301d6e690bb5581502ea3d8591a1600fd87acc
Author: Tom Lane 
Date:   Tue Mar 24 20:17:18 2009 +

Implement "fastupdate" support for GIN indexes, in which we try to
accumulate multiple index entries in a holding area before adding them
to the main index structure.  This helps because bulk insert is
(usually) significantly faster than retail insert for GIN.
..
..
Teodor Sigaev

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: adding partitioned tables to publications

2019-10-10 Thread Amit Langote
On Mon, Oct 7, 2019 at 9:55 AM Amit Langote  wrote:
> One cannot currently add partitioned tables to a publication.
>
> create table p (a int, b int) partition by hash (a);
> create table p1 partition of p for values with (modulus 3, remainder 0);
> create table p2 partition of p for values with (modulus 3, remainder 1);
> create table p3 partition of p for values with (modulus 3, remainder 2);
>
> create publication publish_p for table p;
> ERROR:  "p" is a partitioned table
> DETAIL:  Adding partitioned tables to publications is not supported.
> HINT:  You can add the table partitions individually.
>
> One can do this instead:
>
> create publication publish_p1 for table p1;
> create publication publish_p2 for table p2;
> create publication publish_p3 for table p3;
>
> but maybe that's too much code to maintain for users.
>
> I propose that we make this command:
>
> create publication publish_p for table p;
>
> automatically add all the partitions to the publication.  Also, any
> future partitions should also be automatically added to the
> publication.  So, publishing a partitioned table automatically
> publishes all of its existing and future partitions.  Attached patch
> implements that.
>
> What doesn't change with this patch is that the partitions on the
> subscription side still have to match one-to-one with the partitions
> on the publication side, because the changes are still replicated as
> being made to the individual partitions, not as the changes to the
> root partitioned table.  It might be useful to implement that
> functionality on the publication side, because it allows users to
> define the replication target any way they need to, but this patch
> doesn't implement that.

Added this to the next CF: https://commitfest.postgresql.org/25/2301/

Thanks,
Amit