Re: [bug fix] Cascaded standby cannot start after a clean shutdown

2018-02-26 Thread Michael Paquier
On Mon, Feb 26, 2018 at 07:25:46AM +, Tsunakawa, Takayuki wrote:
> From: Michael Paquier [mailto:mich...@paquier.xyz]
>> The WAL receiver approach also has a drawback.  If WAL is streamed at full
>> speed, then the primary sends data with a maximum of 6 WAL pages.
>> When beginning streaming with a new segment, then the WAL sent stops at
>> page boundary.  But if you stop once in the middle of a page then you need
>> to zero-fill the page until the current segment is finished streaming.  So
>> if the workload generates spiky WAL then the WAL receiver can would a lot
>> of extra lseek() calls with the patch applied, while all the writes would
>> be sequential on HEAD, so that's not performant-wise IMO.
> 
> Does even the non-cascading standby stop in the middle of a page?  I
> thought the master always the whole WAL blocks without stopping in the
> middle of a page.

You even have problems on normal standbys.  I have a small script which
is able to reproduce that if you want (need a small rewrite as it is
adapted to my test framework) which introduces a garbage set of WAL
segments on a stopped standby.  With the small monitoring patch I
mentioned upthread then you can see the XLOG reader finding garbage data
as well before validating the record header.  With any fixes on the WAL
receiver, your first patch included, then the garbage read goes away,
and XLOG reader complains about a record with an incorrect length
(invalid record length at XX/YYY: wanted 24, got 0) instead of complains
from header validation part.  One key point is to cleanly stop the
primary to as it forces the standby's WAL receiver to write to its WAL
segment in the middle of a page. 

>> Another idea I am thinking about would be to zero-fill the segments when
>> recycled instead of being just renamed when doing InstallXLogFileSegment().
>> This would also have the advantage of making the segments ahead more
>> compressible, which is a gain for custom backups, and the WAL receiver does
>> not need any tweaks as it would write the data on a clean file.  Zero-filling
>> the segments is done only when a new segment is created (see XLogFileInit).
> 
> Yes, I was (and am) inclined to take this approach; this is easy and
> clean, but not good for performance...  I hope there's something which
> justifies this approach. 

InstallXLogFileSegment uses a plain durable_link_or_rename() to recycle
the past segment which syncs the old segment before the rename anyway,
so the I/O effort will be there, no?

This was mentioned back in 2001 by the way, but this did not count much
for the case discussed here:
https://www.postgresql.org/message-id/24901.995381770%40sss.pgh.pa.us
The issue here is that the streaming case makes it easier to hit the
problem as it opens more easily access to not-completely written WAL
pages depending on the message frequency during replication.  At the
same time, we are discussing about a very low-probability issue.  Note
that if the XLOG reader is bumping into this problem, then at the next
WAL receiver wake up, recovery would begin from the beginning of the
last segment, and if the primary has produced some more WAL then the
standby would be able to actually avoid the random junk.  It is also
possible to bypass the problem by zeroing manually the areas in
question, or to actually wait for the standby to generate more WAL so as
the garbage is overwritten automatically.  And you really need to be
very, very unlucky to have random garbage able to bypass the header
validation checks.
--
Michael


signature.asc
Description: PGP signature


RE: [bug fix] pg_rewind takes long time because it mistakenly copies data files

2018-02-26 Thread Tsunakawa, Takayuki
From: Michael Paquier [mailto:mich...@paquier.xyz]
> Your patch is able to fix that.  I have also checked that after diverging
> the promoted server with more data and inserting data on the old primary
> then the correct set of blocks from the tablespace is fetched as well by
> pg_rewind.  This patch also behaves correctly when creating a new relation
> on the promoted server as it copies the whole relation. In short your patch
> looks good to me.

How quick, I was surprised.  Thank you so much!  I'd be glad if you could be 
the reviewer in CF:

https://commitfest.postgresql.org/17/1542/


> Creating a test case for this patch would be nice, however this needs a
> bit of work so as the tablespace map can be used with pg_basebackup and
> PostgresNode.pm (or use raw pg_basebackup commands in pg_rewind tests):
> - PostgresNode::init_from_backup needs to be able to understand extra
> options given by caller for pg_basebackup.
> - RewindTest::create_standby should be extended with extra arguments given
> for previous extension.
> :(

That sounds difficult from your way of saying... but this may be a good 
opportunity to look into the TAP tests.

Regards
Takayuki Tsunakawa





Re: [HACKERS] Pluggable storage

2018-02-26 Thread Alexander Korotkov
On Fri, Feb 23, 2018 at 2:20 AM, Robert Haas  wrote:

> On Fri, Feb 16, 2018 at 5:56 AM, Alexander Korotkov
>  wrote:
> > BTW, EnterpriseDB announces zheap table access method (heap with undo
> log)
> > [2].  I think this is great, and I'm looking forward for publishing
> zheap in
> > mailing lists.  But I'm concerning about its compatibility with pluggable
> > table access methods API.  Does zheap use table AM API from this
> thread?  Or
> > does it just override current heap and needs to be adopted to use table
> AM
> > API?  Or does it implements own API?
>
> Right now it just hacks the code.  The plan is to adapt it to whatever
> API we settle on in this thread.


Great, thank you for clarification.  I'm looking forward reviewing zheap :)

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: VACUUM FULL name is very confusing to some people (or to most non expert people)

2018-02-26 Thread Lætitia Avrot
>
> Although VACUUM and VACUUM FULL is different, then result is same (depends
> on detail level) - the data files are optimized for other processing. You
> should to see a VACUUM like family of commands that does some data files
> optimizations. VACUUM, VACUUM FULL, VACUUM FREEZE, VACUUM ANALYZE, ...
> Personally I don't think so we need to implement new synonym command for
> this case.
>

Here's how I understand what you wrote : "Each and every vacuum operations
are different flavours for files optimization so it's legitimate to use
similar names". I agree for VACUUM ANALYZE and VACUUM FREEZE that can be
seen as options to do more things than a simple VACUUM.

But I disagree for VACUUM FULL that isn't an option to do one more thing
than VACUUM does. VACUUM FULL is a complete different process.

Let's take an example:
In a production server with average production load, if you're already
running a VACUUM, you can change it to a VACUUM ANALYZE without many risks.
But I wouldn't dare try a VACUUM FULL without pg_repack.


> Why you you cannot to say your students - "VACUUM FULL is like SHRINK in
> SQL Server"?
>

I do explain that to my students but I'm not sure they memorize it, because
they do have a lot to memorize in a training session.

I keep meeting customers to who I have to explain that a simple VACUUM
doesn't rebuild indexes. Am I the only one facing that problem ?


> Regards
>
> Pavel
>

Regards

Lætitia


Re: [HACKERS] [PATCH] Vacuum: Update FSM more frequently

2018-02-26 Thread Masahiko Sawada
On Tue, Feb 20, 2018 at 5:04 PM, Masahiko Sawada  wrote:
> On Fri, Feb 16, 2018 at 5:00 AM, Claudio Freire  
> wrote:
>> On Thu, Feb 15, 2018 at 4:47 PM, Claudio Freire  
>> wrote:
>>> On Wed, Feb 14, 2018 at 3:59 AM, Masahiko Sawada  
>>> wrote:
>>
> The final FSM vacuum pass isn't partial, to finally correct all those
> small inconsistencies.

 Yes, but the purpose of this patch is to prevent table bloating from
 concurrent modifications?
>>>
>>> Yes, by *marking* unmarked space. If the slot is above the threshold,
>>> it means there's already marked free space on that branch, and search
>>> will go into it already, so no pressing need to refine the mark. The
>>> space already marked can serve while vacuum makes further progress.
>>> Ie: it can wait.
>>
>> Although... I think I see what you mean.
>>
>> If you have this:
>>
>> 255
>> .   0
>> .   .   0
>> .   .   255
>> .   0
>> .   .   64
>> .   .   64
>> .   0
>> .   .   0
>> .   .   64
>>
>>
>> A partial vacuum won't even recurse beyond the root node, so it won't
>> expose the free space 2 levels down.
>
> Yeah, that's what I meant. I think this might be able to happen
> slightly easily if a tables has fillfactor < 100. For example,
> updating tuples on pages that are almost packed except for fillfactor
> with the more bigger value
>
>>
>> This could be arrived at by having an almost fully packed table that
>> contains some free space near the end, and it gets deletions near the
>> beginning. Vacuum will mark the leaf levels at the beginning of the
>> table, and we'd like for that free space to be discoverable to avoid
>> packing more rows near the end where they prevent truncation.
>>
>> The patch as it is now will only vacuum that part of the FSM when the
>> root value drops, which usually requires all the free space on that
>> region of the heap to be exhausted.
>>
>> With the current recursive FSM vacuum method, however, that's
>> unavoidable. We can't detect that there's some free space below the
>> current level to be exposed without recursing into the tree, and
>> recursing is exactly the kind of work we want to prevent with tree
>> pruning.
>>
>> In all my trials, however, I did not run into that case. It must not
>> be easy to get into that situation, because the root node already has
>> ~4000 slots in it. Each of those 4000 slots should be in that
>> situation to keep the space at the end of the table as the sole
>> discoverable free space, which seems like a rather contorted worst
>> case.
>
> Okay, I guess that this patch cannot help in the case where the
> freespace of pages shown on fsm gets decreased by vacuum because the
> upper-level node doesn't reflect the value on the lower page. However
> it doesn't happen often as you said and it's the same as it is even
> though it happens. So I also think it would not become a problem.
>
> I'll review v4 patch.
>

Here is review comment for v4 patch.

@@ -1922,6 +1988,8 @@ count_nondeletable_pages(Relation onerel,
LVRelStats *vacrelstats)
 * We don't insert a vacuum delay point here, because we have an
 * exclusive lock on the table which we want to hold
for as short a
 * time as possible.  We still need to check for
interrupts however.
+* We might have to acquire the autovacuum lock,
however, but that
+* shouldn't pose a deadlock risk.
 */
CHECK_FOR_INTERRUPTS();

Is this change necessary?


+   /*
+* If there are no indexes then we should periodically
vacuum the FSM
+* on huge relations to make free space visible early.
+*/
+   if (nindexes == 0 &&
+   (vacuumed_pages - vacuumed_pages_at_fsm_vac) >
vacuum_fsm_every_pages)
+   {
+   /* Vacuum the Free Space Map */
+   FreeSpaceMapVacuum(onerel, max_freespace);
+   vacuumed_pages_at_fsm_vac = vacuumed_pages;
+   max_freespace = 0;
+   }

I think this code block should be executed before we check if the page
is whether new or empty and then do 'continue'. Otherwise we cannot
reach this code if the table has a lot of new or empty pages.


@@ -785,7 +789,7 @@ fsm_search(Relation rel, uint8 min_cat)
  * Recursive guts of FreeSpaceMapVacuum
  */
 static uint8
-fsm_vacuum_page(Relation rel, FSMAddress addr, bool *eof_p)
+fsm_vacuum_page(Relation rel, FSMAddress addr, uint8 threshold, bool *eof_p)
 {
Buffer  buf;
Pagepage;

I think the comment for 'threshold' is needed. Maybe for
FreeSpaceMapvacuum as well?


@@ -663,6 +663,8 @@ fsm_extend(Relation rel, BlockNumber fsm_nblocks)
  * If minValue > 0, the updated page is also searched for a page with at
  * least minValue of free space. If one is found, its slot number is
  * returned, -1 otherwise.
+ *
+ * If minValue == 0, the value a

Re: [bug fix] Cascaded standby cannot start after a clean shutdown

2018-02-26 Thread Michael Paquier
On Mon, Feb 26, 2018 at 05:08:49PM +0900, Michael Paquier wrote:
> This was mentioned back in 2001 by the way, but this did not count much
> for the case discussed here:
> https://www.postgresql.org/message-id/24901.995381770%40sss.pgh.pa.us
> The issue here is that the streaming case makes it easier to hit the
> problem as it opens more easily access to not-completely written WAL
> pages depending on the message frequency during replication.  At the
> same time, we are discussing about a very low-probability issue.  Note
> that if the XLOG reader is bumping into this problem, then at the next
> WAL receiver wake up, recovery would begin from the beginning of the
> last segment, and if the primary has produced some more WAL then the
> standby would be able to actually avoid the random junk.  It is also
> possible to bypass the problem by zeroing manually the areas in
> question, or to actually wait for the standby to generate more WAL so as
> the garbage is overwritten automatically.  And you really need to be
> very, very unlucky to have random garbage able to bypass the header
> validation checks.

By the way, as long as I have my mind of it.  Another strategy would be
to just make the checks in XLogReadRecord() a bit smarter if the whole
record header is not on the page.  If we check at least for
AllocSizeIsValid(total_len) then there this code would not fail on an
allocation as you user reported.  Still this misses the case where a
record size is lower than 1GB but invalid so you would allocate
allocate_recordbuf for nothing :(

At least this extra check is costless, and avoids all kind of hard
failures.
--
Michael


signature.asc
Description: PGP signature


Re: [bug fix] pg_rewind takes long time because it mistakenly copies data files

2018-02-26 Thread Michael Paquier
On Mon, Feb 26, 2018 at 08:13:02AM +, Tsunakawa, Takayuki wrote:
> From: Michael Paquier [mailto:mich...@paquier.xyz]
>> Your patch is able to fix that.  I have also checked that after diverging
>> the promoted server with more data and inserting data on the old primary
>> then the correct set of blocks from the tablespace is fetched as well by
>> pg_rewind.  This patch also behaves correctly when creating a new relation
>> on the promoted server as it copies the whole relation. In short your patch
>> looks good to me.
> 
> How quick, I was surprised.  Thank you so much!  I'd be glad if you could be 
> the reviewer in CF:
> 
> https://commitfest.postgresql.org/17/1542/

Sure.

>> Creating a test case for this patch would be nice, however this needs a
>> bit of work so as the tablespace map can be used with pg_basebackup and
>> PostgresNode.pm (or use raw pg_basebackup commands in pg_rewind tests):
>> - PostgresNode::init_from_backup needs to be able to understand extra
>> options given by caller for pg_basebackup.
>> - RewindTest::create_standby should be extended with extra arguments given
>> for previous extension.
>> :(
> 
> That sounds difficult from your way of saying... but this may be a
> good opportunity to look into the TAP tests. 

Anything like that would be work only for HEAD I think as that's a bit
of refactoring.  And indeed it could give you a good introduction to the
TAP facility.
--
Michael


signature.asc
Description: PGP signature


Using JSONB directly from application

2018-02-26 Thread Anthony Communier
Hello,

It would be nice if application connected to a Postrgesql database could
send and receive JSONB in binary. It could save some useless text
conversion. All works could be done on application side which are often
more scalable than database itself.

Regards,

Anthony Communier


Re: csv format for psql

2018-02-26 Thread Fabien COELHO


Hello Daniel,


This patch implements csv as an output format in psql
(\pset format csv). It's quite similar to the unaligned format,
except that it applies CSV quoting rules (obviously!) and that
it prints no footer and no title.
As with unaligned, a header with column names is output unless
tuples_only is on. It also supports the fieldsep/fielsep_zero
and recordsep/recordsep_zero settings.


Patch applies cleanly and compiles. "make check" ok, although there is
no specific test for this feature...

The documentation should mention the other CSV options (COPY, \copy, ...) 
and explain how they compare to this one. Maybe a specific paragraph about 
how to do CSV? I understand "\pset format csv" as triggering that all 
outputs compared to per command options.


Given that this is somehow already available, I'm wondering why there is 
no code sharing.


I find it annoying that setting csv keeps the default '|' separator, where
ISTM that it should be by default "," (as in COMMA separated value:-).
However it would not be a good idea to change another variables when setting
one, obviously.

Maybe some \csv command could set the format to csv, fieldsep to ",", 
tuples_only to on, recordsep to '\n'? Not sure whether it would be 
acceptable, though, and how to turn it off once turned on... Probably an 
average (aka not good) idea:-)


The format adds column headers, however they are not escaped properly:

  psql> \pset format csv
  psql> \pset fieldsep ,
  psql> SELECT 1 AS "hello, world", 2 AS ;
hello, world,"
1,2

Also it does not seem to work properly in expanded mode, both for the
column and values:

  psql> \x
  psql> SELECT 1 AS "bla""", E'\n,"' AS foo;
bla",1
foo,
,"

There MUST be some tests, especially with ugly stuff (escapes, newlines,
double quotes, various types, expanded or not, field seps, strange 
column names...).




Most of times, the need for CSV is covered by \copy or COPY with
the CSV option, but there are some cases where it would be more
practical to have it as an output format in psql.

* \copy does not interpolate psql variables and is a single-line
command, so making a query fit these contraints can be cumbersome.
It can be got around by defining a temporary view and
\copy from that view, but that doesn't work in a read-only context
such as when connected to a standby.

* the server-side COPY TO STDOUT can also be used from psql,
typically with psql -c "COPY (query) TO STDOUT CSV" > file.csv,
but that's too simple to extract multiple result sets per script.
COPY is also more rigid than psql in the options to delimit
fields and records.

* copy with csv can't help for the output of meta-commands
such as \gx, \crosstabview, \l, \d ... whereas a CSV format within psql
does work with these.


--
Fabien.



Re: [HACKERS] Partition-wise aggregation/grouping

2018-02-26 Thread Jeevan Chalke
Hi Robert,

On Fri, Feb 23, 2018 at 2:53 AM, Robert Haas  wrote:

> On Thu, Feb 8, 2018 at 8:05 AM, Jeevan Chalke
>  wrote:
> > In this attached version, I have rebased my changes over new design of
> > partially_grouped_rel. The preparatory changes of adding
> > partially_grouped_rel are in 0001.
>
> I spent today hacking in 0001; results attached.  The big change from
> your version is that this now uses generate_gather_paths() to add
> Gather/Gather Merge nodes (except in the case where we sort by group
> pathkeys and then Gather Merge) rather than keeping all of the bespoke
> code.  That turned up to be a bit less elegant than I would have liked
> -- I had to an override_rows argument to generate_gather_paths to make
> it work.  But overall I think this is still a big improvement, since
> it lets us share code instead of duplicating it.  Also, it potentially
> lets us add partially-aggregated but non-parallel paths into
> partially_grouped_rel->pathlist and that should Just Work; they will
> get the Finalize Aggregate step but not the Gather.  With your
> arrangement that wouldn't work.
>
> Please review/test.
>

I have reviewed and tested the patch and here are my couple of points:

 /*
- * If the input rel belongs to a single FDW, so does the grouped rel.
+ * If the input rel belongs to a single FDW, so does the grouped rel.
Same
+ * for the partially_grouped_rel.
  */
 grouped_rel->serverid = input_rel->serverid;
 grouped_rel->userid = input_rel->userid;
 grouped_rel->useridiscurrent = input_rel->useridiscurrent;
 grouped_rel->fdwroutine = input_rel->fdwroutine;
+partially_grouped_rel->serverid = input_rel->serverid;
+partially_grouped_rel->userid = input_rel->userid;
+partially_grouped_rel->useridiscurrent = input_rel->useridiscurrent;
+partially_grouped_rel->fdwroutine = input_rel->fdwroutine;

In my earlier mail where I have posted a patch for this partially grouped
rel changes, I forgot to put my question on this.
I was unclear about above changes and thus passed grouped_rel whenever we
wanted to work on partially_grouped_rel to fetch relevant details.

One idea I thought about is to memcpy the struct once we have set all
required fields for grouped_rel so that we don't have to do similar stuff
for partially_grouped_rel.

---

+ * Insert a Sort node, if required.  But there's no point in
+ * sorting anything but the cheapest path.
  */
-if (root->group_pathkeys)
+if (!pathkeys_contained_in(root->group_pathkeys,
path->pathkeys))
+{
+if (path != linitial(partially_grouped_rel->pathlist))
+continue;

Paths in pathlist are added by add_path(). Though we have paths is pathlist
is sorted with the cheapest total path, we generally use
RelOptInfo->cheapest_total_path instead of using first entry, unlike
partial paths. But here you use the first entry like partial paths case.
Will it better to use cheapest total path from partially_grouped_rel? This
will require calling set_cheapest on partially_grouped_rel before we call
this function.

Attached top-up patch doing this along with few indentation fixes.

Rest of the changes look good to me.

Once this gets in, I will re-base my other patches accordingly.

And, thanks for committing 0006.

Thanks


>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>


-- 
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 1c792a0..f6b0208 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -2453,7 +2453,8 @@ set_worktable_pathlist(PlannerInfo *root, RelOptInfo *rel, RangeTblEntry *rte)
  * we must do something.)
  */
 void
-generate_gather_paths(PlannerInfo *root, RelOptInfo *rel, bool override_rows)
+generate_gather_paths(PlannerInfo *root, RelOptInfo *rel,
+	  bool override_rows)
 {
 	Path	   *cheapest_partial_path;
 	Path	   *simple_gather_path;
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index e4f9bd4..e8f6cc5 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -6101,7 +6101,7 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 			 */
 			if (!pathkeys_contained_in(root->group_pathkeys, path->pathkeys))
 			{
-if (path != linitial(partially_grouped_rel->pathlist))
+if (path != partially_grouped_rel->cheapest_total_path)
 	continue;
 path = (Path *) create_sort_path(root,
  grouped_rel,
@@ -6186,9 +6186,7 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 		 */
 		if (partially_grouped_rel->pathlist)
 		{
-			Path	   *path;
-
-			path = (Path *) linitial(partially_grouped_rel->pathlist);
+			Path	   *

Optimizing nested ConvertRowtypeExpr execution

2018-02-26 Thread Ashutosh Bapat
Hi,
In a multi-level partitioned table, a parent whole-row reference gets
translated into nested ConvertRowtypeExpr with child whole-row
reference as the leaf. During the execution, the child whole-row
reference gets translated into all all intermediate parents' whole-row
references, ultimately represented as parent's whole-row reference.
AFAIU, the intermediate translations are unnecessary. The leaf child
whole-row can be directly translated into top parent's whole-row
reference. Here's a WIP patch which does that by eliminating
intermediate ConvertRowtypeExprs during ExecInitExprRec().

I tested the performance with two-level partitions, and 1M rows, on my
laptop selecting just the whole-row expression. I saw ~20% improvement
in the execution time. Please see the attached test and its output
with and without patch.

For two-level inheritance hierarchy, this patch doesn't show any
performance difference since the plan time hierarchy is flattened into
single level.

Instead of the approach that the patch takes, we might modify
adjust_appendrel_attrs() not to produce nested ConvertRowtypeExprs in
the first place. With that we may get rid of code which handles nested
ConvertRowtypeExprs like is_converted_whole_row_reference(). But I
haven't tried that approach yet.

Suggestions/comments welcome.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
diff --git a/src/backend/executor/execExpr.c b/src/backend/executor/execExpr.c
index db5fcaf..ec896a4 100644
--- a/src/backend/executor/execExpr.c
+++ b/src/backend/executor/execExpr.c
@@ -1364,6 +1364,37 @@ ExecInitExprRec(Expr *node, ExprState *state,
 		case T_ConvertRowtypeExpr:
 			{
 ConvertRowtypeExpr *convert = (ConvertRowtypeExpr *) node;
+ConvertRowtypeExpr *tmp_cre = convert;
+bool		nested_cre = false;
+
+/*
+ * If this is a nested ConvertRowtypeExpr resulting from a
+ * multi-level partition/inheritance hierarchy, its leaf node
+ * will be a whole-row expression. We can convert the leaf
+ * whole-row directly into the topmost parent, without
+ * converting it into the intermediate parent row types.
+ */
+while (IsA(tmp_cre->arg, ConvertRowtypeExpr))
+{
+	tmp_cre = castNode(ConvertRowtypeExpr, tmp_cre->arg);
+	nested_cre = true;
+}
+
+if (nested_cre && IsA(tmp_cre->arg, Var) &&
+	castNode(Var, tmp_cre->arg)->varattno == 0)
+{
+	ConvertRowtypeExpr *new_convert;
+
+	/*
+	 * XXX: Instead of modifying the expression directly, we
+	 * save a modified copy in the execution tree. May be it's
+	 * safe to modify the expression directly, not sure.
+	 */
+	new_convert = makeNode(ConvertRowtypeExpr);
+	memcpy(new_convert, convert, sizeof(ConvertRowtypeExpr));
+	new_convert->arg = tmp_cre->arg;
+	convert = new_convert;
+}
 
 /* evaluate argument into step's result area */
 ExecInitExprRec(convert->arg, state, resv, resnull);


nest_cre.out
Description: Binary data
\set num_rows 100
drop table t1;
drop table p cascade;
create table t1 (a int, b varchar, c timestamp) partition by range(a);
create table t1p1 (b varchar, c timestamp, a int) partition by range(a);
alter table t1 attach partition t1p1 for values from (0) to (100);
create table t1p1p1(c timestamp, a int, b varchar);
alter table t1p1 attach partition t1p1p1 for values from (0) to (50);
insert into t1 select abs(random()) * 49, i, now() from generate_series(1,  :num_rows) i;
explain analyze select t1 from t1;
   QUERY PLAN   

 Append  (cost=0.00..13565.97 rows=719697 width=32) (actual time=0.079..683.430 rows=100 loops=1)
   ->  Seq Scan on t1p1p1  (cost=0.00..13565.97 rows=719697 width=32) (actual time=0.077..616.439 rows=100 loops=1)
 Planning time: 0.193 ms
 Execution time: 717.929 ms
(4 rows)

explain analyze select t1 from t1;
   QUERY PLAN   

 Append  (cost=0.00..13565.97 rows=719697 width=32) (actual time=0.017..607.063 rows=100 loops=1)
   ->  Seq Scan on t1p1p1  (cost=0.00..13565.97 rows=719697 width=32) (actual time=0.016..541.619 rows=100 loops=1)
 Planning time: 0.115 ms
 Execution time: 640.628 ms
(4 rows)

explain analyze select t1 from t1;
   QUERY PLAN   

 Append  (cost=0.00..13565.97 rows=719697 width=32) (actual time=0.016..605.972 rows=100 loops=1)

Re: [HACKERS] Add support for tuple routing to foreign partitions

2018-02-26 Thread Etsuro Fujita

(2018/02/23 16:38), Amit Langote wrote:

On Thu, Feb 22, 2018 at 8:49 PM, Etsuro Fujita
  wrote:

This would introduce an asymmetry (we can move tuples from plain partitions
to foreign partitions, but the reverse is not true), but I am thinking that
it would be probably okay to document about that.



About just documenting the asymmetry you mentioned that's caused by
the fact that we don't enforce constraints on foreign tables, I
started wondering if we shouldn't change our stance on the matter wrt
"partition" constraints?


I'm not sure that it's a good idea to make an exception in that case. 
Another concern is triggers on the remote side; those might change the 
row so that the partition constraint of the containing partition is no 
longer satisfied.



But, admittedly, that's a topic for a
different thread.


OK, I'll leave that for another patch.

Will post a new version.  Thanks for the comments!

Best regards,
Etsuro Fujita



Re: GSOC 2018 ideas

2018-02-26 Thread Aleksander Alekseev
Hello Charles,

> I saw PostgreSQL is selected in GSOC 2018 and pretty interested in the
> ideas of thrift data types support that proposed by you. So, I want to
> prepare for a proposal based on this idea.

Glad you are interested in this project!

> Can I have more detailed information of what documents or code that I
> need to understand?

I would recommend the following documents and code:

* Source code of pg_protobuf
  https://github.com/afiskon/pg_protobuf
* "Writing Postgres Extensions" tutorial series by Manuel Kniep
  http://big-elephants.com/2015-10/writing-postgres-extensions-part-i/
* "So you want to make an extension?" talk by Keith Fiske
  http://slides.keithf4.com/extension_dev/#/
* Apache Thrift official website
  https://thrift.apache.org/
* Also a great explanation of the Thrift format can be found in the
  book "Designing Data-Intensive Applications" by Martin Kleppmann
  http://dataintensive.net/

> Also, if this idea is allocated to other student (or in other worlds,
> you prefer some student to work on it), do let me know, so that I can
> pick some other project in PostgreSQL. Any comments or suggestions are
> welcomed!

To my best knowledge currently there are no other students interested in
this particular work.

-- 
Best regards,
Aleksander Alekseev


signature.asc
Description: PGP signature


Re: VACUUM FULL name is very confusing to some people (or to most non expert people)

2018-02-26 Thread Pavel Stehule
2018-02-26 9:56 GMT+01:00 Lætitia Avrot :

> Although VACUUM and VACUUM FULL is different, then result is same (depends
>> on detail level) - the data files are optimized for other processing. You
>> should to see a VACUUM like family of commands that does some data files
>> optimizations. VACUUM, VACUUM FULL, VACUUM FREEZE, VACUUM ANALYZE, ...
>> Personally I don't think so we need to implement new synonym command for
>> this case.
>>
>
> Here's how I understand what you wrote : "Each and every vacuum operations
> are different flavours for files optimization so it's legitimate to use
> similar names". I agree for VACUUM ANALYZE and VACUUM FREEZE that can be
> seen as options to do more things than a simple VACUUM.
>
> But I disagree for VACUUM FULL that isn't an option to do one more thing
> than VACUUM does. VACUUM FULL is a complete different process.
>
> Let's take an example:
> In a production server with average production load, if you're already
> running a VACUUM, you can change it to a VACUUM ANALYZE without many risks.
> But I wouldn't dare try a VACUUM FULL without pg_repack.
>

VACUUM and VACUUM ANALYZE does same VACUUM work.

The VACUUM FREEZE is different too.


>
>
>> Why you you cannot to say your students - "VACUUM FULL is like SHRINK in
>> SQL Server"?
>>
>
> I do explain that to my students but I'm not sure they memorize it,
> because they do have a lot to memorize in a training session.
>

Maybe the core of this issue is fact so VACUUM FULL, VACUUM FREEZE is
statements based on two keywords commands. Lazy VACUUM uses just keyword.
>From this perspective the lazy VACUUM "should be" renamed, because it is
inconsistent - and some databases uses some names like OPTIMIZE table (and
it sound much more optimistic :)). I teach PostgreSQL more than ten years -
and I have not the problem with this topic - the hard part is explain of
VACCUM FREEZE - but VACUUM and VACUUM FULL can be explained simply (with
some detail reduction). VACUUM recuces MVCC costs, VACUUM FULL reduces
bloating. ANALYZE is orthogonal - calculate column statistic.


>
> I keep meeting customers to who I have to explain that a simple VACUUM
> doesn't rebuild indexes. Am I the only one facing that problem ?
>

simple VACUUM (lazy VACUUM does a tasks that don't needs aggressive locks)
- rebuild indexes needs strong locks.

I agree, it is not pretty clean, because VACUUM FULL share some work with
REINDEX, but introduction new command change nothing.



>
>
>> Regards
>>
>> Pavel
>>
>
> Regards
>
> Lætitia
>


Re: Use of term hostaddrs for multiple hostaddr values

2018-02-26 Thread Vasundhar Boddapati
I have gone through the comments, both look straight forward to go

Re: [doc fix] Correct calculation of vm.nr_hugepages

2018-02-26 Thread Vasundhar Boddapati
The following review has been posted through the commitfest application:
make installcheck-world:  not tested
Implements feature:   not tested
Spec compliant:   not tested
Documentation:tested, passed

It works. Can be Merged.

Re: VACUUM FULL name is very confusing to some people (or to most non expert people)

2018-02-26 Thread David G. Johnston
On Sun, Feb 25, 2018 at 10:51 AM, Lætitia Avrot 
wrote:

> Hi all,
>
> For most beginners (and even a lot of advanced users)  there is a strong
> confusion between simple VACUUM and VACUUM FULL. They think "full" is
> simply an option to the maintenance operation vacuum while it's not. It's a
> complete different  operation.
>
> I have a hard time explaining it when I teach PostgreSQL Administration
> (even if I stress the matter) and I constantly meet customer that are wrong
> about it.
>

I don't think adding an actual alias to the system is worthwhile but if
thinking of "VACUUM FULL" as "DEFRAGMENT" helps people remember by all
means go for it.

​Maybe I'm being naive but by the time people learn MVCC and the different
pros, cons, and the ways to manage the resultant bloat understanding the
difference between VACUUM and VACUUM FULL will be straight forward, just
attaching different labels to different behaviors.  Whether FULL is an
option or part of its own command doesn't seem that important - you'd still
have to learn why the option exists and the pros and cons of using it
instead of just the basic option.  If the response is "this is all too
complicated, I'll pick the one that sounds like its more thorough" then we
are back to understanding MVCC.  One would still have decide between VACUUM
and DEFRAGMENT and quite possibly decide to perform both...at least with
VACUUM FULL its made obvious that both commands are related.

David J.


Re: Using JSONB directly from application

2018-02-26 Thread Craig Ringer
On 26 February 2018 at 04:05, Anthony Communier  wrote:

> Hello,
>
> It would be nice if application connected to a Postrgesql database could
> send and receive JSONB in binary. It could save some useless text
> conversion. All works could be done on application side which are often
> more scalable than database itself.
>


To support this, you'd need to extract PostgreSQL's jsonb support into a C
library that could be used independently of backend server infrastructure
like 'palloc' and memory contexts, ereport(), etc. Or write a parallel
implementation.

It's one of the reasons some people questioned the wisdom of doing jsonb as
a bespoke format not using protobufs or whatever. I'm not one of them,
since I wasn't the one doing the work, and I also know how hard it can be
to neatly fit general purpose library code into the DB server where we
expect it to do little things like not abort() on malloc() failure.

If you're interested in writing such a library, I suggest proposing a broad
design for how you intend to do it here. I suspect that duplicating enough
server backend infrastructure to make the existing jsonb implementation
friendly to frontend code would be frustrating, but maybe it's not as bad
as I think. Certainly if you did such a thing, many people would thank you,
because the inability to use ereport() and elog(), PG_TRY, the List API,
etc, in FRONTEND code is a constant if minor source of irritation in
PostgreSQL development.


-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Patch: Pass IndexInfo correctly to aminsert for indexes on TOAST

2018-02-26 Thread Robert Haas
On Thu, Feb 22, 2018 at 9:13 PM, Ahuja, Nitin  wrote:
> Patch Description
> Currently, while saving the TOAST entry the call to index access method
> insert (aminsert) passed IndexInfo attribute as NULL. IndexInfo parameter
> was introduced in Postgres 10 for the aminsert method to allow caching data
> across aminsert calls within a single SQL statement. The IndexInfo is passed
> correctly for regular tuple insertion but not for TOAST entries. This
> currently prohibits aminsert method to be able to use IndexInfo for TOAST
> entries.
>
> This patch correctly passes the built IndexInfo to aminsert for TOAST
> entries.

If you don't offer any justification for the patch it's unlikely to be
accepted.  BuildIndexInfo() isn't free, so unless it hurts something
to skip calling it, we should keep skipping it.

Also, your patch is (a) in the body of the email instead of attached,
(b) in HTML format, and (c) reversed -- if it were properly formatted,
the original hunks would be first and the revised ones second.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: remove pg_class.relhaspkey

2018-02-26 Thread Robert Haas
On Mon, Feb 26, 2018 at 12:36 AM, Michael Paquier  wrote:
> I would be of the opinion to drop them.

+1.  On this point, I am in agreement with the gentleman who wrote
http://postgr.es/m/7903.1310671...@sss.pgh.pa.us

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Contention preventing locking

2018-02-26 Thread Amit Kapila
On Thu, Feb 15, 2018 at 9:30 PM, Konstantin Knizhnik
 wrote:
> Hi,
>
> PostgreSQL performance degrades signficantly in case of high contention.
> You can look at the attached YCSB results (ycsb-zipf-pool.png) to estimate
> the level of this degradation.
>
> Postgres is acquiring two kind of heavy weight locks during update: it locks
> TID of the updated tuple and XID of transaction created this version.
> In debugger I see the following picture: if several transactions are trying
> to update the same record, then first of all they compete for
> SnapshotDirty.xmax transaction lock in EvalPlanQualFetch.
> Then in heap_tuple_update they are trying to lock TID of the updated tuple:
> heap_acquire_tuplock in heapam.c
>

There is no function named heap_tuple_update.  Do you mean to say heap_update?

> After that transactions wait completion of the transaction updated the
> tuple: XactLockTableWait in heapam.c
>
> So in heap_acquire_tuplock all competing transactions are waiting for TID of
> the updated version. When transaction which changed this tuple is committed,
> one of the competitors will grant this lock and proceed, creating new
> version of the tuple. Then all other competitors will be awaken and ... find
> out that locked tuple is not the last version any more.
> Then they will locate new version and try to lock it... The more competitors
> we have, then more attempts they all have to perform (quadratic complexity).
>

The attempts are controlled by marking the tuple as locked by me after
waiting on SnapshotDirty.xmax.  So, the backend which marks the tuple
as locked must get the tuple to update and I think it is ensured in
code that only one backend will be allowed to do so.  I can see some
serialization in this logic, but I think we should first try to get
the theory behind the contention problem you are seeing before trying
to find the solution for it.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Contention preventing locking

2018-02-26 Thread Amit Kapila
On Tue, Feb 20, 2018 at 10:34 PM, Konstantin Knizhnik
 wrote:
>
>
> On 20.02.2018 19:39, Simon Riggs wrote:
>>
>> On 20 February 2018 at 16:07, Konstantin Knizhnik
>>  wrote:
>>>
>>>
>>> On 20.02.2018 14:26, Simon Riggs wrote:

 Try locking the root tid rather than the TID, that is at least unique
 per page for a chain of tuples, just harder to locate.

>>> As far as I understand, it is necessary to traverse the whole page to
>>> locate
>>> root tuple, isn't it?
>>> If so, then I expect it to be too expensive operation. Scanning the whole
>>> page on tuple update seems to be not an acceptable solution.
>>
>> Probably.
>>
>> It occurs to me that you can lock the root tid in index_fetch_heap().
>> I hear other DBMS lock via the index.
>>
>> However, anything you do with tuple locking could interact badly with
>> heap_update and the various lock modes, so be careful.
>>
>> You also have contention for heap_page_prune_opt() and with SELECTs to
>> consider, so I think you need to look at both page and tuple locking.
>>
>
> So, if I correctly understand the primary goal of setting tuple lock in
> heapam.c is to avoid contention caused
> by concurrent release of all waiters.
> But my transaction lock chaining patch eliminates this problem in other way.
> So what about combining your patch (do not lock Snapshot.xmax) + with my
> xlock patch and ... completely eliminate tuple lock in heapam?
> In this case update of tuple will require obtaining just one heavy weight
> lock.
>
> I made such experiment and didn't find any synchronization problems with my
> pgrw test.
> Performance is almost the same as with vanilla+xlock patch:
>
> https://docs.google.com/spreadsheets/d/1QOYfUehy8U3sdasMjGnPGQJY8JiRfZmlS64YRBM0YTo/edit?usp=sharing
>
> I wonder why instead of chaining transaction locks (which can be done quite
> easily) approach with extra tuple lock was chosen?

Can you please explain, how it can be done easily without extra tuple
locks?  I have tried to read your patch but due to lack of comments,
it is not clear what you are trying to achieve.  As far as I can see
you are changing the locktag passed to LockAcquireExtended by the
first waiter for the transaction.  How will it achieve the serial
waiting protocol (queue the waiters for tuple) for a particular tuple
being updated?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] [PATCH] Vacuum: Update FSM more frequently

2018-02-26 Thread Claudio Freire
On Mon, Feb 26, 2018 at 6:00 AM, Masahiko Sawada  wrote:
> Here is review comment for v4 patch.
>
> @@ -1922,6 +1988,8 @@ count_nondeletable_pages(Relation onerel,
> LVRelStats *vacrelstats)
>  * We don't insert a vacuum delay point here, because we have 
> an
>  * exclusive lock on the table which we want to hold
> for as short a
>  * time as possible.  We still need to check for
> interrupts however.
> +* We might have to acquire the autovacuum lock,
> however, but that
> +* shouldn't pose a deadlock risk.
>  */
> CHECK_FOR_INTERRUPTS();
>
> Is this change necessary?

I don't recall doing that change. Maybe a rebase gone wrong.

> 
> +   /*
> +* If there are no indexes then we should periodically
> vacuum the FSM
> +* on huge relations to make free space visible early.
> +*/
> +   if (nindexes == 0 &&
> +   (vacuumed_pages - vacuumed_pages_at_fsm_vac) >
> vacuum_fsm_every_pages)
> +   {
> +   /* Vacuum the Free Space Map */
> +   FreeSpaceMapVacuum(onerel, max_freespace);
> +   vacuumed_pages_at_fsm_vac = vacuumed_pages;
> +   max_freespace = 0;
> +   }
>
> I think this code block should be executed before we check if the page
> is whether new or empty and then do 'continue'. Otherwise we cannot
> reach this code if the table has a lot of new or empty pages.

In order for the counter (vacuumed_pages) to increase, there have to
be plenty of opportunities for this code to run, and I specifically
wanted to avoid vacuuming the FSM too often for those cases
particularly (when Vacuum scans lots of pages but does no writes).

> 
> @@ -663,6 +663,8 @@ fsm_extend(Relation rel, BlockNumber fsm_nblocks)
>   * If minValue > 0, the updated page is also searched for a page with at
>   * least minValue of free space. If one is found, its slot number is
>   * returned, -1 otherwise.
> + *
> + * If minValue == 0, the value at the root node is returned.
>   */
>  static int
>  fsm_set_and_search(Relation rel, FSMAddress addr, uint16 slot,
> @@ -687,6 +689,8 @@ fsm_set_and_search(Relation rel, FSMAddress addr,
> uint16 slot,
>
> addr.level == FSM_BOTTOM_LEVEL,
>true);
> }
> +   else
> +   newslot = fsm_get_avail(page, 0);
>
> I think it's not good idea to make fsm_set_and_search return either a
> slot number or a category value according to the argument. Category
> values is actually uint8 but this function returns it as int.

Should be fine, uint8 can be contained inside an int in all platforms.

> Also we
> can call fsm_set_and_search with minValue = 0 at
> RecordAndGetPageWithFreeSpace() when search_cat is 0. With you patch,
> fsm_set_and_search() then returns the root value but it's not correct.

I guess I can add another function to do that.

> Also, is this change necessary for this patch? ISTM this change is
> used for the change in fsm_update_recursive() as follows but I guess
> this change can be a separate patch.
>
> +   new_cat = fsm_set_and_search(rel, parent, parentslot, new_cat, 0);
> +
> +   /*
> +* Bubble up, not the value we just set, but the one now in the root
> +* node of the just-updated page, which is the page's highest value.
> +*/

I can try to separate them I guess.



Re: [HACKERS] Partition-wise aggregation/grouping

2018-02-26 Thread Robert Haas
On Mon, Feb 26, 2018 at 6:38 AM, Jeevan Chalke
 wrote:
> One idea I thought about is to memcpy the struct once we have set all
> required fields for grouped_rel so that we don't have to do similar stuff
> for partially_grouped_rel.

I think that would be a poor idea.  We want to copy a few specific
fields, not everything, and copying those fields is cheap, because
they are just simple assignment statements.  I think memcpy()'ing the
whole structure would be using a sledgehammer to solve a problem for
which a scalpel is more suited.

> Paths in pathlist are added by add_path(). Though we have paths is pathlist
> is sorted with the cheapest total path, we generally use
> RelOptInfo->cheapest_total_path instead of using first entry, unlike partial
> paths. But here you use the first entry like partial paths case. Will it
> better to use cheapest total path from partially_grouped_rel? This will
> require calling set_cheapest on partially_grouped_rel before we call this
> function.

Hmm, I guess that seems like a reasonable approach, although I am not
sure it matters much either way.

> Attached top-up patch doing this along with few indentation fixes.

I don't see much point to the change in generate_gather_paths -- that
line is only 77 characters long.

Committed after incorporating your other fixes and updating the
optimizer README.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] [PATCH] Vacuum: Update FSM more frequently

2018-02-26 Thread Claudio Freire
On Mon, Feb 26, 2018 at 11:31 AM, Claudio Freire  wrote:
>> 
>> +   /*
>> +* If there are no indexes then we should periodically
>> vacuum the FSM
>> +* on huge relations to make free space visible early.
>> +*/
>> +   if (nindexes == 0 &&
>> +   (vacuumed_pages - vacuumed_pages_at_fsm_vac) >
>> vacuum_fsm_every_pages)
>> +   {
>> +   /* Vacuum the Free Space Map */
>> +   FreeSpaceMapVacuum(onerel, max_freespace);
>> +   vacuumed_pages_at_fsm_vac = vacuumed_pages;
>> +   max_freespace = 0;
>> +   }
>>
>> I think this code block should be executed before we check if the page
>> is whether new or empty and then do 'continue'. Otherwise we cannot
>> reach this code if the table has a lot of new or empty pages.
>
> In order for the counter (vacuumed_pages) to increase, there have to
> be plenty of opportunities for this code to run, and I specifically
> wanted to avoid vacuuming the FSM too often for those cases
> particularly (when Vacuum scans lots of pages but does no writes).

Wait, I see what you mean. Entirely empty pages.

Ok, I can move it.



Re: Contention preventing locking

2018-02-26 Thread Konstantin Knizhnik



On 26.02.2018 17:00, Amit Kapila wrote:

On Thu, Feb 15, 2018 at 9:30 PM, Konstantin Knizhnik
 wrote:

Hi,

PostgreSQL performance degrades signficantly in case of high contention.
You can look at the attached YCSB results (ycsb-zipf-pool.png) to estimate
the level of this degradation.

Postgres is acquiring two kind of heavy weight locks during update: it locks
TID of the updated tuple and XID of transaction created this version.
In debugger I see the following picture: if several transactions are trying
to update the same record, then first of all they compete for
SnapshotDirty.xmax transaction lock in EvalPlanQualFetch.
Then in heap_tuple_update they are trying to lock TID of the updated tuple:
heap_acquire_tuplock in heapam.c


There is no function named heap_tuple_update.  Do you mean to say heap_update?


Yes, sorry. I mean heap_update.




After that transactions wait completion of the transaction updated the
tuple: XactLockTableWait in heapam.c

So in heap_acquire_tuplock all competing transactions are waiting for TID of
the updated version. When transaction which changed this tuple is committed,
one of the competitors will grant this lock and proceed, creating new
version of the tuple. Then all other competitors will be awaken and ... find
out that locked tuple is not the last version any more.
Then they will locate new version and try to lock it... The more competitors
we have, then more attempts they all have to perform (quadratic complexity).


The attempts are controlled by marking the tuple as locked by me after
waiting on SnapshotDirty.xmax.  So, the backend which marks the tuple
as locked must get the tuple to update and I think it is ensured in
code that only one backend will be allowed to do so.  I can see some
serialization in this logic, but I think we should first try to get
the theory behind the contention problem you are seeing before trying
to find the solution for it.



Sorry, I could not say, that I completely understand logic of locking 
updated tuples in Postgres, although  I have spent some time inspecting 
this code.
It seems to be strange and quite inefficient to me that updating one 
tuple requires obtaining three heavy-weight locks.

Did you read this thread:

https://www.postgresql.org/message-id/CANP8%2BjKoMAyXvMO2hUqFzHX8fYSFc82q9MEse2zAEOC8vxwDfA%40mail.gmail.com


In any, the summary of all my last experiments with locking is the 
following:


1. My proposal with transaction lock chaining can reduce performance 
degradation with increasing number of competing transactions.
But degradation still takes place. Moreover, my recent experiments shows 
that lock chaining can cause deadlocks. It seems to me that I know the 
reason of such deadlocks,

but right now have no idea how to fix them.
2.  Replacing tuple TID locks with tuple PK lock has positive effect 
only if lock is hold until end of transaction. Certainly we can not lock 
all updated tuples (it will cause lock hash overflow).
But we can keep until end of transaction just some small number of tuple 
locks. Unfortunately it is not the only problem with this solution: it 
is relatively simple to implement in case of integer primary key, but 
handling all kind of PKs, including compound keys requires more efforts. 
Also there can be no PK at all...
3. Alternative to PK lock is root tuple lock (head of  hot update 
chain). Unfortunately there is no efficient way (at least I do not know 
one) of locating root tuple.
Scanning the whole page seems to be too inefficient. If we perform index 
scan, than we actually start with root tuple, so it is possible to 
somehow propagate it.
But it doesn't work in case of heap scan. And I suspect that as in case 
of PK lock, it can increase performance only of lock is hold until end 
of transaction.


So the problem with lock contention really exist. It can be quite easily 
reproduced at the computer with large number of core with either pgbench 
with zipf distribution,
either with YCSB benchmark, either with my pgrw benchmark... But how to 
fix it is unclear.



--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: Contention preventing locking

2018-02-26 Thread Konstantin Knizhnik



On 26.02.2018 17:20, Amit Kapila wrote:

On Tue, Feb 20, 2018 at 10:34 PM, Konstantin Knizhnik
 wrote:


On 20.02.2018 19:39, Simon Riggs wrote:

On 20 February 2018 at 16:07, Konstantin Knizhnik
 wrote:


On 20.02.2018 14:26, Simon Riggs wrote:

Try locking the root tid rather than the TID, that is at least unique
per page for a chain of tuples, just harder to locate.


As far as I understand, it is necessary to traverse the whole page to
locate
root tuple, isn't it?
If so, then I expect it to be too expensive operation. Scanning the whole
page on tuple update seems to be not an acceptable solution.

Probably.

It occurs to me that you can lock the root tid in index_fetch_heap().
I hear other DBMS lock via the index.

However, anything you do with tuple locking could interact badly with
heap_update and the various lock modes, so be careful.

You also have contention for heap_page_prune_opt() and with SELECTs to
consider, so I think you need to look at both page and tuple locking.


So, if I correctly understand the primary goal of setting tuple lock in
heapam.c is to avoid contention caused
by concurrent release of all waiters.
But my transaction lock chaining patch eliminates this problem in other way.
So what about combining your patch (do not lock Snapshot.xmax) + with my
xlock patch and ... completely eliminate tuple lock in heapam?
In this case update of tuple will require obtaining just one heavy weight
lock.

I made such experiment and didn't find any synchronization problems with my
pgrw test.
Performance is almost the same as with vanilla+xlock patch:

https://docs.google.com/spreadsheets/d/1QOYfUehy8U3sdasMjGnPGQJY8JiRfZmlS64YRBM0YTo/edit?usp=sharing

I wonder why instead of chaining transaction locks (which can be done quite
easily) approach with extra tuple lock was chosen?

Can you please explain, how it can be done easily without extra tuple
locks?  I have tried to read your patch but due to lack of comments,
it is not clear what you are trying to achieve.  As far as I can see
you are changing the locktag passed to LockAcquireExtended by the
first waiter for the transaction.  How will it achieve the serial
waiting protocol (queue the waiters for tuple) for a particular tuple
being updated?

The idea of transaction lock chaining was very simple. I have explained 
it in the first main in this thread.

Assumed that transaction T1 has updated tuple R1.
Transaction T2 also tries to update this tuple and so waits for T1 XID.
If then yet another transaction T3 also tries to update R1, then it 
should wait for T2, not for T1.
In this case commit of transaction T1 will awake just one transaction 
and not all other transaction pretending to update this tuple.
So there will be no races between multiple transactions which consume a 
lot of system resources.


Unfortunately the way i have implemented this idea: release lock of the 
original XID and tries to obtain lock of the XID of last waiting 
transaction can lead to deadlock.


--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: Sample values for pg_stat_statements

2018-02-26 Thread Julian Markwort
Hi Vik,

this is my review of your patch. I hope I've ticked all the necessary
boxes.


Submission review:
Patch has context, applies cleanly, make and make check run
successfully, patch contains tests for the added functionality.
The patch doesn't seem to contain any documentation regarding the new
columns.

I think the patch could be shorter due to some not really necessary
whitespace changes, e.g. lines 414, ff. in the pgss.1.patch file.
Modifying the first test for '!entry' to '!entry || need_params' in
line 628 and adding another test for '!entry' later in the file leads
to many unneccessarily changed lines, because they are simply indented
one step further (Prominently noticeable with comments, eg. line 672-
678 and 733-739.
I'd like to see the check for 'need_params' have it's own block,
leaving the existing block largely intact.
This could happen after the original 'if(!entry)'' block, at which
point you can be sure that an entry already exists, so there is no need
for the duplicated code that stores params and consts in the query text
file and their references in the entry.


Usability review:
The patch fulfills it's goal. The new columns consist of arrays of text
as well as an array of regtypes. Unfortunately, this makes the
pg_stat_statements view even more cluttered and confusing. (The view
was very cluttered before your patch, the best solution is probably to
not 'SELECT * FROM pg_stat_statements;'...)

Regarding the security implications that I can think of, this patch
behaves in similar fashion to the rest of pg_stat_statements, showing
the consts, params, and param_types only to users with proper access
rights and if the showtext flag is set.


Feature test:
The feature works as advertised and does not seem to lead to any assert
failures or memory management errors.
Manual testing indicates that data is properly persisted through
database shutdowns and restarts.




If the intended purpose is to have some basic idea of the kinds of
values that are used with certain statements, I'd like to suggest that
you take a look at my own patch, which implements the tracking of good
and bad plans in pg_stat_statements, in the current commitfest. My
approach not only shows the values that where used when the statement
was executed for the first time (regarding the lifetime of the
pg_stat_statements tracked data), but it shows values of possibly more
current executions of the statements and offers the possibility to
identify values leading to very good or very poor performance.

Maybe we can combine our efforts; Here is one idea that came to my
mind:
- Store the parameters of a statement if the execution led to a new
slower or faster plan.
- Provide a function that allows users to take the jumbled query
expression and have the database explain it, based on the parameters
that were recorded previously.

Kind regards
Julian Markwort

smime.p7s
Description: S/MIME cryptographic signature


Re: [HACKERS] [PATCH] Vacuum: Update FSM more frequently

2018-02-26 Thread Claudio Freire
On Mon, Feb 26, 2018 at 11:31 AM, Claudio Freire  wrote:
> On Mon, Feb 26, 2018 at 6:00 AM, Masahiko Sawada  
> wrote:
>> Here is review comment for v4 patch.
>>
>> @@ -1922,6 +1988,8 @@ count_nondeletable_pages(Relation onerel,
>> LVRelStats *vacrelstats)
>>  * We don't insert a vacuum delay point here, because we 
>> have an
>>  * exclusive lock on the table which we want to hold
>> for as short a
>>  * time as possible.  We still need to check for
>> interrupts however.
>> +* We might have to acquire the autovacuum lock,
>> however, but that
>> +* shouldn't pose a deadlock risk.
>>  */
>> CHECK_FOR_INTERRUPTS();
>>
>> Is this change necessary?
>
> I don't recall doing that change. Maybe a rebase gone wrong.
>
>> 
>> +   /*
>> +* If there are no indexes then we should periodically
>> vacuum the FSM
>> +* on huge relations to make free space visible early.
>> +*/
>> +   if (nindexes == 0 &&
>> +   (vacuumed_pages - vacuumed_pages_at_fsm_vac) >
>> vacuum_fsm_every_pages)
>> +   {
>> +   /* Vacuum the Free Space Map */
>> +   FreeSpaceMapVacuum(onerel, max_freespace);
>> +   vacuumed_pages_at_fsm_vac = vacuumed_pages;
>> +   max_freespace = 0;
>> +   }
>>
>> I think this code block should be executed before we check if the page
>> is whether new or empty and then do 'continue'. Otherwise we cannot
>> reach this code if the table has a lot of new or empty pages.
>
> In order for the counter (vacuumed_pages) to increase, there have to
> be plenty of opportunities for this code to run, and I specifically
> wanted to avoid vacuuming the FSM too often for those cases
> particularly (when Vacuum scans lots of pages but does no writes).
>
>> 
>> @@ -663,6 +663,8 @@ fsm_extend(Relation rel, BlockNumber fsm_nblocks)
>>   * If minValue > 0, the updated page is also searched for a page with at
>>   * least minValue of free space. If one is found, its slot number is
>>   * returned, -1 otherwise.
>> + *
>> + * If minValue == 0, the value at the root node is returned.
>>   */
>>  static int
>>  fsm_set_and_search(Relation rel, FSMAddress addr, uint16 slot,
>> @@ -687,6 +689,8 @@ fsm_set_and_search(Relation rel, FSMAddress addr,
>> uint16 slot,
>>
>> addr.level == FSM_BOTTOM_LEVEL,
>>true);
>> }
>> +   else
>> +   newslot = fsm_get_avail(page, 0);
>>
>> I think it's not good idea to make fsm_set_and_search return either a
>> slot number or a category value according to the argument. Category
>> values is actually uint8 but this function returns it as int.
>
> Should be fine, uint8 can be contained inside an int in all platforms.
>
>> Also we
>> can call fsm_set_and_search with minValue = 0 at
>> RecordAndGetPageWithFreeSpace() when search_cat is 0. With you patch,
>> fsm_set_and_search() then returns the root value but it's not correct.
>
> I guess I can add another function to do that.
>
>> Also, is this change necessary for this patch? ISTM this change is
>> used for the change in fsm_update_recursive() as follows but I guess
>> this change can be a separate patch.
>>
>> +   new_cat = fsm_set_and_search(rel, parent, parentslot, new_cat, 0);
>> +
>> +   /*
>> +* Bubble up, not the value we just set, but the one now in the root
>> +* node of the just-updated page, which is the page's highest value.
>> +*/
>
> I can try to separate them I guess.

Patch set with the requested changes attached.

2 didn't change AFAIK, it's just rebased on top of the new 1.
From f9c5c0e0c8a79d241796a9b26d305c2b18bb665d Mon Sep 17 00:00:00 2001
From: Claudio Freire 
Date: Fri, 28 Jul 2017 21:31:59 -0300
Subject: [PATCH 1/3] Vacuum: Update FSM more frequently

Make Vacuum update the FSM more frequently, to avoid the case where
autovacuum fails to reach the point where it updates the FSM in
highly contended tables.

Introduce a tree pruning threshold to FreeSpaceMapVacuum that avoids
recursing into branches that already contain enough free space, to
avoid having to traverse the whole FSM and thus induce quadratic
costs. Intermediate FSM vacuums are only supposed to make enough
free space visible to avoid extension until the final (non-partial)
FSM vacuum.

Run partial FSM vacuums after each index pass, which is a point
at which whole ranges of the heap have been thorougly cleaned, and
we can expect no further updates to those ranges of the FSM save
for concurrent activity. When there are no indexes, and thus no
index passes, run partial FSM vacuums every 8GB of dirtied pages
or 1/8th of the relation, whichever is highest. This allows some
partial work to be made visible without incurring quadratic cost.

In any c

Re: [HACKERS] [PATCH] Vacuum: Update FSM more frequently

2018-02-26 Thread Claudio Freire
On Mon, Feb 26, 2018 at 1:32 PM, Claudio Freire  wrote:
> On Mon, Feb 26, 2018 at 11:31 AM, Claudio Freire  
> wrote:
>> On Mon, Feb 26, 2018 at 6:00 AM, Masahiko Sawada  
>> wrote:
>>> Here is review comment for v4 patch.
>>>
>>> @@ -1922,6 +1988,8 @@ count_nondeletable_pages(Relation onerel,
>>> LVRelStats *vacrelstats)
>>>  * We don't insert a vacuum delay point here, because we 
>>> have an
>>>  * exclusive lock on the table which we want to hold
>>> for as short a
>>>  * time as possible.  We still need to check for
>>> interrupts however.
>>> +* We might have to acquire the autovacuum lock,
>>> however, but that
>>> +* shouldn't pose a deadlock risk.
>>>  */
>>> CHECK_FOR_INTERRUPTS();
>>>
>>> Is this change necessary?
>>
>> I don't recall doing that change. Maybe a rebase gone wrong.
>>
>>> 
>>> +   /*
>>> +* If there are no indexes then we should periodically
>>> vacuum the FSM
>>> +* on huge relations to make free space visible early.
>>> +*/
>>> +   if (nindexes == 0 &&
>>> +   (vacuumed_pages - vacuumed_pages_at_fsm_vac) >
>>> vacuum_fsm_every_pages)
>>> +   {
>>> +   /* Vacuum the Free Space Map */
>>> +   FreeSpaceMapVacuum(onerel, max_freespace);
>>> +   vacuumed_pages_at_fsm_vac = vacuumed_pages;
>>> +   max_freespace = 0;
>>> +   }
>>>
>>> I think this code block should be executed before we check if the page
>>> is whether new or empty and then do 'continue'. Otherwise we cannot
>>> reach this code if the table has a lot of new or empty pages.
>>
>> In order for the counter (vacuumed_pages) to increase, there have to
>> be plenty of opportunities for this code to run, and I specifically
>> wanted to avoid vacuuming the FSM too often for those cases
>> particularly (when Vacuum scans lots of pages but does no writes).
>>
>>> 
>>> @@ -663,6 +663,8 @@ fsm_extend(Relation rel, BlockNumber fsm_nblocks)
>>>   * If minValue > 0, the updated page is also searched for a page with at
>>>   * least minValue of free space. If one is found, its slot number is
>>>   * returned, -1 otherwise.
>>> + *
>>> + * If minValue == 0, the value at the root node is returned.
>>>   */
>>>  static int
>>>  fsm_set_and_search(Relation rel, FSMAddress addr, uint16 slot,
>>> @@ -687,6 +689,8 @@ fsm_set_and_search(Relation rel, FSMAddress addr,
>>> uint16 slot,
>>>
>>> addr.level == FSM_BOTTOM_LEVEL,
>>>true);
>>> }
>>> +   else
>>> +   newslot = fsm_get_avail(page, 0);
>>>
>>> I think it's not good idea to make fsm_set_and_search return either a
>>> slot number or a category value according to the argument. Category
>>> values is actually uint8 but this function returns it as int.
>>
>> Should be fine, uint8 can be contained inside an int in all platforms.
>>
>>> Also we
>>> can call fsm_set_and_search with minValue = 0 at
>>> RecordAndGetPageWithFreeSpace() when search_cat is 0. With you patch,
>>> fsm_set_and_search() then returns the root value but it's not correct.
>>
>> I guess I can add another function to do that.
>>
>>> Also, is this change necessary for this patch? ISTM this change is
>>> used for the change in fsm_update_recursive() as follows but I guess
>>> this change can be a separate patch.
>>>
>>> +   new_cat = fsm_set_and_search(rel, parent, parentslot, new_cat, 0);
>>> +
>>> +   /*
>>> +* Bubble up, not the value we just set, but the one now in the root
>>> +* node of the just-updated page, which is the page's highest value.
>>> +*/
>>
>> I can try to separate them I guess.
>
> Patch set with the requested changes attached.
>
> 2 didn't change AFAIK, it's just rebased on top of the new 1.

Oops. Forgot about the comment changes requested. Fixed those in v6, attached.
From 3bde81cd9a0474e65c487899ddd290ad896b5feb Mon Sep 17 00:00:00 2001
From: Claudio Freire 
Date: Fri, 28 Jul 2017 21:31:59 -0300
Subject: [PATCH 1/3] Vacuum: Update FSM more frequently

Make Vacuum update the FSM more frequently, to avoid the case where
autovacuum fails to reach the point where it updates the FSM in
highly contended tables.

Introduce a tree pruning threshold to FreeSpaceMapVacuum that avoids
recursing into branches that already contain enough free space, to
avoid having to traverse the whole FSM and thus induce quadratic
costs. Intermediate FSM vacuums are only supposed to make enough
free space visible to avoid extension until the final (non-partial)
FSM vacuum.

Run partial FSM vacuums after each index pass, which is a point
at which whole ranges of the heap have been thorougly cleaned, and
we can expect no further updates to those ranges of the FSM save
for concurrent activity. When t

jsonlog logging only some messages?

2018-02-26 Thread Greg Stark
I tried loading the jsonlog module from
https://github.com/michaelpq/pg_plugins into Postgres 9.6.

However it seems it resulted in logs only for session log messages but
not any background worker log messages. We have log_checkpoints set
but there were no log messages in the json log about checkpoints. Nor
were there any from autovacuum.

Also I have log_destination set to stderr,cvslog and as I understand
it the jsonlog module effectively overrides the "stderr" output which
goes to the file named *.log (which I find super confusing,
incidentally). But I was expecting to get the csv file as well. We
didn't, we only got the *.log file with the (partial) json logs. Is
there something I'm missing here or is this unexpected?

-- 
greg



Re: [HACKERS] path toward faster partition pruning

2018-02-26 Thread Robert Haas
On Sun, Feb 25, 2018 at 11:10 PM, Amit Langote
 wrote:
> I think I'm convinced that partopcintype OIDs can be used where I thought
> parttypid ones were necessary.  The pruning patch uses the respective OID
> from this array when extracting the datum from an OpExpr to be compared
> with the partition bound datums.  It's sensible, I now think, to require
> the extracted datum to be of partition opclass declared input type, rather
> than the type of the partition key involved.  So, I removed the parttypid
> that I'd added to PartitionSchemeData.
>
> Updated the comments to make clear the distinction between and purpose of
> having both parttypcoll and partcollation.  Also expanded the comment
> about partsupfunc a bit.

I don't think this fundamentally fixes the problem, although it does
narrow it.  By requiring partcollation to match across every relation
with the same PartitionScheme, you're making partition-wise join fail
to work in some cases where it previously did.  Construct a test case
where parttypcoll matches and partcollation doesn't; then, without the
patch, the two relations will have the same PartitionScheme and thus
be eligible for a partition-wise join, but with the patch, they will
have different PartitionSchemes and thus won't.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Precision loss casting float to numeric

2018-02-26 Thread Tom Lane
Chapman Flack  writes:
> The 0002-*.patch is a proof-of-concept patching float4_numeric and
> float8_numeric in the trivial way (just using FLT_DECIMAL_DIG and
> DBL_DECIMAL_DIG in place of FLT_DIG and DBL_DIG). It makes the new
> regression test pass. (It will only work under a compiler that has
> __FLT_DECIMAL_DIG__ and __DBL_DECIMAL_DIG__ available, and I used
> those internal versions to avoid mucking with build tooling to change
> the target C standard, which I assume wouldn't be welcome anyway.

Nope.  TBH, I'd think about just using "DBL_DIG + 3", given our existing
coding around extra_float_digits in places like pg_dump and postgres_fdw.
The knowledge that you need 2 or 3 extra digits is already well embedded.

Conceivably you could do it like

#ifndef DBL_DECIMAL_DIG
#ifdef __DBL_DECIMAL_DIG__
#define DBL_DECIMAL_DIG __DBL_DECIMAL_DIG__
#else
#define DBL_DECIMAL_DIG (DBL_DIG + 3)
#endif
#endif

but I'm not exactly seeing how that buys us anything.

The bigger question here is whether people actually want this behavioral
change.  I think there's probably a bigger chance of complaints that
"casting 1.1::float8 to numeric now produces some weird,
incorrectly-rounded result" than that we make anyone happier.

I have a vague idea that at some point in the past we discussed making
this conversion use extra_float_digits, which'd allow satisfying both
camps, at the nontrivial price that the conversion would have to be
considered stable not immutable.  We didn't pull the trigger, if this
memory is real at all, presumably because of the mutability issue.

Another idea would be to leave the cast alone and introduce a named
function that does the "exact" conversion.  Possibly that makes nobody
happy, but at least both the cast and the function could be immutable.
It'd dodge backwards-compatibility objections, too.

regards, tom lane



Re: SSL passphrase prompt external command

2018-02-26 Thread Peter Eisentraut
On 2/26/18 01:32, Daniel Gustafsson wrote:
> +replaced by a prompt string.  (Write %% for a
> +literal %.)  Note that the prompt string will
> 
> I might be thick, but I don’t see where the %% handled?

Ah, I had broken that in my submitted patch.  New patch attached.

> Also, AFAICT a string
> ending with %\0 will print a literal % without requiring %% (which may be a
> perfectly fine case to allow, depending on how strict we want to be with the
> format).

That is intentional.  I made the behavior match that of archive_command,
restore_command, etc.  It's note quite how printf works, but we might as
well stick with what we already have.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From af787163a2535ee552e9645941d3844afbeb04d4 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Mon, 26 Feb 2018 13:28:38 -0500
Subject: [PATCH v2] Add ssl_passphrase_command setting

This allows specifying an external command for prompting for or
otherwise obtaining passphrases for SSL key files.  This is useful
because in many cases there is no TTY easily available during service
startup.
---
 doc/src/sgml/config.sgml  |  45 ++
 src/backend/libpq/Makefile|   2 +-
 src/backend/libpq/be-secure-common.c  | 125 ++
 src/backend/libpq/be-secure-openssl.c |  58 +---
 src/backend/libpq/be-secure.c |   1 +
 src/backend/utils/misc/guc.c  |  10 +++
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 src/include/libpq/libpq.h |   7 ++
 src/test/ssl/Makefile |   5 ++
 src/test/ssl/README   |   3 +
 src/test/ssl/ServerSetup.pm   |   2 +-
 src/test/ssl/ssl/server-password.key  |  18 
 src/test/ssl/t/001_ssltests.pl|  35 ++--
 src/tools/msvc/Mkvcbuild.pm   |   1 +
 14 files changed, 292 insertions(+), 21 deletions(-)
 create mode 100644 src/backend/libpq/be-secure-common.c
 create mode 100644 src/test/ssl/ssl/server-password.key

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 00fc364c0a..3ae9eb5b3e 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1312,6 +1312,51 @@ SSL

   
  
+
+ 
+  ssl_passphrase_command (string)
+  
+   ssl_passphrase_command configuration 
parameter
+  
+  
+  
+   
+Sets an external command to be invoked when a passphrase for
+decrypting an SSL file such as a private key needs to be obtained.  By
+default, this parameter is empty, which means the built-in prompting
+mechanism is used.
+   
+   
+The command must print the passphrase to the standard output and exit
+with code 0.  In the parameter value, %p is
+replaced by a prompt string.  (Write %% for a
+literal %.)  Note that the prompt string will
+probably contain whitespace, so be sure to quote adequately.  A single
+newline is stripped from the end of the output if present.
+   
+   
+The command does not actually have to prompt the user for a
+passphrase.  It can read it from a file, obtain it from a keychain
+facility, or similar.  It is up to the user to make sure the chosen
+mechanism is adequately secure.
+   
+   
+The passphrase command will also be called during a configuration
+reload if a key file needs a passphrase.  If the setting starts with
+-, then it will be ignored during a reload and the
+SSL configuration will not be reloaded if a passphrase is needed.
+(The - will not be part of the actual command
+called.)  The latter setting is appropriate for a command that
+requires a TTY for prompting, which might not be available when the
+server is running.  The default behavior with an empty setting is also
+not to prompt during a reload.
+   
+   
+This parameter can only be set in the 
postgresql.conf
+file or on the server command line.
+   
+  
+ 
 
 

diff --git a/src/backend/libpq/Makefile b/src/backend/libpq/Makefile
index 7fa2b02743..3dbec23e30 100644
--- a/src/backend/libpq/Makefile
+++ b/src/backend/libpq/Makefile
@@ -14,7 +14,7 @@ include $(top_builddir)/src/Makefile.global
 
 # be-fsstubs is here for historical reasons, probably belongs elsewhere
 
-OBJS = be-fsstubs.o be-secure.o auth.o crypt.o hba.o ifaddr.o pqcomm.o \
+OBJS = be-fsstubs.o be-secure.o be-secure-common.o auth.o crypt.o hba.o 
ifaddr.o pqcomm.o \
pqformat.o pqmq.o pqsignal.o auth-scram.o
 
 ifeq ($(with_openssl),yes)
diff --git a/src/backend/libpq/be-secure-common.c 
b/src/backend/libpq/be-secure-common.c
new file mode 100644
index 00..4ff85c81d4

2018-03 CFM

2018-02-26 Thread David Steele
Hackers,

Just a few days left until the last Commitfest for the PG11 release begins!

I'm planning to fill the CFM role, unless there are objections.

Regards,
-- 
-David
da...@pgmasters.net



Re: prokind column (was Re: [HACKERS] SQL procedures)

2018-02-26 Thread Peter Eisentraut
On 2/24/18 14:08, Tom Lane wrote:
> I took a quick look through this and noted a few small problems; the
> only one worth mentioning here is that you've broken psql tab completion
> for functions and aggregates when talking to a pre-v11 server.
> I don't think that's acceptable; however, since tab-complete.c has no
> existing provisions for adjusting its queries based on server version,
> it's not really clear what to do to fix it.  I'm sure that's soluble
> (and I think I recall a recent thread bumping up against the same
> problem for another change), but it needs a bit more sweat.

The patches proposed in the thread "PATCH: psql tab completion for
SELECT" appear to add support for version-dependent tab completion, so
this could be resolved "later".

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Precision loss casting float to numeric

2018-02-26 Thread Robert Haas
On Mon, Feb 26, 2018 at 1:29 PM, Tom Lane  wrote:
> The bigger question here is whether people actually want this behavioral
> change.  I think there's probably a bigger chance of complaints that
> "casting 1.1::float8 to numeric now produces some weird,
> incorrectly-rounded result" than that we make anyone happier.

Yeah, I worry about that, too.

Of course, as you know, I also have a deep and abiding skepticism
about IEEE binary floats as a concept.  Anomalies are unavoidable; we
can choose exactly which set users experience, but we can eliminate
them because the underlying storage format is poorly-adapted to the
behavior people actually want.  It's too bad that IEEE decimal floats
weren't standardized until 2008.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: prokind column (was Re: [HACKERS] SQL procedures)

2018-02-26 Thread Tom Lane
Peter Eisentraut  writes:
> On 2/24/18 14:08, Tom Lane wrote:
>> I took a quick look through this and noted a few small problems; the
>> only one worth mentioning here is that you've broken psql tab completion
>> for functions and aggregates when talking to a pre-v11 server.
>> I don't think that's acceptable; however, since tab-complete.c has no
>> existing provisions for adjusting its queries based on server version,
>> it's not really clear what to do to fix it.  I'm sure that's soluble
>> (and I think I recall a recent thread bumping up against the same
>> problem for another change), but it needs a bit more sweat.

> The patches proposed in the thread "PATCH: psql tab completion for
> SELECT" appear to add support for version-dependent tab completion, so
> this could be resolved "later".

I'm not sure that other patch will get in; AFAICS it's incomplete and
rather controversial.  But I guess we could put this issue on the
open-items list so we don't forget.

Anyway, once the release dust settles, I'll try to do a proper review
of this patch.  It'd be good if we could get it in this week before
the CF starts, so that any affected patches could be rebased.

regards, tom lane



Re: 2018-03 CFM

2018-02-26 Thread Tom Lane
David Steele  writes:
> I'm planning to fill the CFM role, unless there are objections.

Thanks for volunteering!

regards, tom lane



Re: jsonpath

2018-02-26 Thread Oleg Bartunov
On Mon, Feb 26, 2018 at 6:34 PM, Nikita Glukhov  wrote:
> Attached 10th version of the jsonpath patches.
>
> 1. Fixed error handling in arithmetic operators.
>
>Now run-time errors in arithmetic operators are catched (added
>PG_TRY/PG_CATCH around operator's functions calls) and converted into
>Unknown values in predicates as it is required by the standard:
>
>=# SELECT jsonb '[1,0,2]' @* '$[*] ? (1 / @ > 0)';
> ?column?
>--
> 1
> 2
>(2 rows)
>
> 2. Fixed grammar for parenthesized expressions.
>
> 3. Refactored GIN support for jsonpath operators.
>
> 4. Added one more operator json[b] @# jsonpath returning singleton json[b]
> with
>automatic conditional wrapping of sequences with more than one element
> into
>arrays:
>
>=# SELECT jsonb '[1,2,3,4,5]' @# '$[*] ? (@ > 2)';
> ?column?
>---
> [3, 4, 5]
>(1 row)
>
>=# SELECT jsonb '[1,2,3,4,5]' @# '$[*] ? (@ > 4)';
> ?column?
>--
> 5
>(1 row)
>
>=# SELECT jsonb '[1,2,3,4,5]' @# '$[*] ? (@ > 5)';
> ?column?
>--
> (null)
>(1 row)
>
>Existing set-returning operator json[b] @* jsonpath is also very userful
> but
>can't be used in functional indices like new operator @#.
>
>Note that conditional wrapping of @# differs from the wrapping in
>JSON_QUERY(... WITH [ARRAY] WRAPPER), where only singleton objects and
>arrays are not wrapped.  Unconditional wrapping can be emulated with our
>array construction feature (see below).
>
> 5. Changed time zone behavior in .datetime() item method.
>
>In the previous version of the patch timestamptz SQL/JSON items were
>serialized into JSON string items using session time zone.  This behavior
>did not allow jsonpath operators to be marked as immutable, and therefore
>they could not be used in functional indices.  Also, when the time zone
> was
>not specified in the input string, but TZM or TZH format fields were
> present
>in the format string, session time zone was used as a default for
>timestamptz items.
>
>To make jsonpath operators immutable we decided to save input time zone
> for
>timestamptz items and disallow automatic time zone assignment.  Also
>additional parameter was added to .datetime() for default time zone
>specification:
>
>=# SET timezone = '+03';
>SET
>
>=# SELECT jsonb '"10-03-2017 12:34:56"' @*
>'$.datetime("DD-MM- HH24:MI:SS TZH")';
>ERROR:  Invalid argument for SQL/JSON datetime function
>
>=# SELECT jsonb '"10-03-2017 12:34:56"' @*
>'$.datetime("DD-MM- HH24:MI:SS TZH", "+05")';
>  ?column?
>-
> "2017-03-10T12:34:56+05:00"
>(1 row)
>
>=# SELECT jsonb '"10-03-2017 12:34:56 +05"' @*
>'$.datetime("DD-MM- HH24:MI:SS TZH")';
>  ?column?
>-
> "2017-03-10T12:34:56+05:00"
>(1 row)
>
>Please note that our .datetime() behavior is not standard now: by the
>standard, input and format strings must match exactly, i.e. they both
> should
>not contain trailing unmatched elements, so automatic time zone
> assignment
>is impossible.  But it too restrictive for PostgreSQL users, so we
> decided
>to preserve usual PostgreSQL behavior here:
>
>=# SELECT jsonb '"10-03-2017"' @* '$.datetime("DD-MM- HH24:MI:SS")';
>   ?column?
>---
> "2017-03-10T00:00:00"
>(1 row)

I think someday we should consider adding support for sql standard
conforming datetime. Since it
breaks postgres behaviour we will need 'standard_conforming_datetime' guc.


>
>
>Also PostgreSQL is able to automatically recognize format of the input
>string for the specified datetime type, but we can only bring this
> behavior
>into jsonpath by introducing separate item methods .date(), .time(),
>.timetz(), .timestamp() and .timestamptz().   Also we can use here our
>unfinished feature that gives us ability to work with PostresSQL types in
>jsonpath using cast operator :: (see sqljson_ext branch in our github
> repo):
>
>=# SELECT jsonb '"10/03/2017 12:34"' @* '$::timestamptz';
>  ?column?
>-
> "2017-03-10T12:34:00+03:00"
>(1 row)

Another note.
We decided to preserve TZ in JSON_QUERY function and follow standard
Postgres behaviour  in JSON_VALUE,
since JSON_QUERY returns JSON object and JSON_VALUE returns SQL value.

SELECT JSON_QUERY(jsonb '"2018-02-21 17:01:23 +05"',
'$.datetime("-MM-DD HH24:MI:SS TZH")');
 json_query
-
 "2018-02-21T17:01:23+05:00"
(1 row)

show timezone;
 TimeZone
--
 W-SU
(1 row)

SELECT JSON_VALUE(jsonb '"2018-02-21 17:01:23 +05"',
'$.datetime("-MM-DD HH24:MI:SS TZH")');
   json_value

 2018-02-21 15:01:23+03
(1 row)
>
>
>
> A br

Re: Precision loss casting float to numeric

2018-02-26 Thread Chapman Flack
On 02/26/2018 01:29 PM, Tom Lane wrote:

> I think there's probably a bigger chance of complaints that
> "casting 1.1::float8 to numeric now produces some weird,
> incorrectly-rounded result" than that we make anyone happier.

Arguably, though, that's a moment that can be used to explain
exactly what the correctly-rounded value of 1.1::float8 is and
why, and then people both know something new and understand more
precisely what's happening to their data, and can apply round()
to it in exactly the places they want, if they want.

In contrast, the current fact that 1.1::float8 looks like 1.1
when cast to numeric puts a superficial smile on people's faces,
while they haven't really been asked how they feel about losing
five, six, or thirty-eight bits of precision when casting one data
type into another of supposedly greater precision and back. I think
the typical assumption is that, sure, you may lose precision
if you cast to a *less* precise type, but the other direction's
assumed value-preserving.

I can see the concern about changing behavior for code that may
exist already. I would never have thought of making the behavior
of a cast sensitive to extra_float_digits (in fact, I hadn't
known about extra_float_digits; it's described in the "locale
and formatting" section, which I never dreamed of consulting
for a cast between internal value representations; am I weird
in that?).

I wonder if an alternative to making a cast that can't be immutable,
because it looks at a GUC, could be to offer a choice of cast
functions: if you need the other behavior, create a schema, do a
CREATE CAST in it, and put it on your search path ahead of pg_catalog.
Kind of ugly, but that can happen dealing with kind of ugly situations.

-Chap



Re: invalid memory alloc request size error with commit 4b93f579

2018-02-26 Thread Tom Lane
Rushabh Lathia  writes:
> In ExecBRUpdateTriggers(), we need to add a check that if trigtuple is same
> as newtuple, then we don't require to free the trigtuple.

Hm.  Seems like this is a very old bug: it's always been legal for a
trigger to return the "old" tuple if it felt like it, even if plpgsql
didn't happen to exercise that case.

Because of that angle, I'm not really happy with using plpgsql as
part of the test case.  The bug ought to be repaired in the back
branches too, but this test will prove little in the back branches.
Moreover, if somebody were to rejigger plpgsql again, the test might
stop proving anything at all.

I wonder whether it is worth creating a C trigger function
(probably in regress.c) specifically to exercise this situation.
If not, I'm inclined not to bother with adding a test case.

regards, tom lane



Re: Precision loss casting float to numeric

2018-02-26 Thread Tom Lane
Chapman Flack  writes:
> I wonder if an alternative to making a cast that can't be immutable,
> because it looks at a GUC, could be to offer a choice of cast
> functions: if you need the other behavior, create a schema, do a
> CREATE CAST in it, and put it on your search path ahead of pg_catalog.

Nope, because casts aren't schema-local, since they don't have names.
There can be only one cast between given source and target types.

regards, tom lane



Re: Precision loss casting float to numeric

2018-02-26 Thread Bear Giles
On Mon, Feb 26, 2018 at 11:29 AM, Tom Lane  wrote:

> Chapman Flack  writes:
> > The 0002-*.patch is a proof-of-concept patching float4_numeric and
> > float8_numeric in the trivial way (just using FLT_DECIMAL_DIG and
> > DBL_DECIMAL_DIG in place of FLT_DIG and DBL_DIG). It makes the new
> > regression test pass. (It will only work under a compiler that has
> > __FLT_DECIMAL_DIG__ and __DBL_DECIMAL_DIG__ available, and I used
> > those internal versions to avoid mucking with build tooling to change
> > the target C standard, which I assume wouldn't be welcome anyway.
>
> Nope.  TBH, I'd think about just using "DBL_DIG + 3", given our existing
> coding around extra_float_digits in places like pg_dump and postgres_fdw.
> The knowledge that you need 2 or 3 extra digits is already well embedded.
>
> Conceivably you could do it like
>
> #ifndef DBL_DECIMAL_DIG
> #ifdef __DBL_DECIMAL_DIG__
> #define DBL_DECIMAL_DIG __DBL_DECIMAL_DIG__
> #else
> #define DBL_DECIMAL_DIG (DBL_DIG + 3)
> #endif
> #endif
>
> but I'm not exactly seeing how that buys us anything.
>
> The bigger question here is whether people actually want this behavioral
> change.  I think there's probably a bigger chance of complaints that
> "casting 1.1::float8 to numeric now produces some weird,
> incorrectly-rounded result" than that we make anyone happier.
>
> I have a vague idea that at some point in the past we discussed making
> this conversion use extra_float_digits, which'd allow satisfying both
> camps, at the nontrivial price that the conversion would have to be
> considered stable not immutable.  We didn't pull the trigger, if this
> memory is real at all, presumably because of the mutability issue.
>
> Another idea would be to leave the cast alone and introduce a named
> function that does the "exact" conversion.  Possibly that makes nobody
> happy, but at least both the cast and the function could be immutable.
> It'd dodge backwards-compatibility objections, too.
>
> regards, tom lane
>

​Working for a company that ​
has enterprise customers this can't be overemphasized.
Never require the user to do something so they keep getting the same
results.​
​
​
​ It doesn't
matter if it's "wrong".

​I would vote for a property. If you want the best effort to match the IEEE
spec
you need to execute 'set use_ieee_numbers'  and you'll get the extra digits
and
rounding behavior. If not ​you'll get the existing behavior.

Bear


Re: invalid memory alloc request size error with commit 4b93f579

2018-02-26 Thread Tom Lane
I wrote:
> I wonder whether it is worth creating a C trigger function
> (probably in regress.c) specifically to exercise this situation.

Actually, that doesn't seem too bad at all.  I propose applying
and back-patching the attached.

BTW, I noticed while doing this that the adjacent "funny_dup17"
trigger is dead code, and has been since commit 1547ee01 of
1999-09-29.  I'm inclined to rip it out, because anyone looking
at regress.c would naturally assume that anything in there is
being exercised.

regards, tom lane

diff --git a/src/backend/commands/trigger.c b/src/backend/commands/trigger.c
index fffc009..fbd176b 100644
*** a/src/backend/commands/trigger.c
--- b/src/backend/commands/trigger.c
*** ExecBRUpdateTriggers(EState *estate, EPQ
*** 2815,2821 
  			return NULL;		/* "do nothing" */
  		}
  	}
! 	if (trigtuple != fdw_trigtuple)
  		heap_freetuple(trigtuple);
  
  	if (newtuple != slottuple)
--- 2815,2821 
  			return NULL;		/* "do nothing" */
  		}
  	}
! 	if (trigtuple != fdw_trigtuple && trigtuple != newtuple)
  		heap_freetuple(trigtuple);
  
  	if (newtuple != slottuple)
diff --git a/src/test/regress/expected/triggers.out b/src/test/regress/expected/triggers.out
index ce8fa21..4771069 100644
*** a/src/test/regress/expected/triggers.out
--- b/src/test/regress/expected/triggers.out
*** DROP TABLE fkeys2;
*** 153,158 
--- 153,184 
  -- select count(*) from dup17 where x = 13;
  --
  -- DROP TABLE dup17;
+ -- Check behavior when trigger returns unmodified trigtuple
+ create table trigtest (f1 int, f2 text);
+ create trigger trigger_return_old
+ 	before insert or delete or update on trigtest
+ 	for each row execute procedure trigger_return_old();
+ insert into trigtest values(1, 'foo');
+ select * from trigtest;
+  f1 | f2  
+ +-
+   1 | foo
+ (1 row)
+ 
+ update trigtest set f2 = f2 || 'bar';
+ select * from trigtest;
+  f1 | f2  
+ +-
+   1 | foo
+ (1 row)
+ 
+ delete from trigtest;
+ select * from trigtest;
+  f1 | f2 
+ +
+ (0 rows)
+ 
+ drop table trigtest;
  create sequence ttdummy_seq increment 10 start 0 minvalue 0;
  create table tttest (
  	price_id	int4,
diff --git a/src/test/regress/input/create_function_1.source b/src/test/regress/input/create_function_1.source
index cde78eb..6b6bcd8 100644
*** a/src/test/regress/input/create_function_1.source
--- b/src/test/regress/input/create_function_1.source
*** CREATE FUNCTION funny_dup17 ()
*** 42,47 
--- 42,52 
  AS '@libdir@/regress@DLSUFFIX@'
  LANGUAGE C;
  
+ CREATE FUNCTION trigger_return_old ()
+ RETURNS trigger
+ AS '@libdir@/regress@DLSUFFIX@'
+ LANGUAGE C;
+ 
  CREATE FUNCTION ttdummy ()
  RETURNS trigger
  AS '@libdir@/regress@DLSUFFIX@'
diff --git a/src/test/regress/output/create_function_1.source b/src/test/regress/output/create_function_1.source
index ab601be..9154b4a 100644
*** a/src/test/regress/output/create_function_1.source
--- b/src/test/regress/output/create_function_1.source
*** CREATE FUNCTION funny_dup17 ()
*** 39,44 
--- 39,48 
  RETURNS trigger
  AS '@libdir@/regress@DLSUFFIX@'
  LANGUAGE C;
+ CREATE FUNCTION trigger_return_old ()
+ RETURNS trigger
+ AS '@libdir@/regress@DLSUFFIX@'
+ LANGUAGE C;
  CREATE FUNCTION ttdummy ()
  RETURNS trigger
  AS '@libdir@/regress@DLSUFFIX@'
diff --git a/src/test/regress/regress.c b/src/test/regress/regress.c
index 13e7207..521b181 100644
*** a/src/test/regress/regress.c
--- b/src/test/regress/regress.c
*** funny_dup17(PG_FUNCTION_ARGS)
*** 444,449 
--- 444,465 
  	return PointerGetDatum(tuple);
  }
  
+ PG_FUNCTION_INFO_V1(trigger_return_old);
+ 
+ Datum
+ trigger_return_old(PG_FUNCTION_ARGS)
+ {
+ 	TriggerData *trigdata = (TriggerData *) fcinfo->context;
+ 	HeapTuple	tuple;
+ 
+ 	if (!CALLED_AS_TRIGGER(fcinfo))
+ 		elog(ERROR, "trigger_return_old: not fired by trigger manager");
+ 
+ 	tuple = trigdata->tg_trigtuple;
+ 
+ 	return PointerGetDatum(tuple);
+ }
+ 
  #define TTDUMMY_INFINITY	99
  
  static SPIPlanPtr splan = NULL;
diff --git a/src/test/regress/sql/triggers.sql b/src/test/regress/sql/triggers.sql
index ae8349c..2bad469 100644
*** a/src/test/regress/sql/triggers.sql
--- b/src/test/regress/sql/triggers.sql
*** DROP TABLE fkeys2;
*** 138,143 
--- 138,159 
  --
  -- DROP TABLE dup17;
  
+ -- Check behavior when trigger returns unmodified trigtuple
+ create table trigtest (f1 int, f2 text);
+ 
+ create trigger trigger_return_old
+ 	before insert or delete or update on trigtest
+ 	for each row execute procedure trigger_return_old();
+ 
+ insert into trigtest values(1, 'foo');
+ select * from trigtest;
+ update trigtest set f2 = f2 || 'bar';
+ select * from trigtest;
+ delete from trigtest;
+ select * from trigtest;
+ 
+ drop table trigtest;
+ 
  create sequence ttdummy_seq increment 10 start 0 mi

Re: RTLD_GLOBAL (& JIT inlining)

2018-02-26 Thread Andres Freund
On 2018-02-23 11:05:23 -0500, Tom Lane wrote:
> Andres Freund  writes:
> > I think using RTLD_LOCAL on most machines would be a much better
> > idea. I've not found proper explanations why GLOBAL is used. We started
> > using it ages ago, with [2], but that commit contains no explanation,
> > and a quick search didn't show up anything either. Peter?
> 
> https://www.postgresql.org/message-id/7142.122...@sss.pgh.pa.us
> 
> My position is the same as then: I'm happy to remove it if it doesn't
> break things anywhere ... but it seems like it would cause problems for
> plpython, unless their behavior has changed since 2001 which is
> surely possible.

It hasn't, and it's not just plpython, at least plperl is affected as
well. I'd guess most libraries that have their own dynamic loading
support are affected.

So RTLD_LOCAL is out of the question, but I think we can get a good bit
of the benefit by either specifying -Wl,-Bsymbolic at shlib build time,
or RTLD_DEEPBIND at dlopen() time.  Either leads to the opened shared
library effectively being put at the beginning of the search path,
therefore avoiding the issue that an earlier loaded shared library or
symbols from the main binary can accidentally overwrite things in the
shared library itself. Which incidentally also makes loading a bit
faster.

A bit of googling suggests -Bsymbolic is likely to be more portable.

A quick test confirms that under linux our tests pass with it, after
patching Makefile.shlib.

Greetings,

Andres Freund



Re: RTLD_GLOBAL (& JIT inlining)

2018-02-26 Thread Ants Aasma
On Mon, Feb 26, 2018 at 11:28 PM, Andres Freund  wrote:
> So RTLD_LOCAL is out of the question, but I think we can get a good bit
> of the benefit by either specifying -Wl,-Bsymbolic at shlib build time,
> or RTLD_DEEPBIND at dlopen() time.  Either leads to the opened shared
> library effectively being put at the beginning of the search path,
> therefore avoiding the issue that an earlier loaded shared library or
> symbols from the main binary can accidentally overwrite things in the
> shared library itself. Which incidentally also makes loading a bit
> faster.

I think this would also fix oracle_fdw crashing when postgres is
compiled with --with-ldap. At least RTLD_DEEPBIND helped. [1]

[1] 
https://www.postgresql.org/message-id/CA%2BCSw_tPDYgnzCYW0S4oU0mTUoUhZ9pc7MRBPXVD-3Zbiwni9w%40mail.gmail.com

Ants Aasma



TODO item for broken \s with libedit seems fixed

2018-02-26 Thread Patrick Krecker
It appears this was fixed back in 2014 with 750c5ee. I propose for it
to be removed from the TODO list.

Thanks,
Patrick



Re: [HACKERS] GSoC 2017: weekly progress reports (week 4) and patch for hash index

2018-02-26 Thread Thomas Munro
On Tue, Feb 13, 2018 at 7:47 PM, Thomas Munro
 wrote:
> One way to avoid all that might be to use pseudo page numbers that
> don't suffer from splits.   I don't know how you'd choose the
> constant, but it could be something like pseudo page number = hash
> value % 1024.  In other words, you'd use the full hash value for the
> 'tuple' part of the predicate lock tag, and a shorter hash value for
> the 'page' part of the predicate lock tag, so you'd never have to
> handle split, and promotion from fine grained 'tuple' (= hash value)
> locks to coarse grained 'page' = (short hash value) locks would still
> work automatically when space runs out.

Thinking about how to tune that got me thinking about a simple middle
way we could perhaps consider...

What if we just always locked pseudo page numbers using hash_value %
max_predicate_locks_per_relation (= effectively 31 by default)?  Then
you'd have lower collision rates on small hash indexes, you'd never
have to deal with page splits, and you'd never exceed
max_predicate_locks_per_relation so you'd never escalate to relation
level locks on busy systems.  On the downside, you'd have eg ~3%
chance of collision instead of a 1/hash_maxbucket chance of collision,
so it gets a bit worse for large indexes on systems that are not busy
enough to exceed max_predicate_locks_per_relation.  You win some, you
lose some...

-- 
Thomas Munro
http://www.enterprisedb.com



Re: TODO item for broken \s with libedit seems fixed

2018-02-26 Thread Tatsuo Ishii
> It appears this was fixed back in 2014 with 750c5ee. I propose for it
> to be removed from the TODO list.

Yes, I confirmed it by using Ubuntu + libedit. I have not tested it on
Mac OS X yet though.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp



Re: Use of term hostaddrs for multiple hostaddr values

2018-02-26 Thread Michael Paquier
On Mon, Feb 26, 2018 at 01:22:01PM +, Vasundhar Boddapati wrote:
> I have gone through the comments, both look straight forward to go

Thanks for the input.  Please note that this has been pushed already as
the following commit:
commit: 5c15a54e851ecdd2b53e6d6a84f8ec0802ffc3cb
author: Magnus Hagander 
date: Sun, 21 Jan 2018 13:41:52 +0100
Fix wording of "hostaddrs"
--
Michael


signature.asc
Description: PGP signature


Re: [HACKERS] [PATCH] Vacuum: Update FSM more frequently

2018-02-26 Thread Masahiko Sawada
On Tue, Feb 27, 2018 at 1:44 AM, Claudio Freire  wrote:
> On Mon, Feb 26, 2018 at 1:32 PM, Claudio Freire  
> wrote:
>> On Mon, Feb 26, 2018 at 11:31 AM, Claudio Freire  
>> wrote:
>>> On Mon, Feb 26, 2018 at 6:00 AM, Masahiko Sawada  
>>> wrote:
 Here is review comment for v4 patch.

 @@ -1922,6 +1988,8 @@ count_nondeletable_pages(Relation onerel,
 LVRelStats *vacrelstats)
  * We don't insert a vacuum delay point here, because we 
 have an
  * exclusive lock on the table which we want to hold
 for as short a
  * time as possible.  We still need to check for
 interrupts however.
 +* We might have to acquire the autovacuum lock,
 however, but that
 +* shouldn't pose a deadlock risk.
  */
 CHECK_FOR_INTERRUPTS();

 Is this change necessary?
>>>
>>> I don't recall doing that change. Maybe a rebase gone wrong.
>>>
 
 +   /*
 +* If there are no indexes then we should periodically
 vacuum the FSM
 +* on huge relations to make free space visible early.
 +*/
 +   if (nindexes == 0 &&
 +   (vacuumed_pages - vacuumed_pages_at_fsm_vac) >
 vacuum_fsm_every_pages)
 +   {
 +   /* Vacuum the Free Space Map */
 +   FreeSpaceMapVacuum(onerel, max_freespace);
 +   vacuumed_pages_at_fsm_vac = vacuumed_pages;
 +   max_freespace = 0;
 +   }

 I think this code block should be executed before we check if the page
 is whether new or empty and then do 'continue'. Otherwise we cannot
 reach this code if the table has a lot of new or empty pages.
>>>
>>> In order for the counter (vacuumed_pages) to increase, there have to
>>> be plenty of opportunities for this code to run, and I specifically
>>> wanted to avoid vacuuming the FSM too often for those cases
>>> particularly (when Vacuum scans lots of pages but does no writes).
>>>
 
 @@ -663,6 +663,8 @@ fsm_extend(Relation rel, BlockNumber fsm_nblocks)
   * If minValue > 0, the updated page is also searched for a page with at
   * least minValue of free space. If one is found, its slot number is
   * returned, -1 otherwise.
 + *
 + * If minValue == 0, the value at the root node is returned.
   */
  static int
  fsm_set_and_search(Relation rel, FSMAddress addr, uint16 slot,
 @@ -687,6 +689,8 @@ fsm_set_and_search(Relation rel, FSMAddress addr,
 uint16 slot,

 addr.level == FSM_BOTTOM_LEVEL,
true);
 }
 +   else
 +   newslot = fsm_get_avail(page, 0);

 I think it's not good idea to make fsm_set_and_search return either a
 slot number or a category value according to the argument. Category
 values is actually uint8 but this function returns it as int.
>>>
>>> Should be fine, uint8 can be contained inside an int in all platforms.
>>>
 Also we
 can call fsm_set_and_search with minValue = 0 at
 RecordAndGetPageWithFreeSpace() when search_cat is 0. With you patch,
 fsm_set_and_search() then returns the root value but it's not correct.
>>>
>>> I guess I can add another function to do that.
>>>
 Also, is this change necessary for this patch? ISTM this change is
 used for the change in fsm_update_recursive() as follows but I guess
 this change can be a separate patch.

 +   new_cat = fsm_set_and_search(rel, parent, parentslot, new_cat, 0);
 +
 +   /*
 +* Bubble up, not the value we just set, but the one now in the 
 root
 +* node of the just-updated page, which is the page's highest 
 value.
 +*/
>>>
>>> I can try to separate them I guess.
>>
>> Patch set with the requested changes attached.
>>
>> 2 didn't change AFAIK, it's just rebased on top of the new 1.
>
> Oops. Forgot about the comment changes requested. Fixed those in v6, attached.

Thank you for updating patches!

0001 patch looks good to me except for the following unnecessary empty lines.

+* If there are no indexes then we should periodically
vacuum the FSM
+* on huge relations to make free space visible early.
+*/
+   else if (nindexes == 0 && fsm_updated_pages >
vacuum_fsm_every_pages)
+   {
+   /* Vacuum the Free Space Map */
+   FreeSpaceMapVacuum(onerel, max_freespace);
+   fsm_updated_pages = 0;
+   max_freespace = 0;
+   }
+
+
+   /*

@@ -913,10 +967,14 @@ lazy_scan_heap(Relation onerel, int options,
LVRelStats *vacrelstats,

vmbuf

Re: invalid memory alloc request size error with commit 4b93f579

2018-02-26 Thread Daniel Gustafsson
> On 27 Feb 2018, at 05:25, Tom Lane  wrote:
> 
> I wrote:
>> I wonder whether it is worth creating a C trigger function
>> (probably in regress.c) specifically to exercise this situation.
> 
> Actually, that doesn't seem too bad at all.  I propose applying
> and back-patching the attached.

LGTM

> BTW, I noticed while doing this that the adjacent "funny_dup17"
> trigger is dead code, and has been since commit 1547ee01 of
> 1999-09-29.  I'm inclined to rip it out, because anyone looking
> at regress.c would naturally assume that anything in there is
> being exercised.

+1, yes please.  regress_dist_ptpath() and regress_path_dist() in regress.c
also seem to be dead, and have been so for..  quite some time.

cheers ./daniel




Re: TODO item for broken \s with libedit seems fixed

2018-02-26 Thread Patrick Krecker
On Mon, Feb 26, 2018 at 4:54 PM, Tatsuo Ishii  wrote:
>> It appears this was fixed back in 2014 with 750c5ee. I propose for it
>> to be removed from the TODO list.
>
> Yes, I confirmed it by using Ubuntu + libedit. I have not tested it on
> Mac OS X yet though.
>
> Best regards,
> --
> Tatsuo Ishii
> SRA OSS, Inc. Japan
> English: http://www.sraoss.co.jp/index_en.php
> Japanese:http://www.sraoss.co.jp

I tested on macOS 10.12.6.

Patrick



Re: 2018-03 CFM

2018-02-26 Thread Michael Paquier
On Mon, Feb 26, 2018 at 02:03:56PM -0500, Tom Lane wrote:
> David Steele  writes:
> > I'm planning to fill the CFM role, unless there are objections.
> 
> Thanks for volunteering!

+1.  I have also added a new commit fest entry marked as future, which
will likely be used for 12 development:
https://commitfest.postgresql.org/18/
2018-09 is a tentative name, please don't take that as a decided date as
this will be discussed in Ottawa during the dev meeting.

This matters for hackers as once a CF is marked as in-progress, there is
no way to register a patch in it.  And in the case of this March's CF,
there would be no places where such future patches could go.
--
Michael


signature.asc
Description: PGP signature


Re: TODO item for broken \s with libedit seems fixed

2018-02-26 Thread Tatsuo Ishii
> On Mon, Feb 26, 2018 at 4:54 PM, Tatsuo Ishii  wrote:
>>> It appears this was fixed back in 2014 with 750c5ee. I propose for it
>>> to be removed from the TODO list.
>>
>> Yes, I confirmed it by using Ubuntu + libedit. I have not tested it on
>> Mac OS X yet though.
>>
>> Best regards,
>> --
>> Tatsuo Ishii
>> SRA OSS, Inc. Japan
>> English: http://www.sraoss.co.jp/index_en.php
>> Japanese:http://www.sraoss.co.jp
> 
> I tested on macOS 10.12.6.

Ok, let's mark it as "Done.", (rather than removing it) if there's no
objection.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp



Re: Unexpected behavior with transition tables in update statement trigger

2018-02-26 Thread Thomas Munro
On Tue, Feb 27, 2018 at 4:18 AM, Tom Kazimiers  wrote:
> On Mon, Feb 26, 2018 at 11:15:44PM +1300, Thomas Munro wrote:
>> On Sat, Feb 24, 2018 at 4:47 PM, Tom Kazimiers 
>> wrote:
>> Thanks for the reproducer.  Yeah, that seems to be a bug.
>> nodeNamedTuplestorescan.c allocates a new read pointer for each
>> separate scan of the named tuplestore, but it doesn't call
>> tuplestore_select_read_pointer() so that the two scans that appear in
>> your UNION ALL plan are sharing the same read pointer.  At first
>> glance the attached seems to fix the problem, but I'll need to look
>> more carefully tomorrow.
>
> Thanks very much for investigating this. I can confirm that applying your
> patch results in the tuples I expected in both my test trigger and my actual
> trigger function.

Thanks for testing.

> It would be great if this or a similar fix would make it into the next
> official release.

Here's a new version with tuplestore_select_read_pointer() added in
another place where it was lacking, and commit message.  Moving to
-hackers, where patches go.

Here's a shorter repro.  On master it prints:

NOTICE:  count = 1
NOTICE:  count union = 1

With the patch the second number is 2, as it should be.

CREATE TABLE test (i int);
INSERT INTO test VALUES (1);

CREATE OR REPLACE FUNCTION my_trigger_fun() RETURNS trigger
LANGUAGE plpgsql AS
$$
  BEGIN
 RAISE NOTICE 'count = %', (SELECT COUNT(*) FROM new_test);
 RAISE NOTICE 'count union = %', (SELECT COUNT(*)
  FROM (SELECT * FROM new_test UNION ALL SELECT * FROM new_test) ss);
 RETURN NULL;
 END;
$$;

CREATE TRIGGER my_trigger AFTER UPDATE ON test
REFERENCING NEW TABLE AS new_test OLD TABLE as old_test
FOR EACH STATEMENT EXECUTE PROCEDURE my_trigger_fun();

UPDATE test SET i = i;

-- 
Thomas Munro
http://www.enterprisedb.com


0001-Fix-tuplestore-read-pointer-confusion-in-nodeNamedtu.patch
Description: Binary data


Re: invalid memory alloc request size error with commit 4b93f579

2018-02-26 Thread Rushabh Lathia
Thanks Tom.

On Tue, Feb 27, 2018 at 2:55 AM, Tom Lane  wrote:

> I wrote:
> > I wonder whether it is worth creating a C trigger function
> > (probably in regress.c) specifically to exercise this situation.
>
> Actually, that doesn't seem too bad at all.  I propose applying
> and back-patching the attached.
>
>
Patch looks good to me.


> BTW, I noticed while doing this that the adjacent "funny_dup17"
> trigger is dead code, and has been since commit 1547ee01 of
> 1999-09-29.  I'm inclined to rip it out, because anyone looking
> at regress.c would naturally assume that anything in there is
> being exercised.
>
> regards, tom lane
>
>


-- 
Rushabh Lathia


Re: jsonlog logging only some messages?

2018-02-26 Thread Michael Paquier
On Mon, Feb 26, 2018 at 05:38:56PM +, Greg Stark wrote:
> I tried loading the jsonlog module from
> https://github.com/michaelpq/pg_plugins into Postgres 9.6.
> 
> However it seems it resulted in logs only for session log messages but
> not any background worker log messages. We have log_checkpoints set
> but there were no log messages in the json log about checkpoints. Nor
> were there any from autovacuum.

Hm.  I have just loaded jsonlog on a 9.6 and 10 instance where
log_checkpoints is enabled with this background worker which logs a
simple string every 10s:
https://github.com/michaelpq/pg_plugins/tree/master/hello_world

Both checkpoint logs and the logs of the bgworker are showing up for me.

> Also I have log_destination set to stderr,cvslog and as I understand
> it the jsonlog module effectively overrides the "stderr" output which
> goes to the file named *.log (which I find super confusing,
> incidentally). But I was expecting to get the csv file as well. We
> didn't, we only got the *.log file with the (partial) json logs. Is
> there something I'm missing here or is this unexpected?

Yeah, that's unfortunately expected...  jsonlog enforces
edata->output_to_server to false, which has the advantage of disabling
extra log outputs, but also has the advantage of preventing duplicate
entries into stderr.  One way that I can see to solve things would be to
patch the backend and get output_to_server replaced by a filter which
allows a plugin to choose what are the extra output types allowed.  In
this case you don't want stderr later on, but would still want csvlog to
happen, instead of having an all-or-nothing switch.  I haven't tested,
but it could be possible to have as well jsonlog enforce Log_destination
to only use csvlog after it generates its entry so as stderr is not
duplicated by csvlog gets though.  Not sure how you would reset the
parameter though, so you may perhaps want to invoke an extra plugin
which outputs to csvlog as jsonlog cascades through things.
--
Michael


signature.asc
Description: PGP signature


Re: TODO item for broken \s with libedit seems fixed

2018-02-26 Thread Tom Lane
Tatsuo Ishii  writes:
> Ok, let's mark it as "Done.", (rather than removing it) if there's no
> objection.

I think that, back when we were actively maintaining the TODO list,
the idea was that items got marked as "Done" when committed and then
removed from the list after the fix was released.  In this case, since
the fix went out years ago, we might as well just remove the entry
immediately.

regards, tom lane



Re: TODO item for broken \s with libedit seems fixed

2018-02-26 Thread Tatsuo Ishii
> I think that, back when we were actively maintaining the TODO list,
> the idea was that items got marked as "Done" when committed and then
> removed from the list after the fix was released.  In this case, since
> the fix went out years ago, we might as well just remove the entry
> immediately.

I would prefer to mark it done then remove the item just for leaving
an editing history.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp



Re: prokind column (was Re: [HACKERS] SQL procedures)

2018-02-26 Thread Michael Paquier
On Mon, Feb 26, 2018 at 02:03:19PM -0500, Tom Lane wrote:
> I'm not sure that other patch will get in; AFAICS it's incomplete and
> rather controversial.  But I guess we could put this issue on the
> open-items list so we don't forget.

+1.  We already know that we want to do a switch to prokind anyway,
while the other patch is still pending (don't think much about it
myself, I would just recommend users to use a version of psql matching
the one of the server instead of putting an extra load of maintenance
into psql for years to come).  So I would recommend to push forward with
this one and fix what we know we have to fix, then request a rebase.  We
gain nothing by letting things in a semi-broken state.  I'll be happy to
help those folks rebase and/or write the patch to update psql's tab
completion for prokind as need be.

> Anyway, once the release dust settles, I'll try to do a proper review
> of this patch.  It'd be good if we could get it in this week before
> the CF starts, so that any affected patches could be rebased.

Here is some input from me.  I don't like either that prorettype is used
to determine if the object used is a procedure or a function.  The patch
series is very sensitive to changes in pg_proc.h, still those apply
correctly when using bc1adc65 as base commit.

The original commit adding support for procedures had this diff in
clauses.c:
@@ -4401,6 +4401,7 @@ inline_function(Oid funcid, Oid result_type, Oid 
result_collid,
if (funcform->prolang != SQLlanguageId ||
funcform->prosecdef ||
funcform->proretset ||
+   funcform->prorettype == InvalidOid ||
funcform->prorettype == RECORDOID ||

Perhaps this should be replaced with a check on prokind?

It seems to me that the same applies to fmgr_sql_validator().  In
information_schema.sql, type_udt_catalog uses a similar comparison so
this should have a comment about why using prorettype makes more sense.
In short for all those tings, it is fine to use prorettype when directly
working on the result type, like what is done in plperl and plpython.
For all the code paths working on the object type, I think that using
prokind should be the way to go.

getProcedureTypeDescription() should use an enum instead, complaining
with elog(ERROR) if the type found is something else?

I think that get_func_isagg should be dropped and replaced by
get_func_prokind.
--
Michael


signature.asc
Description: PGP signature


Re: invalid memory alloc request size error with commit 4b93f579

2018-02-26 Thread Tom Lane
Daniel Gustafsson  writes:
>> On 27 Feb 2018, at 05:25, Tom Lane  wrote:
>> BTW, I noticed while doing this that the adjacent "funny_dup17"
>> trigger is dead code, and has been since commit 1547ee01 of
>> 1999-09-29.  I'm inclined to rip it out, because anyone looking
>> at regress.c would naturally assume that anything in there is
>> being exercised.

> +1, yes please.  regress_dist_ptpath() and regress_path_dist() in regress.c
> also seem to be dead, and have been so for..  quite some time.

Yeah.  Looking at
https://coverage.postgresql.org/src/test/regress/regress.c.gcov.html
it's evident that none of these functions are actually exercised
in the regression tests:

regress_dist_ptpath unreferenced anywhere
regress_path_dist   unreferenced anywhere
poly2path   unreferenced anywhere
widget_in   used in type definition, but no input ever happens
widget_out  used in type definition, but no output ever happens
pt_in_widgetused for operator that evidently isn't called
boxarea SQL function is created, but used nowhere
funny_dup17 SQL function is created, but used nowhere
int44in used in type definition, but no input ever happens
int44outused in type definition, but no output ever happens
test_fdw_handlerused by dummy FDW tests

I'm inclined to just remove regress_dist_ptpath, regress_path_dist,
poly2path, boxarea, and funny_dup17.  The others might better be dealt
with by making some actual use of them, since those type and operator
creation commands seem to have some test value of their own.

I notice BTW that int44in and int44out are not inverses, ie int44out
produces a string that int44in can't read :-(.  We'd definitely have to
fix that if we wanted to make any real use of the type.

regards, tom lane



Re: TODO item for broken \s with libedit seems fixed

2018-02-26 Thread Tom Lane
Tatsuo Ishii  writes:
> I would prefer to mark it done then remove the item just for leaving
> an editing history.

Sure, although leaving a commit message with a pointer to the fix in git
would document this better.

regards, tom lane



Re: TODO item for broken \s with libedit seems fixed

2018-02-26 Thread Tatsuo Ishii
> Sure, although leaving a commit message with a pointer to the fix in git
> would document this better.

Right. We should always do that.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp



Re: [HACKERS] path toward faster partition pruning

2018-02-26 Thread Amit Langote
On 2018/02/27 3:27, Robert Haas wrote:
> On Sun, Feb 25, 2018 at 11:10 PM, Amit Langote
>  wrote:
>> I think I'm convinced that partopcintype OIDs can be used where I thought
>> parttypid ones were necessary.  The pruning patch uses the respective OID
>> from this array when extracting the datum from an OpExpr to be compared
>> with the partition bound datums.  It's sensible, I now think, to require
>> the extracted datum to be of partition opclass declared input type, rather
>> than the type of the partition key involved.  So, I removed the parttypid
>> that I'd added to PartitionSchemeData.
>>
>> Updated the comments to make clear the distinction between and purpose of
>> having both parttypcoll and partcollation.  Also expanded the comment
>> about partsupfunc a bit.
> 
> I don't think this fundamentally fixes the problem, although it does
> narrow it.  By requiring partcollation to match across every relation
> with the same PartitionScheme, you're making partition-wise join fail
> to work in some cases where it previously did.  Construct a test case
> where parttypcoll matches and partcollation doesn't; then, without the
> patch, the two relations will have the same PartitionScheme and thus
> be eligible for a partition-wise join, but with the patch, they will
> have different PartitionSchemes and thus won't.

I may be confused but shouldn't two tables partitioned on the same column
(of the same collatable type), but using different collations for
partitioning should end up with different PartitionSchemes?  Different
partitioning collations would mean that same data may end up in different
partitions of the respective tables.

create table p (a text) partition by range (a collate "en_US");
create table p1 partition of p for values from ('a') to ('m');
create table p2 partition of p for values from ('m') to ('z ');

create table q (a text) partition by range (a collate "C");
create table q1 partition of q for values from ('a') to ('m');
create table q2 partition of q for values from ('m') to ('z ');

insert into p values ('A');
INSERT 0 1

insert into q values ('A');
ERROR:  no partition of relation "q" found for row
DETAIL:  Partition key of the failing row contains (a) = (A).

You may say that partition bounds might have to be different too in this
case and hence partition-wise join won't occur anyway, but I'm wondering
if the mismatch of partcollation itself isn't enough to conclude that?

Thanks,
Amit




RE: [bug fix] Cascaded standby cannot start after a clean shutdown

2018-02-26 Thread Tsunakawa, Takayuki
From: Michael Paquier [mailto:mich...@paquier.xyz]
> By the way, as long as I have my mind of it.  Another strategy would be
> to just make the checks in XLogReadRecord() a bit smarter if the whole record
> header is not on the page.  If we check at least for
> AllocSizeIsValid(total_len) then there this code would not fail on an
> allocation as you user reported.  Still this misses the case where a record
> size is lower than 1GB but invalid so you would allocate allocate_recordbuf
> for nothing :(

That was my first thought, and I gave it up.  As you say, XLogReadRecord() 
could allocate up to 1 GB of memory for a garbage.  That allocation can fail 
due to memory shortage, which prevents the recovery from proceeding.


Regards
Takayuki Tsunakawa







Re: Kerberos test suite

2018-02-26 Thread Michael Paquier
On Wed, Feb 14, 2018 at 09:27:04AM -0500, Peter Eisentraut wrote:
> Here is a patch with a test suite for the Kerberos/GSSAPI authentication
> functionality.  It's very similar in principle to the recently added
> LDAP tests, and similar caveats apply.
> 
> You will need the client and server parts of a krb5 package
> installation, possibly named krb5-workstation and krb5-server, or
> perhaps krb5-user and krb5-kdc.

Thanks.  Could you document that on the README please?  krb5-user and
krb5-kdc is a split from Debian.  For darwin, are you using macports or
homebrew?  I would assume the later, and it would be nice to precise
that in the README as well.  On Debian you need to install as well
krb5-admin-server as it includes kadmin.local which the test needs.
Once I understood that I have been able to run the tests.

> (If it appears to hang for you in the "setting up Kerberos" step, you
> might need more entropy/wait a while.  That problem appears to be
> limited to some virtual machine setups, but the specifics are not
> clear.)

That's one of those "move your mouse" or "type randomly your keyboard"
to generate more entropy for the installation setup?

You have forgotten to update ALWAYS_SUBDIRS in src/test/Makefile.

+my ($stdout, $krb5_version);
+IPC::Run::run [ 'krb5-config', '--version' ], '>', \$stdout or die
"could not execute krb5-config";
+$stdout =~ m/Kerberos 5 release ([0-9]+\.[0-9]+)/ or die "could not get
Kerberos version";
+$krb5_version = $1;
Time for a new routine command_log which executes the command, then
returns stdout and stderr to the caller?

+system_or_bail 'echo secret1 | kinit test1';
Using IPC::Run stuff would be better here.

@@ -1153,6 +1152,11 @@ sub psql
$params{on_error_stop} = 1 unless defined $params{on_error_stop};
$params{on_error_die}  = 0 unless defined $params{on_error_die};

+   $connstr .= ' host=localhost' if defined $params{tcpip};
+
+   my @psql_params =
+ ('psql', '-XAtq', '-d', $connstr, '-f', '-');
This bit I don't like.  Wouldn't it be enough to abuse of extra_params
and use a custom connection string?  The last value wins in a psql
command. 
--
Michael


signature.asc
Description: PGP signature


Re: TODO item for broken \s with libedit seems fixed

2018-02-26 Thread Tatsuo Ishii
Patrick,

>> Sure, although leaving a commit message with a pointer to the fix in git
>> would document this better.
> 
> Right. We should always do that.

Would you like to do this yourself? Or shall I do this?

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp



Re: invalid memory alloc request size error with commit 4b93f579

2018-02-26 Thread Daniel Gustafsson
> On 27 Feb 2018, at 11:10, Tom Lane  wrote:
> 
> Daniel Gustafsson  writes:
>>> On 27 Feb 2018, at 05:25, Tom Lane  wrote:
>>> BTW, I noticed while doing this that the adjacent "funny_dup17"
>>> trigger is dead code, and has been since commit 1547ee01 of
>>> 1999-09-29.  I'm inclined to rip it out, because anyone looking
>>> at regress.c would naturally assume that anything in there is
>>> being exercised.
> 
>> +1, yes please.  regress_dist_ptpath() and regress_path_dist() in regress.c
>> also seem to be dead, and have been so for..  quite some time.
> 
> Yeah.  Looking at
> https://coverage.postgresql.org/src/test/regress/regress.c.gcov.html
> it's evident that none of these functions are actually exercised
> in the regression tests:

Aha, that was a more clever way of figuring it out than what I did.

> I'm inclined to just remove regress_dist_ptpath, regress_path_dist,
> poly2path, boxarea, and funny_dup17.  The others might better be dealt
> with by making some actual use of them, since those type and operator
> creation commands seem to have some test value of their own.

Agreed.

> I notice BTW that int44in and int44out are not inverses, ie int44out
> produces a string that int44in can't read :-(.  We'd definitely have to
> fix that if we wanted to make any real use of the type.

Thats not nice given that the names imply that, but I agree that it’s not
something that needs to be changed given its current usecase.  That probably
warrants a comment in regress.c and/or the create_type test suite though.

cheers ./daniel


Re: Kerberos test suite

2018-02-26 Thread Thomas Munro
On Thu, Feb 15, 2018 at 3:27 AM, Peter Eisentraut
 wrote:
> Here is a patch with a test suite for the Kerberos/GSSAPI authentication
> functionality.  It's very similar in principle to the recently added
> LDAP tests, and similar caveats apply.

+not run by default.  You might need to adjust some paths in the test
+file to have it find MIT Kerberos in a place that hadn't been thought
+of yet.

FWIW it passes for me if I add this:

+elsif ($^O eq 'freebsd')
+{
+   $krb5_bin_dir = '/usr/local/bin';
+   $krb5_sbin_dir = '/usr/local/sbin';
+}

One thing that surprised me is that your test runs my system's
installed psql command out of my $PATH, not the one under test.  Is
that expected?

-- 
Thomas Munro
http://www.enterprisedb.com



Re: [HACKERS] Transactions involving multiple postgres foreign servers

2018-02-26 Thread Masahiko Sawada
On Wed, Feb 21, 2018 at 6:07 AM, Robert Haas  wrote:
> On Tue, Feb 13, 2018 at 5:42 AM, Masahiko Sawada  
> wrote:
>>> The fdw-transactions section of the documentation seems to imply that
>>> henceforth every FDW must call FdwXactRegisterForeignServer, which I
>>> think is an unacceptable API break.
>>>
>>> It doesn't seem advisable to make this behavior depend on
>>> max_prepared_foreign_transactions.  I think that it should be an
>>> server-level option: use 2PC for this server, or not?  FDWs that don't
>>> have 2PC default to "no"; but others can be set to "yes" if the user
>>> wishes.  But we shouldn't force that decision to be on a cluster-wide
>>> basis.
>>
>> Since I've added a new option two_phase_commit to postgres_fdw we need
>> to ask FDW whether the foreign server is 2PC-capable server or not in
>> order to register the foreign server information. That's why the patch
>> asked FDW to call  FdwXactRegisterForeignServer. However we can
>> register a foreign server information automatically by executor (e.g.
>> at  BeginDirectModify and at BeginForeignModify) if a foreign server
>> itself has that information. We can add two_phase_commit_enabled
>> column to pg_foreign_server system catalog and that column is set to
>> true if the foriegn server is 2PC-capable (i.g. has enough functions)
>> and user want to use it.
>
> I don't see why this would need a new catalog column.

I might be missing your point. As for API breaking, this patch doesn't
break any existing FDWs. All new APIs I proposed are dedicated to 2PC.
In other words, FDWs that work today can continue working after this
patch gets committed, but if FDWs want to support atomic commit then
they should be updated to use new APIs. The reason why the calling of
FdwXactRegisterForeignServer is necessary is that the core code
controls the foreign servers that involved with the transaction but
the whether each foreign server uses 2PC option (two_phase_commit) is
known only on FDW code. We can eliminate the necessity of calling
FdwXactRegisterForeignServer by moving 2PC option from fdw-level to
server-level in order not to enforce calling the registering function
on FDWs. If we did that, the user can use the 2PC option as a
server-level option.

>
>>> But that brings up another issue: why is MyFdwConnections named that
>>> way and why does it have those semantics?  That is, why do we really
>>> need a list of every FDW connection? I think we really only need the
>>> ones that are 2PC-capable writes.  If a non-2PC-capable foreign server
>>> is involved in the transaction, then we don't really to keep it in a
>>> list.  We just need to flag the transaction as not locally prepareable
>>> i.e. clear TwoPhaseReady.  I think that this could all be figured out
>>> in a much less onerous way: if we ever perform a modification of a
>>> foreign table, have nodeModifyTable.c either mark the transaction
>>> non-preparable by setting XACT_FLAGS_FDWNOPREPARE if the foreign
>>> server is not 2PC capable, or otherwise add the appropriate
>>> information to MyFdwConnections, which can then be renamed to indicate
>>> that it contains only information about preparable stuff.  Then you
>>> don't need each individual FDW to be responsible about calling some
>>> new function; the core code just does the right thing automatically.
>>
>> I could not get this comment. Did you mean that the foreign
>> transaction on not 2PC-capable foreign server should be end in the
>> same way as before (i.g. by XactCallback)?
>>
>> Currently, because there is not FDW API to end foreign transaction,
>> almost FDWs use XactCallbacks to end the transaction. But after
>> introduced new FDW APIs, I think it's better to use FDW APIs to end
>> transactions rather than using XactCallbacks. Otherwise we end up with
>> having FDW APIs for 2PC (prepare and resolve) and XactCallback for
>> ending the transaction, which would be hard to understand. So I've
>> changed the foreign transaction management so that core code
>> explicitly asks FDW to end/prepare a foreign transaction instead of
>> ending it by individual FDWs. After introduced new FDW APIs, core code
>> can have the information of all foreign servers involved with the
>> transaction and call each APIs at appropriate timing.
>
> Well, it's one thing to introduce a new API.  It's another thing to
> require existing FDWs to be updated to use it.  There are a lot of
> existing FDWs out there, and I think that it is needlessly unfriendly
> to force them all to be updated for v11 (or whenever this gets
> committed) even if we think the new API is clearly better.  FDWs that
> work today should continue working after this patch is committed.

Agreed.

> Separately, I think there's a question of whether the new API is in
> fact better -- I'm not sure I have a completely well-formed opinion
> about that yet.

I think one API should do one job. In terms of keeping simple API the
current four APIs would be not bad. AFAICS other DB server that
suppo

Scenario using pg_rewind

2018-02-26 Thread Kato, Sho
Hi,

I have a question of how to use pg_rewind.
I see the manual. It says bringing an old master server back online after 
failover is typical case.

manual:
  A typical scenario is to bring an old master server back online after 
failover, as a standby that follows the new master.

I came up with an another scenario where user uses a synchronized old master as 
a new master.
If so, pg_rewind need to copy WAL from a source server.
Is pg_rewind assuming to use the old master as a new master?

If pg_rewind is used only for a standby server installation, 
pg_rewind does not need to get WAL from source server, does it?

regards,

--
Kato Sho






Re: Two small patches for the isolationtester lexer

2018-02-26 Thread Daniel Gustafsson
> On 22 Feb 2018, at 05:12, Tom Lane  wrote:
> 
> Daniel Gustafsson  writes:
>> On 21 Feb 2018, at 21:41, Tom Lane  wrote:
>>> I can't think of one; but I wonder if it's worth working a bit harder and
>>> removing the fixed limit altogether, probably by using a PQExpBuffer.
>>> If you've hit 1024 today, somebody will bump up against 2048 tomorrow.
> 
>> The thought did cross my mind, but I opted for the simple hack first.  I can
>> take a stab at using a PQExpBuffer to see where that leads.
> 
> Another idea is just to teach addlitchar to realloc the buffer bigger
> when necessary.

I think this is the best approach for the task, the attached patch changes the
static allocation to instead realloc when required.  Having an upper limit on
the allocation seemed like a good idea to me, but perhaps it’s overthinking and
complicating things for no good reason.

cheers ./daniel



0001-Grow-isolation-spec-scanner-buffer-on-large-SQL-step.patch
Description: Binary data


Re: Two small patches for the isolationtester lexer

2018-02-26 Thread Daniel Gustafsson
> On 22 Feb 2018, at 05:10, Tom Lane  wrote:
> 
> I wrote;
>> Daniel Gustafsson  writes:
>>> I also (again) forgot about the # comments not being allowed inside setup 
>>> and
>>> teardown blocks, so patch 0002 proposes adding support for these as the
>>> documentation implies.
> 
>> Hmm, not sure this is a good idea.  # is a valid SQL operator name, so
>> doing this would create some risk of breaking legal queries.
> 
> Actually, looking closer, this would also trigger on '#' used inside a
> SQL literal, which seems to move the problem cases into the "pretty
> likely" category instead of the "far-fetched" one.  So I'd only be OK
> with it if we made the lexer smart enough to distinguish inside-a-SQL-
> literal from not.  That might be a good thing anyway, since it would
> allow us to not choke on literals containing '}', but it'd be a great
> deal more work.  You might be able to steal code from the psql lexer
> though.

I agree, patch 0002 was broken and the correct fix is a much bigger project -
one too big for me to tackle right now (but hopefully at some point in the near
future).  Thanks for the review of it though!

cheers ./daniel


Re: [bug fix] Cascaded standby cannot start after a clean shutdown

2018-02-26 Thread Michael Paquier
On Mon, Feb 26, 2018 at 07:25:46AM +, Tsunakawa, Takayuki wrote:
> From: Michael Paquier [mailto:mich...@paquier.xyz]
>> The WAL receiver approach also has a drawback.  If WAL is streamed at full
>> speed, then the primary sends data with a maximum of 6 WAL pages.
>> When beginning streaming with a new segment, then the WAL sent stops at
>> page boundary.  But if you stop once in the middle of a page then you need
>> to zero-fill the page until the current segment is finished streaming.  So
>> if the workload generates spiky WAL then the WAL receiver can would a lot
>> of extra lseek() calls with the patch applied, while all the writes would
>> be sequential on HEAD, so that's not performant-wise IMO.
> 
> Does even the non-cascading standby stop in the middle of a page?  I
> thought the master always the whole WAL blocks without stopping in the
> middle of a page.

You even have problems on normal standbys.  I have a small script which
is able to reproduce that if you want (need a small rewrite as it is
adapted to my test framework) which introduces a garbage set of WAL
segments on a stopped standby.  With the small monitoring patch I
mentioned upthread then you can see the XLOG reader finding garbage data
as well before validating the record header.  With any fixes on the WAL
receiver, your first patch included, then the garbage read goes away,
and XLOG reader complains about a record with an incorrect length
(invalid record length at XX/YYY: wanted 24, got 0) instead of complains
from header validation part.  One key point is to cleanly stop the
primary to as it forces the standby's WAL receiver to write to its WAL
segment in the middle of a page. 

>> Another idea I am thinking about would be to zero-fill the segments when
>> recycled instead of being just renamed when doing InstallXLogFileSegment().
>> This would also have the advantage of making the segments ahead more
>> compressible, which is a gain for custom backups, and the WAL receiver does
>> not need any tweaks as it would write the data on a clean file.  Zero-filling
>> the segments is done only when a new segment is created (see XLogFileInit).
> 
> Yes, I was (and am) inclined to take this approach; this is easy and
> clean, but not good for performance...  I hope there's something which
> justifies this approach. 

InstallXLogFileSegment uses a plain durable_link_or_rename() to recycle
the past segment which syncs the old segment before the rename anyway,
so the I/O effort will be there, no?

This was mentioned back in 2001 by the way, but this did not count much
for the case discussed here:
https://www.postgresql.org/message-id/24901.995381770%40sss.pgh.pa.us
The issue here is that the streaming case makes it easier to hit the
problem as it opens more easily access to not-completely written WAL
pages depending on the message frequency during replication.  At the
same time, we are discussing about a very low-probability issue.  Note
that if the XLOG reader is bumping into this problem, then at the next
WAL receiver wake up, recovery would begin from the beginning of the
last segment, and if the primary has produced some more WAL then the
standby would be able to actually avoid the random junk.  It is also
possible to bypass the problem by zeroing manually the areas in
question, or to actually wait for the standby to generate more WAL so as
the garbage is overwritten automatically.  And you really need to be
very, very unlucky to have random garbage able to bypass the header
validation checks.
--
Michael


signature.asc
Description: PGP signature


RE: [bug fix] pg_rewind takes long time because it mistakenly copies data files

2018-02-26 Thread Tsunakawa, Takayuki
From: Michael Paquier [mailto:mich...@paquier.xyz]
> Your patch is able to fix that.  I have also checked that after diverging
> the promoted server with more data and inserting data on the old primary
> then the correct set of blocks from the tablespace is fetched as well by
> pg_rewind.  This patch also behaves correctly when creating a new relation
> on the promoted server as it copies the whole relation. In short your patch
> looks good to me.

How quick, I was surprised.  Thank you so much!  I'd be glad if you could be 
the reviewer in CF:

https://commitfest.postgresql.org/17/1542/


> Creating a test case for this patch would be nice, however this needs a
> bit of work so as the tablespace map can be used with pg_basebackup and
> PostgresNode.pm (or use raw pg_basebackup commands in pg_rewind tests):
> - PostgresNode::init_from_backup needs to be able to understand extra
> options given by caller for pg_basebackup.
> - RewindTest::create_standby should be extended with extra arguments given
> for previous extension.
> :(

That sounds difficult from your way of saying... but this may be a good 
opportunity to look into the TAP tests.

Regards
Takayuki Tsunakawa





Re: [HACKERS] Pluggable storage

2018-02-26 Thread Alexander Korotkov
On Fri, Feb 23, 2018 at 2:20 AM, Robert Haas  wrote:

> On Fri, Feb 16, 2018 at 5:56 AM, Alexander Korotkov
>  wrote:
> > BTW, EnterpriseDB announces zheap table access method (heap with undo
> log)
> > [2].  I think this is great, and I'm looking forward for publishing
> zheap in
> > mailing lists.  But I'm concerning about its compatibility with pluggable
> > table access methods API.  Does zheap use table AM API from this
> thread?  Or
> > does it just override current heap and needs to be adopted to use table
> AM
> > API?  Or does it implements own API?
>
> Right now it just hacks the code.  The plan is to adapt it to whatever
> API we settle on in this thread.


Great, thank you for clarification.  I'm looking forward reviewing zheap :)

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: VACUUM FULL name is very confusing to some people (or to most non expert people)

2018-02-26 Thread Lætitia Avrot
>
> Although VACUUM and VACUUM FULL is different, then result is same (depends
> on detail level) - the data files are optimized for other processing. You
> should to see a VACUUM like family of commands that does some data files
> optimizations. VACUUM, VACUUM FULL, VACUUM FREEZE, VACUUM ANALYZE, ...
> Personally I don't think so we need to implement new synonym command for
> this case.
>

Here's how I understand what you wrote : "Each and every vacuum operations
are different flavours for files optimization so it's legitimate to use
similar names". I agree for VACUUM ANALYZE and VACUUM FREEZE that can be
seen as options to do more things than a simple VACUUM.

But I disagree for VACUUM FULL that isn't an option to do one more thing
than VACUUM does. VACUUM FULL is a complete different process.

Let's take an example:
In a production server with average production load, if you're already
running a VACUUM, you can change it to a VACUUM ANALYZE without many risks.
But I wouldn't dare try a VACUUM FULL without pg_repack.


> Why you you cannot to say your students - "VACUUM FULL is like SHRINK in
> SQL Server"?
>

I do explain that to my students but I'm not sure they memorize it, because
they do have a lot to memorize in a training session.

I keep meeting customers to who I have to explain that a simple VACUUM
doesn't rebuild indexes. Am I the only one facing that problem ?


> Regards
>
> Pavel
>

Regards

Lætitia


Re: [HACKERS] [PATCH] Vacuum: Update FSM more frequently

2018-02-26 Thread Masahiko Sawada
On Tue, Feb 20, 2018 at 5:04 PM, Masahiko Sawada  wrote:
> On Fri, Feb 16, 2018 at 5:00 AM, Claudio Freire  
> wrote:
>> On Thu, Feb 15, 2018 at 4:47 PM, Claudio Freire  
>> wrote:
>>> On Wed, Feb 14, 2018 at 3:59 AM, Masahiko Sawada  
>>> wrote:
>>
> The final FSM vacuum pass isn't partial, to finally correct all those
> small inconsistencies.

 Yes, but the purpose of this patch is to prevent table bloating from
 concurrent modifications?
>>>
>>> Yes, by *marking* unmarked space. If the slot is above the threshold,
>>> it means there's already marked free space on that branch, and search
>>> will go into it already, so no pressing need to refine the mark. The
>>> space already marked can serve while vacuum makes further progress.
>>> Ie: it can wait.
>>
>> Although... I think I see what you mean.
>>
>> If you have this:
>>
>> 255
>> .   0
>> .   .   0
>> .   .   255
>> .   0
>> .   .   64
>> .   .   64
>> .   0
>> .   .   0
>> .   .   64
>>
>>
>> A partial vacuum won't even recurse beyond the root node, so it won't
>> expose the free space 2 levels down.
>
> Yeah, that's what I meant. I think this might be able to happen
> slightly easily if a tables has fillfactor < 100. For example,
> updating tuples on pages that are almost packed except for fillfactor
> with the more bigger value
>
>>
>> This could be arrived at by having an almost fully packed table that
>> contains some free space near the end, and it gets deletions near the
>> beginning. Vacuum will mark the leaf levels at the beginning of the
>> table, and we'd like for that free space to be discoverable to avoid
>> packing more rows near the end where they prevent truncation.
>>
>> The patch as it is now will only vacuum that part of the FSM when the
>> root value drops, which usually requires all the free space on that
>> region of the heap to be exhausted.
>>
>> With the current recursive FSM vacuum method, however, that's
>> unavoidable. We can't detect that there's some free space below the
>> current level to be exposed without recursing into the tree, and
>> recursing is exactly the kind of work we want to prevent with tree
>> pruning.
>>
>> In all my trials, however, I did not run into that case. It must not
>> be easy to get into that situation, because the root node already has
>> ~4000 slots in it. Each of those 4000 slots should be in that
>> situation to keep the space at the end of the table as the sole
>> discoverable free space, which seems like a rather contorted worst
>> case.
>
> Okay, I guess that this patch cannot help in the case where the
> freespace of pages shown on fsm gets decreased by vacuum because the
> upper-level node doesn't reflect the value on the lower page. However
> it doesn't happen often as you said and it's the same as it is even
> though it happens. So I also think it would not become a problem.
>
> I'll review v4 patch.
>

Here is review comment for v4 patch.

@@ -1922,6 +1988,8 @@ count_nondeletable_pages(Relation onerel,
LVRelStats *vacrelstats)
 * We don't insert a vacuum delay point here, because we have an
 * exclusive lock on the table which we want to hold
for as short a
 * time as possible.  We still need to check for
interrupts however.
+* We might have to acquire the autovacuum lock,
however, but that
+* shouldn't pose a deadlock risk.
 */
CHECK_FOR_INTERRUPTS();

Is this change necessary?


+   /*
+* If there are no indexes then we should periodically
vacuum the FSM
+* on huge relations to make free space visible early.
+*/
+   if (nindexes == 0 &&
+   (vacuumed_pages - vacuumed_pages_at_fsm_vac) >
vacuum_fsm_every_pages)
+   {
+   /* Vacuum the Free Space Map */
+   FreeSpaceMapVacuum(onerel, max_freespace);
+   vacuumed_pages_at_fsm_vac = vacuumed_pages;
+   max_freespace = 0;
+   }

I think this code block should be executed before we check if the page
is whether new or empty and then do 'continue'. Otherwise we cannot
reach this code if the table has a lot of new or empty pages.


@@ -785,7 +789,7 @@ fsm_search(Relation rel, uint8 min_cat)
  * Recursive guts of FreeSpaceMapVacuum
  */
 static uint8
-fsm_vacuum_page(Relation rel, FSMAddress addr, bool *eof_p)
+fsm_vacuum_page(Relation rel, FSMAddress addr, uint8 threshold, bool *eof_p)
 {
Buffer  buf;
Pagepage;

I think the comment for 'threshold' is needed. Maybe for
FreeSpaceMapvacuum as well?


@@ -663,6 +663,8 @@ fsm_extend(Relation rel, BlockNumber fsm_nblocks)
  * If minValue > 0, the updated page is also searched for a page with at
  * least minValue of free space. If one is found, its slot number is
  * returned, -1 otherwise.
+ *
+ * If minValue == 0, the value a

Re: [bug fix] Cascaded standby cannot start after a clean shutdown

2018-02-26 Thread Michael Paquier
On Mon, Feb 26, 2018 at 05:08:49PM +0900, Michael Paquier wrote:
> This was mentioned back in 2001 by the way, but this did not count much
> for the case discussed here:
> https://www.postgresql.org/message-id/24901.995381770%40sss.pgh.pa.us
> The issue here is that the streaming case makes it easier to hit the
> problem as it opens more easily access to not-completely written WAL
> pages depending on the message frequency during replication.  At the
> same time, we are discussing about a very low-probability issue.  Note
> that if the XLOG reader is bumping into this problem, then at the next
> WAL receiver wake up, recovery would begin from the beginning of the
> last segment, and if the primary has produced some more WAL then the
> standby would be able to actually avoid the random junk.  It is also
> possible to bypass the problem by zeroing manually the areas in
> question, or to actually wait for the standby to generate more WAL so as
> the garbage is overwritten automatically.  And you really need to be
> very, very unlucky to have random garbage able to bypass the header
> validation checks.

By the way, as long as I have my mind of it.  Another strategy would be
to just make the checks in XLogReadRecord() a bit smarter if the whole
record header is not on the page.  If we check at least for
AllocSizeIsValid(total_len) then there this code would not fail on an
allocation as you user reported.  Still this misses the case where a
record size is lower than 1GB but invalid so you would allocate
allocate_recordbuf for nothing :(

At least this extra check is costless, and avoids all kind of hard
failures.
--
Michael


signature.asc
Description: PGP signature


Re: [bug fix] pg_rewind takes long time because it mistakenly copies data files

2018-02-26 Thread Michael Paquier
On Mon, Feb 26, 2018 at 08:13:02AM +, Tsunakawa, Takayuki wrote:
> From: Michael Paquier [mailto:mich...@paquier.xyz]
>> Your patch is able to fix that.  I have also checked that after diverging
>> the promoted server with more data and inserting data on the old primary
>> then the correct set of blocks from the tablespace is fetched as well by
>> pg_rewind.  This patch also behaves correctly when creating a new relation
>> on the promoted server as it copies the whole relation. In short your patch
>> looks good to me.
> 
> How quick, I was surprised.  Thank you so much!  I'd be glad if you could be 
> the reviewer in CF:
> 
> https://commitfest.postgresql.org/17/1542/

Sure.

>> Creating a test case for this patch would be nice, however this needs a
>> bit of work so as the tablespace map can be used with pg_basebackup and
>> PostgresNode.pm (or use raw pg_basebackup commands in pg_rewind tests):
>> - PostgresNode::init_from_backup needs to be able to understand extra
>> options given by caller for pg_basebackup.
>> - RewindTest::create_standby should be extended with extra arguments given
>> for previous extension.
>> :(
> 
> That sounds difficult from your way of saying... but this may be a
> good opportunity to look into the TAP tests. 

Anything like that would be work only for HEAD I think as that's a bit
of refactoring.  And indeed it could give you a good introduction to the
TAP facility.
--
Michael


signature.asc
Description: PGP signature


Using JSONB directly from application

2018-02-26 Thread Anthony Communier
Hello,

It would be nice if application connected to a Postrgesql database could
send and receive JSONB in binary. It could save some useless text
conversion. All works could be done on application side which are often
more scalable than database itself.

Regards,

Anthony Communier


Re: csv format for psql

2018-02-26 Thread Fabien COELHO


Hello Daniel,


This patch implements csv as an output format in psql
(\pset format csv). It's quite similar to the unaligned format,
except that it applies CSV quoting rules (obviously!) and that
it prints no footer and no title.
As with unaligned, a header with column names is output unless
tuples_only is on. It also supports the fieldsep/fielsep_zero
and recordsep/recordsep_zero settings.


Patch applies cleanly and compiles. "make check" ok, although there is
no specific test for this feature...

The documentation should mention the other CSV options (COPY, \copy, ...) 
and explain how they compare to this one. Maybe a specific paragraph about 
how to do CSV? I understand "\pset format csv" as triggering that all 
outputs compared to per command options.


Given that this is somehow already available, I'm wondering why there is 
no code sharing.


I find it annoying that setting csv keeps the default '|' separator, where
ISTM that it should be by default "," (as in COMMA separated value:-).
However it would not be a good idea to change another variables when setting
one, obviously.

Maybe some \csv command could set the format to csv, fieldsep to ",", 
tuples_only to on, recordsep to '\n'? Not sure whether it would be 
acceptable, though, and how to turn it off once turned on... Probably an 
average (aka not good) idea:-)


The format adds column headers, however they are not escaped properly:

  psql> \pset format csv
  psql> \pset fieldsep ,
  psql> SELECT 1 AS "hello, world", 2 AS ;
hello, world,"
1,2

Also it does not seem to work properly in expanded mode, both for the
column and values:

  psql> \x
  psql> SELECT 1 AS "bla""", E'\n,"' AS foo;
bla",1
foo,
,"

There MUST be some tests, especially with ugly stuff (escapes, newlines,
double quotes, various types, expanded or not, field seps, strange 
column names...).




Most of times, the need for CSV is covered by \copy or COPY with
the CSV option, but there are some cases where it would be more
practical to have it as an output format in psql.

* \copy does not interpolate psql variables and is a single-line
command, so making a query fit these contraints can be cumbersome.
It can be got around by defining a temporary view and
\copy from that view, but that doesn't work in a read-only context
such as when connected to a standby.

* the server-side COPY TO STDOUT can also be used from psql,
typically with psql -c "COPY (query) TO STDOUT CSV" > file.csv,
but that's too simple to extract multiple result sets per script.
COPY is also more rigid than psql in the options to delimit
fields and records.

* copy with csv can't help for the output of meta-commands
such as \gx, \crosstabview, \l, \d ... whereas a CSV format within psql
does work with these.


--
Fabien.



Re: [HACKERS] Partition-wise aggregation/grouping

2018-02-26 Thread Jeevan Chalke
Hi Robert,

On Fri, Feb 23, 2018 at 2:53 AM, Robert Haas  wrote:

> On Thu, Feb 8, 2018 at 8:05 AM, Jeevan Chalke
>  wrote:
> > In this attached version, I have rebased my changes over new design of
> > partially_grouped_rel. The preparatory changes of adding
> > partially_grouped_rel are in 0001.
>
> I spent today hacking in 0001; results attached.  The big change from
> your version is that this now uses generate_gather_paths() to add
> Gather/Gather Merge nodes (except in the case where we sort by group
> pathkeys and then Gather Merge) rather than keeping all of the bespoke
> code.  That turned up to be a bit less elegant than I would have liked
> -- I had to an override_rows argument to generate_gather_paths to make
> it work.  But overall I think this is still a big improvement, since
> it lets us share code instead of duplicating it.  Also, it potentially
> lets us add partially-aggregated but non-parallel paths into
> partially_grouped_rel->pathlist and that should Just Work; they will
> get the Finalize Aggregate step but not the Gather.  With your
> arrangement that wouldn't work.
>
> Please review/test.
>

I have reviewed and tested the patch and here are my couple of points:

 /*
- * If the input rel belongs to a single FDW, so does the grouped rel.
+ * If the input rel belongs to a single FDW, so does the grouped rel.
Same
+ * for the partially_grouped_rel.
  */
 grouped_rel->serverid = input_rel->serverid;
 grouped_rel->userid = input_rel->userid;
 grouped_rel->useridiscurrent = input_rel->useridiscurrent;
 grouped_rel->fdwroutine = input_rel->fdwroutine;
+partially_grouped_rel->serverid = input_rel->serverid;
+partially_grouped_rel->userid = input_rel->userid;
+partially_grouped_rel->useridiscurrent = input_rel->useridiscurrent;
+partially_grouped_rel->fdwroutine = input_rel->fdwroutine;

In my earlier mail where I have posted a patch for this partially grouped
rel changes, I forgot to put my question on this.
I was unclear about above changes and thus passed grouped_rel whenever we
wanted to work on partially_grouped_rel to fetch relevant details.

One idea I thought about is to memcpy the struct once we have set all
required fields for grouped_rel so that we don't have to do similar stuff
for partially_grouped_rel.

---

+ * Insert a Sort node, if required.  But there's no point in
+ * sorting anything but the cheapest path.
  */
-if (root->group_pathkeys)
+if (!pathkeys_contained_in(root->group_pathkeys,
path->pathkeys))
+{
+if (path != linitial(partially_grouped_rel->pathlist))
+continue;

Paths in pathlist are added by add_path(). Though we have paths is pathlist
is sorted with the cheapest total path, we generally use
RelOptInfo->cheapest_total_path instead of using first entry, unlike
partial paths. But here you use the first entry like partial paths case.
Will it better to use cheapest total path from partially_grouped_rel? This
will require calling set_cheapest on partially_grouped_rel before we call
this function.

Attached top-up patch doing this along with few indentation fixes.

Rest of the changes look good to me.

Once this gets in, I will re-base my other patches accordingly.

And, thanks for committing 0006.

Thanks


>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>


-- 
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 1c792a0..f6b0208 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -2453,7 +2453,8 @@ set_worktable_pathlist(PlannerInfo *root, RelOptInfo *rel, RangeTblEntry *rte)
  * we must do something.)
  */
 void
-generate_gather_paths(PlannerInfo *root, RelOptInfo *rel, bool override_rows)
+generate_gather_paths(PlannerInfo *root, RelOptInfo *rel,
+	  bool override_rows)
 {
 	Path	   *cheapest_partial_path;
 	Path	   *simple_gather_path;
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index e4f9bd4..e8f6cc5 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -6101,7 +6101,7 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 			 */
 			if (!pathkeys_contained_in(root->group_pathkeys, path->pathkeys))
 			{
-if (path != linitial(partially_grouped_rel->pathlist))
+if (path != partially_grouped_rel->cheapest_total_path)
 	continue;
 path = (Path *) create_sort_path(root,
  grouped_rel,
@@ -6186,9 +6186,7 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 		 */
 		if (partially_grouped_rel->pathlist)
 		{
-			Path	   *path;
-
-			path = (Path *) linitial(partially_grouped_rel->pathlist);
+			Path	   *

Optimizing nested ConvertRowtypeExpr execution

2018-02-26 Thread Ashutosh Bapat
Hi,
In a multi-level partitioned table, a parent whole-row reference gets
translated into nested ConvertRowtypeExpr with child whole-row
reference as the leaf. During the execution, the child whole-row
reference gets translated into all all intermediate parents' whole-row
references, ultimately represented as parent's whole-row reference.
AFAIU, the intermediate translations are unnecessary. The leaf child
whole-row can be directly translated into top parent's whole-row
reference. Here's a WIP patch which does that by eliminating
intermediate ConvertRowtypeExprs during ExecInitExprRec().

I tested the performance with two-level partitions, and 1M rows, on my
laptop selecting just the whole-row expression. I saw ~20% improvement
in the execution time. Please see the attached test and its output
with and without patch.

For two-level inheritance hierarchy, this patch doesn't show any
performance difference since the plan time hierarchy is flattened into
single level.

Instead of the approach that the patch takes, we might modify
adjust_appendrel_attrs() not to produce nested ConvertRowtypeExprs in
the first place. With that we may get rid of code which handles nested
ConvertRowtypeExprs like is_converted_whole_row_reference(). But I
haven't tried that approach yet.

Suggestions/comments welcome.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
diff --git a/src/backend/executor/execExpr.c b/src/backend/executor/execExpr.c
index db5fcaf..ec896a4 100644
--- a/src/backend/executor/execExpr.c
+++ b/src/backend/executor/execExpr.c
@@ -1364,6 +1364,37 @@ ExecInitExprRec(Expr *node, ExprState *state,
 		case T_ConvertRowtypeExpr:
 			{
 ConvertRowtypeExpr *convert = (ConvertRowtypeExpr *) node;
+ConvertRowtypeExpr *tmp_cre = convert;
+bool		nested_cre = false;
+
+/*
+ * If this is a nested ConvertRowtypeExpr resulting from a
+ * multi-level partition/inheritance hierarchy, its leaf node
+ * will be a whole-row expression. We can convert the leaf
+ * whole-row directly into the topmost parent, without
+ * converting it into the intermediate parent row types.
+ */
+while (IsA(tmp_cre->arg, ConvertRowtypeExpr))
+{
+	tmp_cre = castNode(ConvertRowtypeExpr, tmp_cre->arg);
+	nested_cre = true;
+}
+
+if (nested_cre && IsA(tmp_cre->arg, Var) &&
+	castNode(Var, tmp_cre->arg)->varattno == 0)
+{
+	ConvertRowtypeExpr *new_convert;
+
+	/*
+	 * XXX: Instead of modifying the expression directly, we
+	 * save a modified copy in the execution tree. May be it's
+	 * safe to modify the expression directly, not sure.
+	 */
+	new_convert = makeNode(ConvertRowtypeExpr);
+	memcpy(new_convert, convert, sizeof(ConvertRowtypeExpr));
+	new_convert->arg = tmp_cre->arg;
+	convert = new_convert;
+}
 
 /* evaluate argument into step's result area */
 ExecInitExprRec(convert->arg, state, resv, resnull);


nest_cre.out
Description: Binary data
\set num_rows 100
drop table t1;
drop table p cascade;
create table t1 (a int, b varchar, c timestamp) partition by range(a);
create table t1p1 (b varchar, c timestamp, a int) partition by range(a);
alter table t1 attach partition t1p1 for values from (0) to (100);
create table t1p1p1(c timestamp, a int, b varchar);
alter table t1p1 attach partition t1p1p1 for values from (0) to (50);
insert into t1 select abs(random()) * 49, i, now() from generate_series(1,  :num_rows) i;
explain analyze select t1 from t1;
   QUERY PLAN   

 Append  (cost=0.00..13565.97 rows=719697 width=32) (actual time=0.079..683.430 rows=100 loops=1)
   ->  Seq Scan on t1p1p1  (cost=0.00..13565.97 rows=719697 width=32) (actual time=0.077..616.439 rows=100 loops=1)
 Planning time: 0.193 ms
 Execution time: 717.929 ms
(4 rows)

explain analyze select t1 from t1;
   QUERY PLAN   

 Append  (cost=0.00..13565.97 rows=719697 width=32) (actual time=0.017..607.063 rows=100 loops=1)
   ->  Seq Scan on t1p1p1  (cost=0.00..13565.97 rows=719697 width=32) (actual time=0.016..541.619 rows=100 loops=1)
 Planning time: 0.115 ms
 Execution time: 640.628 ms
(4 rows)

explain analyze select t1 from t1;
   QUERY PLAN   

 Append  (cost=0.00..13565.97 rows=719697 width=32) (actual time=0.016..605.972 rows=100 loops=1)

Re: [HACKERS] Add support for tuple routing to foreign partitions

2018-02-26 Thread Etsuro Fujita

(2018/02/23 16:38), Amit Langote wrote:

On Thu, Feb 22, 2018 at 8:49 PM, Etsuro Fujita
  wrote:

This would introduce an asymmetry (we can move tuples from plain partitions
to foreign partitions, but the reverse is not true), but I am thinking that
it would be probably okay to document about that.



About just documenting the asymmetry you mentioned that's caused by
the fact that we don't enforce constraints on foreign tables, I
started wondering if we shouldn't change our stance on the matter wrt
"partition" constraints?


I'm not sure that it's a good idea to make an exception in that case. 
Another concern is triggers on the remote side; those might change the 
row so that the partition constraint of the containing partition is no 
longer satisfied.



But, admittedly, that's a topic for a
different thread.


OK, I'll leave that for another patch.

Will post a new version.  Thanks for the comments!

Best regards,
Etsuro Fujita



Re: GSOC 2018 ideas

2018-02-26 Thread Aleksander Alekseev
Hello Charles,

> I saw PostgreSQL is selected in GSOC 2018 and pretty interested in the
> ideas of thrift data types support that proposed by you. So, I want to
> prepare for a proposal based on this idea.

Glad you are interested in this project!

> Can I have more detailed information of what documents or code that I
> need to understand?

I would recommend the following documents and code:

* Source code of pg_protobuf
  https://github.com/afiskon/pg_protobuf
* "Writing Postgres Extensions" tutorial series by Manuel Kniep
  http://big-elephants.com/2015-10/writing-postgres-extensions-part-i/
* "So you want to make an extension?" talk by Keith Fiske
  http://slides.keithf4.com/extension_dev/#/
* Apache Thrift official website
  https://thrift.apache.org/
* Also a great explanation of the Thrift format can be found in the
  book "Designing Data-Intensive Applications" by Martin Kleppmann
  http://dataintensive.net/

> Also, if this idea is allocated to other student (or in other worlds,
> you prefer some student to work on it), do let me know, so that I can
> pick some other project in PostgreSQL. Any comments or suggestions are
> welcomed!

To my best knowledge currently there are no other students interested in
this particular work.

-- 
Best regards,
Aleksander Alekseev


signature.asc
Description: PGP signature


Re: VACUUM FULL name is very confusing to some people (or to most non expert people)

2018-02-26 Thread Pavel Stehule
2018-02-26 9:56 GMT+01:00 Lætitia Avrot :

> Although VACUUM and VACUUM FULL is different, then result is same (depends
>> on detail level) - the data files are optimized for other processing. You
>> should to see a VACUUM like family of commands that does some data files
>> optimizations. VACUUM, VACUUM FULL, VACUUM FREEZE, VACUUM ANALYZE, ...
>> Personally I don't think so we need to implement new synonym command for
>> this case.
>>
>
> Here's how I understand what you wrote : "Each and every vacuum operations
> are different flavours for files optimization so it's legitimate to use
> similar names". I agree for VACUUM ANALYZE and VACUUM FREEZE that can be
> seen as options to do more things than a simple VACUUM.
>
> But I disagree for VACUUM FULL that isn't an option to do one more thing
> than VACUUM does. VACUUM FULL is a complete different process.
>
> Let's take an example:
> In a production server with average production load, if you're already
> running a VACUUM, you can change it to a VACUUM ANALYZE without many risks.
> But I wouldn't dare try a VACUUM FULL without pg_repack.
>

VACUUM and VACUUM ANALYZE does same VACUUM work.

The VACUUM FREEZE is different too.


>
>
>> Why you you cannot to say your students - "VACUUM FULL is like SHRINK in
>> SQL Server"?
>>
>
> I do explain that to my students but I'm not sure they memorize it,
> because they do have a lot to memorize in a training session.
>

Maybe the core of this issue is fact so VACUUM FULL, VACUUM FREEZE is
statements based on two keywords commands. Lazy VACUUM uses just keyword.
>From this perspective the lazy VACUUM "should be" renamed, because it is
inconsistent - and some databases uses some names like OPTIMIZE table (and
it sound much more optimistic :)). I teach PostgreSQL more than ten years -
and I have not the problem with this topic - the hard part is explain of
VACCUM FREEZE - but VACUUM and VACUUM FULL can be explained simply (with
some detail reduction). VACUUM recuces MVCC costs, VACUUM FULL reduces
bloating. ANALYZE is orthogonal - calculate column statistic.


>
> I keep meeting customers to who I have to explain that a simple VACUUM
> doesn't rebuild indexes. Am I the only one facing that problem ?
>

simple VACUUM (lazy VACUUM does a tasks that don't needs aggressive locks)
- rebuild indexes needs strong locks.

I agree, it is not pretty clean, because VACUUM FULL share some work with
REINDEX, but introduction new command change nothing.



>
>
>> Regards
>>
>> Pavel
>>
>
> Regards
>
> Lætitia
>


Re: Use of term hostaddrs for multiple hostaddr values

2018-02-26 Thread Vasundhar Boddapati
I have gone through the comments, both look straight forward to go

Re: [doc fix] Correct calculation of vm.nr_hugepages

2018-02-26 Thread Vasundhar Boddapati
The following review has been posted through the commitfest application:
make installcheck-world:  not tested
Implements feature:   not tested
Spec compliant:   not tested
Documentation:tested, passed

It works. Can be Merged.

Re: VACUUM FULL name is very confusing to some people (or to most non expert people)

2018-02-26 Thread David G. Johnston
On Sun, Feb 25, 2018 at 10:51 AM, Lætitia Avrot 
wrote:

> Hi all,
>
> For most beginners (and even a lot of advanced users)  there is a strong
> confusion between simple VACUUM and VACUUM FULL. They think "full" is
> simply an option to the maintenance operation vacuum while it's not. It's a
> complete different  operation.
>
> I have a hard time explaining it when I teach PostgreSQL Administration
> (even if I stress the matter) and I constantly meet customer that are wrong
> about it.
>

I don't think adding an actual alias to the system is worthwhile but if
thinking of "VACUUM FULL" as "DEFRAGMENT" helps people remember by all
means go for it.

​Maybe I'm being naive but by the time people learn MVCC and the different
pros, cons, and the ways to manage the resultant bloat understanding the
difference between VACUUM and VACUUM FULL will be straight forward, just
attaching different labels to different behaviors.  Whether FULL is an
option or part of its own command doesn't seem that important - you'd still
have to learn why the option exists and the pros and cons of using it
instead of just the basic option.  If the response is "this is all too
complicated, I'll pick the one that sounds like its more thorough" then we
are back to understanding MVCC.  One would still have decide between VACUUM
and DEFRAGMENT and quite possibly decide to perform both...at least with
VACUUM FULL its made obvious that both commands are related.

David J.


Re: Using JSONB directly from application

2018-02-26 Thread Craig Ringer
On 26 February 2018 at 04:05, Anthony Communier  wrote:

> Hello,
>
> It would be nice if application connected to a Postrgesql database could
> send and receive JSONB in binary. It could save some useless text
> conversion. All works could be done on application side which are often
> more scalable than database itself.
>


To support this, you'd need to extract PostgreSQL's jsonb support into a C
library that could be used independently of backend server infrastructure
like 'palloc' and memory contexts, ereport(), etc. Or write a parallel
implementation.

It's one of the reasons some people questioned the wisdom of doing jsonb as
a bespoke format not using protobufs or whatever. I'm not one of them,
since I wasn't the one doing the work, and I also know how hard it can be
to neatly fit general purpose library code into the DB server where we
expect it to do little things like not abort() on malloc() failure.

If you're interested in writing such a library, I suggest proposing a broad
design for how you intend to do it here. I suspect that duplicating enough
server backend infrastructure to make the existing jsonb implementation
friendly to frontend code would be frustrating, but maybe it's not as bad
as I think. Certainly if you did such a thing, many people would thank you,
because the inability to use ereport() and elog(), PG_TRY, the List API,
etc, in FRONTEND code is a constant if minor source of irritation in
PostgreSQL development.


-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Patch: Pass IndexInfo correctly to aminsert for indexes on TOAST

2018-02-26 Thread Robert Haas
On Thu, Feb 22, 2018 at 9:13 PM, Ahuja, Nitin  wrote:
> Patch Description
> Currently, while saving the TOAST entry the call to index access method
> insert (aminsert) passed IndexInfo attribute as NULL. IndexInfo parameter
> was introduced in Postgres 10 for the aminsert method to allow caching data
> across aminsert calls within a single SQL statement. The IndexInfo is passed
> correctly for regular tuple insertion but not for TOAST entries. This
> currently prohibits aminsert method to be able to use IndexInfo for TOAST
> entries.
>
> This patch correctly passes the built IndexInfo to aminsert for TOAST
> entries.

If you don't offer any justification for the patch it's unlikely to be
accepted.  BuildIndexInfo() isn't free, so unless it hurts something
to skip calling it, we should keep skipping it.

Also, your patch is (a) in the body of the email instead of attached,
(b) in HTML format, and (c) reversed -- if it were properly formatted,
the original hunks would be first and the revised ones second.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: remove pg_class.relhaspkey

2018-02-26 Thread Robert Haas
On Mon, Feb 26, 2018 at 12:36 AM, Michael Paquier  wrote:
> I would be of the opinion to drop them.

+1.  On this point, I am in agreement with the gentleman who wrote
http://postgr.es/m/7903.1310671...@sss.pgh.pa.us

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Contention preventing locking

2018-02-26 Thread Amit Kapila
On Thu, Feb 15, 2018 at 9:30 PM, Konstantin Knizhnik
 wrote:
> Hi,
>
> PostgreSQL performance degrades signficantly in case of high contention.
> You can look at the attached YCSB results (ycsb-zipf-pool.png) to estimate
> the level of this degradation.
>
> Postgres is acquiring two kind of heavy weight locks during update: it locks
> TID of the updated tuple and XID of transaction created this version.
> In debugger I see the following picture: if several transactions are trying
> to update the same record, then first of all they compete for
> SnapshotDirty.xmax transaction lock in EvalPlanQualFetch.
> Then in heap_tuple_update they are trying to lock TID of the updated tuple:
> heap_acquire_tuplock in heapam.c
>

There is no function named heap_tuple_update.  Do you mean to say heap_update?

> After that transactions wait completion of the transaction updated the
> tuple: XactLockTableWait in heapam.c
>
> So in heap_acquire_tuplock all competing transactions are waiting for TID of
> the updated version. When transaction which changed this tuple is committed,
> one of the competitors will grant this lock and proceed, creating new
> version of the tuple. Then all other competitors will be awaken and ... find
> out that locked tuple is not the last version any more.
> Then they will locate new version and try to lock it... The more competitors
> we have, then more attempts they all have to perform (quadratic complexity).
>

The attempts are controlled by marking the tuple as locked by me after
waiting on SnapshotDirty.xmax.  So, the backend which marks the tuple
as locked must get the tuple to update and I think it is ensured in
code that only one backend will be allowed to do so.  I can see some
serialization in this logic, but I think we should first try to get
the theory behind the contention problem you are seeing before trying
to find the solution for it.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Contention preventing locking

2018-02-26 Thread Amit Kapila
On Tue, Feb 20, 2018 at 10:34 PM, Konstantin Knizhnik
 wrote:
>
>
> On 20.02.2018 19:39, Simon Riggs wrote:
>>
>> On 20 February 2018 at 16:07, Konstantin Knizhnik
>>  wrote:
>>>
>>>
>>> On 20.02.2018 14:26, Simon Riggs wrote:

 Try locking the root tid rather than the TID, that is at least unique
 per page for a chain of tuples, just harder to locate.

>>> As far as I understand, it is necessary to traverse the whole page to
>>> locate
>>> root tuple, isn't it?
>>> If so, then I expect it to be too expensive operation. Scanning the whole
>>> page on tuple update seems to be not an acceptable solution.
>>
>> Probably.
>>
>> It occurs to me that you can lock the root tid in index_fetch_heap().
>> I hear other DBMS lock via the index.
>>
>> However, anything you do with tuple locking could interact badly with
>> heap_update and the various lock modes, so be careful.
>>
>> You also have contention for heap_page_prune_opt() and with SELECTs to
>> consider, so I think you need to look at both page and tuple locking.
>>
>
> So, if I correctly understand the primary goal of setting tuple lock in
> heapam.c is to avoid contention caused
> by concurrent release of all waiters.
> But my transaction lock chaining patch eliminates this problem in other way.
> So what about combining your patch (do not lock Snapshot.xmax) + with my
> xlock patch and ... completely eliminate tuple lock in heapam?
> In this case update of tuple will require obtaining just one heavy weight
> lock.
>
> I made such experiment and didn't find any synchronization problems with my
> pgrw test.
> Performance is almost the same as with vanilla+xlock patch:
>
> https://docs.google.com/spreadsheets/d/1QOYfUehy8U3sdasMjGnPGQJY8JiRfZmlS64YRBM0YTo/edit?usp=sharing
>
> I wonder why instead of chaining transaction locks (which can be done quite
> easily) approach with extra tuple lock was chosen?

Can you please explain, how it can be done easily without extra tuple
locks?  I have tried to read your patch but due to lack of comments,
it is not clear what you are trying to achieve.  As far as I can see
you are changing the locktag passed to LockAcquireExtended by the
first waiter for the transaction.  How will it achieve the serial
waiting protocol (queue the waiters for tuple) for a particular tuple
being updated?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] [PATCH] Vacuum: Update FSM more frequently

2018-02-26 Thread Claudio Freire
On Mon, Feb 26, 2018 at 6:00 AM, Masahiko Sawada  wrote:
> Here is review comment for v4 patch.
>
> @@ -1922,6 +1988,8 @@ count_nondeletable_pages(Relation onerel,
> LVRelStats *vacrelstats)
>  * We don't insert a vacuum delay point here, because we have 
> an
>  * exclusive lock on the table which we want to hold
> for as short a
>  * time as possible.  We still need to check for
> interrupts however.
> +* We might have to acquire the autovacuum lock,
> however, but that
> +* shouldn't pose a deadlock risk.
>  */
> CHECK_FOR_INTERRUPTS();
>
> Is this change necessary?

I don't recall doing that change. Maybe a rebase gone wrong.

> 
> +   /*
> +* If there are no indexes then we should periodically
> vacuum the FSM
> +* on huge relations to make free space visible early.
> +*/
> +   if (nindexes == 0 &&
> +   (vacuumed_pages - vacuumed_pages_at_fsm_vac) >
> vacuum_fsm_every_pages)
> +   {
> +   /* Vacuum the Free Space Map */
> +   FreeSpaceMapVacuum(onerel, max_freespace);
> +   vacuumed_pages_at_fsm_vac = vacuumed_pages;
> +   max_freespace = 0;
> +   }
>
> I think this code block should be executed before we check if the page
> is whether new or empty and then do 'continue'. Otherwise we cannot
> reach this code if the table has a lot of new or empty pages.

In order for the counter (vacuumed_pages) to increase, there have to
be plenty of opportunities for this code to run, and I specifically
wanted to avoid vacuuming the FSM too often for those cases
particularly (when Vacuum scans lots of pages but does no writes).

> 
> @@ -663,6 +663,8 @@ fsm_extend(Relation rel, BlockNumber fsm_nblocks)
>   * If minValue > 0, the updated page is also searched for a page with at
>   * least minValue of free space. If one is found, its slot number is
>   * returned, -1 otherwise.
> + *
> + * If minValue == 0, the value at the root node is returned.
>   */
>  static int
>  fsm_set_and_search(Relation rel, FSMAddress addr, uint16 slot,
> @@ -687,6 +689,8 @@ fsm_set_and_search(Relation rel, FSMAddress addr,
> uint16 slot,
>
> addr.level == FSM_BOTTOM_LEVEL,
>true);
> }
> +   else
> +   newslot = fsm_get_avail(page, 0);
>
> I think it's not good idea to make fsm_set_and_search return either a
> slot number or a category value according to the argument. Category
> values is actually uint8 but this function returns it as int.

Should be fine, uint8 can be contained inside an int in all platforms.

> Also we
> can call fsm_set_and_search with minValue = 0 at
> RecordAndGetPageWithFreeSpace() when search_cat is 0. With you patch,
> fsm_set_and_search() then returns the root value but it's not correct.

I guess I can add another function to do that.

> Also, is this change necessary for this patch? ISTM this change is
> used for the change in fsm_update_recursive() as follows but I guess
> this change can be a separate patch.
>
> +   new_cat = fsm_set_and_search(rel, parent, parentslot, new_cat, 0);
> +
> +   /*
> +* Bubble up, not the value we just set, but the one now in the root
> +* node of the just-updated page, which is the page's highest value.
> +*/

I can try to separate them I guess.



Re: [HACKERS] Partition-wise aggregation/grouping

2018-02-26 Thread Robert Haas
On Mon, Feb 26, 2018 at 6:38 AM, Jeevan Chalke
 wrote:
> One idea I thought about is to memcpy the struct once we have set all
> required fields for grouped_rel so that we don't have to do similar stuff
> for partially_grouped_rel.

I think that would be a poor idea.  We want to copy a few specific
fields, not everything, and copying those fields is cheap, because
they are just simple assignment statements.  I think memcpy()'ing the
whole structure would be using a sledgehammer to solve a problem for
which a scalpel is more suited.

> Paths in pathlist are added by add_path(). Though we have paths is pathlist
> is sorted with the cheapest total path, we generally use
> RelOptInfo->cheapest_total_path instead of using first entry, unlike partial
> paths. But here you use the first entry like partial paths case. Will it
> better to use cheapest total path from partially_grouped_rel? This will
> require calling set_cheapest on partially_grouped_rel before we call this
> function.

Hmm, I guess that seems like a reasonable approach, although I am not
sure it matters much either way.

> Attached top-up patch doing this along with few indentation fixes.

I don't see much point to the change in generate_gather_paths -- that
line is only 77 characters long.

Committed after incorporating your other fixes and updating the
optimizer README.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] [PATCH] Vacuum: Update FSM more frequently

2018-02-26 Thread Claudio Freire
On Mon, Feb 26, 2018 at 11:31 AM, Claudio Freire  wrote:
>> 
>> +   /*
>> +* If there are no indexes then we should periodically
>> vacuum the FSM
>> +* on huge relations to make free space visible early.
>> +*/
>> +   if (nindexes == 0 &&
>> +   (vacuumed_pages - vacuumed_pages_at_fsm_vac) >
>> vacuum_fsm_every_pages)
>> +   {
>> +   /* Vacuum the Free Space Map */
>> +   FreeSpaceMapVacuum(onerel, max_freespace);
>> +   vacuumed_pages_at_fsm_vac = vacuumed_pages;
>> +   max_freespace = 0;
>> +   }
>>
>> I think this code block should be executed before we check if the page
>> is whether new or empty and then do 'continue'. Otherwise we cannot
>> reach this code if the table has a lot of new or empty pages.
>
> In order for the counter (vacuumed_pages) to increase, there have to
> be plenty of opportunities for this code to run, and I specifically
> wanted to avoid vacuuming the FSM too often for those cases
> particularly (when Vacuum scans lots of pages but does no writes).

Wait, I see what you mean. Entirely empty pages.

Ok, I can move it.



  1   2   >