Re: Passing CopyMultiInsertInfo structure to CopyMultiInsertInfoNextFreeSlot()

2019-05-16 Thread Ashutosh Sharma
On Fri, May 17, 2019 at 3:10 AM Alvaro Herrera 
wrote:

> On 2019-May-14, Michael Paquier wrote:
>
> > On Tue, May 14, 2019 at 01:19:30PM +1200, David Rowley wrote:
> > > When I wrote the code I admit that I was probably wearing my
> > > object-orientated programming hat. I had in mind that the whole
> > > function series would have made a good class.  Passing the
> > > CopyMultiInsertInfo was sort of the non-OOP equivalent to having
> > > this/Me/self available, as it would be for any instance method of a
> > > class.  Back to reality, this isn't OOP, so I was wearing the wrong
> > > hat.  I think the unused parameter should likely be removed.  It's
> > > probably not doing a great deal of harm since the function is static
> > > inline and the compiler should be producing any code for the unused
> > > param, but for the sake of preventing confusion, it should be removed.
> > > Ashutosh had to ask about it, so it wasn't immediately clear what the
> > > purpose of it was. Since there's none, be gone with it, I say.
> >
> > Sounds fair to me.  This has been introduced by 86b8504, so let's see
> > what's Andres take.
>
> If this were up to me, I'd leave the function signature alone, and just add
> (void) miinfo;  /* unused parameter */
> to the function code.  It seems perfectly reasonable to have that
> function argument, and a little weird not to have it.
>
>
I think, we should only do that if at all there is any circumstances under
which 'miinfo' might be used otherwise it would good to remove it unless
there is some specific reason for having it. As an example, please refer to
the following code in printTableAddHeader() or printTableAddCell().

#ifndef ENABLE_NLS
(void) translate;   /* unused parameter */
#endif

The function argument *translate* has been marked as unsed but only for
non-nls build which means it will be used if it is nls enabled build. But,
I do not see any such requirement in our case. Please let me know if I
missing anything here.

Thanks,

-- 
With Regards,
Ashutosh Sharma
EnterpriseDB:*http://www.enterprisedb.com *


Re: behaviour change - default_tablesapce + partition table

2019-05-16 Thread Amit Langote
Agree that this behavior change seems unintentional.

On 2019/05/17 12:40, Rushabh Lathia wrote:
> Looking at the commit changes, it seems like at condition when no other
> tablespace is specified, we default the tablespace to the parent partitioned
> table's.
> 
> else if (stmt->partbound)
> {
> /*
>  * For partitions, when no other tablespace is specified, we default
>  * the tablespace to the parent partitioned table's.
>  */
> Assert(list_length(inheritOids) == 1);
> tablespaceId = get_rel_tablespace(linitial_oid(inheritOids));
> }
> 
> But here it doesn't consider the default_tablespace if the parent
> partitioned
> tablespace is an InvalidOid (which was the care before this commit).
> 
> PFA patch to fix the same.

+
+   if (!OidIsValid(tablespaceId))
+   tablespaceId = 
GetDefaultTablespace(stmt->relation->relpersistence,
partitioned);
}
else
tablespaceId = 
GetDefaultTablespace(stmt->relation->relpersistence,

Why not change it like this instead:

@@ -681,7 +681,8 @@ DefineRelation(CreateStmt *stmt, char relkind, Oid
ownerId,
 Assert(list_length(inheritOids) == 1);
 tablespaceId = get_rel_tablespace(linitial_oid(inheritOids));
 }
-else
+
+if (!OidIsValid(tablespaceId))
 tablespaceId = GetDefaultTablespace(stmt->relation->relpersistence,
 partitioned);

Thanks,
Amit





Re: jsonpath

2019-05-16 Thread Alexander Korotkov
Couple patches improving jsonpath docs are attached.  The first one
documents nesting level filter for .** accessor.  The second adds to
documentation of jsonpath array subsciption usage of expressions and
multiple slice ranges.  I'm going to push both patches if no
objections.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


0001-doc-jsonpath-nesting-level-filter.patch
Description: Binary data


0002-doc-jsonpath-subscription-syntax-improve.patch
Description: Binary data


Re: Adding a test for speculative insert abort case

2019-05-16 Thread Melanie Plageman
On Thu, May 16, 2019 at 2:03 PM Andres Freund  wrote:

> Hi,
>
> On 2019-05-16 13:59:47 -0700, Melanie Plageman wrote:
> > On Wed, May 15, 2019 at 10:32 PM Ashwin Agrawal 
> wrote:
> >
> > >
> > > The second index would help to hold the session after inserting the
> tuple
> > > in unique index but before completing the speculative insert. Hence,
> helps
> > > to create the condition easily. I believe order of index insertion is
> > > helping here that unique index is inserted and then non-unique index is
> > > inserted too.
> > >
> > >
> > Oh, cool. I didn't know that execution order would be guaranteed for
> which
> > index
> > to insert into first.
>
> It's not *strictly* speaking *always* well defined. The list of indexes
> is sorted by the oid of the index - so once created, it's
> consistent. But when the oid assignment wraps around, it'd be the other
> way around. But I think it's ok to disregard that - it'll never happen
> in regression tests run against a new cluster, and you'd have to run
> tests against an installed cluster for a *LONG* time for a *tiny* window
> where the wraparound would happen precisely between the creation of the
> two indexes.
>
> Makes sense?
>

Yep, thanks.


> I guess we could make that case a tiny bit easier to diagnose in the
> extremely unlikely case it happens by having a step that outputs
> SELECT 'index_a'::regclass::int8 < 'index_b'::regclass::int8;
>
>
Good idea.
I squashed the changes I suggested in previous emails, Ashwin's patch, my
suggested updates to that patch, and the index order check all into one
updated
patch attached.

-- 
Melanie Plageman


0003-Add-test-to-validate-speculative-wait-is-performed.patch
Description: Binary data


behaviour change - default_tablesapce + partition table

2019-05-16 Thread Rushabh Lathia
Hi,

Consider the below test:

create tablespace mytbs location '/home/rushabh/mywork/workspace/pg/';
create table test ( a int , b int ) partition by list (a);

set default_tablespace to mytbs;
create table test_p1 partition of test for values in (1);

In the above test, after the setting the default_tablespace I am creating
a partition table and expecting that to get created under "mytbs"
tablespace.

But that is not the case:

postgres@66247=#select relname, reltablespace from pg_class where relname =
'test_p1';
 relname | reltablespace
-+---
 test_p1 | 0
(1 row)

I noticed the behaviour change for default_tablespace with partition table
with below commit.

commit 87259588d0ab0b8e742e30596afa7ae25caadb18
Author: Alvaro Herrera 
Date:   Thu Apr 25 10:20:23 2019 -0400

Fix tablespace inheritance for partitioned rels

Commit ca4103025dfe left a few loose ends.  The most important one
(broken pg_dump output) is already fixed by virtue of commit
3b23552ad8bb, but some things remained:

I don't think that new behaviour is intended and if it's an intended change
than need to fix pg_dump as well - other wise lets say,
1) create the above test on v11
2) take a dump
3) restore on v12

will end up partition into wrong tablesapce.

Looking at the commit changes, it seems like at condition when no other
tablespace is specified, we default the tablespace to the parent partitioned
table's.

else if (stmt->partbound)
{
/*
 * For partitions, when no other tablespace is specified, we default
 * the tablespace to the parent partitioned table's.
 */
Assert(list_length(inheritOids) == 1);
tablespaceId = get_rel_tablespace(linitial_oid(inheritOids));
}

But here it doesn't consider the default_tablespace if the parent
partitioned
tablespace is an InvalidOid (which was the care before this commit).

PFA patch to fix the same.

Thanks,

-- 
Rushabh Lathia
www.EnterpriseDB.com
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index bfcf947..0cc9c0b 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -680,6 +680,9 @@ DefineRelation(CreateStmt *stmt, char relkind, Oid ownerId,
 		 */
 		Assert(list_length(inheritOids) == 1);
 		tablespaceId = get_rel_tablespace(linitial_oid(inheritOids));
+
+		if (!OidIsValid(tablespaceId))
+			tablespaceId = GetDefaultTablespace(stmt->relation->relpersistence, partitioned);
 	}
 	else
 		tablespaceId = GetDefaultTablespace(stmt->relation->relpersistence,


Re: Statistical aggregate functions are not working with PARTIAL aggregation

2019-05-16 Thread Andres Freund
Hi,

On 2019-05-08 13:06:36 +0900, Kyotaro HORIGUCHI wrote:
> This behavior is introduced by 69c3936a14 (in v11).  At that time
> FunctionCallInfoData is pallioc0'ed and has fixed length members
> arg[6] and argnull[7]. So nulls[1] is always false even if nargs
> = 1 so the issue had not been revealed.

> After introducing a9c35cf85c (in v12) the same check is done on
> FunctionCallInfoData that has NullableDatum args[] of required
> number of elements. In that case args[1] is out of palloc'ed
> memory so this issue has been revealed.

> In a second look, I seems to me that the right thing to do here
> is setting numInputs instaed of numArguments to numTransInputs in
> combining step.

Yea, to me this just seems a consequence of the wrong
numTransInputs. Arguably this is a bug going back to 9.6, where
combining aggregates where introduced. It's just that numTransInputs
isn't used anywhere for combining aggregates, before 11.

It's documentation says:

/*
 * Number of aggregated input columns to pass to the transfn.  This
 * includes the ORDER BY columns for ordered-set aggs, but not for plain
 * aggs.  (This doesn't count the transition state value!)
 */
int numTransInputs;

which IMO is violated by having it set to the plain aggregate's value,
rather than the combine func.

While I agree that fixing numTransInputs is the right way, I'm not
convinced the way you did it is the right approach. I'm somewhat
inclined to think that it's wrong that ExecInitAgg() calls
build_pertrans_for_aggref() with a numArguments that's not actually
right? Alternatively I think we should just move the numTransInputs
computation into the existing branch around DO_AGGSPLIT_COMBINE.

It seems pretty clear that this needs to be fixed for v11, it seems too
fragile to rely on trans_fcinfo->argnull[2] being zero initialized.

I'm less sure about fixing it for 9.6/10. There's no use of
numTransInputs for combining back then.

David, I assume you didn't adjust numTransInput plainly because it
wasn't needed / you didn't notice? Do you have a preference for a fix?



Independent of these changes, some of the code around partial, ordered
set and polymorphic aggregates really make it hard to understand things:

/* Detect how many arguments to pass to the finalfn */
if (aggform->aggfinalextra)
peragg->numFinalArgs = numArguments + 1;
else
peragg->numFinalArgs = numDirectArgs + 1;

What on earth is that supposed to mean? Sure, the +1 is obvious, but why
the different sources for arguments are needed isn't - especially
because numArguments was just calculated with the actual aggregate
inputs. Nor is aggfinalextra's documentation particularly elucidating:
/* true to pass extra dummy arguments to aggfinalfn */
boolaggfinalextra BKI_DEFAULT(f);

especially not why aggfinalextra means we have to ignore direct
args. Presumably because aggfinalextra just emulates what direct args
does for ordered set args, but we allow both to be set.

Similarly

/* Detect how many arguments to pass to the transfn */
if (AGGKIND_IS_ORDERED_SET(aggref->aggkind))
pertrans->numTransInputs = numInputs;
else
pertrans->numTransInputs = numArguments;

is hard to understand, without additional comments. One can, looking
around, infer that it's because ordered set aggs need sort columns
included. But that should just have been mentioned.

And to make sense of build_aggregate_transfn_expr()'s treatment of
direct args, one has to know that direct args are only possible for
ordered set aggregates. Which IMO is not obvious in nodeAgg.c.

...

I feel this code has become quite creaky in the last few years.

Greetings,

Andres Freund




Re: vacuumdb and new VACUUM options

2019-05-16 Thread Michael Paquier
On Wed, May 15, 2019 at 03:44:22PM +0900, Masahiko Sawada wrote:
> I've attached new version patch that takes the way to let the backend
> parser do all work.

I was wondering how the error handling gets by not having the parsing
on the frontend, and it could be worse:
$ vacuumdb --index-cleanup=popo
vacuumdb: vacuuming database "postgres"
vacuumdb: error: vacuuming of table "pg_catalog.pg_proc" in database
"postgres" failed: ERROR:  index_cleanup requires a Boolean value
$ vacuumdb --truncate=popo
vacuumdb: vacuuming database "postgres"
vacuumdb: error: vacuuming of table "pg_catalog.pg_proc" in database
"postgres" failed: ERROR:  truncate requires a Boolean value

For TRUNCATE, we actually get to the same error, and INDEX_CLEANUP
just defers with the separator between the two terms.  I think that we
could live with that for simplicity's sake.  Perhaps others have
different opinions though.

+   if (vacopts.index_cleanup != NULL)
Checking directly for NULL-ness here is inconsistent with the previous
callers.

+$node->issues_sql_like(
+   [ 'vacuumdb', '--index-cleanup=true', 'postgres' ],
+   qr/statement: VACUUM \(INDEX_CLEANUP true\).*;/,
+   'vacuumdb --index-cleanup=true')
We should have a failure test here instead of testing two times the
same boolean parameter with opposite values to make sure that we still
complain on invalid input values.

+ Specify that VACUUM should attempt to remove
+ index entries pointing to dead tuples. If this option is not specified
+ the behavior depends on vacuum_index_cleanup option
+ for the table to be vacuumed.
The description of other commands do not mention directly VACUUM, and
have a more straight-forward description about the goal of the option.
So the first sentence could be reworked as follows:
Removes index entries pointing to dead tuples.

And the second for --truncate:
Truncates any empty pages at the end of the relation.

My 2c.
--
Michael


signature.asc
Description: PGP signature


Re: pglz performance

2019-05-16 Thread Michael Paquier
On Thu, May 16, 2019 at 10:13:22PM +0500, Andrey Borodin wrote:
> Meanwhile I made some enhancements to test suit:
> Intel Server
> NOTICE:  0: Decompressor pglz_decompress_hacked result 10.346763
> NOTICE:  0: Decompressor pglz_decompress_hacked8 result 11.192078
> NOTICE:  0: Decompressor pglz_decompress_hacked16 result 11.957727
> NOTICE:  0: Decompressor pglz_decompress_vanilla result 14.262256
> 
> ARM Server
> NOTICE:  Decompressor pglz_decompress_hacked result 12.98
> NOTICE:  Decompressor pglz_decompress_hacked8 result 13.004935
> NOTICE:  Decompressor pglz_decompress_hacked16 result 13.043015
> NOTICE:  Decompressor pglz_decompress_vanilla result 18.239242
> 
> Power9 Server
> NOTICE:  Decompressor pglz_decompress_hacked result 10.992974
> NOTICE:  Decompressor pglz_decompress_hacked8 result 11.747443
> NOTICE:  Decompressor pglz_decompress_hacked16 result 11.026342
> NOTICE:  Decompressor pglz_decompress_vanilla result 16.375315
> 
> Intel laptop
> NOTICE:  Decompressor pglz_decompress_hacked result 9.445808
> NOTICE:  Decompressor pglz_decompress_hacked8 result 9.105360
> NOTICE:  Decompressor pglz_decompress_hacked16 result 9.621833
> NOTICE:  Decompressor pglz_decompress_vanilla result 10.661968
> 
> From these results pglz_decompress_hacked looks best.

That's nice.

From the numbers you are presenting here, all of them are much better
than the original, and there is not much difference between any of the
patched versions.  Having a 20%~30% improvement with a patch is very
nice.

After that comes the simplicity and the future maintainability of what
is proposed.  I am not much into accepting a patch which has a 1%~2%
impact for some hardwares and makes pglz much more complex and harder
to understand.  But I am really eager to see a patch with at least a
10% improvement which remains simple, even more if it simplifies the
logic used in pglz.
--
Michael


signature.asc
Description: PGP signature


Re: Fixing order of resowner cleanup in 12, for Windows

2019-05-16 Thread Thomas Munro
On Thu, May 9, 2019 at 10:23 PM Thomas Munro  wrote:
> The reason the patch didn't solve the problem is that
> AtEOXact_Parallel() calls DestroyParallelContext().  So DSM segments
> that happen to belong to ParallelContext objects are already gone by
> the time resowner.c gets involved.

This was listed as an open item for PostgreSQL 12, but I'm going to
move it to "older bugs".  I want to fix it, but now that I understand
what's wrong, it's a slightly bigger design issue than I'm game to try
to fix right now.

This means that 12, like 11, will be capable of leaking empty
temporary directories on Windows whenever an error is raised in the
middle of parallel CREATE INDEX or multi-batch Parallel Hash Join.
The directories are eventually unlinked at restart, and at least
the (potentially large) files inside the directory are unlinked on
abort.  I think we can live with that for a bit longer.


--
Thomas Munro
https://enterprisedb.com




Re: VACUUM fails to parse 0 and 1 as boolean value

2019-05-16 Thread Michael Paquier
On Thu, May 16, 2019 at 03:29:36PM -0400, Robert Haas wrote:
> I'm not sure how much value it really has to define
> opt_boolean_or_string_or_numeric.  It saves 1 line of code in each of
> 3 places, but costs 6 lines of code to have it.
> 
> Perhaps we could try to unify at a higher level.  Like can we merge
> vac_analyze_option_list with explain_option_list?

var_value has also similar semantics, and it uses makeAConst().  At
this point of the game, I'd like to think that it would be just better
to leave all the refactoring behind us on HEAD, to apply the first
patch presented on this thread, and to work more on that for v13 once
it opens for business if there is a patch to discuss about.
--
Michael


signature.asc
Description: PGP signature


Re: VACUUM fails to parse 0 and 1 as boolean value

2019-05-16 Thread Kyotaro HORIGUCHI
Mmm. It has gone before complete.

At Fri, 17 May 2019 10:21:21 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI 
 wrote in 
<20190517.102121.72558057.horiguchi.kyot...@lab.ntt.co.jp>
> We now have several syntax elements seemingly the same but behave
> different way.
> 
> At Thu, 16 May 2019 15:29:36 -0400, Robert Haas  wrote 
> in 
> > On Thu, May 16, 2019 at 2:56 PM Fujii Masao  wrote:
> > > Yes. Thanks for the comment!
> > > Attached is the updated version of the patch.
> > > It adds such common rule.
> > 
> > I'm not sure how much value it really has to define
> > opt_boolean_or_string_or_numeric.  It saves 1 line of code in each of
> > 3 places, but costs 6 lines of code to have it.
> 
> ANALYZE (options) desn't accept 1/0 but only accepts true/false
> or on/off. Why are we going to make VACUUM differently?

The patch changes the behvaior of ANALYZE together. Please ignore
this.

> And the documentation for ANALYZE doesn't mention the change.
> 
> I think we don't need to support 1/0 as boolean here (it's
> unnatural) and the documentation of VACUUM/ANALYZE should be
> fixed.
> 
> > Perhaps we could try to unify at a higher level.  Like can we merge
> > vac_analyze_option_list with explain_option_list?
> 
> Also REINDEX (VERBOSE) doesn't accept explict argument as of
> now. (reindex_option_list)
> 
> I'm not sure about FDW/SERVER/CREATE USER MAPPING but perhaps
> it's a different from this.
> 
> COPY .. WITH (options) doesn't accpet 1/0 as boolean.
> 
> copy_generic_opt_arg:
>   opt_boolean_or_string  { $$ = (Node *) makeString($1); }
>   | NumericOnly  { $$ = (Node *) $1; }
>   | '*'  { $$ = (Node *) makeNode(A_Star); }
>   | '(' copy_generic_opt_arg_list ')'{ $$ = (Node *) $2; }
>   | /* EMPTY */  { $$ = NULL; }
> ;

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center





Re: VACUUM fails to parse 0 and 1 as boolean value

2019-05-16 Thread Kyotaro HORIGUCHI
We now have several syntax elements seemingly the same but behave
different way.

At Thu, 16 May 2019 15:29:36 -0400, Robert Haas  wrote 
in 
> On Thu, May 16, 2019 at 2:56 PM Fujii Masao  wrote:
> > Yes. Thanks for the comment!
> > Attached is the updated version of the patch.
> > It adds such common rule.
> 
> I'm not sure how much value it really has to define
> opt_boolean_or_string_or_numeric.  It saves 1 line of code in each of
> 3 places, but costs 6 lines of code to have it.

ANALYZE (options) desn't accept 1/0 but only accepts true/false
or on/off. Why are we going to make VACUUM differently?

And the documentation for ANALYZE doesn't mention the change.

I think we don't need to support 1/0 as boolean here (it's
unnatural) and the documentation of VACUUM/ANALYZE should be
fixed.

> Perhaps we could try to unify at a higher level.  Like can we merge
> vac_analyze_option_list with explain_option_list?

Also REINDEX (VERBOSE) doesn't accept explict argument as of
now. (reindex_option_list)

I'm not sure about FDW/SERVER/CREATE USER MAPPING but perhaps
it's a different from this.

COPY .. WITH (options) doesn't accpet 1/0 as boolean.

copy_generic_opt_arg:
  opt_boolean_or_string  { $$ = (Node *) makeString($1); }
  | NumericOnly  { $$ = (Node *) $1; }
  | '*'  { $$ = (Node *) makeNode(A_Star); }
  | '(' copy_generic_opt_arg_list ')'{ $$ = (Node *) $2; }
  | /* EMPTY */  { $$ = NULL; }
;

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center





Re: Re: SQL/JSON: functions

2019-05-16 Thread Alexander Korotkov
Hi!

On Tue, May 14, 2019 at 3:54 AM Andrew Alsup  wrote:
> array slicing [0:], [:1], and [0:1] do not work:$.c1.c2[0:].b3
>
> # select json_path_query('{"a1": 123, "b1": "xxx", "c1": {"a2": 456,
> "b2": "yyy", "c2": [{"a3": 777, "b3": "7z"},{"a3": 888, "b3":
> "8z"}]}}'::json, '$.c1.c2[0:].b3'::jsonpath);
> 2019-05-13 20:47:48.740 EDT [21856] ERROR:  bad jsonpath representation
> at character 147
> 2019-05-13 20:47:48.740 EDT [21856] DETAIL:  syntax error, unexpected
> ':', expecting ',' or ']' at or near ":"
> 2019-05-13 20:47:48.740 EDT [21856] STATEMENT:  select
> json_path_query('{"a1": 123, "b1": "xxx", "c1": {"a2": 456, "b2": "yyy",
> "c2": [{"a3": 777, "b3": "7z"},{"a3": 888, "b3": "8z"}]}}'::json,
> '$.c1.c2[0:].b3'::jsonpath);
> ERROR:  bad jsonpath representation
> LINE 1: ...7, "b3": "7z"},{"a3": 888, "b3": "8z"}]}}'::json, '$.c1.c2[0...
>   ^
> DETAIL:  syntax error, unexpected ':', expecting ',' or ']' at or near ":"

There is no color syntax from array slicing in jsonpath.  Instead of
"[0:]" one should write "[0 to last]".  See documentation
https://www.postgresql.org/docs/devel/datatype-json.html#TYPE-JSONPATH-ACCESSORS

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: Avoiding hash join batch explosions with extreme skew and weird stats

2019-05-16 Thread Thomas Munro
On Fri, May 17, 2019 at 11:46 AM Tomas Vondra
 wrote:
> On Thu, May 16, 2019 at 06:58:43PM -0400, Tom Lane wrote:
> >Thomas Munro  writes:
> >> On Fri, May 17, 2019 at 4:39 AM Tomas Vondra
> >>  wrote:
> >>> I kinda like the idea with increasing the spaceAllowed value. Essentially,
> >>> if we decide adding batches would be pointless, increasing the memory
> >>> budget is the only thing we can do anyway.
> >
> >> But that's not OK, we need to fix THAT.
> >
> >I don't think it's necessarily a good idea to suppose that we MUST
> >fit in work_mem come what may.  It's likely impossible to guarantee
> >that in all cases.  Even if we can, a query that runs for eons will
> >help nobody.
>
> I kinda agree with Thomas - arbitrarily increasing work_mem is something
> we should not do unless abosolutely necessary. If the query is slow, it's
> up to the user to bump the value up, if deemed appropriate.

+1

I think we can gaurantee that we can fit in work_mem with only one
exception: we have to allow work_mem to be exceeded when we otherwise
couldn't fit a single tuple.

Then the worst possible case with the looping algorithm is that we
degrade to loading just one inner tuple at a time into the hash table,
at which point we effectively have a nested loop join (except (1) it's
flipped around: for each tuple on the inner side, we scan the outer
side; and (2) we can handle full outer joins).  In any reasonable case
you'll have a decent amount of tuples at a time, so you won't have to
loop too many times so it's not really quadratic in the number of
tuples.  The realisation that it's a nested loop join in the extreme
case is probably why the MySQL people called it 'block nested loop
join' (and as far as I can tell from quick googling, it might be their
*primary* strategy for hash joins that don't fit in memory, not just a
secondary strategy after Grace fails, but I might be wrong about
that).  Unlike plain old single-tuple nested loop join, it works in
arbitrary sized blocks (the hash table).  What we would call a regular
hash join, they call a BNL that just happens to have only one loop.  I
think Grace is probably a better primary strategy, but loops are a
good fallback.

The reason I kept mentioning sort-merge in earlier threads is because
it'd be better in the worst cases.  Unfortunately it would be worse in
the best case (smallish numbers of loops) and I suspect many real
world cases.  It's hard to decide, so perhaps we should be happy that
sort-merge can't be considered currently because the join conditions
may not be merge-joinable.


--
Thomas Munro
https://enterprisedb.com




Re: Avoiding hash join batch explosions with extreme skew and weird stats

2019-05-16 Thread Tomas Vondra

On Thu, May 16, 2019 at 06:58:43PM -0400, Tom Lane wrote:

Thomas Munro  writes:

On Fri, May 17, 2019 at 4:39 AM Tomas Vondra
 wrote:

I kinda like the idea with increasing the spaceAllowed value. Essentially,
if we decide adding batches would be pointless, increasing the memory
budget is the only thing we can do anyway.



But that's not OK, we need to fix THAT.


I don't think it's necessarily a good idea to suppose that we MUST
fit in work_mem come what may.  It's likely impossible to guarantee
that in all cases.  Even if we can, a query that runs for eons will
help nobody.



I kinda agree with Thomas - arbitrarily increasing work_mem is something
we should not do unless abosolutely necessary. If the query is slow, it's
up to the user to bump the value up, if deemed appropriate.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services





Re: Parallel Foreign Scans - need advice

2019-05-16 Thread Thomas Munro
On Fri, May 17, 2019 at 12:45 AM Korry Douglas  wrote:
> It sounds like there is no reliable way to get the information that I’m 
> looking for, is that right?

Correct.  And if there were, it could only be used to write bugs.  Let
me see if I can demonstrate...  I'll use the file_fdw patch from the
link I gave before, and I'll add an elog(LOG) message to show when
fileIterateForeignScan() runs.

$ echo 1 > /tmp/t2

postgres=# create table t1 as select generate_series(1, 100)::int i;
SELECT 100
postgres=# create server files foreign data wrapper file_fdw;
CREATE SERVER
postgres=# create foreign table t2 (n int) server files
  options (filename '/tmp/t2', format 'csv');
CREATE FOREIGN TABLE

The relevant EXPLAIN output is harder to understand if the parallel
leader participates, but it changes nothing important, so I'll turn
that off first, and then see how it is run:

postgres=# set parallel_leader_participation = off;
SET
postgres=# explain (analyze, verbose) select count(*) from (select *
from t1 union all select * from t2) ss;

QUERY PLAN
---
 Finalize Aggregate  (cost=14176.32..14176.33 rows=1 width=8) (actual
time=234.023..234.023 rows=1 loops=1)
   Output: count(*)
   ->  Gather  (cost=14176.10..14176.31 rows=2 width=8) (actual
time=233.840..235.079 rows=2 loops=1)
 Output: (PARTIAL count(*))
 Workers Planned: 2
 Workers Launched: 2
 ->  Partial Aggregate  (cost=13176.10..13176.11 rows=1
width=8) (actual time=223.550..223.555 rows=1 loops=2)
   Output: PARTIAL count(*)
   Worker 0: actual time=223.432..223.443 rows=1 loops=1
   Worker 1: actual time=223.667..223.668 rows=1 loops=1
   ->  Parallel Append  (cost=0.00..11926.10 rows=50
width=0) (actual time=0.087..166.669 rows=50 loops=2)
 Worker 0: actual time=0.083..166.366 rows=499687 loops=1
 Worker 1: actual time=0.092..166.972 rows=500314 loops=1
 ->  Parallel Seq Scan on public.t1
(cost=0.00..9425.00 rows=50 width=0) (actual time=0.106..103.384
rows=50 loops=2)
   Worker 0: actual time=0.123..103.106
rows=499686 loops=1
   Worker 1: actual time=0.089..103.662
rows=500314 loops=1
 ->  Parallel Foreign Scan on public.t2
(cost=0.00..1.10 rows=1 width=0) (actual time=0.079..0.096 rows=1
loops=1)
   Foreign File: /tmp/numbers
   Foreign File Size: 2 b
   Worker 0: actual time=0.079..0.096 rows=1 loops=1
 Planning Time: 0.219 ms
 Execution Time: 235.262 ms
(22 rows)

You can see the that Parallel Foreign Scan was only actually run by
one worker.  So if you were somehow expecting both of them to show up
in order to produce the correct results, you have a bug.  The reason
that happened is because Parallal Append sent one worker to chew on
t1, and another to chew on t2, but the scan of t2 was finished very
quickly, so that worker then went to help out with t1.  And for
further proof of that, here's what I see in my server log (note only
ever called twice, and in the same process):

2019-05-17 10:51:42.248 NZST [52158] LOG:  fileIterateForeignScan
2019-05-17 10:51:42.248 NZST [52158] STATEMENT:  explain analyze
select count(*) from (select * from t1 union all select * from t2) ss;
2019-05-17 10:51:42.249 NZST [52158] LOG:  fileIterateForeignScan
2019-05-17 10:51:42.249 NZST [52158] STATEMENT:  explain analyze
select count(*) from (select * from t1 union all select * from t2) ss;

Therefore you can't allocate the work up front based on expected
number of workers, even if it works in simple examples.  Your node
isn't necessarily the only node in the plan, and higher up nodes get
to decide when, if at all, you run, in each worker.

-- 
Thomas Munro
https://enterprisedb.com




Re: Avoiding hash join batch explosions with extreme skew and weird stats

2019-05-16 Thread Tom Lane
Thomas Munro  writes:
> On Fri, May 17, 2019 at 4:39 AM Tomas Vondra
>  wrote:
>> I kinda like the idea with increasing the spaceAllowed value. Essentially,
>> if we decide adding batches would be pointless, increasing the memory
>> budget is the only thing we can do anyway.

> But that's not OK, we need to fix THAT.

I don't think it's necessarily a good idea to suppose that we MUST
fit in work_mem come what may.  It's likely impossible to guarantee
that in all cases.  Even if we can, a query that runs for eons will
help nobody.

regards, tom lane




Re: Avoiding hash join batch explosions with extreme skew and weird stats

2019-05-16 Thread Tomas Vondra

On Fri, May 17, 2019 at 10:21:56AM +1200, Thomas Munro wrote:

On Fri, May 17, 2019 at 4:39 AM Tomas Vondra
 wrote:

I think this is a step in the right direction, but as I said on the other
thread(s), I think we should not disable growth forever and recheck once
in a while. Otherwise we'll end up in sad situation with non-uniform data
sets, as poined out by Hubert Zhang in [1]. It's probably even truer with
this less strict logic, using 95% as a threshold (instead of 100%).

I kinda like the idea with increasing the spaceAllowed value. Essentially,
if we decide adding batches would be pointless, increasing the memory
budget is the only thing we can do anyway.


But that's not OK, we need to fix THAT.



I agree increasing the budget is not ideal, althought at the moment it's
the only thing we can do. If we can improve that, great.


The problem however is that we only really look at a single bit - it may
be that doubling the batches would not help, but doing it twice would
actually reduce the memory usage. For example, assume there are 2 distinct
values in the batch, with hash values (in binary)


Yes, that's a good point, and not a case that we should ignore.  But
if we had a decent fall-back strategy that respected work_mem, we
wouldn't care so much if we get it wrong in a corner case.  I'm
arguing that we should use Grace partitioning as our primary
partitioning strategy, but fall back to looping (or possibly
sort-merging) for the current batch if Grace doesn't seem to be
working.  You'll always be able to find cases where if you'd just
tried one more round, Grace would work, but that seems acceptable to
me, because getting it wrong doesn't melt your computer, it just
probably takes longer.  Or maybe it doesn't.  How much longer would it
take to loop twice?  Erm, twice as long, and each loop makes actual
progress, unlike extra speculative Grace partition expansions which
apply not just to the current batch but all batches, might not
actually work, and you *have* to abandon at some point.  The more I
think about it, the more I think that a loop-base escape valve, though
unpalatably quadratic, is probably OK because we're in a sink-or-swim
situation at this point, and our budget is work_mem, not work_time.



True.


I'm concerned that we're trying to find ways to treat the symptoms,
allowing us to exceed work_mem but maybe not so much, instead of
focusing on the fundamental problem, which is that we don't yet have
an algorithm that is guaranteed to respect work_mem.



Yes, that's a good point.


Admittedly I don't have a patch, just a bunch of handwaving.  One
reason I haven't attempted to write it is because although I know how
to do the non-parallel version using a BufFile full of match bits in
sync with the tuples for outer joins, I haven't figured out how to do
it for parallel-aware hash join, because then each loop over the outer
batch could see different tuples in each participant.  You could use
the match bit in HashJoinTuple header, but then you'd have to write
all the tuples out again, which is more IO than I want to do.  I'll
probably start another thread about that.



That pesky parallelism ;-)


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services





Re: Avoiding hash join batch explosions with extreme skew and weird stats

2019-05-16 Thread Thomas Munro
On Fri, May 17, 2019 at 4:39 AM Tomas Vondra
 wrote:
> I think this is a step in the right direction, but as I said on the other
> thread(s), I think we should not disable growth forever and recheck once
> in a while. Otherwise we'll end up in sad situation with non-uniform data
> sets, as poined out by Hubert Zhang in [1]. It's probably even truer with
> this less strict logic, using 95% as a threshold (instead of 100%).
>
> I kinda like the idea with increasing the spaceAllowed value. Essentially,
> if we decide adding batches would be pointless, increasing the memory
> budget is the only thing we can do anyway.

But that's not OK, we need to fix THAT.

> The problem however is that we only really look at a single bit - it may
> be that doubling the batches would not help, but doing it twice would
> actually reduce the memory usage. For example, assume there are 2 distinct
> values in the batch, with hash values (in binary)

Yes, that's a good point, and not a case that we should ignore.  But
if we had a decent fall-back strategy that respected work_mem, we
wouldn't care so much if we get it wrong in a corner case.  I'm
arguing that we should use Grace partitioning as our primary
partitioning strategy, but fall back to looping (or possibly
sort-merging) for the current batch if Grace doesn't seem to be
working.  You'll always be able to find cases where if you'd just
tried one more round, Grace would work, but that seems acceptable to
me, because getting it wrong doesn't melt your computer, it just
probably takes longer.  Or maybe it doesn't.  How much longer would it
take to loop twice?  Erm, twice as long, and each loop makes actual
progress, unlike extra speculative Grace partition expansions which
apply not just to the current batch but all batches, might not
actually work, and you *have* to abandon at some point.  The more I
think about it, the more I think that a loop-base escape valve, though
unpalatably quadratic, is probably OK because we're in a sink-or-swim
situation at this point, and our budget is work_mem, not work_time.

I'm concerned that we're trying to find ways to treat the symptoms,
allowing us to exceed work_mem but maybe not so much, instead of
focusing on the fundamental problem, which is that we don't yet have
an algorithm that is guaranteed to respect work_mem.

Admittedly I don't have a patch, just a bunch of handwaving.  One
reason I haven't attempted to write it is because although I know how
to do the non-parallel version using a BufFile full of match bits in
sync with the tuples for outer joins, I haven't figured out how to do
it for parallel-aware hash join, because then each loop over the outer
batch could see different tuples in each participant.  You could use
the match bit in HashJoinTuple header, but then you'd have to write
all the tuples out again, which is more IO than I want to do.  I'll
probably start another thread about that.

-- 
Thomas Munro
https://enterprisedb.com




Re: PSA: New intel MDS vulnerability mitigations cause measurable slowdown

2019-05-16 Thread Thomas Munro
On Fri, May 17, 2019 at 9:42 AM Thomas Munro  wrote:
> pgbench -M prepared read-only transactions per second, 16 client threads:

(That second "16 client threads" line should read "32 client threads".)

-- 
Thomas Munro
https://enterprisedb.com




Re: PSA: New intel MDS vulnerability mitigations cause measurable slowdown

2019-05-16 Thread Thomas Munro
On Fri, May 17, 2019 at 5:26 AM Chapman Flack  wrote:
> On 5/16/19 12:24 PM, Albert Cervera i Areny wrote:
> > Missatge de Thomas Munro  del dia dj., 16 de
> > maig 2019 a les 13:09:
> >> With all three mitigations activated, my little dev machine has gone
> >> from being able to do ~11.8 million baseline syscalls per second to
> >
> > Did you mean "1.8"?
>
> Not in what I thought I saw:
>
> >> ~1.6 million, or ~1.4 million ...
> >>
> >>   PTI   IBRS  MDS=off  MDS=VERW MDS=AVX
> >>   = =   
> >>   off   off   11798658  4764159  3274043
>  
> >>   off   on 2652564  1941606  1655356
> >>   onoff4973053  2932906  2339779
> >>   onon 1988527  1556922  1378798
>^^^  ^^^

Right.  Actually it's worse than that -- after I posted I realised
that I had some debug stuff enabled in my kernel that was slowing
things down a bit, so I reran the tests overnight with a production
kernel and here is what I see this morning.  It's actually ~17.8
million syscalls/sec -> ~1.7 million syscalls/sec, if you go from all
mitigations off to all mitigations on, or -> ~3.2 million for just PTI
+ MDS.  And the loss of TPS is ~5% for the case I was most interested
in, just turning on MDS=VERW if you already had PTI on and IBRS off.

Raw getuid() syscalls per second:

  PTI   IBRS  MDS=off  MDS=VERW MDS=AVX
  = =   
  off   off   17771744  5372032  3575035
  off   on 3060923  2166527  1817052
  onoff5622591  3150883  2463934
  onon 2213190  1687748  1475605

pgbench read-only transactions per second, 1 client thread:

  PTI   IBRS  MDS=off  MDS=VERW MDS=AVX
  = =   
  off   off  224142210321571
  off   on   212982081720418
  onoff  224732208021550
  onon   212862085020386

pgbench -M prepared read-only transactions per second, 1 client thread:

  PTI   IBRS  MDS=off  MDS=VERW MDS=AVX
  = =   
  off   off  435084247641123
  off   on   407293948338555
  onoff  441104298942012
  onon   411433999038798

pgbench -M prepared read-only transactions per second, 4 client threads:

  PTI   IBRS  MDS=off  MDS=VERW MDS=AVX
  = =   
  off   off 1007359768996662
  off   on   801427780477064
  onoff 1005409701095827
  onon   794927697676226

pgbench -M prepared read-only transactions per second, 16 client threads:

  PTI   IBRS  MDS=off  MDS=VERW MDS=AVX
  = =   
  off   off 161015   152978   152556
  off   on  145605   139438   139179
  onoff 155359   147691   146987
  onon  140976   134978   134177

pgbench -M prepared read-only transactions per second, 16 client threads:

  PTI   IBRS  MDS=off  MDS=VERW MDS=AVX
  = =   
  off   off 157986   150132   149436
  off   on  142618   136220   135901
  onoff 153482   146214   145839
  onon  138650   133074   132142

-- 
Thomas Munro
https://enterprisedb.com




Re: Passing CopyMultiInsertInfo structure to CopyMultiInsertInfoNextFreeSlot()

2019-05-16 Thread Alvaro Herrera
On 2019-May-14, Michael Paquier wrote:

> On Tue, May 14, 2019 at 01:19:30PM +1200, David Rowley wrote:
> > When I wrote the code I admit that I was probably wearing my
> > object-orientated programming hat. I had in mind that the whole
> > function series would have made a good class.  Passing the
> > CopyMultiInsertInfo was sort of the non-OOP equivalent to having
> > this/Me/self available, as it would be for any instance method of a
> > class.  Back to reality, this isn't OOP, so I was wearing the wrong
> > hat.  I think the unused parameter should likely be removed.  It's
> > probably not doing a great deal of harm since the function is static
> > inline and the compiler should be producing any code for the unused
> > param, but for the sake of preventing confusion, it should be removed.
> > Ashutosh had to ask about it, so it wasn't immediately clear what the
> > purpose of it was. Since there's none, be gone with it, I say.
> 
> Sounds fair to me.  This has been introduced by 86b8504, so let's see
> what's Andres take.

If this were up to me, I'd leave the function signature alone, and just add
(void) miinfo;  /* unused parameter */
to the function code.  It seems perfectly reasonable to have that
function argument, and a little weird not to have it.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Adding a test for speculative insert abort case

2019-05-16 Thread Andres Freund
Hi,

On 2019-05-16 13:59:47 -0700, Melanie Plageman wrote:
> On Wed, May 15, 2019 at 10:32 PM Ashwin Agrawal  wrote:
> 
> >
> > The second index would help to hold the session after inserting the tuple
> > in unique index but before completing the speculative insert. Hence, helps
> > to create the condition easily. I believe order of index insertion is
> > helping here that unique index is inserted and then non-unique index is
> > inserted too.
> >
> >
> Oh, cool. I didn't know that execution order would be guaranteed for which
> index
> to insert into first.

It's not *strictly* speaking *always* well defined. The list of indexes
is sorted by the oid of the index - so once created, it's
consistent. But when the oid assignment wraps around, it'd be the other
way around. But I think it's ok to disregard that - it'll never happen
in regression tests run against a new cluster, and you'd have to run
tests against an installed cluster for a *LONG* time for a *tiny* window
where the wraparound would happen precisely between the creation of the
two indexes.

Makes sense?

I guess we could make that case a tiny bit easier to diagnose in the
extremely unlikely case it happens by having a step that outputs
SELECT 'index_a'::regclass::int8 < 'index_b'::regclass::int8;

Greetings,

Andres Freund




Re: Adding a test for speculative insert abort case

2019-05-16 Thread Melanie Plageman
On Wed, May 15, 2019 at 10:32 PM Ashwin Agrawal  wrote:

>
> The second index would help to hold the session after inserting the tuple
> in unique index but before completing the speculative insert. Hence, helps
> to create the condition easily. I believe order of index insertion is
> helping here that unique index is inserted and then non-unique index is
> inserted too.
>
>
Oh, cool. I didn't know that execution order would be guaranteed for which
index
to insert into first.


> Attaching patch with the test using the idea Andres mentioned and it works
> to excercise the speculative wait.
>
>
It looks good.
I thought it would be helpful to mention why you have s1 create the
non-unique
index after the permutation has begun. You don't want this index to
influence
the behavior of the other permutations--this part makes sense. However, why
have
s1 do it instead of the controller?

I added a couple suggested changes to the comments in the permutation in the
version in the patch I attached. Note that I did not update the answer
files.
(These suggested changes to comments are in distinct from and would be in
addition to the suggestions I had for the wording of the comments overall
in the
above email I sent).

-- 
Melanie Plageman


0002-Add-test-to-validate-speculative-wait-is-performed.patch
Description: Binary data


Re: VACUUM can finish an interrupted nbtree page split -- is that okay?

2019-05-16 Thread Peter Geoghegan
On Thu, May 16, 2019 at 1:05 PM Peter Geoghegan  wrote:
> Actually, now that I look back at how page deletion worked 5+ years
> ago, I realize that I have this slightly wrong: the leaf level check
> is not sufficient to figure out if the parent's right sibling is
> pending deletion (which is represented explicitly as a half-dead
> internal page prior to 9.4). All the same, I'm going to push ahead
> with this patch. Bugfix commit efada2b8e92 was always about a bug in
> 9.4 -- it had nothing to do with 9.3.

I meant bugfix commit 8da31837803 (commit efada2b8e92 was the commit
that had the bug in question).


-- 
Peter Geoghegan




Re: VACUUM can finish an interrupted nbtree page split -- is that okay?

2019-05-16 Thread Peter Geoghegan
On Mon, May 13, 2019 at 9:09 PM Peter Geoghegan  wrote:
> Even when that happens, the index is already considered corrupt by
> VACUUM, so the same VACUUM process that could in theory be adversely
> affected by removing the half-dead internal page check will complain
> about the page when it gets to it later on -- the user will be told to
> REINDEX. And even then, we will never actually get to apply the check
> that I propose to remove, since we're already checking the leaf page
> sibling of the leaf-level target -- the leaf-level test that was added
> by efada2b8e92 was clearly necessary. But it was also sufficient (no
> equivalent internal half-dead right sibling test is needed): a 9.3-era
> half-dead internal page cannot have more than one child, which must be
> undergoing deletion as well.

Actually, now that I look back at how page deletion worked 5+ years
ago, I realize that I have this slightly wrong: the leaf level check
is not sufficient to figure out if the parent's right sibling is
pending deletion (which is represented explicitly as a half-dead
internal page prior to 9.4). All the same, I'm going to push ahead
with this patch. Bugfix commit efada2b8e92 was always about a bug in
9.4 -- it had nothing to do with 9.3. And, in the unlikely event that
there is a problem on a pg_upgrade'd 9.3 -> 12 database that happens
to have half-dead internal pages, we'll still get a useful, correct
error from VACUUM one way or another. It might be slightly less
friendly as error messages about corruption go, but that seems
acceptable to me.

-- 
Peter Geoghegan




Re: VACUUM fails to parse 0 and 1 as boolean value

2019-05-16 Thread Robert Haas
On Thu, May 16, 2019 at 2:56 PM Fujii Masao  wrote:
> Yes. Thanks for the comment!
> Attached is the updated version of the patch.
> It adds such common rule.

I'm not sure how much value it really has to define
opt_boolean_or_string_or_numeric.  It saves 1 line of code in each of
3 places, but costs 6 lines of code to have it.

Perhaps we could try to unify at a higher level.  Like can we merge
vac_analyze_option_list with explain_option_list?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: VACUUM fails to parse 0 and 1 as boolean value

2019-05-16 Thread Fujii Masao
On Wed, May 15, 2019 at 2:52 AM Andres Freund  wrote:
>
> Hi,
>
> On 2019-05-15 02:45:21 +0900, Fujii Masao wrote:
> > VACUUM fails to parse 0 and 1 as boolean value
> >
> > The document for VACUUM explains
> >
> > boolean
> > Specifies whether the selected option should be turned on or off.
> > You can write TRUE, ON, or 1 to enable the option, and FALSE, OFF,
> > or 0 to disable it.
> >
> > But VACUUM fails to parse 0 and 1 as boolean value as follows.
> >
> > =# VACUUM (INDEX_CLEANUP 1);
> > ERROR:  syntax error at or near "1" at character 23
> > STATEMENT:  VACUUM (INDEX_CLEANUP 1);
> >
> > This looks a bug. The cause of this is a lack of NumericOnly clause
> > for vac_analyze_option_arg in gram.y. The attached patch
> > adds such NumericOnly. The bug exists only in 12dev.
> >
> > Barring any objection, I will commit the patch.
>
> Might be worth having a common rule for such options, so we don't
> duplicate the knowledge between different places.

Yes. Thanks for the comment!
Attached is the updated version of the patch.
It adds such common rule.

Regards,

-- 
Fujii Masao


vacuum_numeric_as_boolean_v2.patch
Description: Binary data


Re: PSA: New intel MDS vulnerability mitigations cause measurable slowdown

2019-05-16 Thread Chapman Flack
On 5/16/19 12:24 PM, Albert Cervera i Areny wrote:
> Missatge de Thomas Munro  del dia dj., 16 de
> maig 2019 a les 13:09:
>> With all three mitigations activated, my little dev machine has gone
>> from being able to do ~11.8 million baseline syscalls per second to
> 
> Did you mean "1.8"?

Not in what I thought I saw:

>> ~1.6 million, or ~1.4 million ...
>>
>>   PTI   IBRS  MDS=off  MDS=VERW MDS=AVX
>>   = =   
>>   off   off   11798658  4764159  3274043
 
>>   off   on 2652564  1941606  1655356
>>   onoff4973053  2932906  2339779
>>   onon 1988527  1556922  1378798
   ^^^  ^^^

-Chap




Re: pglz performance

2019-05-16 Thread Andrey Borodin


> 15 мая 2019 г., в 15:06, Andrey Borodin  написал(а):
> 
> Owners of AMD and ARM devices are welcome.

Yandex hardware RND guys gave me ARM server and Power9 server. They are looking 
for AMD and some new Intel boxes.

Meanwhile I made some enhancements to test suit:
1. I've added Shakespeare payload: concatenation of works of this prominent 
poet.
2. For each payload compute "sliced time" - time to decompress payload if it 
was sliced by 2Kb pieces or 8Kb pieces.
3. For each decompressor we compute "score": (sum of time to decompress each 
payload, each payload sliced by 2Kb and 8Kb) * 5 times

I've attached full test logs, meanwhile here's results for different platforms.

Intel Server
NOTICE:  0: Decompressor pglz_decompress_hacked result 10.346763
NOTICE:  0: Decompressor pglz_decompress_hacked8 result 11.192078
NOTICE:  0: Decompressor pglz_decompress_hacked16 result 11.957727
NOTICE:  0: Decompressor pglz_decompress_vanilla result 14.262256

ARM Server
NOTICE:  Decompressor pglz_decompress_hacked result 12.98
NOTICE:  Decompressor pglz_decompress_hacked8 result 13.004935
NOTICE:  Decompressor pglz_decompress_hacked16 result 13.043015
NOTICE:  Decompressor pglz_decompress_vanilla result 18.239242

Power9 Server
NOTICE:  Decompressor pglz_decompress_hacked result 10.992974
NOTICE:  Decompressor pglz_decompress_hacked8 result 11.747443
NOTICE:  Decompressor pglz_decompress_hacked16 result 11.026342
NOTICE:  Decompressor pglz_decompress_vanilla result 16.375315

Intel laptop
NOTICE:  Decompressor pglz_decompress_hacked result 9.445808
NOTICE:  Decompressor pglz_decompress_hacked8 result 9.105360
NOTICE:  Decompressor pglz_decompress_hacked16 result 9.621833
NOTICE:  Decompressor pglz_decompress_vanilla result 10.661968

From these results pglz_decompress_hacked looks best.

Best regards, Andrey Borodin.

Intel Server

pgload01f/postgres M # select test_pglz();
NOTICE:  0: Time to decompress one byte in ns:
LOCATION:  test_pglz, test_pglz.c:249
NOTICE:  0: Payload 00010001
LOCATION:  test_pglz, test_pglz.c:252
NOTICE:  0: Decompressor pglz_decompress_hacked result 0.142841
LOCATION:  test_pglz, test_pglz.c:255
NOTICE:  0: Decompressor pglz_decompress_hacked8 result 0.136788
LOCATION:  test_pglz, test_pglz.c:255
NOTICE:  0: Decompressor pglz_decompress_hacked16 result 0.145459
LOCATION:  test_pglz, test_pglz.c:255
NOTICE:  0: Decompressor pglz_decompress_vanilla result 0.207186
LOCATION:  test_pglz, test_pglz.c:255
NOTICE:  0: Payload 00010001 sliced by 2Kb
LOCATION:  test_pglz, test_pglz.c:259
NOTICE:  0: Decompressor pglz_decompress_hacked result 0.747311
LOCATION:  test_pglz, test_pglz.c:262
NOTICE:  0: Decompressor pglz_decompress_hacked8 result 0.780535
LOCATION:  test_pglz, test_pglz.c:262
NOTICE:  0: Decompressor pglz_decompress_hacked16 result 0.826859
LOCATION:  test_pglz, test_pglz.c:262
NOTICE:  0: Decompressor pglz_decompress_vanilla result 1.074147
LOCATION:  test_pglz, test_pglz.c:262
NOTICE:  0: Payload 00010001 sliced by 8Kb
LOCATION:  test_pglz, test_pglz.c:266
NOTICE:  0: Decompressor pglz_decompress_hacked result 0.680256
LOCATION:  test_pglz, test_pglz.c:269
NOTICE:  0: Decompressor pglz_decompress_hacked8 result 0.746620
LOCATION:  test_pglz, test_pglz.c:269
NOTICE:  0: Decompressor pglz_decompress_hacked16 result 0.759602
LOCATION:  test_pglz, test_pglz.c:269
NOTICE:  0: Decompressor pglz_decompress_vanilla result 1.030290
LOCATION:  test_pglz, test_pglz.c:269
NOTICE:  0: Payload 00010006
LOCATION:  test_pglz, test_pglz.c:252
NOTICE:  0: Decompressor pglz_decompress_hacked result 0.040214
LOCATION:  test_pglz, test_pglz.c:255
NOTICE:  0: Decompressor pglz_decompress_hacked8 result 0.042281
LOCATION:  test_pglz, test_pglz.c:255
NOTICE:  0: Decompressor pglz_decompress_hacked16 result 0.042684
LOCATION:  test_pglz, test_pglz.c:255
NOTICE:  0: Decompressor pglz_decompress_vanilla result 0.123906
LOCATION:  test_pglz, test_pglz.c:255
NOTICE:  0: Payload 00010006 sliced by 2Kb
LOCATION:  test_pglz, test_pglz.c:259
NOTICE:  0: Decompressor pglz_decompress_hacked result 0.368333
LOCATION:  test_pglz, test_pglz.c:262
NOTICE:  0: Decompressor pglz_decompress_hacked8 result 0.367808
LOCATION:  test_pglz, test_pglz.c:262
NOTICE:  0: Decompressor pglz_decompress_hacked16 result 0.367856
LOCATION:  test_pglz, test_pglz.c:262
NOTICE:  0: Decompressor pglz_decompress_vanilla result 0.742352
LOCATION:  test_pglz, test_pglz.c:262
NOTICE:  0: Payload 00010006 sliced by 8Kb
LOCATION:  test_pglz, test_pglz.c:266
NOTICE:  0: Decompressor pglz_decompress_hacked result 0.276363
LOCATION:  test_pglz, test_pglz.c:269
NOTICE:  0: Decompressor pglz_decompress_hacked8 result 0.281572
LOCATION:  test_pglz, test_pglz.c:269
NOTICE:  0: Decompressor pglz_decompress_hacked16 result 

Re: Avoiding hash join batch explosions with extreme skew and weird stats

2019-05-16 Thread Tomas Vondra

On Thu, May 16, 2019 at 01:22:31PM +1200, Thomas Munro wrote:

...

Here's a quick hack to show that a 95% cut-off fixes those examples.
I don't really know how to choose the number, but I suspect it should
be much closer to 100 than 50.  I think this is the easiest of three
fundamental problems that need to be solved in this area.  The others
are: accounting for per-partition overheads as Tomas pointed out, and
providing an actual fallback strategy that respects work_mem when
extreme skew is detected OR per-partition overheads dominate.  I plan
to experiment with nested loop hash join (or whatever you want to call
it: the thing where you join every arbitrary fragment of the hash
table against the outer batch, and somehow deal with outer match
flags) when time permits.



I think this is a step in the right direction, but as I said on the other
thread(s), I think we should not disable growth forever and recheck once
in a while. Otherwise we'll end up in sad situation with non-uniform data
sets, as poined out by Hubert Zhang in [1]. It's probably even truer with
this less strict logic, using 95% as a threshold (instead of 100%).

I kinda like the idea with increasing the spaceAllowed value. Essentially,
if we decide adding batches would be pointless, increasing the memory
budget is the only thing we can do anyway.

The problem however is that we only really look at a single bit - it may
be that doubling the batches would not help, but doing it twice would
actually reduce the memory usage. For example, assume there are 2 distinct
values in the batch, with hash values (in binary)

 10101
 101010111

and assume we currently. Clearly, splitting batches is going to do nothing
until we get to the 000 vs. 111 parts.

At first I thought this is rather unlikely and we can ignore that, but I'm
not really sure about that - it may actually be pretty likely. We may get
to 101010 bucket with sufficiently large data set, and then it's ~50%
probability the next bit is the same (assuming two distinct values). So
this may be quite an issue, I think.

regards


[1] 
https://www.postgresql.org/message-id/CAB0yrekv%3D6_T_eUe2kOEvWUMwufcvfd15SFmCABtYFOkxCFdfA%40mail.gmail.com

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services





Re: PSA: New intel MDS vulnerability mitigations cause measurable slowdown

2019-05-16 Thread Albert Cervera i Areny
Missatge de Thomas Munro  del dia dj., 16 de
maig 2019 a les 13:09:
>
> On Wed, May 15, 2019 at 1:13 PM Andres Freund  wrote:
> > > I've run a quick pgbench benchmark:
> > >
> > > *Without* disabling SMT, for readonly pgbench, I'm seeing regressions
> > > between 7-11%, depending on the size of shared_buffers (and some runtime
> > > variations).  That's just on my laptop, with an i7-6820HQ / Haswell CPU.
> > > I'd be surprised if there weren't adversarial loads with bigger
> > > slowdowns - what gets more expensive with the mitigations is syscalls.
>
> This stuff landed in my FreeBSD 13.0-CURRENT kernel, so I was curious
> to measure it with and without the earlier mitigations.  On my humble
> i7-8550U laptop with the new 1.22 microcode installed, with my usual
> settings of PTI=on and IBRS=off, so far MDS=VERW gives me ~1.5% loss
> of TPS with a single client, up to 4.3% loss of TPS for 16 clients,
> but it didn't go higher when I tried 32 clients.  This was a tiny
> scale 10 database, though in a quick test it didn't look like it was
> worse with scale 100.
>
> With all three mitigations activated, my little dev machine has gone
> from being able to do ~11.8 million baseline syscalls per second to

Did you mean "1.8"?

> ~1.6 million, or ~1.4 million with the AVX variant of the mitigation.
>
> Raw getuid() syscalls per second:
>
>   PTI   IBRS  MDS=off  MDS=VERW MDS=AVX
>   = =   
>   off   off   11798658  4764159  3274043
>   off   on 2652564  1941606  1655356
>   onoff4973053  2932906  2339779
>   onon 1988527  1556922  1378798
>
> pgbench read-only transactions per second, 1 client thread:
>
>   PTI   IBRS  MDS=off  MDS=VERW MDS=AVX
>   = =   
>   off   off  193931894918615
>   off   on   179461758617323
>   onoff  193811901518696
>   onon   180451770917418
>
> pgbench -M prepared read-only transactions per second, 1 client thread:
>
>   PTI   IBRS  MDS=off  MDS=VERW MDS=AVX
>   = =   
>   off   off  350203404933200
>   off   on   316583090230229
>   onoff  354453435333415
>   onon   324153159930712
>
> pgbench -M prepared read-only transactions per second, 4 client threads:
>
>   PTI   IBRS  MDS=off  MDS=VERW MDS=AVX
>   = =   
>   off   off  795157689876465
>   off   on   636086222061952
>   onoff  778637543174847
>   onon   627096079060575
>
> pgbench -M prepared read-only transactions per second, 16 client threads:
>
>   PTI   IBRS  MDS=off  MDS=VERW MDS=AVX
>   = =   
>   off   off 125984   121164   120468
>   off   on  112884   108346   107984
>   onoff 121032   116156   115462
>   onon  108889   104636   104027
>
> time gmake -s check:
>
>   PTI   IBRS  MDS=off  MDS=VERW MDS=AVX
>   = =   
>   off   off  16.7816.8517.03
>   off   on   18.1918.8119.08
>   onoff  16.6716.8617.33
>   onon   18.5818.8318.99
>
> --
> Thomas Munro
> https://enterprisedb.com
>
>


-- 
Albert Cervera i Areny
http://www.NaN-tic.com
Tel. 93 553 18 03




How do we support FULL JOIN on PostGIS types?

2019-05-16 Thread Komяpa
Hi!

Greetings from OSGeo Code sprint in Minneapolis :)

We're trying to make FULL JOIN on equality of geometry and can't figure out
why it doesn't work.

Here's reproducer, it works on bytea but not on PostGIS geometry throwing
out

ERROR: FULL JOIN is only supported with merge-joinable or hash-joinable
join conditions

https://trac.osgeo.org/postgis/ticket/4394

We already have a btree opclass with equality:
https://github.com/postgis/postgis/blob/svn-trunk/postgis/postgis.sql.in#L420


We also have hash equality opclass:
https://github.com/postgis/postgis/blob/svn-trunk/postgis/postgis.sql.in#L440


Reading through Postgres documentation I can't figure out what else shall
we do for this join to work. How do we make it work?

-- 
Darafei Praliaskouski
Support me: http://patreon.com/komzpa


Re: dropdb --force

2019-05-16 Thread Anthony Nowocien
Also works fine according to my testing. Documentation is also clear.
Thanks for this useful patch.

RE: psql - add SHOW_ALL_RESULTS option

2019-05-16 Thread Fabien COELHO


Re-bonjour Daniel,


This attached version does:
 - ensure that warnings appear just before its
 - add the entry in psql's help
 - redefine the function boundary so that timing is cleaner
 - include somehow improved tests


\errverbose seems to no longer work with the patch:


Here is a v3 which fixes \errverbose, hopefully.

The feature is still an option which is not enabled by default.

--
Fabien.diff --git a/doc/src/sgml/ref/psql-ref.sgml b/doc/src/sgml/ref/psql-ref.sgml
index b86764003d..de2b56cf88 100644
--- a/doc/src/sgml/ref/psql-ref.sgml
+++ b/doc/src/sgml/ref/psql-ref.sgml
@@ -3918,6 +3918,17 @@ bar
 
   
 
+  
+SHOW_ALL_RESULTS
+
+
+When this variable is set to on, all results of a combined
+(\;) query are shown instead of just the last one.
+Default is off.
+
+
+  
+
   
 SHOW_CONTEXT
 
diff --git a/src/bin/psql/common.c b/src/bin/psql/common.c
index 9579e10630..d35ab91995 100644
--- a/src/bin/psql/common.c
+++ b/src/bin/psql/common.c
@@ -472,6 +472,16 @@ ResetCancelConn(void)
 #endif
 }
 
+static void
+ShowErrorMessage(const PGresult *result)
+{
+	const char *error = PQerrorMessage(pset.db);
+
+	if (strlen(error))
+		pg_log_info("%s", error);
+
+	CheckConnection();
+}
 
 /*
  * AcceptResult
@@ -482,7 +492,7 @@ ResetCancelConn(void)
  * Returns true for valid result, false for error state.
  */
 static bool
-AcceptResult(const PGresult *result)
+AcceptResult(const PGresult *result, bool show_error)
 {
 	bool		OK;
 
@@ -513,15 +523,8 @@ AcceptResult(const PGresult *result)
 break;
 		}
 
-	if (!OK)
-	{
-		const char *error = PQerrorMessage(pset.db);
-
-		if (strlen(error))
-			pg_log_info("%s", error);
-
-		CheckConnection();
-	}
+	if (!OK && show_error)
+		ShowErrorMessage(result);
 
 	return OK;
 }
@@ -701,7 +704,7 @@ PSQLexec(const char *query)
 
 	ResetCancelConn();
 
-	if (!AcceptResult(res))
+	if (!AcceptResult(res, true))
 	{
 		ClearOrSaveResult(res);
 		res = NULL;
@@ -743,7 +746,7 @@ PSQLexecWatch(const char *query, const printQueryOpt *opt)
 
 	ResetCancelConn();
 
-	if (!AcceptResult(res))
+	if (!AcceptResult(res, true))
 	{
 		ClearOrSaveResult(res);
 		return 0;
@@ -999,199 +1002,113 @@ loop_exit:
 	return success;
 }
 
-
 /*
- * ProcessResult: utility function for use by SendQuery() only
- *
- * When our command string contained a COPY FROM STDIN or COPY TO STDOUT,
- * PQexec() has stopped at the PGresult associated with the first such
- * command.  In that event, we'll marshal data for the COPY and then cycle
- * through any subsequent PGresult objects.
- *
- * When the command string contained no such COPY command, this function
- * degenerates to an AcceptResult() call.
- *
- * Changes its argument to point to the last PGresult of the command string,
- * or NULL if that result was for a COPY TO STDOUT.  (Returning NULL prevents
- * the command status from being printed, which we want in that case so that
- * the status line doesn't get taken as part of the COPY data.)
- *
- * Returns true on complete success, false otherwise.  Possible failure modes
- * include purely client-side problems; check the transaction status for the
- * server-side opinion.
+ * Marshal the COPY data.  Either subroutine will get the
+ * connection out of its COPY state, then call PQresultStatus()
+ * once and report any error.
+ *
+ * For COPY OUT, direct the output to pset.copyStream if it's set,
+ * otherwise to pset.gfname if it's set, otherwise to queryFout.
+ * For COPY IN, use pset.copyStream as data source if it's set,
+ * otherwise cur_cmd_source.
+ *
+ * Update result if further processing is necessary.  (Returning NULL
+ * prevents the command status from being printed, which we want in that
+ * case so that the status line doesn't get taken as part of the COPY data.)
  */
 static bool
-ProcessResult(PGresult **results)
+HandleCopyResult(PGresult **result)
 {
 	bool		success = true;
-	bool		first_cycle = true;
+	FILE	   *copystream;
+	PGresult   *copy_result;
+	ExecStatusType result_status = PQresultStatus(*result);
 
-	for (;;)
+	Assert(result_status == PGRES_COPY_OUT ||
+		   result_status == PGRES_COPY_IN);
+
+	SetCancelConn();
+	if (result_status == PGRES_COPY_OUT)
 	{
-		ExecStatusType result_status;
-		bool		is_copy;
-		PGresult   *next_result;
+		bool		need_close = false;
+		bool		is_pipe = false;
 
-		if (!AcceptResult(*results))
+		if (pset.copyStream)
 		{
-			/*
-			 * Failure at this point is always a server-side failure or a
-			 * failure to submit the command string.  Either way, we're
-			 * finished with this command string.
-			 */
-			success = false;
-			break;
+			/* invoked by \copy */
+			copystream = pset.copyStream;
 		}
-
-		result_status = PQresultStatus(*results);
-		switch (result_status)
+		else if (pset.gfname)
 		{
-			case PGRES_EMPTY_QUERY:
-			case PGRES_COMMAND_OK:
-			case PGRES_TUPLES_OK:
-is_copy = false;
-break;
-
-			case 

Re: Multivariate MCV stats can leak data to unprivileged users

2019-05-16 Thread Tomas Vondra

On Thu, May 16, 2019 at 02:28:03PM +0100, Dean Rasheed wrote:

On Mon, 13 May 2019 at 23:36, Tomas Vondra  wrote:


Yeah. I suggest we add a simple pg_stats_ext view, similar to pg_stats.
It would:

(1) translate the schema / relation / attribute names

  I don't see why translating column indexes to names would be fiddly.
  It's a matter of simple unnest + join, no? Or what issues you see?



Yeah, you're right. I thought it would be harder than that. One minor
thing -- I think we should have an explicit ORDER BY where we collect
the column names, to guarantee that they're listed in the right order.



Good point.


(2) include values for ndistinct / dependencies, if built

  Those don't include any actual values, so this should be OK. You might
  make the argument that even this does leak a bit of information (e.g.
  when you can see values in one column, and you know there's a strong
  functional dependence, it tells you something about the other column
  which you may not see). But I think we kinda already leak information
  about that through estimates, so maybe that's not an issue.



Hmm. For normal statistics, if the user has no permissions on the
table, they don't get to see any of these kinds of statistics, not
even things like n_distinct. I think we should do the same here --
i.e., if the user has no permissions on the table, don't let them see
anything. Such a user will not be able to run EXPLAIN on queries
against the table, so they won't get to see any estimates, and I don't
think they should get to see any extended statistics either.



OK, I haven't realized we don't show that even for normal stats.


(3) include MCV list only when user has access to all columns

  Essentially, if the user is missing 'select' privilege on at least one
  of the columns, there'll be NULL. Otherwise the MCV value.



OK, that seems reasonable, except as I said above, I think that should
apply to all statistics, not just the MCV lists.


The attached patch adds pg_stats_ext doing this. I don't claim it's the
best possible query backing the view, so any improvements are welcome.


I've been thinking we might somehow filter the statistics values, e.g. by
not showing values for attributes the user has no 'select' privilege on,
but that seems like a can of worms. It'd lead to MCV items that can't be
distinguished because the only difference was the removed attribute, and
so on. So I propose we simply show/hide the whole MCV list.



Agreed.


Likewise, I don't think it makes sense to expand the MCV list in this
view - that works for the single-dimensional case, because then the
list is expanded into two arrays (values + frequencies), which are easy
to process further. But for multivariate MCV lists that's much more
complex - we don't know how many attributes are there, for example.

So I suggest we just show the pg_mcv_list value as is, and leave it up
to the user to call the pg_mcv_list_items() function if needed.



I think expanding the MCV lists is actually quite useful because then
you can see arrays of values, nulls, frequencies and base frequencies
in a reasonably readable form (it certainly looks better than a binary
dump), without needing to join to a function call, which is a bit
ugly, and unmemorable.



Hmmm, ok. I think my main worry here is that it may or may not work for
more complex types of extended stats that are likely to come in the
future. Although, maybe it can be made work even for that.


The results from the attached look quite reasonable at first glance.
It contains a few other changes as well:

1). It exposes the schema, name and owner of the statistics object as
well via the view, for completeness.

2). It changes a few column names -- traditionally these views strip
off the column name prefix from the underlying table, so I've
attempted to be consistent with other similar views.

3). I added array-valued columns for each of the MCV list components,
which makes it more like pg_stats.

4). I moved all the permission checks to the top-level WHERE clause,
so a user needs to have select permissions on all the columns
mentioned by the statistics, and the table mustn't have RLS in effect,
otherwise the user won't see the row for that statistics object.

5). Some columns from pg_statistic_ext have to be made visible for
psql \d to work. Basically, it needs to be able to query for the
existence of extended statistics, but it doesn't need to see the
actual statistical data. Of course, we could change psql to use the
view, but this way gives us better backwards compatibility with older
clients.

This is still going to break compatibility of any user code looking at
stxndistinct or stxdependencies from pg_statistic_ext, but at least it
doesn't break old versions of psql.

Note: doc and test updates still to do.



Thanks. I'm travelling today/tomorrow, but I'll do my best to fill in the
missing bits ASAP.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL 

Re: Multivariate MCV stats can leak data to unprivileged users

2019-05-16 Thread Dean Rasheed
On Mon, 13 May 2019 at 23:36, Tomas Vondra  wrote:
>
> Yeah. I suggest we add a simple pg_stats_ext view, similar to pg_stats.
> It would:
>
> (1) translate the schema / relation / attribute names
>
>   I don't see why translating column indexes to names would be fiddly.
>   It's a matter of simple unnest + join, no? Or what issues you see?
>

Yeah, you're right. I thought it would be harder than that. One minor
thing -- I think we should have an explicit ORDER BY where we collect
the column names, to guarantee that they're listed in the right order.

> (2) include values for ndistinct / dependencies, if built
>
>   Those don't include any actual values, so this should be OK. You might
>   make the argument that even this does leak a bit of information (e.g.
>   when you can see values in one column, and you know there's a strong
>   functional dependence, it tells you something about the other column
>   which you may not see). But I think we kinda already leak information
>   about that through estimates, so maybe that's not an issue.
>

Hmm. For normal statistics, if the user has no permissions on the
table, they don't get to see any of these kinds of statistics, not
even things like n_distinct. I think we should do the same here --
i.e., if the user has no permissions on the table, don't let them see
anything. Such a user will not be able to run EXPLAIN on queries
against the table, so they won't get to see any estimates, and I don't
think they should get to see any extended statistics either.

> (3) include MCV list only when user has access to all columns
>
>   Essentially, if the user is missing 'select' privilege on at least one
>   of the columns, there'll be NULL. Otherwise the MCV value.
>

OK, that seems reasonable, except as I said above, I think that should
apply to all statistics, not just the MCV lists.

> The attached patch adds pg_stats_ext doing this. I don't claim it's the
> best possible query backing the view, so any improvements are welcome.
>
>
> I've been thinking we might somehow filter the statistics values, e.g. by
> not showing values for attributes the user has no 'select' privilege on,
> but that seems like a can of worms. It'd lead to MCV items that can't be
> distinguished because the only difference was the removed attribute, and
> so on. So I propose we simply show/hide the whole MCV list.
>

Agreed.

> Likewise, I don't think it makes sense to expand the MCV list in this
> view - that works for the single-dimensional case, because then the
> list is expanded into two arrays (values + frequencies), which are easy
> to process further. But for multivariate MCV lists that's much more
> complex - we don't know how many attributes are there, for example.
>
> So I suggest we just show the pg_mcv_list value as is, and leave it up
> to the user to call the pg_mcv_list_items() function if needed.
>

I think expanding the MCV lists is actually quite useful because then
you can see arrays of values, nulls, frequencies and base frequencies
in a reasonably readable form (it certainly looks better than a binary
dump), without needing to join to a function call, which is a bit
ugly, and unmemorable.

The results from the attached look quite reasonable at first glance.
It contains a few other changes as well:

1). It exposes the schema, name and owner of the statistics object as
well via the view, for completeness.

2). It changes a few column names -- traditionally these views strip
off the column name prefix from the underlying table, so I've
attempted to be consistent with other similar views.

3). I added array-valued columns for each of the MCV list components,
which makes it more like pg_stats.

4). I moved all the permission checks to the top-level WHERE clause,
so a user needs to have select permissions on all the columns
mentioned by the statistics, and the table mustn't have RLS in effect,
otherwise the user won't see the row for that statistics object.

5). Some columns from pg_statistic_ext have to be made visible for
psql \d to work. Basically, it needs to be able to query for the
existence of extended statistics, but it doesn't need to see the
actual statistical data. Of course, we could change psql to use the
view, but this way gives us better backwards compatibility with older
clients.

This is still going to break compatibility of any user code looking at
stxndistinct or stxdependencies from pg_statistic_ext, but at least it
doesn't break old versions of psql.

Note: doc and test updates still to do.

Regards,
Dean
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
new file mode 100644
index 566100d..3227d71
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -253,6 +253,46 @@ CREATE VIEW pg_stats WITH (security_barr
 
 REVOKE ALL on pg_statistic FROM public;
 
+CREATE VIEW pg_stats_ext WITH (security_barrier) AS
+SELECT cn.nspname AS schemaname,
+   c.relname AS tablename,
+ 

Re: Parallel Foreign Scans - need advice

2019-05-16 Thread Korry Douglas


> That's only a superficial problem.  You don't even know if or when the
> workers that are launched will all finish up running your particular
> node, because (for example) they might be sent to different children
> of a Parallel Append node above you (AFAICS there is no way for a
> participant to indicate "I've finished all the work allocated to me,
> but I happen to know that some other worker #3 is needed here" -- as
> soon as any participant reports that it has executed the plan to
> completion, pa_finished[] will prevent new workers from picking that
> node to execute).  Suppose we made a rule that *every* worker must
> visit *every* partial child of a Parallel Append and run it to
> completion (and any similar node in the future must do the same)...
> then I think there is still a higher level design problem: if you do
> allocate work up front rather than on demand, then work could be
> unevenly distributed, and parallel query would be weakened.

What I really need (for the scheme I’m using at the moment) is to know how many 
workers will be used to execute my particular Plan.  I understand that some 
workers will naturally end up idle while the last (busy) worker finishes up.  
I’m dividing the workload (the number of row groups to scan) by the number of 
workers to get an even distribution.   

I’m willing to pay that price (at least, I haven’t seen a problem so far… 
famous last words)

I do plan to switch over to get-next-chunk allocator as you mentioned below, 
but I’d like to get the minimized-seek mechanism working first.

It sounds like there is no reliable way to get the information that I’m looking 
for, is that right?

> So I think you ideally need a simple get-next-chunk work allocator
> (like Parallel Seq Scan and like the file_fdw patch I posted[1]), or a
> pass-the-baton work allocator when there is a dependency between
> chunks (like Parallel Index Scan for btrees), or a more complicated
> multi-phase system that counts participants arriving and joining in
> (like Parallel Hash) so that participants can coordinate and wait for
> each other in controlled circumstances.

I haven’t looked at Parallel Hash - will try to understand that next.

> If this compressed data doesn't have natural chunks designed for this
> purpose (like, say, ORC stripes), perhaps you could have a dedicated
> workers streaming data (compressed? decompressed?) into shared memory,
> and parallel query participants coordinating to consume chunks of
> that?


I’ll give that some thought.  Thanks for the ideas.

— Korry





Re: PSA: New intel MDS vulnerability mitigations cause measurable slowdown

2019-05-16 Thread Thomas Munro
On Wed, May 15, 2019 at 1:13 PM Andres Freund  wrote:
> > I've run a quick pgbench benchmark:
> >
> > *Without* disabling SMT, for readonly pgbench, I'm seeing regressions
> > between 7-11%, depending on the size of shared_buffers (and some runtime
> > variations).  That's just on my laptop, with an i7-6820HQ / Haswell CPU.
> > I'd be surprised if there weren't adversarial loads with bigger
> > slowdowns - what gets more expensive with the mitigations is syscalls.

This stuff landed in my FreeBSD 13.0-CURRENT kernel, so I was curious
to measure it with and without the earlier mitigations.  On my humble
i7-8550U laptop with the new 1.22 microcode installed, with my usual
settings of PTI=on and IBRS=off, so far MDS=VERW gives me ~1.5% loss
of TPS with a single client, up to 4.3% loss of TPS for 16 clients,
but it didn't go higher when I tried 32 clients.  This was a tiny
scale 10 database, though in a quick test it didn't look like it was
worse with scale 100.

With all three mitigations activated, my little dev machine has gone
from being able to do ~11.8 million baseline syscalls per second to
~1.6 million, or ~1.4 million with the AVX variant of the mitigation.

Raw getuid() syscalls per second:

  PTI   IBRS  MDS=off  MDS=VERW MDS=AVX
  = =   
  off   off   11798658  4764159  3274043
  off   on 2652564  1941606  1655356
  onoff4973053  2932906  2339779
  onon 1988527  1556922  1378798

pgbench read-only transactions per second, 1 client thread:

  PTI   IBRS  MDS=off  MDS=VERW MDS=AVX
  = =   
  off   off  193931894918615
  off   on   179461758617323
  onoff  193811901518696
  onon   180451770917418

pgbench -M prepared read-only transactions per second, 1 client thread:

  PTI   IBRS  MDS=off  MDS=VERW MDS=AVX
  = =   
  off   off  350203404933200
  off   on   316583090230229
  onoff  354453435333415
  onon   324153159930712

pgbench -M prepared read-only transactions per second, 4 client threads:

  PTI   IBRS  MDS=off  MDS=VERW MDS=AVX
  = =   
  off   off  795157689876465
  off   on   636086222061952
  onoff  778637543174847
  onon   627096079060575

pgbench -M prepared read-only transactions per second, 16 client threads:

  PTI   IBRS  MDS=off  MDS=VERW MDS=AVX
  = =   
  off   off 125984   121164   120468
  off   on  112884   108346   107984
  onoff 121032   116156   115462
  onon  108889   104636   104027

time gmake -s check:

  PTI   IBRS  MDS=off  MDS=VERW MDS=AVX
  = =   
  off   off  16.7816.8517.03
  off   on   18.1918.8119.08
  onoff  16.6716.8617.33
  onon   18.5818.8318.99

-- 
Thomas Munro
https://enterprisedb.com




Re: Replace hashtable growEnable flag

2019-05-16 Thread Hubert Zhang
Thanks Tomas.
I will follow this problem on your thread. This thread could be terminated.

On Thu, May 16, 2019 at 3:58 AM Tomas Vondra 
wrote:

> On Wed, May 15, 2019 at 06:19:38PM +0800, Hubert Zhang wrote:
> >Hi all,
> >
> >When we build hash table for a hash join node etc., we split tuples into
> >different hash buckets. Since tuples could not all be held in memory.
> >Postgres splits each bucket into batches, only the current batch of bucket
> >is in memory while other batches are written to disk.
> >
> >During ExecHashTableInsert(), if the memory cost exceeds the operator
> >allowed limit(hashtable->spaceAllowed), batches will be split on the fly
> by
> >calling ExecHashIncreaseNumBatches().
> >
> >In past, if data is distributed unevenly, the split of batch may
> failed(All
> >the tuples falls into one split batch and the other batch is empty) Then
> >Postgres will set hashtable->growEnable to false. And never expand batch
> >number any more.
> >
> >If tuples become diverse in future, spliting batch is still valuable and
> >could avoid the current batch become too big and finally OOM.
> >
> >To fix this, we introduce a penalty on hashtable->spaceAllowed, which is
> >the threshold to determine whether to increase batch number.
> >If batch split failed, we increase the penalty instead of just turn off
> the
> >growEnable flag.
> >
> >Any comments?
> >
>
> There's already another thread discussing various issues with how hashjoin
> increases the number of batches, including various issues with how/when we
> disable adding more batches.
>
> https://commitfest.postgresql.org/23/2109/
>
> In general I think you're right something like this is necessary, but I
> think we may need to rethink growEnable a bit more.
>
> For example, the way you implemented it, after reaching the increased
> limit, we just increase the number of batches just like today, and then
> decide whether it actually helped. But that means we double the number of
> BufFile entries, which uses more and more memory (because each is 8kB and
> we need 1 per batch). I think in this case (after increasing the limit) we
> should check whether increasing batches makes sense or not. And only do it
> if it helps. Otherwise we'll double the amount of memory for BufFile(s)
> and also the work_mem. That's not a good idea.
>
> But as I said, there are other issues discussed on the other thread. For
> example we only disable the growth when all rows fall into the same batch.
> But that's overly strict.
>
>
> regards
>
> --
> Tomas Vondra
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.2ndQuadrant.com=DwIBAg=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw=lz-kpGdw_rtpgYV2ho3DjDSB5Psxis_b-3VZKON7K7c=y2bI6_b4EPRd9aTQGv9Pio3c_ZtCWs_jzKd4t8CtJEI=XHLORM8U7I6XR_EDkgSFtJDvhxIVd2rDA7r-xvJa278=
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>
>

-- 
Thanks

Hubert Zhang


RE: psql - add SHOW_ALL_RESULTS option

2019-05-16 Thread Fabien COELHO



Bonjour Daniel,


IMHO this new setting should be on by default: few people know about \; so
it would not change anything for most, and I do not see why those who use
it would not be interested by the results of all the queries they asked for.

I agree with your opinion.


Ok. I did not yet change the default in the attached version, though.


I'd go further and suggest that there shouldn't be a variable
controlling this. All results that come in should be processed, period.
It's not just about \; If the ability of CALL to produce multiple
resultsets gets implemented (it was posted as a POC during v11
development), this will be needed too.


I do agree, but I'm afraid that if there is no opt-out it could be seen as 
a regression by some.



This attached version does:
 - ensure that warnings appear just before its
 - add the entry in psql's help
 - redefine the function boundary so that timing is cleaner
 - include somehow improved tests


\errverbose seems to no longer work with the patch:

test=> select 1/0;
psql: ERROR:  division by zero

test=> \errverbose
There is no previous error.

as opposed to this output with PG11:

test=> \errverbose
ERROR:  22012: division by zero
LOCATION:  int4div, int.c:820


Thanks for the catch. I'll investigate.

\errverbose has probably no regression tests because its output includes 
these ever-changing line numbers; hence `make check` cannot be used to 
find this regression.


What is not tested does not work:-( The TAP infrastructure for psql 
included in some patch (https://commitfest.postgresql.org/23/2100/ I 
guess) would help testing such slightly varying features which cannot be 
tested with a hardcoded reference text.


--
Fabien.




Re: New EXPLAIN option: ALL

2019-05-16 Thread Michael Paquier
On Wed, May 15, 2019 at 10:46:39AM -0400, Tom Lane wrote:
> Andres Freund  writes:
>> On May 15, 2019 7:20:34 AM PDT, David Fetter  wrote:
>>> Indeed. I think we should move our regression tests to TAP and
>>> dispense with this.
> 
>> -inconceivably much
> 
> Yeah, that's not happening.

+1 to the we-shall-not-move part.
--
Michael


signature.asc
Description: PGP signature


Re: Pluggable Storage - Andres's take

2019-05-16 Thread Ashwin Agrawal
On Wed, May 15, 2019 at 11:54 AM Andres Freund  wrote:

> Hi,
>
> On 2019-04-25 15:43:15 -0700, Andres Freund wrote:
>
> > > 3) nodeTidscan, skipping over too large tids
> > >I think this should just be moved into the AMs, there's no need to
> > >have this in nodeTidscan.c
> >
> > I think here it's *not* actually correct at all to use the relation
> > size. It's currently doing:
> >
> >   /*
> >* We silently discard any TIDs that are out of range at the time
> of scan
> >* start.  (Since we hold at least AccessShareLock on the table,
> it won't
> >* be possible for someone to truncate away the blocks we intend to
> >* visit.)
> >*/
> >   nblocks =
> RelationGetNumberOfBlocks(tidstate->ss.ss_currentRelation);
> >
> > which is fine (except for a certain abstraction leakage) for an AM like
> > heap or zheap, but I suspect strongly that that's not ok for Ashwin &
> > Heikki's approach where tid isn't tied to physical representation.
> >
> >
> > The obvious answer would be to just move that check into the
> > table_fetch_row_version implementation (currently just calling
> > heap_fetch()) - but that doesn't seem OK from a performance POV, because
> > we'd then determine the relation size once for each tid, rather than
> > once per tidscan.  And it'd also check in cases where we know the tid is
> > supposed to be valid (e.g. fetching trigger tuples and such).
> >
> > The proper fix seems to be to introduce a new scan variant
> > (e.g. table_beginscan_tid()), and then have table_fetch_row_version take
> > a scan as a parameter.  But it seems we'd have to introduce that as a
> > separate tableam callback, because we'd not want to incur the overhead
> > of creating an additional scan / RelationGetNumberOfBlocks() checks for
> > triggers et al.
>
> Attached is a prototype of a variation of this. I added a
> table_tuple_tid_valid(TableScanDesc sscan, ItemPointer tid)
> callback / wrapper. Currently it just takes a "plain" scan, but we could
> add a separate table_beginscan variant too.
>
> For heap that just means we can just use HeapScanDesc's rs_nblock to
> filter out invalid tids, and we only need to call
> RelationGetNumberOfBlocks() once, rather than every
> table_tuple_tid_valid(0 / table_get_latest_tid() call. Which is a good
> improvement for nodeTidscan's table_get_latest_tid() call (for WHERE
> CURRENT OF) - which previously computed the relation size once per
> tuple.
>
> Needs a bit of polishing, but I think this is the right direction?
>

Highlevel this looks good to me. Will look into full details tomorrow. This
alligns with the high level thought I made but implemented in much better
way, to consult with the AM to perform the optimization or not. So, now
using the new callback table_tuple_tid_valid() AM either can implement some
way to perform the validation for TID to optimize the scan, or if has no
way to check based on scan descriptor then can decide to always pass true
and let table_fetch_row_version() handle the things.