Re: [HACKERS] path toward faster partition pruning

2018-02-19 Thread Rajkumar Raghuwanshi
On Tue, Feb 20, 2018 at 12:34 PM, Amit Langote <
langote_amit...@lab.ntt.co.jp> wrote:

> Attached updated version.
>

Hi Amit,

I have applied v29 patch-set on head and  got "ERROR:  operator 1209 is not
a member of opfamily 1994" with below test case. Please take a look.

CREATE TABLE part (c1 INT4, c2 TEXT, c3 INT4) PARTITION BY LIST (c2);
CREATE TABLE part_p1 PARTITION OF part FOR VALUES IN('ABC');
CREATE TABLE part_p2 PARTITION OF part FOR VALUES IN('DEF');
CREATE TABLE part_p3 PARTITION OF part FOR VALUES IN('GHI');
CREATE TABLE part_p4 PARTITION OF part FOR VALUES IN('JKL');

INSERT INTO part VALUES (100,'ABC',10);
INSERT INTO part VALUES (110,'DEF',20);
INSERT INTO part VALUES (120,'GHI',10);
INSERT INTO part VALUES (130,'JKL',100);

explain (costs off) SELECT * FROM part WHERE c2 LIKE '%ABC%';
*ERROR:  operator 1209 is not a member of opfamily 1994*

Thanks & Regards,
Rajkumar Raghuwanshi
QMG, EnterpriseDB Corporation


Re: ALTER TABLE ADD COLUMN fast default

2018-02-19 Thread Andrew Dunstan
On Tue, Feb 20, 2018 at 5:03 PM, Andrew Dunstan
 wrote:
> http://www.publix.com/myaccount/verify?validationCode=889fd4cb-6dbb-4f93-9d4f-c701410cd8a2


Wow, my c&p was working overtime. Good thing this won't do anyone any good.

cheers

andrew


-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [WIP PATCH] Index scan offset optimisation using visibility map

2018-02-19 Thread Andrey Borodin
Hi!

I've played with patch. I observe that in some expected scenarios it reduces 
read buffers significantly.

> 14 февр. 2018 г., в 0:04, Michail Nikolaev  
> написал(а):
> Patch updated + rebased on master. check-world is passing.

Minor spots:
There are some trailing whitespaces at line ends
> Offset cannot be optimized because parallel execution
I'd replace with 
> Offset cannot be optimized [in spite of | due to] parallel execution

More important thing: now nodeindexonlyscan.c and nodeindexscan.c share more 
similar code and comments. I do not think it is strictly necessary to avoid, 
but we certainly have to answer the question:
Is it possible to refactor code to avoid duplication?
Currently, patch is not invasive, touches only some relevant code. If 
refactoring will make it shotgun-suregery, I think it is better to avoid it.


> Still not sure about questions 0, 2, 3, 4, 5, and 6 from initial mail (about 
> explain, explain analyze, documentation and optimiser).

I think that I'd be cool if EXPLAIN ANALYZE printed heap fetch information if 
"Index-Only" way was used. But we already have this debug information in form 
of reduced count in "Buffers". There is nothing to add to plain EXPLAIN, in my 
opinion.
From my point of view, you should add to patch some words here
https://www.postgresql.org/docs/current/static/indexes-index-only-scans.html
and, if patch will be accepted, here
https://wiki.postgresql.org/wiki/Index-only_scans

I do not know if it is possible to take into account this optimization in cost 
estimation.
Does Limit node take cost of scanning into startup cost?

Thanks!

Best regards, Andrey Borodin.


Re: ALTER TABLE ADD COLUMN fast default

2018-02-19 Thread Andres Freund
Hi,

On 2018-02-17 00:23:40 +0100, Tomas Vondra wrote:
> Anyway, I consider the performance to be OK. But perhaps Andres could
> comment on this too, as he requested the benchmarks.

My performance concerns were less about CREATE TABLE related things than
about analytics workloads or such, where deforming is the primary
bottleneck.  I think it should be ok, but doing a before/after tpc-h of
scale 5-10 or so wouldn't be a bad thing to verify.

Greetings,

Andres Freund



Re: ALTER TABLE ADD COLUMN fast default

2018-02-19 Thread Andrew Dunstan
http://www.publix.com/myaccount/verify?validationCode=889fd4cb-6dbb-4f93-9d4f-c701410cd8a2

On Mon, Feb 19, 2018 at 1:18 PM, David Rowley
 wrote:
> On 19 February 2018 at 13:48, Andrew Dunstan
>  wrote:
>> On Sun, Feb 18, 2018 at 2:52 PM, David Rowley
>>  wrote:
>>> I'll try to spend some time reviewing the code in detail soon.
>>>
>>
>> Thanks. I'll fix the typos in the next version of the patch, hopefully
>> after your review.
>
> Here's a more complete review... I reserve the right to be wrong about
> a few things and be nitpicky on a few more...

I've dealt with most of these issues. The only change of substance is
the suggested change in slot_getmissingattrs().


> Once that's fixed, you could ditch the Min() call in the following code:
>
> attno = Min(attno, attnum);
>
> slot_deform_tuple(slot, attno);
>
> /*
> * If tuple doesn't have all the atts indicated by tupleDesc, read the
> * rest as NULLs or missing values
> */
> if (attno < attnum)
> {
> slot_getmissingattrs(slot, attno);
> }
>
> And change it to something more like:
>
> /*
> * If tuple doesn't have all the atts indicated by tupleDesc, deform as
> * much as possible, then fill the remaining required atts with nulls or
> * missing values.
> */
> if (attno < attnum)
> {
> slot_deform_tuple(slot, attno);
> slot_getmissingattrs(slot, attno, attnum); // <-- (includes new parameter)
> }
> else
> slot_deform_tuple(slot, attnum);
>
> Which should result in a small performance increase due to not having
> to compare attno < attnum twice. Although, perhaps the average compile
> may optimise this anyway. I've not checked.
>


Maybe it would make a difference, but it seems unlikely to be significant.


> 11. Is there a reason why the following code in getmissingattr() can't
> be made into a bsearch?
>
[...]
> As far as I can see, the attrmiss is sorted by amnum. But maybe I just
> failed to find a case where it's not.
>
> I'd imagine doing this would further improve Tomas' benchmark score
> for the patched version, since he was performing a sum() on the final
> column.
>
> Although, If done, I'm actually holding some regard to the fact that
> users may one day complain that their query became slower after a
> table rewrite :-)
>
> Update: I've stumbled upon the following code:
>
> + /*
> + * We have a dependency on the attrdef array being filled in in
> + * ascending attnum order. This should be guaranteed by the index
> + * driving the scan. But we want to be double sure
> + */
> + if (!(attp->attnum > attnum))
> + elog(ERROR, "attribute numbers not ascending");
>
> and the code below this seems to also assemble the missing values in
> attnum order.
>
> While I'm here, I'd rather see this if test written as: if
> (attp->attnum <= attnum)
>
> Since the attnum order in the missing values appears to be well
> defined in ascending order, you can also simplify the comparison logic
> in equalTupleDescs(). The inner-most nested loop shouldn't be
> required.
>


Well, this comment in the existing code in equalTupleDescs() suggests
that in fact we can't rely on the attrdef items being in attnum order:

/*
 * We can't assume that the items are always read from the system
 * catalogs in the same order; so use the adnum field to identify
 * the matching item to compare.
 */

And in any case the code out of date since the code no longer ties
missing values to default values. So I've removed the comment and the
error check.

There are probably a few optimisations we can make if we can assume
that the missing values are in the Tuple Descriptor are in attnum
order. But it's unclear to me why they should be and the attrdef
entries not.



>
> 15. Why not: slot->tts_values[missattnum] = (Datum) 0;
>
> + slot->tts_values[missattnum] = PointerGetDatum(NULL);
>
> There are a few other places you've done this. I'm unsure if there's
> any sort of standard to follow.
>

This is a pretty widely used idiom in the source.


>
> I'll await the next version before looking again.
>


Thanks

attached.

cheers

andrew



-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


fast_default-v11.patch
Description: Binary data


RE: [doc fix] Correct calculation of vm.nr_hugepages

2018-02-19 Thread Tsunakawa, Takayuki
From: Justin Pryzby [mailto:pry...@telsasoft.com]
> One can do with fewer greps:
> pryzbyj@pryzbyj:~$ sudo pmap `pgrep -P 1 -u postgres` |awk
> '/rw-s/&&/zero/{print $2}' # check PPID=1 144848K

Thanks, I'd like to take this.

Regards
Takayuki Tsunakawa



hugepage_size_doc_v2.patch
Description: hugepage_size_doc_v2.patch


Re: Changing default value of wal_sync_method to open_datasync on Linux

2018-02-19 Thread Andres Freund
On 2018-02-20 01:56:17 +, Tsunakawa, Takayuki wrote:
> Disabling the filesystem barrier is a valid tuning method as the PG manual 
> says:

I don't think it says that:

> https://www.postgresql.org/docs/devel/static/wal-reliability.html
> 
> [Excerpt]
> Recent SATA drives (those following ATAPI-6 or later) offer a drive cache 
> flush command (FLUSH CACHE EXT), while SCSI drives have long supported a 
> similar command SYNCHRONIZE CACHE. These commands are not directly accessible 
> to PostgreSQL, but some file systems (e.g., ZFS, ext4) can use them to flush 
> data to the platters on write-back-enabled drives. Unfortunately, such file 
> systems behave suboptimally when combined with battery-backup unit (BBU) disk 
> controllers. In such setups, the synchronize command forces all data from the 
> controller cache to the disks, eliminating much of the benefit of the BBU. 
> You can run the pg_test_fsync program to see if you are affected. If you are 
> affected, the performance benefits of the BBU can be regained by turning off 
> write barriers in the file system or reconfiguring the disk controller, if 
> that is an option. If write barriers are turned off, make sure the battery 
> remains functional; a faulty battery can potentially lead to data loss. 
> Hopefully file system and disk controller designers will eventually address 
> this suboptimal behavior.

Note it's only valid if running with a BBU. In which case the
performance measurements you're doing aren't particularly meaningful
anyway, as you'd test BBU performance rather than disk performance.


Greetings,

Andres Freund



RE: Changing default value of wal_sync_method to open_datasync on Linux

2018-02-19 Thread Tsunakawa, Takayuki
From: Mark Kirkwood [mailto:mark.kirkw...@catalyst.net.nz]
> I think the use of 'nobarrier' is probably disabling most/all reliable
> writing to the devices. What do the numbers look like if use remove this
> option?

Disabling the filesystem barrier is a valid tuning method as the PG manual says:

https://www.postgresql.org/docs/devel/static/wal-reliability.html

[Excerpt]
Recent SATA drives (those following ATAPI-6 or later) offer a drive cache flush 
command (FLUSH CACHE EXT), while SCSI drives have long supported a similar 
command SYNCHRONIZE CACHE. These commands are not directly accessible to 
PostgreSQL, but some file systems (e.g., ZFS, ext4) can use them to flush data 
to the platters on write-back-enabled drives. Unfortunately, such file systems 
behave suboptimally when combined with battery-backup unit (BBU) disk 
controllers. In such setups, the synchronize command forces all data from the 
controller cache to the disks, eliminating much of the benefit of the BBU. You 
can run the pg_test_fsync program to see if you are affected. If you are 
affected, the performance benefits of the BBU can be regained by turning off 
write barriers in the file system or reconfiguring the disk controller, if that 
is an option. If write barriers are turned off, make sure the battery remains 
functional; a faulty battery can potentially lead to data loss. Hopefully file 
system and disk controller designers will eventually address this suboptimal 
behavior.


I removed nobarrier mount option on a VM with HDD.  pgbench throughput 
decreased about 30% to 3550 tps, but the relative difference between fdatasync 
and open_datasync is similar.  I cannot disable nowritebarrier on the bare 
metal with SSD for now, as other developers and test teams are using it.



Regards
Takayuki Tsunakawa







Re: SHA-2 functions

2018-02-19 Thread Michael Paquier
On Mon, Feb 19, 2018 at 08:43:44AM -0500, Peter Eisentraut wrote:
> I also noticed while working on some SSL code that we have perfectly
> good SHA-2 functionality in the server already, but it has no test
> coverage outside the SCRAM tests.

Yep, the refactoring in src/common/ has been done for SCRAM.  As The
first versions of the patch were for SCRAM-SHA-1, only SHA-1 functions
were moved.  With SCRAM-SHA-256, the full set of APIs for 256, 386 and
512 has been moved as they share much code.

> So I suggest these patches that expose the new functions sha224(),
> sha256(), sha384(), sha512().  That allows us to make the SSL and SCRAM
> tests more robust, and it will allow them to be used in general purpose
> contexts over md5().

Huge +1 as well.  This also makes sure that the non-SSL implementation
carried in Postgres is consistent what what OpenSSL has.  This would
matter as well if new SSL implementations are added in the future.

+   sha224('abc')
+   
\x23097d223405d8228642a477bda2​55b32aadbce4bda0b3f7e36c9da7
Some bytea characters from the hash are not able to show up correctly?
This does not result in spaces.

+   Note that for historic reasons, the function md5
+   returns a hex-encoded value of type text whereas the SHA-2
+   functions return type bytea.  Use the functions
+   encode and decode to convert
+   between the two.
Adding an example would be nice.

varlena.c is already large and messy.  I would suggest to split into a
new file all the user-facing cryptographic functions, including md5 and
hex functions, say in src/backend/utils/adt/crypt.c.
--
Michael


signature.asc
Description: PGP signature


Re: Server crash in pg_replication_slot_advance function

2018-02-19 Thread Alvaro Herrera
Petr Jelinek wrote:

> >  Attached patch proposes a required fix.
> > 
> 
> Looks correct to me.

Masahiko Sawada wrote:

> The change looks good to me. Thank you.

Thanks, pushed.


-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Typo in procarray.c

2018-02-19 Thread Masahiko Sawada
Hi,

Attached patch for fixing $subject.

s/replicaton/replication/

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


fix_typo_in_procarray_c.patch
Description: Binary data


Re: master check fails on Windows Server 2008

2018-02-19 Thread Peter Geoghegan
On Mon, Feb 19, 2018 at 4:37 PM, Tom Lane  wrote:
> I see several things we could do about this:
>
> 1. Nothing; just say "sorry, we don't promise that the regression tests
> pass with no plan differences on nonstandard configurations".  Given that
> --disable-float8-byval has hardly any real-world use, there is not a lot
> of downside to that.

That would make sense to me. It will also be necessary to formalize
what "nonstandard configuration" actually means if we go this way, of
course.

I believe that this is already true with "dynamic_shared_memory_type
== DSM_IMPL_NONE", so that's a second entry for the "nonstandard
configuration" list.

-- 
Peter Geoghegan



RE: Changing default value of wal_sync_method to open_datasync on Linux

2018-02-19 Thread Tsunakawa, Takayuki
From: Andres Freund [mailto:and...@anarazel.de]
> Indeed. My past experience with open_datasync on linux shows it to be slower
> by roughly an order of magnitude. Even if that would turn out not to be
> the case anymore, I'm *extremely* hesitant to make such a change.

Thanks for giving so quick feedback.  An order of magnitude is surprising.  Can 
you share the environment (Linux distro version, kernel version, filesystem, 
mount options, workload, etc.)?  Do you think of anything that explains the 
degradation?  I think it is reasonable that open_datasync is faster than 
fdatasync because:

* Short transactions like pgbench require less system calls: 
write()+fdatasync() vs write().
* fdatasync() probably has to scan the page cache for dirty pages.

The above differences should be invisible on slow disks, but they will show up 
on faster storage.  I guess that's why Robert said open_datasync was much 
faster on NVRAM.

The manual says that pg_test_fsync is a tool for selecting wal_sync_method 
value, and it indicates open_datasync is better.  Why is fdatasync the default 
value only on Linux?  I don't understand as a user why PostgreSQL does the 
special handling.  If the current behavior of choosing fdatasync by default is 
due to some deficiency of old kernel and/or filesystem, I think we can change 
the default so that most users don't have to change wal_sync_method.


Regards
Takayuki Tsunakawa







Re: Changing default value of wal_sync_method to open_datasync on Linux

2018-02-19 Thread Mark Kirkwood

On 20/02/18 13:27, Tsunakawa, Takayuki wrote:


Hello,

I propose changing the default value of wal_sync_method from fdatasync to 
open_datasync on Linux.  The patch is attached.  I'm feeling this may be 
controversial, so I'd like to hear your opinions.

The reason for change is better performance.  Robert Haas said open_datasync 
was much faster than fdatasync with NVRAM in this thread:

https://www.postgresql.org/message-id/flat/c20d38e97bcb33dad59e...@lab.ntt.co.jp#c20d38e97bcb33dad59e...@lab.ntt.co.jp

pg_test_fsync shows higher figures for open_datasync:

[SSD on bare metal, ext4 volume mounted with noatime,nobarrier,data=ordered]
--
5 seconds per test
O_DIRECT supported on this platform for open_datasync and open_sync.

Compare file sync methods using one 8kB write:
(in wal_sync_method preference order, except fdatasync is Linux's default)
 open_datasync 50829.597 ops/sec  20 usecs/op
 fdatasync 42094.381 ops/sec  24 usecs/op
 fsync  42209.972 ops/sec  
24 usecs/op
 fsync_writethroughn/a
 open_sync 48669.605 ops/sec  21 usecs/op
--


[HDD on VM, ext4 volume mounted with noatime,nobarrier,data=writeback]
(the figures seem oddly high, though; this may be due to some VM configuration)
--
5 seconds per test
O_DIRECT supported on this platform for open_datasync and open_sync.

Compare file sync methods using one 8kB write:
(in wal_sync_method preference order, except fdatasync is Linux's default)
 open_datasync 34648.778 ops/sec  29 usecs/op
 fdatasync 31570.947 ops/sec  32 usecs/op
 fsync 27783.283 ops/sec  36 usecs/op
 fsync_writethrough  n/a
 open_sync 35238.866 ops/sec  28 usecs/op
--


pgbench only shows marginally better results, although the difference is within 
an error range.  The following is the tps of the default read/write workload of 
pgbench.  I ran the test with all the tables and indexes preloaded with 
pg_prewarm (except pgbench_history), and the checkpoint not happening.  I ran a 
write workload before running the benchmark so that no new WAL file would be 
created during the benchmark run.

[SSD on bare metal, ext4 volume mounted with noatime,nobarrier,data=ordered]
--
1  2  3avg
fdatasync  17610  17164  16678  17150
open_datasync  17847  17457  17958  17754 (+3%)

[HDD on VM, ext4 volume mounted with noatime,nobarrier,data=writeback]
(the figures seem oddly high, though; this may be due to some VM configuration)
--
   1 2 3   avg
fdatasync  4911  5225  5198  5111
open_datasync  4996  5284  5317  5199 (+1%)


As the removed comment describes, when wal_sync_method is open_datasync (or 
open_sync), open() fails with errno=EINVAL if the ext4 volume is mounted with 
data=journal.  That's because open() specifies O_DIRECT in that case.  I don't 
think that's a problem in practice, because data=journal will not be used for 
performance, and wal_level needs to be changed from its default replica to 
minimal and max_wal_senders must be set to 0 for O_DIRECT to be used.




I think the use of 'nobarrier' is probably disabling most/all reliable  
writing to the devices. What do the numbers look like if use remove this  
option?


regards

Mark



Re: master check fails on Windows Server 2008

2018-02-19 Thread Tom Lane
Marina Polyakova  writes:
> On 16-02-2018 19:31, Tom Lane wrote:
>> Weird.  AFAICS the cost estimates for those two plans should be quite
>> different, so this isn't just a matter of the estimates maybe being
>> a bit platform-dependent.  (And that test has been there nearly a
>> year without causing reported problems.)

> Thank you very much! Your test showed that hash aggregation was not even 
> added to the possible paths (see windows_regression.diffs attached). 
> Exploring this, I found that not allowing float8 to pass by value in 
> config.pl was crucial for the size of the hash table used in this query 
> (see diff.patch attached):

Ah-hah.  I can reproduce the described failure if I configure with
--disable-float8-byval on an otherwise 64-bit machine.  It appears that
the minimum work_mem setting that will allow this query to use a hashagg
plan on such a configuration is about 155kB (which squares with the
results you show).  On the other hand, in a normal 64-bit configuration,
work_mem settings of 160kB or more cause other failures (due to other
plans switching to hashagg), and on a 32-bit machine I see such failures
with work_mem of 150kB or more.  So there's basically no setting of
work_mem that would make all these tests pass everywhere.

I see several things we could do about this:

1. Nothing; just say "sorry, we don't promise that the regression tests
pass with no plan differences on nonstandard configurations".  Given that
--disable-float8-byval has hardly any real-world use, there is not a lot
of downside to that.

2. Decide that --disable-float8-byval, and for that matter
--disable-float4-byval, have no real-world use at all and take them out.
There was some point in those options back when we cared about binary
compatibility with version-zero C functions, but now I'm not sure why
anyone would use them.

3. Drop that one test case from stats_ext.sql; I'm not sure how much
additional test value it's actually bringing.

4. Try to tweak the stats_ext.sql test conditions in some more refined
way to get the test to pass everywhere.  This'd be a lot of work with
no guarantee of success, so I'm not too excited about it.

5. Something else?

regards, tom lane



Re: Changing default value of wal_sync_method to open_datasync on Linux

2018-02-19 Thread Andres Freund
Hi,

On 2018-02-20 00:27:47 +, Tsunakawa, Takayuki wrote:
> I propose changing the default value of wal_sync_method from fdatasync
> to open_datasync on Linux.  The patch is attached.  I'm feeling this
> may be controversial, so I'd like to hear your opinions.

Indeed. My past experience with open_datasync on linux shows it to be
slower by roughly an order of magnitude. Even if that would turn out not
to be the case anymore, I'm *extremely* hesitant to make such a change.

> [HDD on VM, ext4 volume mounted with noatime,nobarrier,data=writeback]
> (the figures seem oddly high, though; this may be due to some VM
> configuration)

These numbers clearly aren't reliable, there's absolutely no way an hdd
can properly do ~30k syncs/sec.  Until there's reliable numbers this
seems moot.

Greetings,

Andres Freund



Changing default value of wal_sync_method to open_datasync on Linux

2018-02-19 Thread Tsunakawa, Takayuki
Hello,

I propose changing the default value of wal_sync_method from fdatasync to 
open_datasync on Linux.  The patch is attached.  I'm feeling this may be 
controversial, so I'd like to hear your opinions.

The reason for change is better performance.  Robert Haas said open_datasync 
was much faster than fdatasync with NVRAM in this thread:

https://www.postgresql.org/message-id/flat/c20d38e97bcb33dad59e...@lab.ntt.co.jp#c20d38e97bcb33dad59e...@lab.ntt.co.jp

pg_test_fsync shows higher figures for open_datasync:

[SSD on bare metal, ext4 volume mounted with noatime,nobarrier,data=ordered]
--
5 seconds per test
O_DIRECT supported on this platform for open_datasync and open_sync.

Compare file sync methods using one 8kB write:
(in wal_sync_method preference order, except fdatasync is Linux's default)
open_datasync 50829.597 ops/sec  20 usecs/op
fdatasync 42094.381 ops/sec  24 usecs/op
fsync  42209.972 ops/sec  
24 usecs/op
fsync_writethroughn/a
open_sync 48669.605 ops/sec  21 usecs/op
--


[HDD on VM, ext4 volume mounted with noatime,nobarrier,data=writeback]
(the figures seem oddly high, though; this may be due to some VM configuration)
--
5 seconds per test
O_DIRECT supported on this platform for open_datasync and open_sync.

Compare file sync methods using one 8kB write:
(in wal_sync_method preference order, except fdatasync is Linux's default)
open_datasync 34648.778 ops/sec  29 usecs/op
fdatasync 31570.947 ops/sec  32 usecs/op
fsync 27783.283 ops/sec  36 usecs/op
fsync_writethrough  n/a
open_sync 35238.866 ops/sec  28 usecs/op
--


pgbench only shows marginally better results, although the difference is within 
an error range.  The following is the tps of the default read/write workload of 
pgbench.  I ran the test with all the tables and indexes preloaded with 
pg_prewarm (except pgbench_history), and the checkpoint not happening.  I ran a 
write workload before running the benchmark so that no new WAL file would be 
created during the benchmark run.

[SSD on bare metal, ext4 volume mounted with noatime,nobarrier,data=ordered]
--
   1  2  3avg
fdatasync  17610  17164  16678  17150
open_datasync  17847  17457  17958  17754 (+3%)

[HDD on VM, ext4 volume mounted with noatime,nobarrier,data=writeback]
(the figures seem oddly high, though; this may be due to some VM configuration)
--
  1 2 3   avg
fdatasync  4911  5225  5198  5111
open_datasync  4996  5284  5317  5199 (+1%)


As the removed comment describes, when wal_sync_method is open_datasync (or 
open_sync), open() fails with errno=EINVAL if the ext4 volume is mounted with 
data=journal.  That's because open() specifies O_DIRECT in that case.  I don't 
think that's a problem in practice, because data=journal will not be used for 
performance, and wal_level needs to be changed from its default replica to 
minimal and max_wal_senders must be set to 0 for O_DIRECT to be used.


Regards
Takayuki Tsunakawa



default_wal_sync_method.patch
Description: default_wal_sync_method.patch


Re: SHA-2 functions

2018-02-19 Thread Michael Paquier
On Mon, Feb 19, 2018 at 03:02:02PM -0500, Peter Eisentraut wrote:
> On 2/19/18 09:06, Aleksander Alekseev wrote:
>>> So I suggest these patches that expose the new functions sha224(),
>>> sha256(), sha384(), sha512().  That allows us to make the SSL and SCRAM
>>> tests more robust, and it will allow them to be used in general purpose
>>> contexts over md5().
>> 
>> Nice patch. I wonder though whether tests should include hashing
>> non-ASCII characters as well.
> 
> The input is of type bytea, so the concept of non-ASCII characters
> doesn't quite apply.

Encoding issues is a reason to use bytea output in some cases.  For
logical decoding this is for example also why an interface like
pg_logical_slot_peek_binary_changes() exists...  Back to the patch.
--
Michael


signature.asc
Description: PGP signature


Re: missing toast table for pg_policy

2018-02-19 Thread Michael Paquier
On Mon, Feb 19, 2018 at 11:33:30AM -0500, Joe Conway wrote:
> The attached does exactly this. Gives all system tables toast tables
> except pg_class, pg_attribute, and pg_index, and includes cat version
> bump and regression test in misc_sanity.
> 
> Any further discussion, comments, complaints?

Thanks Joe for working on this.

+SELECT relname, attname, atttypid::regtype
+FROM pg_class c join pg_attribute a on c.oid = attrelid
+WHERE c.oid < 16384 AND
+  reltoastrelid = 0 AND
+  relkind = 'r' AND
+  attstorage != 'p'
This is really a good idea!  (Saw the suggestion upthread as well, did
not comment on it previously.)

Regression tests of pg_upgrade are failing as follows:
New cluster database "postgres" is not empty
Failure, exiting
+ rm -rf /tmp/pg_upgrade_check-Xn0ZLe

Just a thought: introducing a system function which returns
FirstNormalObjectId.  I have myself faced a couple of times case where
this OID is hardcoded in some tests.  I'll spawn a new thread about
that.
--
Michael


signature.asc
Description: PGP signature


Re: [PROPOSAL] Nepali Snowball dictionary

2018-02-19 Thread Oleg Bartunov
On Tue, Feb 20, 2018 at 12:01 AM, Arthur Zakirov
 wrote:
> Thank you for your answer!
>
> 2018-02-19 18:43 GMT+03:00 Tom Lane :
>> We are not the upstream for the snowball stuff, and lack the expertise
>> to decide whether proposed changes are any good.  To get anything
>> changed there, you'd have to get it approved by the snowball group.
>>
>> As best I know, the original list
>> http://lists.tartarus.org/mailman/listinfo/snowball-discuss
>> is moribund, but there's a fork at
>> http://snowballstem.org
>> that has at least some activity.
>
> From the original list it seems that http://snowballstem.org is
> frozen. But development work continues at
> https://github.com/snowballstem by other people.
> I'll try to send them a pull request.
>
>> Probably the first step ought to involve syncing our copy with the
>> current state of that upstream, something that's not been done in a
>> very long time :-(
>
> I think I will try to sync snowball dictionaries with snowballstem.org
> algorithms, it may be useful. Or maybe it is better to sync with the
> github repository. I don't aware how they differ yet, though.

That may be dangerous !

>
> --
> Arthur Zakirov
> Postgres Professional: http://www.postgrespro.com
> Russian Postgres Company
>



Re: pgsql: Allow UNIQUE indexes on partitioned tables

2018-02-19 Thread David G. Johnston
I found the following change to be confusing.

/doc/src/sgml/ref/alter_table.sgml
+   
+Additional restrictions apply when unique indexes are applied to
+partitioned tables; see .
+   

That paragraph appears in the section covering "ALTER TABLE name ADD
table_constraint_using_index"

However, the code says:

/src/backend/commands/tablecmds.c
+   /*
+* Doing this on partitioned tables is not a simple feature to
implement,
+* so let's punt for now.
+*/
+   if (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
+   ereport(ERROR,
+   (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+errmsg("ALTER TABLE / ADD CONSTRAINT USING INDEX is not
supported on partitioned tables")));

I was expecting the doc for ADD CONSTRAINT USING INDEX to note the
limitation explicitly - in lieu of the above paragraph.

Also, I cannot reason out what the following limitation means:

/doc/src/sgml/ref/create_table.sgml
+  If any partitions are in turn partitioned, all columns of each
partition
+  key are considered at each level below the UNIQUE
+  constraint.


As an aside, adding a link to "Data Definiton/Table Partitioning" from at
least CREATE TABLE ... PARTITION BY; and swapping "PARTITION BY" and
"PARTITION OF" in the Parameters section of that page - one must partition
by a table before one can partition it (and also the synopsis lists them in
the BY before OF order), would be helpful.

David J.


On Mon, Feb 19, 2018 at 1:40 PM, Alvaro Herrera 
wrote:

> Allow UNIQUE indexes on partitioned tables
>
> If we restrict unique constraints on partitioned tables so that they
> must always include the partition key, then our standard approach to
> unique indexes already works --- each unique key is forced to exist
> within a single partition, so enforcing the unique restriction in each
> index individually is enough to have it enforced globally.  Therefore we
> can implement unique indexes on partitions by simply removing a few
> restrictions (and adding others.)
>
> Discussion: https://postgr.es/m/2017112921.hi6hg6pem2w2t36z@alvherre.
> pgsql
> Discussion: https://postgr.es/m/20171229230607.3iib6b62fn3uaf47@alvherre.
> pgsql
> Reviewed-by: Simon Riggs, Jesper Pedersen, Peter Eisentraut, Jaime
> Casanova, Amit Langote
>
> Branch
> --
> master
>
> Details
> ---
> https://git.postgresql.org/pg/commitdiff/eb7ed3f3063401496e4aa4bd68fa33
> f0be31a72f
>
> Modified Files
> --
> doc/src/sgml/ddl.sgml |   9 +-
> doc/src/sgml/ref/alter_table.sgml |  15 +-
> doc/src/sgml/ref/create_index.sgml|   5 +
> doc/src/sgml/ref/create_table.sgml|  18 +-
> src/backend/bootstrap/bootparse.y |   2 +
> src/backend/catalog/index.c   |  50 -
> src/backend/catalog/pg_constraint.c   |  76 +++
> src/backend/catalog/toasting.c|   4 +-
> src/backend/commands/indexcmds.c  | 125 +--
> src/backend/commands/tablecmds.c  |  71 ++-
> src/backend/parser/analyze.c  |   7 +
> src/backend/parser/parse_utilcmd.c|  31 +--
> src/backend/tcop/utility.c|   1 +
> src/bin/pg_dump/t/002_pg_dump.pl  |  65 ++
> src/include/catalog/index.h   |   5 +-
> src/include/catalog/pg_constraint_fn.h|   4 +-
> src/include/commands/defrem.h |   1 +
> src/include/parser/parse_utilcmd.h|   3 +-
> src/test/regress/expected/alter_table.out |   8 -
> src/test/regress/expected/create_index.out|   6 +
> src/test/regress/expected/create_table.out|  12 --
> src/test/regress/expected/indexing.out| 294
> +-
> src/test/regress/expected/insert_conflict.out |   2 +-
> src/test/regress/sql/alter_table.sql  |   2 -
> src/test/regress/sql/create_index.sql |   6 +
> src/test/regress/sql/create_table.sql |   8 -
> src/test/regress/sql/indexing.sql | 172 ++-
> 27 files changed, 907 insertions(+), 95 deletions(-)
>
>


Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2018-02-19 Thread Claudio Freire
On Fri, Aug 18, 2017 at 8:39 AM, Claudio Freire  wrote:
> On Fri, Apr 7, 2017 at 10:51 PM, Claudio Freire  
> wrote:
>> Indeed they do, and that's what motivated this patch. But I'd need
>> TB-sized tables to set up something like that. I don't have the
>> hardware or time available to do that (vacuum on bloated TB-sized
>> tables can take days in my experience). Scale 4000 is as big as I can
>> get without running out of space for the tests in my test hardware.
>>
>> If anybody else has the ability, I'd be thankful if they did test it
>> under those conditions, but I cannot. I think Anastasia's test is
>> closer to such a test, that's probably why it shows a bigger
>> improvement in total elapsed time.
>>
>> Our production database could possibly be used, but it can take about
>> a week to clone it, upgrade it (it's 9.5 currently), and run the
>> relevant vacuum.
>
> It looks like I won't be able to do that test with a production
> snapshot anytime soon.
>
> Getting approval for the budget required to do that looks like it's
> going to take far longer than I thought.

I finally had a chance to test the patch in a production snapshot.

Actually, I tried to take out 2 birds with one stone, and I'm also
testing the FSM vacuum patch. It shouldn't significantly alter the
numbers anyway.

So, while the whole-db throttled vacuum (as is run in production) is
still ongoing, an interesting case already popped up.

TL;DR, without the patch, this particular table took 16 1/2 hours more
or less, to vacuum 313M dead tuples. With the patch, it took 6:10h to
vacuum 323M dead tuples. That's quite a speedup. It even used
significantly less CPU time as well.

Since vacuum here is throttled (with cost-based delays), this also
means it generated less I/O.

We have more extreme cases sometimes, so if I see something
interesting in what remains of the test, I'll post those results as
well.

The raw data:

Patched

INFO:  vacuuming "public.aggregated_tracks_hourly_full"
INFO:  scanned index "aggregated_tracks_hourly_full_pkey_null" to
remove 323778164 row versions
DETAIL:  CPU: user: 111.57 s, system: 31.28 s, elapsed: 2693.67 s
INFO:  scanned index "ix_aggregated_tracks_hourly_full_action_null" to
remove 323778164 row versions
DETAIL:  CPU: user: 281.89 s, system: 36.32 s, elapsed: 2915.94 s
INFO:  scanned index "ix_aggregated_tracks_hourly_full_nunq" to remove
323778164 row versions
DETAIL:  CPU: user: 313.35 s, system: 79.22 s, elapsed: 6400.87 s
INFO:  "aggregated_tracks_hourly_full": removed 323778164 row versions
in 7070739 pages
DETAIL:  CPU: user: 583.48 s, system: 69.77 s, elapsed: 8048.00 s
INFO:  index "aggregated_tracks_hourly_full_pkey_null" now contains
720807609 row versions in 10529903 pages
DETAIL:  43184641 index row versions were removed.
5288916 index pages have been deleted, 4696227 are currently reusable.
CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.03 s.
INFO:  index "ix_aggregated_tracks_hourly_full_action_null" now
contains 720807609 row versions in 7635161 pages
DETAIL:  202678522 index row versions were removed.
4432789 index pages have been deleted, 3727966 are currently reusable.
CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.01 s.
INFO:  index "ix_aggregated_tracks_hourly_full_nunq" now contains
720807609 row versions in 15526885 pages
DETAIL:  202678522 index row versions were removed.
9052488 index pages have been deleted, 7390279 are currently reusable.
CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.02 s.
INFO:  "aggregated_tracks_hourly_full": found 41131260 removable,
209520861 nonremovable row versions in 7549244 out of 22391603 pages
DETAIL:  0 dead row versions cannot be removed yet, oldest xmin: 245834316
There were 260553451 unused item pointers.
Skipped 0 pages due to buffer pins, 0 frozen pages.
0 pages are entirely empty.
CPU: user: 1329.64 s, system: 244.22 s, elapsed: 2.14 s.


Vanilla 9.5 (ie: what's in production right now, should be similar to master):

INFO:  vacuuming "public.aggregated_tracks_hourly_full"
INFO:  scanned index "aggregated_tracks_hourly_full_pkey_null" to
remove 178956729 row versions
DETAIL:  CPU 65.51s/253.67u sec elapsed 3490.13 sec.
INFO:  scanned index "ix_aggregated_tracks_hourly_full_action_null" to
remove 178956729 row versions
DETAIL:  CPU 63.26s/238.08u sec elapsed 3483.32 sec.
INFO:  scanned index "ix_aggregated_tracks_hourly_full_nunq" to remove
178956729 row versions
DETAIL:  CPU 340.15s/445.52u sec elapsed 15898.48 sec.
INFO:  "aggregated_tracks_hourly_full": removed 178956729 row versions
in 3121122 pages
DETAIL:  CPU 168.24s/159.20u sec elapsed 5678.51 sec.
INFO:  scanned index "aggregated_tracks_hourly_full_pkey_null" to
remove 134424729 row versions
DETAIL:  CPU 50.66s/265.19u sec elapsed 3977.15 sec.
INFO:  scanned index "ix_aggregated_tracks_hourly_full_action_null" to
remove 134424729 row versions
DETAIL:  CPU 99.68s/326.44u sec elapsed 6580.22 sec.
INFO:  scanned index "ix_aggregated_tracks_hourly_full_nunq" to remove
134424729 row version

Re: [PROPOSAL] Nepali Snowball dictionary

2018-02-19 Thread Arthur Zakirov
Thank you for your answer!

2018-02-19 18:43 GMT+03:00 Tom Lane :
> We are not the upstream for the snowball stuff, and lack the expertise
> to decide whether proposed changes are any good.  To get anything
> changed there, you'd have to get it approved by the snowball group.
>
> As best I know, the original list
> http://lists.tartarus.org/mailman/listinfo/snowball-discuss
> is moribund, but there's a fork at
> http://snowballstem.org
> that has at least some activity.

>From the original list it seems that http://snowballstem.org is
frozen. But development work continues at
https://github.com/snowballstem by other people.
I'll try to send them a pull request.

> Probably the first step ought to involve syncing our copy with the
> current state of that upstream, something that's not been done in a
> very long time :-(

I think I will try to sync snowball dictionaries with snowballstem.org
algorithms, it may be useful. Or maybe it is better to sync with the
github repository. I don't aware how they differ yet, though.

-- 
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company



Re: unique indexes on partitioned tables

2018-02-19 Thread Alvaro Herrera
I pushed this now, with fixes for the last few comments there were.

Peter Eisentraut wrote:

> I don't understand the variable name "third".  I don't see a "first" or
> "second" nearby.

Hah.  That was referring to variables "myself" and "referenced".  I
changed the variable name to "parentConstr".

> I find some of the columns in pg_constraint confusing.  For a primary
> key on a partitioned table, for the PK on the partition I get
> 
> conislocal = false, coninhcount = 1, connoinherit = true
> 
> The last part is confusing to me.

Yeah, I think it was patently wrong.  I changed it so that connoinherit
becomes true in this case.

Alvaro Herrera wrote:
> Jaime Casanova wrote:

> > also noted that if you:
> > 
> > """
> > create table t1(i int) partition by hash (i);
> > create table t1_0 partition of t1 for values with (modulus 2, remainder 0);
> > create table t1_1 partition of t1 for values with (modulus 2, remainder 1);
> > create unique index on t1(i);
> > alter table t1 add primary key using index t1_i_idx ;
> > """
> > 
> > the ALTER TABLE ADD PK does not recurse to partitions, which maybe is
> > perfectly fine because i'm using USING INDEX but it feels like an
> > oversight to me
> 
> Ouch.  Yeah, this is a bug.  I'll try to come up with something.

After looking at it for a few minutes I determined that adding this
feature requires some more work: you need to iterate on all partitions,
obtain the corresponding index, cons up a few fake parse nodes, then
recurse to create the PK in the children.  I think this should be doable
with a couple dozen lines of code, but it's a refinement that can be
added on top.  Care to submit a patch?  In the meantime, I added an
ereport(ERROR) to avoid leaving the system in an inconsistent state
(that probably would not even be reproduced correctly by pg_dump).

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: SHA-2 functions

2018-02-19 Thread Peter Eisentraut
On 2/19/18 09:06, Aleksander Alekseev wrote:
>> So I suggest these patches that expose the new functions sha224(),
>> sha256(), sha384(), sha512().  That allows us to make the SSL and SCRAM
>> tests more robust, and it will allow them to be used in general purpose
>> contexts over md5().
> 
> Nice patch. I wonder though whether tests should include hashing
> non-ASCII characters as well.

The input is of type bytea, so the concept of non-ASCII characters
doesn't quite apply.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: JIT compiling with LLVM v10.1

2018-02-19 Thread Jesper Pedersen

Hi,

On 02/14/2018 01:17 PM, Andres Freund wrote:

On 2018-02-07 06:54:05 -0800, Andres Freund wrote:

I've pushed v10.0. The big (and pretty painful to make) change is that
now all the LLVM specific code lives in src/backend/jit/llvm, which is
built as a shared library which is loaded on demand.



I thought

https://db.in.tum.de/~leis/papers/adaptiveexecution.pdf?lang=en

was relevant for this thread.

Best regards,
 Jesper



Re: unique indexes on partitioned tables

2018-02-19 Thread Alvaro Herrera
Jaime Casanova wrote:

> Hi Álvaro,
> 
> attached a tiny patch (on top of yours) that silence two "variables
> uninitilized" warnings.

Thanks!  Applied.

> also noted that if you:
> 
> """
> create table t1(i int) partition by hash (i);
> create table t1_0 partition of t1 for values with (modulus 2, remainder 0);
> create table t1_1 partition of t1 for values with (modulus 2, remainder 1);
> create unique index on t1(i);
> alter table t1 add primary key using index t1_i_idx ;
> """
> 
> the ALTER TABLE ADD PK does not recurse to partitions, which maybe is
> perfectly fine because i'm using USING INDEX but it feels like an
> oversight to me

Ouch.  Yeah, this is a bug.  I'll try to come up with something.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



pgbench - test whether a variable exists

2018-02-19 Thread Fabien COELHO


Hello devs,

While investigating moving pgbench expressions to fe_utils so that they 
can be shared with psql (\if ..., \let ?), I figure out that psql's \if 
has a syntax to test whether a variable exists, which is not yet available 
to pgbench.


This patch adds the same syntax to pgbench expression so that they can 
represent this expression, already accepted by psql's \if.


Note that it is not really that useful for benchmarking, although it does 
not harm.


--
Fabien.diff --git a/doc/src/sgml/ref/pgbench.sgml b/doc/src/sgml/ref/pgbench.sgml
index 3dd492c..33d58d5 100644
--- a/doc/src/sgml/ref/pgbench.sgml
+++ b/doc/src/sgml/ref/pgbench.sgml
@@ -909,6 +909,8 @@ pgbench  options  d
   integer constants such as 5432,
   double constants such as 3.14159,
   references to variables :variablename,
+  testing whether a variable exists
+  :{?variablename},
   operators
   with their usual SQL precedence and associativity,
   function calls,
diff --git a/src/bin/pgbench/exprparse.y b/src/bin/pgbench/exprparse.y
index e23ca51..ea25b0e 100644
--- a/src/bin/pgbench/exprparse.y
+++ b/src/bin/pgbench/exprparse.y
@@ -24,6 +24,7 @@ static PgBenchExpr *make_boolean_constant(bool bval);
 static PgBenchExpr *make_integer_constant(int64 ival);
 static PgBenchExpr *make_double_constant(double dval);
 static PgBenchExpr *make_variable(char *varname);
+static PgBenchExpr *make_variable_exists(char *varname);
 static PgBenchExpr *make_op(yyscan_t yyscanner, const char *operator,
 		PgBenchExpr *lexpr, PgBenchExpr *rexpr);
 static PgBenchExpr *make_uop(yyscan_t yyscanner, const char *operator, PgBenchExpr *expr);
@@ -55,9 +56,10 @@ static PgBenchExpr *make_case(yyscan_t yyscanner, PgBenchExprList *when_then_lis
 %type  INTEGER_CONST function
 %type  DOUBLE_CONST
 %type  BOOLEAN_CONST
-%type  VARIABLE FUNCTION
+%type  VARIABLE VAREXISTS FUNCTION
 
-%token NULL_CONST INTEGER_CONST DOUBLE_CONST BOOLEAN_CONST VARIABLE FUNCTION
+%token NULL_CONST INTEGER_CONST DOUBLE_CONST BOOLEAN_CONST
+%token VARIABLE VAREXISTS FUNCTION
 %token AND_OP OR_OP NOT_OP NE_OP LE_OP GE_OP LS_OP RS_OP IS_OP
 %token CASE_KW WHEN_KW THEN_KW ELSE_KW END_KW
 
@@ -136,6 +138,7 @@ expr: '(' expr ')'			{ $$ = $2; }
 	| DOUBLE_CONST			{ $$ = make_double_constant($1); }
 	/* misc */
 	| VARIABLE{ $$ = make_variable($1); }
+	| VAREXISTS{ $$ = make_variable_exists($1); }
 	| function '(' elist ')' { $$ = make_func(yyscanner, $1, $3); }
 	| case_control			{ $$ = $1; }
 	;
@@ -209,6 +212,16 @@ make_variable(char *varname)
 
 /* binary operators */
 static PgBenchExpr *
+make_variable_exists(char *varname)
+{
+	PgBenchExpr *expr = pg_malloc(sizeof(PgBenchExpr));
+
+	expr->etype = ENODE_VAREXISTS;
+	expr->u.variable.varname = varname;
+	return expr;
+}
+
+static PgBenchExpr *
 make_op(yyscan_t yyscanner, const char *operator,
 		PgBenchExpr *lexpr, PgBenchExpr *rexpr)
 {
diff --git a/src/bin/pgbench/exprscan.l b/src/bin/pgbench/exprscan.l
index 5c1bd88..ff2b586 100644
--- a/src/bin/pgbench/exprscan.l
+++ b/src/bin/pgbench/exprscan.l
@@ -191,6 +191,13 @@ notnull			[Nn][Oo][Tt][Nn][Uu][Ll][Ll]
 	yylval->bval = false;
 	return BOOLEAN_CONST;
 }
+:\{\?{alnum}+\}	{
+	/* no pg_strndup? */
+	yylval->str = pg_strdup(yytext + 3);
+	/* scratch final '}' */
+	yylval->str[strlen(yylval->str)-1] = '\0';
+	return VAREXISTS;
+}
 {digit}+		{
 	yylval->ival = strtoint64(yytext);
 	return INTEGER_CONST;
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index d420942..b236116 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -2264,6 +2264,10 @@ evaluateExpr(TState *thread, CState *st, PgBenchExpr *expr, PgBenchValue *retval
 return true;
 			}
 
+		case ENODE_VAREXISTS:
+setBoolValue(retval, lookupVariable(st, expr->u.variable.varname) != NULL);
+return true;
+
 		case ENODE_FUNCTION:
 			return evalFunc(thread, st,
 			expr->u.function.function,
diff --git a/src/bin/pgbench/pgbench.h b/src/bin/pgbench/pgbench.h
index 0705ccd..0609802 100644
--- a/src/bin/pgbench/pgbench.h
+++ b/src/bin/pgbench/pgbench.h
@@ -58,6 +58,7 @@ typedef enum PgBenchExprType
 {
 	ENODE_CONSTANT,
 	ENODE_VARIABLE,
+	ENODE_VAREXISTS,
 	ENODE_FUNCTION
 } PgBenchExprType;
 
diff --git a/src/bin/pgbench/t/001_pgbench_with_server.pl b/src/bin/pgbench/t/001_pgbench_with_server.pl
index 99286f6..10c1553 100644
--- a/src/bin/pgbench/t/001_pgbench_with_server.pl
+++ b/src/bin/pgbench/t/001_pgbench_with_server.pl
@@ -259,6 +259,8 @@ pgbench(
 		qr{command=46.: int 46\b},
 		qr{command=47.: boolean true\b},
 		qr{command=48.: boolean true\b},
+		qr{command=58.: boolean false\b}, # var exists
+		qr{command=59.: boolean true\b},
 	],
 	'pgbench expressions',
 	{   '001_pgbench_expressions' => q{-- integer functions
@@ -338,6 +340,9 @@ pgbench(
 \set v2 5432
 \set v3 -54.21E-2
 SELECT :v0, :v1, :v2, :v3;
+-- test variable existence
+\set no debug(:{?no_such_variable})
+\set yes debug

Re: unique indexes on partitioned tables

2018-02-19 Thread Jaime Casanova
On 12 February 2018 at 15:26, Alvaro Herrera  wrote:
> Hello,
>
> Thanks, Peter, Jesper, Amit, for reviewing the patch.  Replying to
> all review comments at once:
>

[... v5 of patch attached ...]

Hi Álvaro,

attached a tiny patch (on top of yours) that silence two "variables
uninitilized" warnings.

also noted that if you:

"""
create table t1(i int) partition by hash (i);
create table t1_0 partition of t1 for values with (modulus 2, remainder 0);
create table t1_1 partition of t1 for values with (modulus 2, remainder 1);
create unique index on t1(i);
alter table t1 add primary key using index t1_i_idx ;
"""

the ALTER TABLE ADD PK does not recurse to partitions, which maybe is
perfectly fine because i'm using USING INDEX but it feels like an
oversight to me

-- 
Jaime Casanova  www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 19233b68cb..677e9cabf8 100644
*** a/src/backend/commands/tablecmds.c
--- b/src/backend/commands/tablecmds.c
***
*** 14177,14183  AttachPartitionEnsureIndexes(Relation rel, Relation attachrel)
  		 */
  		for (i = 0; i < list_length(attachRelIdxs); i++)
  		{
! 			Oid		cldConstrOid;
  
  			/* does this index have a parent?  if so, can't use it */
  			if (has_superclass(RelationGetRelid(attachrelIdxRels[i])))
--- 14177,14183 
  		 */
  		for (i = 0; i < list_length(attachRelIdxs); i++)
  		{
! 			Oid		cldConstrOid = InvalidOid;
  
  			/* does this index have a parent?  if so, can't use it */
  			if (has_superclass(RelationGetRelid(attachrelIdxRels[i])))
***
*** 14475,14481  ATExecAttachPartitionIdx(List **wqueue, Relation parentIdx, RangeVar *name)
  		int			i;
  		PartitionDesc partDesc;
  		Oid			constraintOid,
! 	cldConstrId;
  
  		/*
  		 * If this partition already has an index attached, refuse the operation.
--- 14475,14481 
  		int			i;
  		PartitionDesc partDesc;
  		Oid			constraintOid,
! 	cldConstrId = InvalidOid;
  
  		/*
  		 * If this partition already has an index attached, refuse the operation.


Re: pgbench - allow to specify scale as a size

2018-02-19 Thread Fabien COELHO


Hello Mark,


What if we consider using ascii (utf8?) text file sizes as a reference
point, something independent from the database?


Why not.

TPC-B basically specifies that rows (accounts, tellers, branches) are all 
padded to 100 bytes, thus we could consider (i.e. document) that 
--scale=SIZE refers to the amount of useful data hold, and warn that 
actual storage would add various overheads for page and row headers, free 
space at the end of pages, indexes...


Then one scale step is 100,000 accounts, 10 tellers and 1 branch, i.e.
100,011 * 100 ~ 9.5 MiB of useful data per scale step.


I realize even if a flat file size can be used as a more consistent
reference across platforms, it doesn't help with accurately determining
the final data file sizes due to any architecture specific nuances or
changes in the backend.  But perhaps it might still offer a little more
meaning to be able to say "50 gigabytes of bank account data" rather
than "10 million rows of bank accounts" when thinking about size over
cardinality.


Yep.

Now the overhead is really 60-65%. Although the specification is 
unambiguous, but we still need some maths to know whether it fits in 
buffers or memory... The point of Karel regression is to take this into 
account.


Also, whether this option would be more admissible to Tom is still an open 
question. Tom?


--
Fabien.



Re: extern keyword incorrectly used in some function definitions

2018-02-19 Thread Tom Lane
David Rowley  writes:
> While poking around partition.c I noticed that one of the functions
> there is *defined* as "extern". Normally we'd only do this in the
> declaration of the function. I don't really see why it's needed in the
> definition.
> Anyway, I removed it. I then thought I'd better look around for any
> others too...
> A patch is attached which removes them all, apart from the snowball
> ones. I didn't touch those.

Agreed.  The snowball stuff is just imported from elsewhere, so we're
not going to try to make it conform to project style.  But the rest
of these look good; pushed.

regards, tom lane



Re: pgbench - allow to specify scale as a size

2018-02-19 Thread Mark Wong
On Sat, Feb 17, 2018 at 12:22:37PM -0500, Alvaro Hernandez wrote:
> 
> 
> On 17/02/18 12:17, Tom Lane wrote:
> > Alvaro Hernandez  writes:
> >> On 17/02/18 11:26, Tom Lane wrote:
> >>> Fabien COELHO  writes:
>  Here is a attempt at extending --scale so that it can be given a size.
> >>> I do not actually find this to be a good idea.  It's going to be
> >>> platform-dependent, or not very accurate, or both, and thereby
> >>> contribute to confusion by making results less reproducible.
> >>>
> >>> Plus, what do we do if the backend changes table representation in
> >>> some way that invalidates Kaarel's formula altogether?  More confusion
> >>> would be inevitable.
> >>       Why not then insert a "few" rows, measure size, truncate the table,
> >> compute the formula and then insert to the desired user requested size?
> >> (or insert what should be the minimum, scale 1, measure, and extrapolate
> >> what's missing). It doesn't sound too complicated to me, and targeting a
> >> size is something that I believe it's quite good for user.
> > Then you'd *really* have irreproducible results.
> >
> > regards, tom lane
> 
>      You also have irreproducible results today, according to your 
> criteria. Either you agree on the number of rows but may not agree on 
> the size (today), or you agree on the size but may not agree on the 
> number of rows. Right now you can only pick the former, while I think 
> people would significantly appreciate the latter. If neither is correct, 
> let's at least provide the choice.

What if we consider using ascii (utf8?) text file sizes as a reference
point, something independent from the database?

I realize even if a flat file size can be used as a more consistent
reference across platforms, it doesn't help with accurately determining
the final data file sizes due to any architecture specific nuances or
changes in the backend.  But perhaps it might still offer a little more
meaning to be able to say "50 gigabytes of bank account data" rather
than "10 million rows of bank accounts" when thinking about size over
cardinality.

Regards,
Mark

-- 
Mark Wonghttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, RemoteDBA, Training & Services



Re: missing toast table for pg_policy

2018-02-19 Thread Joe Conway
On 02/18/2018 01:33 PM, Joe Conway wrote:
> On 02/18/2018 11:18 AM, Tom Lane wrote:
>> I'm fairly suspicious of toasting anything that the toast mechanism itself
>> depends on, actually, and that would include at least pg_attribute and
>> pg_index as well as pg_class.  Maybe we could get away with it because
>> there would never be any actual recursion only potential recursion ...
>> but it seems scary.
> 
> Well that is the other approach we could pursue -- instead of justifying
> which system catalogs need toast tables we could create an exclusion
> list of which ones should not have toast tables, with the current
> candidates being pg_class, pg_attribute, and pg_index.

The attached does exactly this. Gives all system tables toast tables
except pg_class, pg_attribute, and pg_index, and includes cat version
bump and regression test in misc_sanity.

Any further discussion, comments, complaints?

Joe

-- 
Crunchy Data - http://crunchydata.com
PostgreSQL Support for Secure Enterprises
Consulting, Training, & Open Source Development
diff --git a/src/backend/catalog/catalog.c b/src/backend/catalog/catalog.c
index 809749add9..813b3b87cc 100644
--- a/src/backend/catalog/catalog.c
+++ b/src/backend/catalog/catalog.c
@@ -258,7 +258,19 @@ IsSharedRelation(Oid relationId)
 		relationId == PgDbRoleSettingToastTable ||
 		relationId == PgDbRoleSettingToastIndex ||
 		relationId == PgShseclabelToastTable ||
-		relationId == PgShseclabelToastIndex)
+		relationId == PgShseclabelToastIndex ||
+		relationId == PgAuthidToastTable ||
+		relationId == PgAuthidToastIndex ||
+		relationId == PgDatabaseToastTable ||
+		relationId == PgDatabaseToastIndex ||
+		relationId == PgPlTemplateToastTable ||
+		relationId == PgPlTemplateToastIndex ||
+		relationId == PgReplicationOriginToastTable ||
+		relationId == PgReplicationOriginToastIndex ||
+		relationId == PgSubscriptionToastTable ||
+		relationId == PgSubscriptionToastIndex ||
+		relationId == PgTablespaceToastTable ||
+		relationId == PgTablespaceToastIndex)
 		return true;
 	return false;
 }
diff --git a/src/include/catalog/catversion.h b/src/include/catalog/catversion.h
index 433d6db4f6..202072c3c7 100644
--- a/src/include/catalog/catversion.h
+++ b/src/include/catalog/catversion.h
@@ -53,6 +53,6 @@
  */
 
 /*			mmddN */
-#define CATALOG_VERSION_NO	201802061
+#define CATALOG_VERSION_NO	201802191
 
 #endif
diff --git a/src/include/catalog/toasting.h b/src/include/catalog/toasting.h
index f6387ae143..4586ac93b2 100644
--- a/src/include/catalog/toasting.h
+++ b/src/include/catalog/toasting.h
@@ -46,25 +46,61 @@ extern void BootstrapToastTable(char *relName,
  */
 
 /* normal catalogs */
+DECLARE_TOAST(pg_aggregate, 4139, 4140);
 DECLARE_TOAST(pg_attrdef, 2830, 2831);
+DECLARE_TOAST(pg_collation, 4141, 4142);
 DECLARE_TOAST(pg_constraint, 2832, 2833);
+DECLARE_TOAST(pg_default_acl, 4143, 4144);
 DECLARE_TOAST(pg_description, 2834, 2835);
+DECLARE_TOAST(pg_event_trigger, 4145, 4146);
+DECLARE_TOAST(pg_extension, 4147, 4148);
+DECLARE_TOAST(pg_foreign_data_wrapper, 4149, 4150);
+DECLARE_TOAST(pg_foreign_server, 4151, 4152);
+DECLARE_TOAST(pg_foreign_table, 4153, 4154);
+DECLARE_TOAST(pg_init_privs, 4155, 4156);
+DECLARE_TOAST(pg_language, 4157, 4158);
+DECLARE_TOAST(pg_largeobject, 4159, 4160);
+DECLARE_TOAST(pg_largeobject_metadata, 4161, 4162);
+DECLARE_TOAST(pg_namespace, 4163, 4164);
+DECLARE_TOAST(pg_partitioned_table, 4165, 4166);
+DECLARE_TOAST(pg_policy, 4167, 4168);
 DECLARE_TOAST(pg_proc, 2836, 2837);
 DECLARE_TOAST(pg_rewrite, 2838, 2839);
 DECLARE_TOAST(pg_seclabel, 3598, 3599);
 DECLARE_TOAST(pg_statistic, 2840, 2841);
 DECLARE_TOAST(pg_statistic_ext, 3439, 3440);
 DECLARE_TOAST(pg_trigger, 2336, 2337);
+DECLARE_TOAST(pg_ts_dict, 4169, 4170);
+DECLARE_TOAST(pg_type, 4171, 4172);
+DECLARE_TOAST(pg_user_mapping, 4173, 4174);
 
 /* shared catalogs */
-DECLARE_TOAST(pg_shdescription, 2846, 2847);
-#define PgShdescriptionToastTable 2846
-#define PgShdescriptionToastIndex 2847
+DECLARE_TOAST(pg_authid, 4175, 4176);
+#define PgAuthidToastTable 4175
+#define PgAuthidToastIndex 4176
+DECLARE_TOAST(pg_database, 4177, 4178);
+#define PgDatabaseToastTable 4177
+#define PgDatabaseToastIndex 4178
 DECLARE_TOAST(pg_db_role_setting, 2966, 2967);
 #define PgDbRoleSettingToastTable 2966
 #define PgDbRoleSettingToastIndex 2967
+DECLARE_TOAST(pg_pltemplate, 4179, 4180);
+#define PgPlTemplateToastTable 4179
+#define PgPlTemplateToastIndex 4180
+DECLARE_TOAST(pg_replication_origin, 4181, 4182);
+#define PgReplicationOriginToastTable 4181
+#define PgReplicationOriginToastIndex 4182
+DECLARE_TOAST(pg_shdescription, 2846, 2847);
+#define PgShdescriptionToastTable 2846
+#define PgShdescriptionToastIndex 2847
 DECLARE_TOAST(pg_shseclabel, 4060, 4061);
 #define PgShseclabelToastTable 4060
 #define PgShseclabelToastIndex 4061
+DECLARE_TOAST(pg_subscription, 4183, 4184);
+#define PgSubscriptionToastTable 4183
+#define PgSubscriptionToastIndex 4184
+DECLARE_TOAST(pg_tablespace, 4185, 4186);

Re: autovacuum: change priority of the vacuumed tables

2018-02-19 Thread Tomas Vondra


On 02/19/2018 03:38 PM, Ildus Kurbangaliev wrote:
> On Fri, 16 Feb 2018 21:48:14 +0900
> Masahiko Sawada  wrote:
> 
>> On Fri, Feb 16, 2018 at 7:50 PM, Ildus Kurbangaliev
>>  wrote:
>>> On Fri, 16 Feb 2018 17:42:34 +0900
>>> Masahiko Sawada  wrote:
>>>  
 On Thu, Feb 15, 2018 at 10:16 PM, Grigory Smolkin
  wrote:  
> On 02/15/2018 09:28 AM, Masahiko Sawada wrote:
>  
>> Hi,
>>
>> On Thu, Feb 8, 2018 at 11:01 PM, Ildus Kurbangaliev
>>  wrote:  
>>>
>>> Hi,
>>>
>>> Attached patch adds 'autovacuum_table_priority' to the current
>>> list of automatic vacuuming settings. It's used in sorting of
>>> vacuumed tables in autovacuum worker before actual vacuum.
>>>
>>> The idea is to give possibility to the users to prioritize
>>> their tables in autovacuum process.
>>>  
>> Hmm, I couldn't understand the benefit of this patch. Would you
>> elaborate it a little more?
>>
>> Multiple autovacuum worker can work on one database. So even if
>> a table that you want to vacuum first is the back of the list
>> and there other worker would pick up it. If the vacuuming the
>> table gets delayed due to some big tables are in front of that
>> table I think you can deal with it by increasing the number of
>> autovacuum workers.
>>
>> Regards,
>>
>> --
>> Masahiko Sawada
>> NIPPON TELEGRAPH AND TELEPHONE CORPORATION
>> NTT Open Source Software Center
>>  
>
> Database can contain thousands of tables and often
> updates/deletes concentrate mostly in only a handful of tables.
> Going through thousands of less bloated tables can take ages.
> Currently autovacuum know nothing about prioritizing it`s work
> with respect to user`s understanding of his data and
> application.  

 Understood. I have a question; please imagine the following case.

 Suppose that there are 1000 tables in a database, and one table of
 them (table-A) has the highest priority while other 999 tables have
 same priority. Almost tables (say 800 tables) including table-A
 need to get vacuumed at some point, so with your patch an AV
 worker listed 800 tables and table-A will be at the head of the
 list. Table-A will get vacuumed first but this AV worker has to
 vacuum other 799 tables even if table-A requires vacuum later
 again.

 If an another AV worker launches during table-A being vacuumed, the
 new AV worker would include table-A but would not process it
 because concurrent AV worker is processing it. So it would vacuum
 other tables instead. Similarly, this AV worker can not get the
 new table list until finish to vacuum all other tables. (Note that
 it might skip some tables if they are already vacuumed by other AV
 worker.) On the other hand, if another new AV worker launches
 after table-A got vacuumed and requires vacuuming again, the new
 AV worker puts the table-A at the head of list. It processes
 table-A first but, again, it has to vacuum other tables before
 getting new table list next time that might include table-A.

 Is this the expected behavior? I'd rather expect postgres to
 vacuum it before other lower priority tables whenever the table
 having the highest priority requires vacuuming, but it wouldn't.  
>>>
>>> Yes, this is the expected behavior. The patch is the way to give the
>>> user at least some control of the sorting, later it could be
>>> extended with something more sophisticated.
>>>  
>>
>> Since user doesn't know that each AV worker processes tables based on
>> its table list that is different from lists that other worker has, I
>> think it's hard for user to understand this parameter. I'd say that
>> user would expect that high priority table can get vacuumed any time.
> 
> Yes, very good point. It could be strange for the user in cases like
> that.
> 
>>
>> I think what you want to solve is to vacuum some tables preferentially
>> if there are many tables requiring vacuuming. Right? If so, I think
>> the prioritizing table only in the list would not solve the
>> fundamental issue. In the example, table-A will still need to wait for
>> other 799 tables to get vacuumed. Table-A will be bloating during
>> vacuuming other tables. To deal with it, I think we need something
>> queue on the shmem per database in order to control the order of
>> tables waiting for vacuuming and need to use it with a smart
>> algorithm. Thoughts?
> 
> Agree, it would require some shared queue for the autovacuum workers if
> we want to prioritize the table across all of them. I will look into
> this, and maybe will come up with something.
> 
> Masahiko, are you working on this too or just interested with the idea?
> 

Hello,

I have a hard time understanding how adding yet another autovacuum
table-level knob makes the DBA's life any easier. Especially when the
behavior is so unreliab

Re: NEXT VALUE FOR sequence

2018-02-19 Thread Tom Lane
Laurenz Albe  writes:
> The SQL standard has the expression "NEXT VALUE FOR asequence" to do
> what we traditionally do with "nextval('asequence')".
> This is an attempt to implement this on top of the recently introduced
> NextValueExpr node.

This has been proposed repeatedly, and rejected repeatedly, because in
fact the standard's semantics for NEXT VALUE FOR are *not* like nextval().
See SQL:2011 4.22.2 "Operations involving sequence generators":

If there are multiple instances of s specifying
the same sequence generator within a single SQL-statement, all those
instances return the same value for a given row processed by that
SQL-statement.

This is not terribly exact --- what is a "processed row" in a join query,
for instance?  But it's certainly not supposed to act like independent
executions of nextval() or NextValueExpr would.  Pending somebody doing
the legwork to produce something that at least arguably conforms to the
spec's semantics, we've left the syntax unimplemented.

regards, tom lane



Re: [PROPOSAL] Nepali Snowball dictionary

2018-02-19 Thread Tom Lane
Arthur Zakirov  writes:
> Is it appropriate to add new snowball dictionaries? I'm not sure about
> policy of including new snowball dictionaries.

We are not the upstream for the snowball stuff, and lack the expertise
to decide whether proposed changes are any good.  To get anything
changed there, you'd have to get it approved by the snowball group.

As best I know, the original list
http://lists.tartarus.org/mailman/listinfo/snowball-discuss
is moribund, but there's a fork at
http://snowballstem.org
that has at least some activity.

Probably the first step ought to involve syncing our copy with the
current state of that upstream, something that's not been done in a
very long time :-(

regards, tom lane



Re: autovacuum: change priority of the vacuumed tables

2018-02-19 Thread Ildus Kurbangaliev
On Fri, 16 Feb 2018 21:48:14 +0900
Masahiko Sawada  wrote:

> On Fri, Feb 16, 2018 at 7:50 PM, Ildus Kurbangaliev
>  wrote:
> > On Fri, 16 Feb 2018 17:42:34 +0900
> > Masahiko Sawada  wrote:
> >  
> >> On Thu, Feb 15, 2018 at 10:16 PM, Grigory Smolkin
> >>  wrote:  
> >> > On 02/15/2018 09:28 AM, Masahiko Sawada wrote:
> >> >  
> >> >> Hi,
> >> >>
> >> >> On Thu, Feb 8, 2018 at 11:01 PM, Ildus Kurbangaliev
> >> >>  wrote:  
> >> >>>
> >> >>> Hi,
> >> >>>
> >> >>> Attached patch adds 'autovacuum_table_priority' to the current
> >> >>> list of automatic vacuuming settings. It's used in sorting of
> >> >>> vacuumed tables in autovacuum worker before actual vacuum.
> >> >>>
> >> >>> The idea is to give possibility to the users to prioritize
> >> >>> their tables in autovacuum process.
> >> >>>  
> >> >> Hmm, I couldn't understand the benefit of this patch. Would you
> >> >> elaborate it a little more?
> >> >>
> >> >> Multiple autovacuum worker can work on one database. So even if
> >> >> a table that you want to vacuum first is the back of the list
> >> >> and there other worker would pick up it. If the vacuuming the
> >> >> table gets delayed due to some big tables are in front of that
> >> >> table I think you can deal with it by increasing the number of
> >> >> autovacuum workers.
> >> >>
> >> >> Regards,
> >> >>
> >> >> --
> >> >> Masahiko Sawada
> >> >> NIPPON TELEGRAPH AND TELEPHONE CORPORATION
> >> >> NTT Open Source Software Center
> >> >>  
> >> >
> >> > Database can contain thousands of tables and often
> >> > updates/deletes concentrate mostly in only a handful of tables.
> >> > Going through thousands of less bloated tables can take ages.
> >> > Currently autovacuum know nothing about prioritizing it`s work
> >> > with respect to user`s understanding of his data and
> >> > application.  
> >>
> >> Understood. I have a question; please imagine the following case.
> >>
> >> Suppose that there are 1000 tables in a database, and one table of
> >> them (table-A) has the highest priority while other 999 tables have
> >> same priority. Almost tables (say 800 tables) including table-A
> >> need to get vacuumed at some point, so with your patch an AV
> >> worker listed 800 tables and table-A will be at the head of the
> >> list. Table-A will get vacuumed first but this AV worker has to
> >> vacuum other 799 tables even if table-A requires vacuum later
> >> again.
> >>
> >> If an another AV worker launches during table-A being vacuumed, the
> >> new AV worker would include table-A but would not process it
> >> because concurrent AV worker is processing it. So it would vacuum
> >> other tables instead. Similarly, this AV worker can not get the
> >> new table list until finish to vacuum all other tables. (Note that
> >> it might skip some tables if they are already vacuumed by other AV
> >> worker.) On the other hand, if another new AV worker launches
> >> after table-A got vacuumed and requires vacuuming again, the new
> >> AV worker puts the table-A at the head of list. It processes
> >> table-A first but, again, it has to vacuum other tables before
> >> getting new table list next time that might include table-A.
> >>
> >> Is this the expected behavior? I'd rather expect postgres to
> >> vacuum it before other lower priority tables whenever the table
> >> having the highest priority requires vacuuming, but it wouldn't.  
> >
> > Yes, this is the expected behavior. The patch is the way to give the
> > user at least some control of the sorting, later it could be
> > extended with something more sophisticated.
> >  
> 
> Since user doesn't know that each AV worker processes tables based on
> its table list that is different from lists that other worker has, I
> think it's hard for user to understand this parameter. I'd say that
> user would expect that high priority table can get vacuumed any time.

Yes, very good point. It could be strange for the user in cases like
that.

> 
> I think what you want to solve is to vacuum some tables preferentially
> if there are many tables requiring vacuuming. Right? If so, I think
> the prioritizing table only in the list would not solve the
> fundamental issue. In the example, table-A will still need to wait for
> other 799 tables to get vacuumed. Table-A will be bloating during
> vacuuming other tables. To deal with it, I think we need something
> queue on the shmem per database in order to control the order of
> tables waiting for vacuuming and need to use it with a smart
> algorithm. Thoughts?

Agree, it would require some shared queue for the autovacuum workers if
we want to prioritize the table across all of them. I will look into
this, and maybe will come up with something.

Masahiko, are you working on this too or just interested with the idea?

-- 
---
Ildus Kurbangaliev
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company



Re: CURRENT OF causes an error when IndexOnlyScan is used

2018-02-19 Thread Anastasia Lubennikova

01.02.2018 05:12, Tom Lane:

Yugo Nagata  writes:

I'm sorry the patch attached in the previous mail is broken and
not raises a compile error. I attached the fixed patch.

This patch is almost certainly wrong: you can't assume that the scan-level
state matches the tuple we are currently processing at top level.  Any
sort of delaying action, for instance a sort or materialize node in
between, would break it.

We need to either fix this aspect:


IndexOnlyScan returns a virtual tuple that doesn't have system
column, so we can not get ctid in the same way of other plans.


I'd like to propose the patch that fixes the issue.
We already have a way to return heaptuple from IndexOnlyScan,
but it was not applied to b-tree for some reason.

Attached patch solves the reported bug.
Moreover, it will come in handy for "index with included attributes" 
feature [1],

where we can store long (and even TOASTed) attributes in indextuple.

[1] https://commitfest.postgresql.org/17/1350/

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 8158508..db8a55c 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -374,6 +374,8 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
 	so->currTuples = so->markTuples = NULL;
 
 	scan->xs_itupdesc = RelationGetDescr(rel);
+	scan->xs_hitupdesc = NULL;
+	scan->xs_hitup = NULL;
 
 	scan->opaque = so;
 
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 51dca64..dd3e8b2 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -25,6 +25,7 @@
 #include "utils/tqual.h"
 
 
+static HeapTuple _bt_fetch_tuple(IndexScanDesc scandesc);
 static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
 			 OffsetNumber offnum);
 static void _bt_saveitem(BTScanOpaque so, int itemIndex,
@@ -38,6 +39,31 @@ static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
 static void _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp);
 static inline void _bt_initialize_more_data(BTScanOpaque so, ScanDirection dir);
 
+/*
+ * Fetch all keys in tuple.
+ * Returns a new HeapTuple containing the originally-indexed data.
+ */
+static HeapTuple
+_bt_fetch_tuple(IndexScanDesc scandesc)
+{
+	Relation index = scandesc->indexRelation;
+	Datum		fetchatt[INDEX_MAX_KEYS];
+	bool		isnull[INDEX_MAX_KEYS];
+	int			i;
+	HeapTuple htuple;
+
+	for (i = 0; i < index->rd_att->natts; i++)
+	{
+		fetchatt[i] = index_getattr(scandesc->xs_itup, i + 1,
+	scandesc->xs_itupdesc, &isnull[i]);
+	}
+
+	htuple = heap_form_tuple(scandesc->xs_hitupdesc, fetchatt, isnull);
+	htuple->t_tableOid = scandesc->heapRelation->rd_id;
+
+
+	return htuple;
+}
 
 /*
  *	_bt_drop_lock_and_maybe_pin()
@@ -1105,8 +1131,32 @@ readcomplete:
 	/* OK, itemIndex says what to return */
 	currItem = &so->currPos.items[so->currPos.itemIndex];
 	scan->xs_ctup.t_self = currItem->heapTid;
+
 	if (scan->xs_want_itup)
+	{
+		if (!scan->xs_hitupdesc)
+		{
+			int natts = RelationGetNumberOfAttributes(scan->indexRelation);
+			int attno;
+			/*
+			* The storage type of the index can be different from the original
+			* datatype being indexed, so we cannot just grab the index's tuple
+			* descriptor. Instead, construct a descriptor with the original data
+			* types.
+			*/
+			scan->xs_hitupdesc = CreateTemplateTupleDesc(natts, false);
+			for (attno = 1; attno <= natts; attno++)
+			{
+TupleDescInitEntry(scan->xs_hitupdesc, attno, NULL,
+scan->indexRelation->rd_opcintype[attno - 1],
+-1, 0);
+			}
+		}
+
 		scan->xs_itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
+		scan->xs_hitup = _bt_fetch_tuple(scan);
+		scan->xs_hitup->t_self = currItem->heapTid;
+	}
 
 	return true;
 }
@@ -1155,8 +1205,34 @@ _bt_next(IndexScanDesc scan, ScanDirection dir)
 	/* OK, itemIndex says what to return */
 	currItem = &so->currPos.items[so->currPos.itemIndex];
 	scan->xs_ctup.t_self = currItem->heapTid;
+
 	if (scan->xs_want_itup)
+	{
+		if (!scan->xs_hitupdesc)
+		{
+			int natts = RelationGetNumberOfAttributes(scan->indexRelation);
+			int attno;
+			/*
+			* The storage type of the index can be different from the original
+			* datatype being indexed, so we cannot just grab the index's tuple
+			* descriptor. Instead, construct a descriptor with the original data
+			* types.
+			*/
+			scan->xs_hitupdesc = CreateTemplateTupleDesc(natts, false);
+			for (attno = 1; attno <= natts; attno++)
+			{
+TupleDescInitEntry(scan->xs_hitupdesc, attno, NULL,
+scan->indexRelation->rd_opcintype[attno - 1],
+-1, 0);
+			}
+		}
+
 		scan->xs_itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
+		if (scan->xs_hitup)
+			pfree(scan->xs_hitup);
+		scan->xs_hitup = _bt_fetch_tuple(scan);
+		scan->xs_hitup->t_self = currItem->heapTid;
+	}
 
 	return t

Re: Prefix operator for text and spgist support

2018-02-19 Thread Ildus Kurbangaliev
On Mon, 19 Feb 2018 15:06:51 +0300
Arthur Zakirov  wrote:

> Hello,
> 
> On Fri, Feb 02, 2018 at 06:03:27PM +0300, Ildus Kurbangaliev wrote:
> > Hi,
> > 
> > Attached patch introduces prefix operator ^@ for text type. For 'a
> > ^@ b' it returns true if 'a' starts with 'b'. Also there is spgist
> > index support for this operator.
> > 
> > It could be useful as an alternative for LIKE for 'something%'
> > templates. Or even used in LIKE queries instead of ~>=~ and ~<~ in
> > the future. But it would require new strategy for btree.  
> 
> I've looked at the patch. It is applied and tests pass.

Hi, thanks for the review.

> 
> I have a couple comments:
> 
> > +   if (ptype == Pattern_Type_Prefix)
> > +   {
> > +   char*s = TextDatumGetCString(constval);
> > +   prefix = string_to_const(s, vartype);
> > +   pstatus = Pattern_Prefix_Partial;
> > +   rest_selec = 1.0;   /* all */
> > +   pfree(s);
> > +   }
> > +   else
> > +   pstatus = pattern_fixed_prefix(patt, ptype,
> > collation,
> > +
> > &prefix, &rest_selec);  
> 
> I think it is better to put Pattern_Type_Prefix processing into
> pattern_fixed_prefix() as another case entry.

At brief look at this place seems better to move this block into
pattern_fixed_prefix function. But there is also `vartype` variable
which used to in prefix construction, and it would require pass this
variable too. And since pattern_fixed_prefix called in 10 other places
and vartype is required only for this ptype it seems better just keep
this block outside of this function.

> 
> Secondly, it is worth to fix the documentation. At least here [1].
> Maybe there are another places where documentation should be fixed.
> 
> 
> 1 -
> https://www.postgresql.org/docs/current/static/spgist-builtin-opclasses.html
> 

I've added documentation in current version of the patch.

-- 
---
Ildus Kurbangaliev
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 1e535cf215..403d0193ce 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -2274,6 +2274,21 @@
ph
   
 
+  
+   
+
+ text_startswith
+
+text_startswith(string, prefix)
+   
+   bool
+   
+Returns true if string starts with prefix.
+   
+   text_startswith('alphabet', 'alph')
+   t
+  
+
   

 
@@ -3964,6 +3979,12 @@ cast(-44 as bit(12))   11010100
 ILIKE, respectively.  All of these operators are
 PostgreSQL-specific.

+
+   
+There is also the prefix operator ^@ and corresponding
+text_startswith function which covers cases when only
+searching by beginning of the string is needed.
+   
   
 
 
diff --git a/doc/src/sgml/spgist.sgml b/doc/src/sgml/spgist.sgml
index 51bb60c92a..00ef9a92b0 100644
--- a/doc/src/sgml/spgist.sgml
+++ b/doc/src/sgml/spgist.sgml
@@ -179,6 +179,7 @@
~<~
~>=~
~>~
+   ^@
   
  
  
diff --git a/src/backend/access/spgist/spgtextproc.c b/src/backend/access/spgist/spgtextproc.c
index f156b2166e..5beb9e33a1 100644
--- a/src/backend/access/spgist/spgtextproc.c
+++ b/src/backend/access/spgist/spgtextproc.c
@@ -496,7 +496,7 @@ spg_text_inner_consistent(PG_FUNCTION_ARGS)
 			 * well end with a partial multibyte character, so that applying
 			 * any encoding-sensitive test to it would be risky anyhow.)
 			 */
-			if (strategy > 10)
+			if (strategy > 10 && strategy != RTPrefixStrategyNumber)
 			{
 if (collate_is_c)
 	strategy -= 10;
@@ -526,6 +526,10 @@ spg_text_inner_consistent(PG_FUNCTION_ARGS)
 	if (r < 0)
 		res = false;
 	break;
+case RTPrefixStrategyNumber:
+	if (r != 0)
+		res = false;
+	break;
 default:
 	elog(ERROR, "unrecognized strategy number: %d",
 		 in->scankeys[j].sk_strategy);
@@ -601,10 +605,21 @@ spg_text_leaf_consistent(PG_FUNCTION_ARGS)
 	for (j = 0; j < in->nkeys; j++)
 	{
 		StrategyNumber strategy = in->scankeys[j].sk_strategy;
-		text	   *query = DatumGetTextPP(in->scankeys[j].sk_argument);
-		int			queryLen = VARSIZE_ANY_EXHDR(query);
+		text	   *query;
+		int			queryLen;
 		int			r;
 
+		/* for this strategy collation is not important */
+		if (strategy == RTPrefixStrategyNumber)
+		{
+			res = DatumGetBool(DirectFunctionCall2(text_startswith,
+out->leafValue, in->scankeys[j].sk_argument));
+			goto next;
+		}
+
+		query = DatumGetTextPP(in->scankeys[j].sk_argument);
+		queryLen = VARSIZE_ANY_EXHDR(query);
+
 		if (strategy > 10)
 		{
 			/* Collation-aware comparison */
@@ -655,6 +670,7 @@ spg_text_leaf_consistent(PG_FUNCTION_ARGS)
 break;
 		}
 
+next:
 		if (!res)
 			break;/* no need to consider remaining conditions */
 	}
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index fcc8323f62..be91082b56 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backen

Re: SHA-2 functions

2018-02-19 Thread Joe Conway
On 02/19/2018 08:43 AM, Peter Eisentraut wrote:
> I also noticed while working on some SSL code that we have perfectly
> good SHA-2 functionality in the server already, but it has no test
> coverage outside the SCRAM tests.
> 
> So I suggest these patches that expose the new functions sha224(),
> sha256(), sha384(), sha512().  That allows us to make the SSL and SCRAM
> tests more robust, and it will allow them to be used in general purpose
> contexts over md5().

I didn't look closely at the patch itself, but +1 on the concept.

Joe

-- 
Crunchy Data - http://crunchydata.com
PostgreSQL Support for Secure Enterprises
Consulting, Training, & Open Source Development



signature.asc
Description: OpenPGP digital signature


Re: Contention preventing locking

2018-02-19 Thread Konstantin Knizhnik



On 16.02.2018 11:59, Michail Nikolaev wrote:

Hello.

Just want to notice - this work also correlates with 
https://www.postgresql.org/message-id/CAEepm%3D18buPTwNWKZMrAXLqja1Tvezw6sgFJKPQ%2BsFFTuwM0bQ%40mail.gmail.com paper.
It provides more general way to address the issue comparing to single 
optimisations (but they could do the work too, of course).




Certainly, contention-aware lock scheduler is much more general way of 
addressing this problem.
But amount of efforts needed to implement such scheduler and integrate 
it in Postgres core is almost non-comparable.


I did some more tests with much simple benchmark than YCSB: it just 
execute specified percent of single record updates and selects for 
relatively small table.
I used 50% of updates, 100 records table size and vary number of 
connections from 10 to 1000 with step 10.
This benchmark (pgrw.c) and updated version of xlock patch are attached 
to this mail.


Results are present at the following spreadsheet:

https://docs.google.com/spreadsheets/d/1QOYfUehy8U3sdasMjGnPGQJY8JiRfZmlS64YRBM0YTo/edit?usp=sharing

I repeated tests several times. Dips on the chart corresponds to the 
auto-vacuum periods. For some reasons them are larger for xlock 
optimization.
But, you can notice that xlock optimization significantly reduce 
degradation speed although doesn't completely eliminate this negative 
trend.



Thanks,
Michail.


чт, 15 февр. 2018 г. в 19:00, Konstantin Knizhnik 
mailto:k.knizh...@postgrespro.ru>>:


Hi,

PostgreSQL performance degrades signficantly in case of high
contention.
You can look at the attached YCSB results (ycsb-zipf-pool.png) to
estimate the level of this degradation.

Postgres is acquiring two kind of heavy weight locks during update: it
locks TID of the updated tuple and XID of transaction created this
version.
In debugger I see the following picture: if several transactions are
trying to update the same record, then first of all they compete for
SnapshotDirty.xmax transaction lock in EvalPlanQualFetch.
Then in heap_tuple_update they are trying to lock TID of the updated
tuple: heap_acquire_tuplock in heapam.c
After that transactions wait completion of the transaction updated the
tuple: XactLockTableWait in heapam.c

So in heap_acquire_tuplock all competing transactions are waiting for
TID of the updated version. When transaction which changed this
tuple is
committed, one of the competitors will grant this lock and proceed,
creating new version of the tuple. Then all other competitors will be
awaken and ... find out that locked tuple is not the last version
any more.
Then they will locate new version and try to lock it... The more
competitors we have, then more attempts they all have to perform
(quadratic complexity).

My idea was that we can avoid such performance degradation if we
somehow
queue competing transactions so that they will get control one-by-one,
without doing useless work.
First of all I tried to replace TID  lock with PK (primary key) lock.
Unlike TID, PK of record  is not changed during hot update. The second
idea is that instead immediate releasing lock after update we can hold
it until the end of transaction. And this optimization really provides
improve of performance - it corresponds to pg_f1_opt configuration at
the attached diagram.
For some workloads it provides up to two times improvement comparing
with vanilla Postgres. But there are many problems with correct
(non-prototype) implementation of this approach:
handling different types of PK, including compound keys, PK
updates,...

Another approach,  which I have recently implemented, is much simpler
and address another lock kind: transaction lock.
The idea o this approach is mostly the same: all competing transaction
are waiting for transaction which is changing the tuple.
Then one of them is given a chance to proceed and now all other
transactions are waiting for this transaction and so on:

T1<-T2,T3,T4,T5,T6,T7,T8,T9
T2<-T3,T4,T5,T6,T7,T8,T9
T3<-T4,T5,T6,T7,T8,T9
...

<- here corresponds to "wait for" dependency between transactions.

If we change this picture to

T1<-T2, T2<-T3, T3<-T4, T4<-T5, T5<-T6, T6<-T7, T7<-T8, T8<-T9

then number of lock requests can be significantly reduced.
And it can be implemented using very simple patch (attached
xlock.patch).
I hope that I have not done something incorrect here.
Effect of this simple patch is even larger:  more than 3 times for
50..70 clients.
Please look at the attached xlock.pdf spreadsheet.

Unfortunately combination of this two approaches gives worser result
than just single xlock.patch.
Certainly this patch also requires further improvement, for example it
will not correctly work in case of aborting transactions due to
deadlock
or s

[PROPOSAL] Nepali Snowball dictionary

2018-02-19 Thread Arthur Zakirov
Hello hackers,

I would like to propose nepali snowball dictionary patch.

Nepali is inflectional and derivational language. And it can be stemmed.

initdb also patched, so it can determine default text search
configuration.

Examples:

=# select ts_lexize('nepali_stem', 'लेख्');
 ts_lexize 
---
 {लेख्}

=# select ts_lexize('nepali_stem', 'लेखछेस्');
 ts_lexize 
---
 {लेख}

=# select ts_lexize('nepali_stem', 'लेखे');
 ts_lexize 
---
 {लेखे}

Authors:
- Oleg Bartunov
- Nepali NLP Group

Is it appropriate to add new snowball dictionaries? I'm not sure about
policy of including new snowball dictionaries.

-- 
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company
diff --git a/doc/src/sgml/textsearch.sgml b/doc/src/sgml/textsearch.sgml
index 610b7bf033..c9c7de52ad 100644
--- a/doc/src/sgml/textsearch.sgml
+++ b/doc/src/sgml/textsearch.sgml
@@ -3723,6 +3723,7 @@ Parser: "pg_catalog.default"
  pg_catalog | german_stem | snowball stemmer for german language
  pg_catalog | hungarian_stem  | snowball stemmer for hungarian language
  pg_catalog | italian_stem| snowball stemmer for italian language
+ pg_catalog | nepali_stem | snowball stemmer for nepali language
  pg_catalog | norwegian_stem  | snowball stemmer for norwegian language
  pg_catalog | portuguese_stem | snowball stemmer for portuguese language
  pg_catalog | romanian_stem   | snowball stemmer for romanian language
diff --git a/src/backend/snowball/Makefile b/src/backend/snowball/Makefile
index 50cbace41d..c29f4184f2 100644
--- a/src/backend/snowball/Makefile
+++ b/src/backend/snowball/Makefile
@@ -40,6 +40,7 @@ OBJS= $(WIN32RES) dict_snowball.o api.o utilities.o \
stem_UTF_8_german.o \
stem_UTF_8_hungarian.o \
stem_UTF_8_italian.o \
+   stem_UTF_8_nepali.o \
stem_UTF_8_norwegian.o \
stem_UTF_8_porter.o \
stem_UTF_8_portuguese.o \
@@ -62,6 +63,7 @@ LANGUAGES=  \
german  german  \
hungarian   hungarian   \
italian italian \
+   nepali  nepali  \
norwegian   norwegian   \
portuguese  portuguese  \
romanianromanian\
diff --git a/src/backend/snowball/dict_snowball.c 
b/src/backend/snowball/dict_snowball.c
index 78c9f73ef0..d96c849118 100644
--- a/src/backend/snowball/dict_snowball.c
+++ b/src/backend/snowball/dict_snowball.c
@@ -49,6 +49,7 @@
 #include "snowball/libstemmer/stem_UTF_8_german.h"
 #include "snowball/libstemmer/stem_UTF_8_hungarian.h"
 #include "snowball/libstemmer/stem_UTF_8_italian.h"
+#include "snowball/libstemmer/stem_UTF_8_nepali.h"
 #include "snowball/libstemmer/stem_UTF_8_norwegian.h"
 #include "snowball/libstemmer/stem_UTF_8_porter.h"
 #include "snowball/libstemmer/stem_UTF_8_portuguese.h"
@@ -102,6 +103,7 @@ static const stemmer_module stemmer_modules[] =
{"german", PG_UTF8, german_UTF_8_create_env, german_UTF_8_close_env, 
german_UTF_8_stem},
{"hungarian", PG_UTF8, hungarian_UTF_8_create_env, 
hungarian_UTF_8_close_env, hungarian_UTF_8_stem},
{"italian", PG_UTF8, italian_UTF_8_create_env, italian_UTF_8_close_env, 
italian_UTF_8_stem},
+   {"nepali", PG_UTF8, nepali_UTF_8_create_env, nepali_UTF_8_close_env, 
nepali_UTF_8_stem},
{"norwegian", PG_UTF8, norwegian_UTF_8_create_env, 
norwegian_UTF_8_close_env, norwegian_UTF_8_stem},
{"porter", PG_UTF8, porter_UTF_8_create_env, porter_UTF_8_close_env, 
porter_UTF_8_stem},
{"portuguese", PG_UTF8, portuguese_UTF_8_create_env, 
portuguese_UTF_8_close_env, portuguese_UTF_8_stem},
diff --git a/src/backend/snowball/libstemmer/stem_UTF_8_nepali.c 
b/src/backend/snowball/libstemmer/stem_UTF_8_nepali.c
new file mode 100644
index 00..f4f6e656ad
--- /dev/null
+++ b/src/backend/snowball/libstemmer/stem_UTF_8_nepali.c
@@ -0,0 +1,440 @@
+/* This file was generated automatically by the Snowball to ISO C compiler */
+/* http://snowballstem.org/ */
+
+#include "header.h"
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+extern int nepali_UTF_8_stem(struct SN_env * z);
+#ifdef __cplusplus
+}
+#endif
+static int r_remove_category_2(struct SN_env * z);
+static int r_remove_category_3(struct SN_env * z);
+static int r_check_category_2(struct SN_env * z);
+static int r_remove_category_1(struct SN_env * z);
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+
+extern struct SN_env * nepali_UTF_8_create_env(void);
+extern void nepali_UTF_8_close_env(struct SN_env * z);
+
+
+#ifdef __cplusplus
+}
+#endif
+static const symbol s_0_0[6] = { 0xE0, 0xA4, 0x95, 0xE0, 0xA5, 0x80 };
+static const symbol s_0_1[9] = { 0xE0, 0xA4, 0xB2, 0xE0, 0xA4, 0xBE, 0xE0, 
0xA4, 0x87 };
+static const symbol s_0_2[6] = { 0xE0, 0xA4, 0xB2, 0xE0, 0xA5, 0x87 };
+static const symbol s_0_3[9] = { 0xE0, 0xA4, 0xB2, 0xE0, 0xA4, 0xBE, 0xE0, 
0xA4, 0x88 };
+static const symbol s_0_4[6] = { 0xE0, 0xA4, 0x95, 0xE0, 0xA5, 0x88 };
+static const sym

Re: SHA-2 functions

2018-02-19 Thread Aleksander Alekseev
Hello Peter,

> So I suggest these patches that expose the new functions sha224(),
> sha256(), sha384(), sha512().  That allows us to make the SSL and SCRAM
> tests more robust, and it will allow them to be used in general purpose
> contexts over md5().

Nice patch. I wonder though whether tests should include hashing
non-ASCII characters as well.

-- 
Best regards,
Aleksander Alekseev


signature.asc
Description: PGP signature


SHA-2 functions

2018-02-19 Thread Peter Eisentraut
There was a complaint recently about the documentation using the widely
frowned-upon md5() function in an unrelated context as an example hash
function.  This is quite common in many examples, such as hashing row
values to compare them, or hashing datums if they don't fit into an
index.  But there is nothing we can easily replace md5 with, without
going to things like pgcrypto.

I also noticed while working on some SSL code that we have perfectly
good SHA-2 functionality in the server already, but it has no test
coverage outside the SCRAM tests.

So I suggest these patches that expose the new functions sha224(),
sha256(), sha384(), sha512().  That allows us to make the SSL and SCRAM
tests more robust, and it will allow them to be used in general purpose
contexts over md5().

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From 52edfab8f1175c69ba791139fde26feb9279364a Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Tue, 6 Feb 2018 21:46:46 -0500
Subject: [PATCH 1/2] Add user-callable SHA-2 functions

Add the user-callable functions sha224, sha256, sha384, sha512.  We
already had these in the C code to support SCRAM, but there was no test
coverage outside of the SCRAM tests.  Adding these as user-callable
functions allows writing some tests.  Also, we have a user-callable md5
function but no more modern alternative, which led to wide use of md5 as
a general-purpose hash function, which leads to occasional complaints
about using md5.

Also mark the existing md5 functions as leak-proof.
---
 doc/src/sgml/func.sgml   |  70 -
 src/backend/utils/adt/varlena.c  | 101 +++
 src/include/catalog/pg_proc.h|  12 +++-
 src/test/regress/expected/opr_sanity.out |   6 ++
 src/test/regress/expected/strings.out|  53 
 src/test/regress/sql/strings.sql |  18 ++
 6 files changed, 257 insertions(+), 3 deletions(-)

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 1e535cf215..bab828d507 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -3640,7 +3640,7 @@ Other Binary String Functions
returning the result in hexadecimal
   
   md5(E'Th\\000omas'::bytea)
-  8ab2d3c9689aaf18 b4958c334c82d8b1
+  
8ab2d3c9689aaf18​b4958c334c82d8b1
  
 
   
@@ -3674,6 +3674,66 @@ Other Binary String Functions
set_byte(E'Th\\000omas'::bytea, 4, 64)
Th\000o@as
   
+
+  
+   
+
+ sha224
+
+sha224(bytea)
+   
+   bytea
+   
+SHA-224 hash
+   
+   sha224('abc')
+   
\x23097d223405d8228642a477bda2​55b32aadbce4bda0b3f7e36c9da7
+  
+
+  
+   
+
+ sha256
+
+sha256(bytea)
+   
+   bytea
+   
+SHA-256 hash
+   
+   sha256('abc')
+   
\xba7816bf8f01cfea414140de5dae2223​b00361a396177a9cb410ff61f20015ad
+  
+
+  
+   
+
+ sha384
+
+sha384(bytea)
+   
+   bytea
+   
+SHA-384 hash
+   
+   sha384('abc')
+   
\xcb00753f45a35e8bb5a03d699ac65007​272c32ab0eded1631a8b605a43ff5bed​8086072ba1e7cc2358baeca134c825a7
+  
+
+  
+   
+
+ sha512
+
+sha512(bytea)
+   
+   bytea
+   
+SHA-512 hash
+   
+   sha512('abc')
+   
\xddaf35a193617abacc417349ae204131​12e6fa4e89a97ea20a964b55d39a​2192992a274fc1a836ba3c23a3feebbd​454d4423643ce80e2a9ac94fa54ca49f
+  
 

   
@@ -3686,6 +3746,14 @@ Other Binary String Functions
the first byte, and bit 15 is the most significant bit of the second byte.
   
 
+  
+   Note that for historic reasons, the function md5
+   returns a hex-encoded value of type text whereas the SHA-2
+   functions return type bytea.  Use the functions
+   encode and decode to convert
+   between the two.
+  
+
   
See also the aggregate function string_agg in
 and the large object functions
diff --git a/src/backend/utils/adt/varlena.c b/src/backend/utils/adt/varlena.c
index 304cb26691..d1266f26f5 100644
--- a/src/backend/utils/adt/varlena.c
+++ b/src/backend/utils/adt/varlena.c
@@ -23,6 +23,7 @@
 #include "catalog/pg_type.h"
 #include "common/int.h"
 #include "common/md5.h"
+#include "common/sha2.h"
 #include "lib/hyperloglog.h"
 #include "libpq/pqformat.h"
 #include "miscadmin.h"
@@ -4611,6 +4612,106 @@ md5_bytea(PG_FUNCTION_ARGS)
PG_RETURN_TEXT_P(cstring_to_text(hexsum));
 }
 
+/*
+ * SHA-2 variants
+ */
+
+Datum
+sha224_bytea(PG_FUNCTION_ARGS)
+{
+   bytea  *in = PG_GETARG_BYTEA_PP(0);
+   const uint8 *data;
+   size_t  len;
+   pg_sha224_ctx ctx;
+   unsigned char buf[PG_SHA224_DIGEST_LENGTH];
+   bytea  *result;
+
+   len = VARSIZE_ANY_EXHDR(in);
+   data = (unsigned char *) VARDATA_ANY(in);
+
+  

Re: [HACKERS] path toward faster partition pruning

2018-02-19 Thread David Rowley
On 19 February 2018 at 22:19, Amit Langote
 wrote:
> Attached updated patches.  Thanks again!

Thanks for making those changes.  I've made another pass of v28 and
have a few more comments.

The patch is starting to look good, but there are some new changes in
recent versions which still don't look quite right.

1. This does not fully make sense:

/*
 * Remove the indexes of partitions excluded due to each of
 * those partitions' *all* of allowed datums appearing in
 * keys->ne_datums, that is compared to the partition key
 * using <> operator.
 */

 "each of those partitions' *all* of allowed" is not correct.

Maybe better to write:

/*
 * Remove the indexes of any partitions which cannot possibly
 * contain rows matching the clauses due to key->ne_datums containing
 * all datum values which are allowed in the given partition.  This
 * is only possible to do in LIST partitioning as it's the only
 * partitioning strategy which allows the specification of exact values.
 */

2. Mine does not, but some compilers may complain about
get_partitions_for_keys() result variable being uninitalised in
get_partitions_for_keys. Probably the easiest fix would be to just
assign to NULL in the default case.

3. Did you mean to put this Assert() inside the loop?

memset(keyisnull, false, sizeof(keyisnull));
i = -1;
while ((i = bms_next_member(keys->keyisnull, i)) >= 0)
{
keys->n_eqkeys++;
keyisnull[i] = true;
}
Assert(i < partnatts);

i will always be -2 at the end of the loop. Seems like a useless
Assert in its current location.

4. Can you add a comment here to say: "Note: LIST partitioning only
supports a single partition key, therefore this function requires no
looping over the partition keys"

/*
 * get_partitions_for_keys_list
 * Return partitions of a list partitioned table for requested keys
 *
 * This interprets the keys and looks up partitions in the partition bound
 * descriptor using the list partitioning semantics.
 */

5. The following comment contains a bit of duplication to the comment
which comes after it. Maybe the following:

/*
* If the query is looking for null keys, there can only be one such
* partition.  Return the same if one exists.
*/

can be changed to:

/* Handle clauses requesting a NULL valued partition key */

6. The following comment does not quite make sense:

/* Exactly matching datum exists. */

Probably better to write:

/* An exact matching datum exists. */

7. "If it's not equal (<)" I think you mean (>), not (<), in:

 * The bound at minoff is <= minkeys, given the way
 * partition_list_bsearch() works.  If it's not equal (<), then
 * increment minoff to make it point to the datum on the right
 * that necessarily satisfies minkeys.  Also do the same if it is
 * equal but minkeys is exclusive.

However, the comment is a bit clumsy. Maybe the following is better?

/*
 * partition_list_bsearch returning a positive number means that
 * minkeys[0] must be greater than or equal to the smallest datum.
 * If we didn't find an exact matching datum (!is_equal) or if the
 * operator used was non-inclusive (>), then in both of these
 * cases we're not interested in the datum pointed to by minoff,
 * but we may start getting matches in the partition which the
 * next datum belongs to, so point to that one instead.  (This may
 * be beyond the last datum in the array, but we'll detect that
 * later.)
 */

8. The following comment could be improved:

* minkeys is greater than the datums of all non-default partitions,
* meaning there isn't one to return.  Return the default partition if
* one exists.

how about:

* The value of minkeys[0] is greater than all of the datums we have
* partitions for.  The only possible partition that could contain a
* match is the default partition.  Return that, if it exists.

9. The following could also be improved:

* The bound at maxoff is <= maxkeys, given the way
* partition_list_bsearch works.  If the bound at maxoff exactly
* matches maxkey (is_equal), but the maxkey is exclusive, then
* decrement maxoff to point to the bound on the left.

how about:

* partition_list_bsearch returning a positive number means that
* maxkeys[0] must be greater than or equal to the smallest datum.
* If the match found is an equal match, but the operator used is
* non-inclusive of that value (<), then the partition belonging
* to maxoff cannot match, so we'll decrement maxoff to point to
* the partition belonging to the previous datum.  We might end up
* decrementing maxoff down to -1, but we'll handle that later.

10. Can you append "  This may not technically be true for some data
types (e.g. integer types), however, we currently lack any sort of
infrastructure to provide us with proofs that would allow us to do
anything smarter here." to:

* For range queries, always include the default list partition,
* because list partitions divide the key space in a discontinuous
* manner, not all values in the given range will have a partition
* assigned.

11. get_partitions_for_keys_range s

Re: NEXT VALUE FOR sequence

2018-02-19 Thread Daniel Verite
Laurenz Albe wrote:

> The SQL standard has the expression "NEXT VALUE FOR asequence" to do
> what we traditionally do with "nextval('asequence')".

The behavior mandated by the standard is that several invocations
of NEXT VALUE on the same sequence on the same output row
must produce the same value. That is:

CREATE SEQUENCE s;
SELECT NEXT VALUE FOR s, NEXT VALUE FOR s
 UNION
SELECT NEXT VALUE FOR s, NEXT VALUE FOR s

should produce
(1,1)
(2,2)

It makes sense that the value does not depend on
the position of the expression as a column.

The trouble of course is that the equivalent with
nextval() would produce instead
(1,2)
(3,4)


There have been previous discussions on the standard syntax
that said that when it will get into postgres, it should go with
the standard conforming semantics.
I guess it would be a much more difficult patch.


Best regards,
-- 
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite



Tracking object modification time using event triggers

2018-02-19 Thread Alexander Levsha
Hi all.
I'm lead developer for pgCodeKeeper which is a tool for PostgreSQL database
schema comparison.

In our tool we have a pg_dump-like schema reader for comparing against live
DB instances.
This reader consumes majority of the time the comparison operation takes
and we had an idea to speed it up.
To do this we need to be able to track last modification time of every DB
object and an extension with event triggers seems like a suitable tool for
this.
The extension we've implemented is available, for anyone interested:
https://github.com/pgcodekeeper/pg_dbo_timestamp/

However, we've discovered that event triggers provide almost no data for
GRANT/REVOKE commands, in particular, there's no way to find out which
objects were altered by these commands.
pg_event_trigger_ddl_commands() does provide a pg_ddl_command data which
seems to contain objects list for GRANT, however it seems totally
inaccessible in plpgsql.

This leads to my question: do we need to dive into writing a C function for
our extension to access pg_ddl_command or some other lower-lever
representation? Or can we use something else to solve our task, maybe
avoiding event triggers entirely?

Thanks.
Alexander Levsha


Re: [doc fix] Correct calculation of vm.nr_hugepages

2018-02-19 Thread Justin Pryzby
On Mon, Feb 19, 2018 at 07:05:47AM +, Tsunakawa, Takayuki wrote:
> The current PostgreSQL documentation overestimates the number of huge pages
> (vm.nr_hugepages) because the calculation uses the maximum virtual address
> space.  In practice, huge pages are only used for the anonymous shared memory
> segment.  The attached patch fixes the documentation.

+ pmap 4170 | grep rw-s | grep zero | awk '{print $2}'

One can do with fewer greps:
pryzbyj@pryzbyj:~$ sudo pmap `pgrep -P 1 -u postgres` |awk 
'/rw-s/&&/zero/{print $2}' # check PPID=1
144848K

or:
pryzbyj@pryzbyj:~$ sudo pmap `pgrep -u postgres |sed q` |awk 
'/rw-s/&&/zero/{print $2}' # check any backend, not just postmaster parent
144848K

Or (this is even less clean but gives an alternative which continues to use
/proc directly):
pryzbyj@pryzbyj:~$ sudo cat /proc/`pgrep -P 1 -u postgres`/maps |awk 
--non-decimal-data 'BEGIN{FS="[ -]"} /zero/{a="0x"$1;b="0x"$2;print 
(b-a)/1024"kB"}'
144848kB

BTW when huge_pages=try, I've been verifying hugepages are in use by running:
pryzbyj@pryzbyj:~$ sudo sh -c 'cat /proc/`pgrep -u postgres -P1`/smaps |grep -c 
"KernelPageSize: *2048 kB"' || echo NOT FOUND 
0
NOT FOUND

Justin



Re: Prefix operator for text and spgist support

2018-02-19 Thread Arthur Zakirov
Hello,

On Fri, Feb 02, 2018 at 06:03:27PM +0300, Ildus Kurbangaliev wrote:
> Hi,
> 
> Attached patch introduces prefix operator ^@ for text type. For 'a ^@ b'
> it returns true if 'a' starts with 'b'. Also there is spgist index
> support for this operator.
> 
> It could be useful as an alternative for LIKE for 'something%'
> templates. Or even used in LIKE queries instead of ~>=~ and ~<~ in the
> future. But it would require new strategy for btree.

I've looked at the patch. It is applied and tests pass.

I have a couple comments:

> + if (ptype == Pattern_Type_Prefix)
> + {
> + char*s = TextDatumGetCString(constval);
> + prefix = string_to_const(s, vartype);
> + pstatus = Pattern_Prefix_Partial;
> + rest_selec = 1.0;   /* all */
> + pfree(s);
> + }
> + else
> + pstatus = pattern_fixed_prefix(patt, ptype, collation,
> +
> &prefix, &rest_selec);

I think it is better to put Pattern_Type_Prefix processing into
pattern_fixed_prefix() as another case entry.

Secondly, it is worth to fix the documentation. At least here [1]. Maybe
there are another places where documentation should be fixed.


1 - https://www.postgresql.org/docs/current/static/spgist-builtin-opclasses.html

-- 
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company



NEXT VALUE FOR sequence

2018-02-19 Thread Laurenz Albe
The SQL standard has the expression "NEXT VALUE FOR asequence" to do
what we traditionally do with "nextval('asequence')".

This is an attempt to implement this on top of the recently introduced
NextValueExpr node.

If there is no obvious reason why we would not want that, I'll add it
to the next commitfest.

Is this behavior ok:

test=> CREATE SEQUENCE testseq;
test=> PREPARE x AS SELECT NEXT VALUE FOR testseq;
test=> EXECUTE x;
 ?column? 
--
1
(1 row)

test=> DROP SEQUENCE testseq;
DROP SEQUENCE
test=> EXECUTE x;
ERROR:  could not open relation with OID 24836

If not, what could be done about it?

Yours,
Laurenz AlbeFrom 57e42f0a32be39eea541808979a41ab6feac6b73 Mon Sep 17 00:00:00 2001
From: Laurenz Albe 
Date: Mon, 19 Feb 2018 12:17:59 +0100
Subject: [PATCH] Add support for NEXT VALUE FOR 

NEXT VALUE FOR is the standard SQL expression for getting the next
value from a sequence, which in PostgreSQL traditionally has been
the task of the "nextval" function.

This completes the implementation of SQL standard feature T176.
---
 doc/src/sgml/func.sgml | 10 
 doc/src/sgml/ref/create_sequence.sgml  |  3 ++-
 doc/src/sgml/syntax.sgml   | 18 ++
 src/backend/commands/tablecmds.c   |  2 ++
 src/backend/executor/execExpr.c|  1 +
 src/backend/executor/execExprInterp.c  |  3 ++-
 src/backend/nodes/copyfuncs.c  |  2 ++
 src/backend/nodes/equalfuncs.c |  2 ++
 src/backend/nodes/outfuncs.c   |  2 ++
 src/backend/nodes/readfuncs.c  |  2 ++
 src/backend/parser/gram.y  |  6 +
 src/backend/parser/parse_expr.c| 43 ++
 src/backend/rewrite/rewriteHandler.c   |  2 ++
 src/backend/utils/adt/ruleutils.c  | 12 +++---
 src/include/executor/execExpr.h|  1 +
 src/include/nodes/primnodes.h  |  7 --
 src/test/regress/expected/sequence.out | 32 -
 src/test/regress/sql/sequence.sql  | 12 +-
 18 files changed, 145 insertions(+), 15 deletions(-)

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 487c7ff750..e135a05314 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -12095,6 +12095,16 @@ nextval('foo'::text)  foo is looked up at
 This function requires USAGE
 or UPDATE privilege on the sequence.

+
+   
+
+ PostgreSQL also offers the SQL standard compliant expression
+ NEXT VALUE FOR sequence_name
+ which is the same as
+ nextval('sequence_name').
+ See  for details.
+
+   
   
  
 
diff --git a/doc/src/sgml/ref/create_sequence.sgml b/doc/src/sgml/ref/create_sequence.sgml
index 3e0d339c85..2671e80f12 100644
--- a/doc/src/sgml/ref/create_sequence.sgml
+++ b/doc/src/sgml/ref/create_sequence.sgml
@@ -51,7 +51,8 @@ CREATE [ TEMPORARY | TEMP ] SEQUENCE [ IF NOT EXISTS ] 
+   NEXT VALUE FOR
+
+   
+NEXT VALUE FOR
+   
+
+   
+This expression is used to obtain the next value for a sequence.
+
+NEXT VALUE FOR sequence_name
+
+
+It is equivalent to calling the nextval function.
+   
+
+  
+
   
Expression Evaluation Rules
 
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 89454d8e80..34a1b0684d 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -5496,6 +5496,8 @@ ATExecAddColumn(List **wqueue, AlteredTableInfo *tab, Relation rel,
 
 			nve->seqid = RangeVarGetRelid(colDef->identitySequence, NoLock, false);
 			nve->typeId = typeOid;
+			/* no need to check permissions on the implicit sequence */
+			nve->checkperms = false;
 
 			defval = (Expr *) nve;
 		}
diff --git a/src/backend/executor/execExpr.c b/src/backend/executor/execExpr.c
index c6eb3ebacf..2f6b536d90 100644
--- a/src/backend/executor/execExpr.c
+++ b/src/backend/executor/execExpr.c
@@ -2104,6 +2104,7 @@ ExecInitExprRec(Expr *node, ExprState *state,
 scratch.opcode = EEOP_NEXTVALUEEXPR;
 scratch.d.nextvalueexpr.seqid = nve->seqid;
 scratch.d.nextvalueexpr.seqtypid = nve->typeId;
+scratch.d.nextvalueexpr.checkperms = nve->checkperms;
 
 ExprEvalPushStep(state, &scratch);
 break;
diff --git a/src/backend/executor/execExprInterp.c b/src/backend/executor/execExprInterp.c
index f646fd9c51..1b067eb491 100644
--- a/src/backend/executor/execExprInterp.c
+++ b/src/backend/executor/execExprInterp.c
@@ -2355,7 +2355,8 @@ ExecEvalCurrentOfExpr(ExprState *state, ExprEvalStep *op)
 void
 ExecEvalNextValueExpr(ExprState *state, ExprEvalStep *op)
 {
-	int64		newval = nextval_internal(op->d.nextvalueexpr.seqid, false);
+	int64		newval = nextval_internal(op->d.nextvalueexpr.seqid,
+		  op->d.nextvalueexpr.checkperms);
 
 	switch (op->d.nextvalueexpr.seqtypid)
 	{
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index bafe0d1071..aaadc8b54b 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/

extern keyword incorrectly used in some function definitions

2018-02-19 Thread David Rowley
While poking around partition.c I noticed that one of the functions
there is *defined* as "extern". Normally we'd only do this in the
declaration of the function. I don't really see why it's needed in the
definition.

Anyway, I removed it. I then thought I'd better look around for any
others too...

A patch is attached which removes them all, apart from the snowball
ones. I didn't touch those.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


remove_extern_func_defs.patch
Description: Binary data


Re: [HACKERS] Removing [Merge]Append nodes which contain a single subpath

2018-02-19 Thread David Rowley
On 19 February 2018 at 18:01, David Rowley  wrote:
> On 19 February 2018 at 15:11, Tomas Vondra  
> wrote:
>> and perhaps we should do s/isproxy/is_proxy/ which seems like the usual
>> naming for boolean variables.
>
> You're right. I'll change that and post an updated patch. I'll also
> reprocess your email again and try to improve some comments to make
> the intent of the code more clear.

I've made a few changes to the patch. "isproxy" is now "is_proxy".
I've made the comments a bit more verbose at the top of the definition
of struct AppendPath. Also shuffled some comments around in allpaths.c

I've attached both a delta patch with just these changes, and also a
completely new patch based on a recent master.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


remove_singleton_appends_2018-02-19_delta.patch
Description: Binary data


remove_singleton_appends_2018-02-19.patch
Description: Binary data