Re: [HACKERS] Improving DISTINCT with LooseScan node

2017-09-18 Thread Dmitriy Sarafannikov

> It seems related to this thread? :
> https://www.postgresql.org/message-id/flat/5037A9C5.4030701%40optionshouse.com#5037a9c5.4030...@optionshouse.com
> 
> And this wiki page : https://wiki.postgresql.org/wiki/Loose_indexscan

Yep. Now i can see 2 use cases for this feature:
1. DISTINCT queries.
2. Effectively scanning multicolumn index if first column is omitted and has 
low cardinality 


> Not an answer to your question, but generally +1 for working on this
> area.  I did some early proto-hacking a while ago, which I haven't had
> time to get back to yet:
> 
> https://www.postgresql.org/message-id/flat/CADLWmXWALK8NPZqdnRQiPnrzAnic7NxYKynrkzO_vxYr8enWww%40mail.gmail.com
>  
> 
> 
> That was based on the idea that a DISTINCT scan using a btree index to
> skip ahead is always going to be using the leading N columns and
> always going to be covered by the index, so I might as well teach
> Index Only Scan how to do it directly rather than making a new node to
> sit on top.  As you can see in that thread I did start thinking about
> using a new node to sit on top and behave a bit like a nested loop for
> the more advanced feature of an Index Skip Scan (trying every value of
> (a) where you had an index on (a, b, c) but had a WHERE clause qual on
> (b, c)), but the only feedback I had so far was from Robert Haas who
> thought that too should probably be pushed into the index scan.

Thank you for information, i will look at this thread.

> FWIW I'd vote for 'skip' rather than 'loose' as a term to use in this
> family of related features (DISTINCT being the easiest to build).  It
> seems more descriptive than the MySQL term.  (DB2 added this a little
> while ago and went with 'index jump scan', suggesting that we should
> consider 'hop'... (weak humour, 'a hop, skip and a jump' being an
> English idiom meaning a short distance)).

Maybe skip would be better, but there will be no problems with something like 
patents?
I mean database which name beginning with big letter «O»? As i know, it has 
Index Skip Scan.




[HACKERS] Improving DISTINCT with LooseScan node

2017-09-17 Thread Dmitriy Sarafannikov
Hi hackers, Everybody knows, that we have unefficient execution of query like "SELECT DISTINCT id from mytable"if table has many-many rows and only several unique id values. Query plan looks like Unique + IndexScan. I have tried to implement this feature in new type of node called Loose Scan.This node must appears in plan together with IndexScan or IndexOnlyScan just like Unique node in this case.But instead of filtering rows with equal values, LooseScan must retreive first row from IndexScan,then remember and return this. With all subsequent calls LooseScan must initiate calling index_rescan via ExecReScanwith search value that > or < (depending on scan direction) of previous value.Total cost of this path must be something like total_cost = n_distinct_values * subpath->startup_costWhat do you think about this idea? I was able to create new LooseScan node, but i ran into difficulties with changing scan keys.I looked (for example) on the ExecReScanIndexOnlyScan function and I find it difficult to understandhow i can reach the ioss_ScanKeys through the state of executor. Can you help me with this? I also looked on the Nested Loop node, which as i think must change scan keys, but i didn't become clear for me.The only thought that came to my head, that maybe i incorrectly create this subplan.I create it just like create_upper_unique_plan, and create subplan for IndexScan via create_plan_recurse.Can you tell me where to look or maybe somewhere there are examples? Thanks Regards,Dmitriy Sarafannikov

Re: [HACKERS] Allow GiST opcalsses without compress\decompres functions

2017-09-14 Thread Dmitriy Sarafannikov
The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:   tested, passed
Spec compliant:   tested, passed
Documentation:tested, passed

This is simple and intuitive patch. Code looks pretty clear and well documented.

I think it is good idea to decrease overhead on calling fake 
compressing/decomressing
functions. This eliminates the need to implement the fetch function in such 
cases.

I also tried to disable compress and decompress functions in contrib/cube:
diff --git a/contrib/cube/cube--1.2.sql b/contrib/cube/cube--1.2.sql
index dea2614888..26301b71d7 100644
--- a/contrib/cube/cube--1.2.sql
+++ b/contrib/cube/cube--1.2.sql
@@ -370,8 +370,6 @@ CREATE OPERATOR CLASS gist_cube_ops

FUNCTION1   g_cube_consistent (internal, cube, smallint, 
oid, internal),
FUNCTION2   g_cube_union (internal, internal),
-   FUNCTION3   g_cube_compress (internal),
-   FUNCTION4   g_cube_decompress (internal),
FUNCTION5   g_cube_penalty (internal, internal, internal),
FUNCTION6   g_cube_picksplit (internal, internal),
FUNCTION7   g_cube_same (cube, cube, internal),

And it is enables IndexOnlyScan, this is expected behaviour.
+ -- test explain
+ set enable_seqscan to 'off';
+ set enable_bitmapscan to 'off';
+ select count(*) from test_cube where c && '(3000,1000),(0,0)';
+  count
+ ---
+  5
+ (1 row)
+
+ explain analyze select c from test_cube where c && '(3000,1000),(0,0)';
+QUERY PLAN
+ 

+  Index Only Scan using test_cube_ix on test_cube  (cost=0.15..56.43 rows=16 
width=32) (actual time=0.015..0.018 rows=5 loops=1)
+Index Cond: (c && '(3000, 1000),(0, 0)'::cube)
+Heap Fetches: 5
+  Planning time: 0.038 ms
+  Execution time: 0.032 ms
+ (5 rows)

The new status of this patch is: Ready for Committer

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Allow GiST opcalsses without compress\decompres functions

2017-09-11 Thread Dmitriy Sarafannikov
The following review has been posted through the commitfest application:
make installcheck-world:  not tested
Implements feature:   not tested
Spec compliant:   not tested
Documentation:not tested

Hi Andrew! Thanks for the patch, but patch 0001-allow-uncompressed-Gist-2.patch 
no longer applies on current master branch.
Please could you rebase it?

Some info from git apply command:
Checking patch doc/src/sgml/gist.sgml...
Checking patch src/backend/access/gist/gist.c...
Checking patch src/backend/access/gist/gistget.c...
Checking patch src/backend/access/gist/gistutil.c...
error: while searching for:
GISTENTRY  *dep;

gistentryinit(*e, k, r, pg, o, l);
dep = (GISTENTRY *)

DatumGetPointer(FunctionCall1Coll(>decompressFn[nkey],

   giststate->supportCollation[nkey],

error: patch failed: src/backend/access/gist/gistutil.c:550
error: while searching for:

gistentryinit(centry, attdata[i], r, NULL, 
(OffsetNumber) 0,
  isleaf);
cep = (GISTENTRY *)

DatumGetPointer(FunctionCall1Coll(>compressFn[i],

  giststate->supportCollation[i],

  PointerGetDatum()));
compatt[i] = cep->key;
}
}

error: patch failed: src/backend/access/gist/gistutil.c:585
Hunk #3 succeeded at 648 (offset -7 lines).
Hunk #4 succeeded at 924 (offset -7 lines).
Hunk #5 succeeded at 944 (offset -7 lines).
Checking patch src/backend/access/gist/gistvalidate.c...
Applied patch doc/src/sgml/gist.sgml cleanly.
Applied patch src/backend/access/gist/gist.c cleanly.
Applied patch src/backend/access/gist/gistget.c cleanly.
Applying patch src/backend/access/gist/gistutil.c with 2 rejects...
Rejected hunk #1.
Rejected hunk #2.
Hunk #3 applied cleanly.
Hunk #4 applied cleanly.
Hunk #5 applied cleanly.
Applied patch src/backend/access/gist/gistvalidate.c cleanly.

The new status of this patch is: Waiting on Author

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Broken hint bits (freeze)

2017-06-08 Thread Dmitriy Sarafannikov

> Why didn't rsync made the copies on master and replica same?

Because rsync was running with —size-only flag.

> I haven't looked in detail, but it sounds slightly risky proposition
> to manipulate the tuples by writing C functions of the form you have
> in your code.  I would have preferred some way to avoid this problem
> by ensuring that replicas are properly synced (complete data of master
> via WAL) or by disabling autovacuum.

Avoiding this problem is a good way. But what to do with already corrupted data?
Can you explain more what do you mean?

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Broken hint bits (freeze)

2017-06-02 Thread Dmitriy Sarafannikov
Thanks for all.

We found the source of the problem. It was mistake in upgrade to 9.6.

We upgrade replica with rsync as it is in the documentation:
rsync --verbose --relative --archive --hard-links --size-only old_pgdata 
new_pgdata remote_dir

We must provide 100% read-only availability of our shard (master + 2 replicas).
So we can’t stop master and both replicas, upgrade them one by one and start 
them.
We do it as follows:
Close master from load, stop master, upgrade it, stop 1st replica, upgrade it, 
start 1st replica,
stop 2nd replica, upgrade it, start 2nd replica, start master, open master.
But upgraded replicas died under load without statistics and we decided to 
perform
analyze on master before upgrading replicas. In this case statistics would be 
copied to replicas by rsync.
The upgrade algorithm became as follows:
Close master, stop master, close master from replicas (iptables), upgrade 
master,
start master, perform analyze, stop master, stop 1st replica, upgrade 1st 
replica,
start 1st replica, stop 2nd replica, upgrade 2nd replica, start 2nd replica,
start master, open master.

If autovacuum starts vacuuming relations while we are performing analyze, wal 
records
generated by it will not be replayed on replicas, because next step is stopping
master with checkpoint and new redo location LSN (newer that these wal records)
will appear in pg_control file, which then will be copied by rsync to replicas.

If it was simple vacuum, we most likely will not see the consequences. Because 
it marks
tuples as deleted, and some of the next new tuples will be placed here, and due 
to FPW
replicas will receive correct page, identical to master.
But if it was vacuum to prevent wraparound, we will see situation like ours. 
Tuples on
master will be frozen, but on replicas not. And it will not change if nobody 
will not
update any tuple on this page.

It’s dangerous, because, if we perform switchover to replica, «corrupted» page
will be delivered to all replicas after next update of any tuple from this page.

We reproduced this case in our test environment and this assumption was 
confirmed.

Starting and stopping master after running pg_upgrade but before rsync to 
collect statistics
was a bad idea.

We know how to find such «corrupted» tuples. And we want to fix this by manually
freezing tuples via calling specially written C functions. Functions are 
«copy-pasted»
and simplified code from vacuum functions with SQL interface (see attachment).
Can you look on them? Do you think it is safe to use them for fixing corrupted 
pages
or is there a better way not to loose data?

Regards,
Dmitriy Sarafannikov



freeze_tuple.c
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Broken hint bits (freeze)

2017-05-24 Thread Dmitriy Sarafannikov
We found that this problem appears also on shards with enabled checksums.
This shard has 1st timeline, which means there was no switchover after upgrade 
to 9.6.

xdb11f(master)=# select pg_current_xlog_location(), 
pg_xlogfile_name(pg_current_xlog_location());
 pg_current_xlog_location | pg_xlogfile_name
--+--
 30BA/5966AD38| 000130BA0059
(1 row)

xdb11f(master)=# select * from page_header(get_raw_page(‘mytable', 1787));
  lsn  | checksum | flags | lower | upper | special | pagesize | 
version | prune_xid
---+--+---+---+---+-+--+-+---
 1F43/8C432C60 |-3337 | 5 |   256 |   304 |8192 | 8192 |   
4 | 0
(1 row)

xdb11h(replica)=# select * from page_header(get_raw_page(‘mytable', 1787));
  lsn  | checksum | flags | lower | upper | special | pagesize | 
version | prune_xid
---+--+---+---+---+-+--+-+---
 1B28/45819C28 |   -17617 | 5 |   256 |   304 |8192 | 8192 |   
4 | 0
(1 row)

xdb11e(replica)=# select * from page_header(get_raw_page('mytable', 1787));
  lsn  | checksum | flags | lower | upper | special | pagesize | 
version | prune_xid
---+--+---+---+---+-+--+-+---
 1B28/45819C28 |   -17617 | 5 |   256 |   304 |8192 | 8192 |   
4 | 0
(1 row)

Master has newer page version and freeze bits.

xdb11f(master)=# select t_xmin, t_infomask::bit(32) & X'0300'::int::bit(32) 
from heap_page_items(get_raw_page(‘mytable', 1787)) where lp = 42;
  t_xmin   | ?column?
---+--
 516651778 | 0011
(1 row)

xdb11h(replica)=# select t_xmin, t_infomask::bit(32) & X'0300'::int::bit(32) 
from heap_page_items(get_raw_page('mytable', 1787)) where lp = 42;
  t_xmin   | ?column?
---+--
 516651778 | 
(1 row)

xdb11e(replica)=# select t_xmin, t_infomask::bit(32) & X'0300'::int::bit(32) 
from heap_page_items(get_raw_page('mytable', 1787)) where lp = 42;
  t_xmin   | ?column?
---+--
 516651778 | 
(1 row)

It seems like replica did not replayed corresponding WAL records.
Any thoughts?

Regards,
Dmitriy Sarafannikov

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Broken hint bits (freeze)

2017-05-23 Thread Dmitriy Sarafannikov
Hi hackers,

We have some problems on our production with hint bits and frozen tuples.
More and more following errors began to appear on master after switchover:
ERROR:  58P01: could not access status of transaction 1952523525
DETAIL:  Could not open file "pg_clog/0746": No such file or directory.
LOCATION:  SlruReportIOError, slru.c:896

We investigated the problem with pageinspect and found the tuples that are the 
cause:

xdb311g(master)=# select * from mytable where ctid = '(4,21)';
ERROR:  58P01: could not access status of transaction 1951521353
DETAIL:  Could not open file "pg_clog/0745": No such file or directory.
LOCATION:  SlruReportIOError, slru.c:896

But the same query successfully executed on replica.

We found some difference in hint bits between master and replica:

xdb311g(master)=# SELECT t_xmin, t_infomask::bit(32) & X'0300'::int::bit(32) 
FROM heap_page_items(get_raw_page(‘mytable',4)) where lp=21;
-[ RECORD 1 ]--
t_xmin   | 1951521353
?column? | 

old master, now replica:
xdb311e(replica)=# SELECT t_xmin, t_infomask::bit(32) & X'0300'::int::bit(32) 
FROM heap_page_items(get_raw_page(‘mytable',4)) where lp=21;
-[ RECORD 1 ]--
t_xmin   | 1951521353
?column? | 0011

X’0300’ = HEAP_XMIN_FROZEN according to

#define HEAP_XMIN_COMMITTED 0x0100  /* t_xmin committed */
#define HEAP_XMIN_INVALID   0x0200  /* t_xmin invalid/aborted */
#define HEAP_XMIN_FROZEN(HEAP_XMIN_COMMITTED|HEAP_XMIN_INVALID)

xdb311g(master)=# select relfrozenxid from pg_class where relname = ‘mytable';
relfrozenxid
--
2266835605
(1 row)

This tuple must be frozen but there are no set bits HEAP_XMIN_COMMITTED and 
HEAP_XMIN_INVALID on master

Another interesting thing that LSN of this page on master and replica are not 
the same:
xdb311g(master)=# select lsn from page_header(get_raw_page(‘mytable',4));
   lsn
---
8092/6A26DD08
(1 row)

xdb311e(replica)=# select lsn from page_header(get_raw_page(‘mytable',4));
   lsn
---
838D/C4A0D280
(1 row)

And LSN on replica is greater that LSN on master (838D/C4A0D280 > 8092/6A26DD08)
How can this be possible?

We wrote a query which returns ctid of frozen tuples, which must be frozen but 
not actually frozen.

xdb311e(replica)=# select t_ctid from generate_series(0, 
pg_relation_size(‘mytable')/8192 - 1 ) s(i) left join lateral 
heap_page_items(get_raw_page(‘mytable',s.i::int)) on true where 
t_xmin::text::bigint < (select relfrozenxid::text::bigint from pg_class where 
relname = ‘mytable') and t_infomask & X'0300'::int < 1;
t_ctid
---
(400,16)
(2837,71)
(2837,72)
(2837,73)
(2837,75)
(2837,76)
(3042,40)
(4750,80)
(4750,81)
(5214,60)
(5214,65)
(6812,31)
(6912,63)
(7329,8)
(7374,26)
(7374,27)
(16 rows)
Same query on master returns 317 rows.

Our thoughts:
1) We think that it is related to switchover.
2) Any WAL-logged modification of this page on master will replace this page on 
replica due to full page writes.
 And all replicas will have broken hint bits too. It’s dangerous.

Where to dig further?

RHEL6, PostgreSQL 9.6.3, wal_log_hints=off, full_page_writes=on, fsync=on, 
checksums disabled.
We don’t think that it is any hardware-related problems because this databases 
started from 9.4
and they survived 2 upgrades with pg_upgrade. And any hardware-related problems 
was not detected. 
Problem appears not only in this shard.
Size of each shard is around 5TB and we can’t provide data.

Regards
Dmitriy Sarafannikov

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PROPOSAL] Use SnapshotAny in get_actual_variable_range

2017-05-12 Thread Dmitriy Sarafannikov

> Ok, i agree. Patch is attached.

I added a patch to the CF


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PROPOSAL] Use SnapshotAny in get_actual_variable_range

2017-05-08 Thread Dmitriy Sarafannikov

> + else \
> + (snapshotdata).xmin = \
> + TransactionIdLimitedForOldSnapshots(RecentGlobalDataXmin, \
> + relation); \
> 
> I think we don't need to use TransactionIdLimitedForOldSnapshots() as
> that is required to override xmin for table vacuum/pruning purposes.
> 
>> Maybe we need
>> to use GetOldestXmin()?
> 
> It is a costly call as it needs ProcArrayLock, so I don't think it
> makes sense to use it here.

Ok, i agree. Patch is attached.



snapshot_non_vacuumable_v3.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PROPOSAL] Use SnapshotAny in get_actual_variable_range

2017-05-08 Thread Dmitriy Sarafannikov
I think we can use RecentGlobalDataXmin for non-catalog relations andRecentGlobalXmin for catalog relations (probably a check similar towhat we have in heap_page_prune_opt).I took check from heap_page_prune_opt (Maybe this check must be present as separate function?)But it requires to initialize snapshot for specific relation. Maybe we need to use GetOldestXmin()?New version of patch is attached.

snapshot_non_vacuumable_v2.patch
Description: Binary data


Re: [HACKERS] [PROPOSAL] Use SnapshotAny in get_actual_variable_range

2017-05-05 Thread Dmitriy Sarafannikov
Amit, thanks for comments!

> 1.
> +#define InitNonVacuumableSnapshot(snapshotdata)  \
> + do { \
> + (snapshotdata).satisfies = HeapTupleSatisfiesNonVacuumable; \
> + (snapshotdata).xmin = RecentGlobalDataXmin; \
> + } while(0)
> +
> 
> Can you explain and add comments why you think RecentGlobalDataXmin is
> the right to use it here?  As of now, it seems to be only used for
> pruning non-catalog tables.

Can you explain me, what value for xmin should be used there?

> 2.
> +bool
> +HeapTupleSatisfiesNonVacuumable(HeapTuple htup, Snapshot snapshot,
> + Buffer buffer)
> +{
> + return HeapTupleSatisfiesVacuum(htup, snapshot->xmin, buffer)
> + != HEAPTUPLE_DEAD;
> +}
> +
> 
> Add comments on top of this function and for the sake of consistency
> update the file header as well (Summary of visibility functions:)

Yes, i will add comments and send new patch.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PROPOSAL] Use SnapshotAny in get_actual_variable_range

2017-05-04 Thread Dmitriy Sarafannikov

> Maybe we need another type of snapshot that would accept any
> non-vacuumable tuple.  I really don't want SnapshotAny semantics here,
> but a tuple that was live more recently than the xmin horizon seems
> like it's acceptable enough.  HeapTupleSatisfiesVacuum already
> implements the right behavior, but we don't have a Snapshot-style
> interface for it.


I have tried to implement this new type of snapshot that accepts any
non-vacuumable tuples.
We have tried this patch in our load environment. And it has smoothed out
and reduced magnitude of the cpu usage peaks.
But this snapshot does not solve the problem completely.

Patch is attached.



snapshot_non_vacuumable.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PROPOSAL] Use SnapshotAny in get_actual_variable_range

2017-05-02 Thread Dmitriy Sarafannikov

> If that is the case, then how would using SnapshotAny solve this
> problem.  We get the value from index first and then check its
> visibility in heap, so if time is spent in _bt_checkkeys, why would
> using a different kind of Snapshot solve the problem?

1st scanning on the index with SnapshotAny will accept this rows and
will not mark this entries as killed. 

Theremore, killed entries are ignored on standby. And scanning with
SnapshotAny will always accept first row from index.
 /*
  * During recovery we ignore killed tuples and don't bother to kill them
  * either. We do this because the xmin on the primary node could easily be
  * later than the xmin on the standby node, so that what the primary
  * thinks is killed is supposed to be visible on standby. So for correct
  * MVCC for queries during recovery we must ignore these hints and check
  * all tuples. Do *not* set ignore_killed_tuples to true when running in a
  * transaction that was started during recovery. xactStartedInRecovery
  * should not be altered by index AMs.
  */
 scan->kill_prior_tuple = false;
 scan->xactStartedInRecovery = TransactionStartedDuringRecovery();
 scan->ignore_killed_tuples = !scan->xactStartedInRecovery;

We execute this read-only queries on standby.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PROPOSAL] Use SnapshotAny in get_actual_variable_range

2017-04-29 Thread Dmitriy Sarafannikov

> Maybe we need another type of snapshot that would accept any
> non-vacuumable tuple.  I really don't want SnapshotAny semantics here,
> but a tuple that was live more recently than the xmin horizon seems
> like it's acceptable enough.  HeapTupleSatisfiesVacuum already
> implements the right behavior, but we don't have a Snapshot-style
> interface for it.


If I understood correctly, this new type of snapshot would help if
there are long running transactions which can see this tuples.
But if there are not long running transactions, it will be the same.
Am i right?

And what about don’t fetch actual min and max values from indexes
whose columns doesn’t involved in join? 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PROPOSAL] Use SnapshotAny in get_actual_variable_range

2017-04-28 Thread Dmitriy Sarafannikov

> What I'm thinking of is the regular indexscan that's done internally
> by get_actual_variable_range, not whatever ends up getting chosen as
> the plan for the user query.  I had supposed that that would kill
> dead index entries as it went, but maybe that's not happening for
> some reason.


Really, this happens as you said. Index entries are marked as dead.
But after this, backends spends cpu time on skip this killed entries
in _bt_checkkeys :

if (scan->ignore_killed_tuples && ItemIdIsDead(iid))
{
/* return immediately if there are more tuples on the page */
if (ScanDirectionIsForward(dir))
{
if (offnum < PageGetMaxOffsetNumber(page))
return NULL;
}
else
{
BTPageOpaque opaque = (BTPageOpaque) 
PageGetSpecialPointer(page);

if (offnum > P_FIRSTDATAKEY(opaque))
return NULL;
}

This confirmed by perf records and backtrace reported by Vladimir earlier.
root@pgload01e ~ # perf report | grep -v '^#' | head
56.67%  postgres   postgres[.] _bt_checkkeys
19.27%  postgres   postgres[.] _bt_readpage
 2.09%  postgres   postgres[.] pglz_decompress
 2.03%  postgres   postgres[.] LWLockAttemptLock
 1.61%  postgres   postgres[.] PinBuffer.isra.3
 1.14%  postgres   postgres[.] hash_search_with_hash_value
 0.68%  postgres   postgres[.] LWLockRelease
 0.42%  postgres   postgres[.] AllocSetAlloc
 0.40%  postgres   postgres[.] SearchCatCache
 0.40%  postgres   postgres[.] ReadBuffer_common
root@pgload01e ~ #
It seems like killing dead tuples does not solve this problem.

[HACKERS] [PROPOSAL] Use SnapshotAny in get_actual_variable_range

2017-04-27 Thread Dmitriy Sarafannikov
reased Planning time:# explain (analyze, verbose, timing, buffers) SELECT * FROM polygon_table polygon INNER JOIN second_table second ON (polygon.second_id = second.id) WHERE ST_Intersects(poly, ST_SetSrid(ST_MakePoint(52.3433914, 58.7438431), 4326));QUERY PLANNested Loop  (cost=0.56..16.86 rows=1 width=160) (actual time=0.100..0.248 rows=4 loops=1)  Output: polygon.id, polygon."time", polygon.poly, polygon.second_id, second.id  Buffers: shared hit=49  ->  Index Scan using polygon_table_poly_id on public.polygon_table polygon  (cost=0.29..8.55 rows=1 width=156) (actual time=0.081..0.220 rows=4 loops=1)Output: polygon.id, polygon."time", polygon.poly, polygon.second_idIndex Cond: (polygon.poly && '010120E610245DD83FF42B4A4079ED2D40365F4D40'::geometry)Filter: _st_intersects(polygon.poly, '010120E610245DD83FF42B4A4079ED2D40365F4D40'::geometry)Rows Removed by Filter: 6Buffers: shared hit=37  ->  Index Only Scan using second_table_pkey on public.second_table second  (cost=0.28..8.29 rows=1 width=4) (actual time=0.005..0.005 rows=1 loops=4)Output: second.idIndex Cond: (second.id = polygon.second_id)Heap Fetches: 4Buffers: shared hit=12Planning time: 115.122 msExecution time: 0.422 ms(16 rows)Time: 116.926 ms# explain (analyze, verbose, timing, buffers) SELECT * FROM polygon_table polygon INNER JOIN second_table second ON (polygon.second_id = second.id) WHERE ST_Intersects(poly, ST_SetSrid(ST_MakePoint(52.3433914, 58.7438431), 4326));QUERY PLAN-Nested Loop  (cost=0.56..16.86 rows=1 width=160) (actual time=0.059..0.373 rows=46 loops=1)  Output: polygon.id, polygon."time", polygon.poly, polygon.second_id, second.id  Buffers: shared hit=170  ->  Index Scan using polygon_table_poly_id on public.polygon_table polygon  (cost=0.29..8.55 rows=1 width=156) (actual time=0.045..0.269 rows=46 loops=1)Output: polygon.id, polygon."time", polygon.poly, polygon.second_idIndex Cond: (polygon.poly && '010120E610245DD83FF42B4A4079ED2D40365F4D40'::geometry)Filter: _st_intersects(polygon.poly, '010120E610245DD83FF42B4A4079ED2D40365F4D40'::geometry)Rows Removed by Filter: 44Buffers: shared hit=32  ->  Index Only Scan using second_table_pkey on public.second_table second  (cost=0.28..8.29 rows=1 width=4) (actual time=0.001..0.002 rows=1 loops=46)Output: second.idIndex Cond: (second.id = polygon.second_id)Heap Fetches: 46Buffers: shared hit=138Planning time: 6.139 msExecution time: 0.482 ms(16 rows)Time: 7.722 msInitially, the function used active snapshot from GetActiveSnapshot(). But in fccebe421d0c410e6378fb281419442c84759213this behavior was "weakened" to SnapshotDirty (I suppose for a similar reason).Was there a particular reason for allowing planner to see uncommitted rows, but forbidding him access to the dead ones?Simple patch that uses SnapshotAny is attached. Comments in code are not changed yet.data.sql file: https://yadi.sk/d/GtBW6Hhu3HQ4CARegards,Dmitriy Sarafannikov

bad_request
Description: Binary data


get_actual_variable_range.patch
Description: Binary data


[HACKERS] Re: [HACKERS] Design for In-Core Logical Replication

2016-07-22 Thread Dmitriy Sarafannikov

>
>CREATE PUBLICATION mypub;
>ALTER PUBLICATION mypub ADD TABLE users, departments;

>CREATE SUBSCRIPTION mysub WITH CONNECTION dbname=foo host=bar 
>user=repuser PUBLICATION mypub;

>    The above will start the replication process which synchronizes the
>    initial table contents of users and
>    departments tables and then starts replicating
>    incremental changes to those tables.

Hi, hackers.

it is very good to have logical replication in core. Also i have some proposal. 
What if we would have ability to execute custom trigger functions on events on 
particular table? Also it would be useful if would have ability to ignore some 
tables in publication or replicatie with some WHERE condition. For example, we 
want replicate table "users" as is (maybe with some WHERE conditions), but on 
events on table "departments" we want execute trigger function 
departments_event_handler(). And we don't want handle events on third table 
which was added to publication.

Something like this:

CREATE PUBLICATION mypub;
ALTER PUBLICATION mypub ADD TABLE users, departments, unnecessary_tbl;

CREATE SUBSCRIPTION mysub WITH CONNECTION dbname=foo host=bar 
user=repuser PUBLICATION mypub;
ALTER SUBSCRIPTION mysub ADD TABLE users REPLICATE TO LOCAL TABLE my_users 
INSERT WHERE new.id_user > 1000 UPDATE WHERE old.id_user < 1000; -- we don't 
want replicate deletes, for example.
ALTER SUBSCRIPTION mysub ADD TABLE departments ON INSERT WHEN (new.id_user > 
1000)  EXECUTE PROCEDURE departments_event_handler(); -- just like trigger


Regards,
Dmitriy Sarafannikov
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Re: [HACKERS] [PERFORM] 9.4 -> 9.5 regression with queries through pgbouncer on RHEL 6

2016-07-15 Thread Dmitriy Sarafannikov


>I don't really see anything suspicious in the profile. This looks more
>like a kernel scheduler issue than a postgres bottleneck one. It seems
>that somehow using nonblocking IO (started in 9.5) causes scheduling
>issues when pgbouncer is also local.
>
>Could you do perf stat -ddd -a sleep 10 or something during both runs? I
>suspect that the context switch ratios will be quite different.

Perf show that in 9.5 case context switches occurs about 2 times less.
Perf output is attached.

Regards,
Dmitriy Sarafannikov

perf.stat
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Re[2]: [HACKERS] 9.4 -> 9.5 regression with queries through pgbouncer on RHEL 6

2016-07-14 Thread Dmitriy Sarafannikov
>>The results above are not really fair, pgbouncer.ini was a bit different on 
>>Ubuntu host (application_name_add_host was disabled). Here are the right 
>>results with exactly the same configuration:
>>
>>OS PostgreSQL version TPS Avg. latency
>>RHEL 6 9.4 44898 1.425 ms
>>RHEL 6 9.5 26199 2.443 ms
>>RHEL 6 9.5 43027 1.487 ms
>>Ubuntu 14.04 9.4 45971 1.392 ms
>>Ubuntu 14.04 9.5 40282 1.589 ms
>>Ubuntu 14.04 9.6 45410 1.409 ms
>>
>>It can be seen that there is a regression for 9.5 in Ubuntu also, but not so 
>>significant. We first thought that the reason is 
>>38628db8d8caff21eb6cf8d775c0b2d04cf07b9b (Add memory barriers for   
>>PgBackendStatus.st _changecount protocol), but in that case the regression 
>>should also be seen in 9.6 also.
>>
>>There also was a bunch of changes in FE/BE communication (like 
>>387da18874afa17156ee3af63766f17efb53c4b9 or 
>>98a64d0bd713cb89e61bef6432befc4b7b5da59e) and that may answer the question of 
>>regression in 9.5 and normal results in 9.6. Probably the right way to find 
>>the answer is to do bisect. I’ll do it but if some more diagnostics 
>>information can help, feel free to ask about it.
>
>Yep, bisect confirms that the first bad commit in REL9_5_STABLE is 
>387da18874afa17156ee3af63766f17efb53c4b9. Full output is attached.
>And bisect for master branch confirms that the situation became much better 
>after 98a64d0bd713cb89e61bef6432befc4b7b5da59e. Output is also attached.
>
>On Ubuntu performance degradation is ~15% and on RHEL it is ~100%. I don’t 
>know what is the cause for different numbers on RHEL and Ubuntu but certainly 
>there is a regression when pgbouncer is connected to postgres through 
>localhost. When I try to connect pgbouncer to postgres through unix-socket 
>performance is constantly bad on all postgres versions.
>
>Both servers are for testing but I can easily provide you SSH access only to 
>Ubuntu host if necessary. I can also gather more diagnostics if needed.

We have not invented anything better than to backport 
98a64d0bd713cb89e61bef6432befc4b7b5da59e from 9.6 to 9.5.
It completely solved the problem. If anyone is interested or anyone will face 
with this problem, patch is attached.


Regards,
Dmitriy Sarafannikovdiff --git a/configure b/configure
index adee368..311340f 100755
--- a/configure
+++ b/configure
@@ -9364,7 +9364,7 @@ fi
 done
 
 
-for ac_header in atomic.h crypt.h dld.h fp_class.h getopt.h ieeefp.h ifaddrs.h langinfo.h mbarrier.h poll.h pwd.h sys/ioctl.h sys/ipc.h sys/poll.h sys/pstat.h sys/resource.h sys/select.h sys/sem.h sys/shm.h sys/socket.h sys/sockio.h sys/tas.h sys/time.h sys/un.h termios.h ucred.h utime.h wchar.h wctype.h
+for ac_header in atomic.h crypt.h dld.h fp_class.h getopt.h ieeefp.h ifaddrs.h langinfo.h mbarrier.h poll.h pwd.h sys/epoll.h sys/ioctl.h sys/ipc.h sys/poll.h sys/pstat.h sys/resource.h sys/select.h sys/sem.h sys/shm.h sys/socket.h sys/sockio.h sys/tas.h sys/time.h sys/un.h termios.h ucred.h utime.h wchar.h wctype.h
 do :
   as_ac_Header=`$as_echo "ac_cv_header_$ac_header" | $as_tr_sh`
 ac_fn_c_check_header_mongrel "$LINENO" "$ac_header" "$as_ac_Header" "$ac_includes_default"
diff --git a/configure.in b/configure.in
index 5025798..66e8fa9 100644
--- a/configure.in
+++ b/configure.in
@@ -1082,7 +1082,7 @@ AC_SUBST(UUID_LIBS)
 ##
 
 dnl sys/socket.h is required by AC_FUNC_ACCEPT_ARGTYPES
-AC_CHECK_HEADERS([atomic.h crypt.h dld.h fp_class.h getopt.h ieeefp.h ifaddrs.h langinfo.h mbarrier.h poll.h pwd.h sys/ioctl.h sys/ipc.h sys/poll.h sys/pstat.h sys/resource.h sys/select.h sys/sem.h sys/shm.h sys/socket.h sys/sockio.h sys/tas.h sys/time.h sys/un.h termios.h ucred.h utime.h wchar.h wctype.h])
+AC_CHECK_HEADERS([atomic.h crypt.h dld.h fp_class.h getopt.h ieeefp.h ifaddrs.h langinfo.h mbarrier.h poll.h pwd.h sys/epoll.h sys/ioctl.h sys/ipc.h sys/poll.h sys/pstat.h sys/resource.h sys/select.h sys/sem.h sys/shm.h sys/socket.h sys/sockio.h sys/tas.h sys/time.h sys/un.h termios.h ucred.h utime.h wchar.h wctype.h])
 
 # On BSD, test for net/if.h will fail unless sys/socket.h
 # is included first.
diff --git a/src/backend/libpq/be-secure.c b/src/backend/libpq/be-secure.c
index 26d8faa..52b35a2 100644
--- a/src/backend/libpq/be-secure.c
+++ b/src/backend/libpq/be-secure.c
@@ -139,16 +139,38 @@ retry:
 	/* In blocking mode, wait until the socket is ready */
 	if (n < 0 && !port->noblock && (errno == EWOULDBLOCK || errno == EAGAIN))
 	{
-		int			w;
+		WaitEvent	event;
 
 		Assert(waitfor);
 
-		w = WaitLatchOrSocket(MyLatch,
-			  WL_LATCH_SET | waitfor,
-			  port->sock, 0);
+		ModifyWaitEvent(FeBeWaitSet, 0, waitfor, NULL);
+
+		WaitEventSetWait(FeBeWaitSet, -1 /* no timeout */ , , 1);
+
+		/*
+		 * If the postmaster has died, it's not safe to continue running,
+		 * because it is the postmaster's job to kill us if some other backend
+		 * exists uncleanly.  Moreover, we won't run very well in this state;
+		 * helper processes like walwriter and the bgwriter will exit, so
+		 * performance may be 

[HACKERS] Re[2]: [HACKERS] Incorrect error message in InitializeSessionUserId

2016-03-04 Thread Dmitriy Sarafannikov
>On Fri, Mar 4, 2016, 5:23 +03:00 от Michael Paquier < 
>michael.paqu...@gmail.com >:
>
>> The patch adds the support of taking the role name from the role tuple
>> instead of using the provided rolename variable, because it is possible
>> that rolename variable is NULL if the connection is from a background
>> worker.
>>
>> The patch is fine, I didn't find any problems, I marked it as ready for
>> committer.
>>
>> IMO this patch may need to backpatch supported branches as it is
>> a bug fix. Committer can decide.
>
>+1 for the backpatch. The current error message should the rolename be
>undefined in this context is misleading for users.
>-- 
>Michael

Thanks for feedback.

This patch successfully applies to the 9.5 branch.
In 9.4 and below versions InitializeSessionUserId function has other signature:
void InitializeSessionUserId(const char *rolename)
and it is impossible to pass role Oid to this function.

In this way, the patch is relevant only to the master and 9.5 branches

Regards,
Dmitriy Sarafannikov
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Incorrect error message in InitializeSessionUserId

2016-03-01 Thread Dmitriy Sarafannikov
 Hi all,

I have found incorrect error message in InitializeSessionUserId function 
if you try to connect to database by role Oid (for example 
BackgroundWorkerInitializeConnectionByOid).
If role have no permissions to login, you will see error message like this:
FATAL:  role "(null)" is not permitted to log in

I changed few lines of code and fixed this.
Patch is attached.
I want to add this patch to commitfest.
Any objections?



Regards,
Dmitriy Sarafannikovdiff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c
index 603a256..72fddc5 100644
--- a/src/backend/utils/init/miscinit.c
+++ b/src/backend/utils/init/miscinit.c
@@ -474,6 +474,7 @@ InitializeSessionUserId(const char *rolename, Oid roleid)
 {
 	HeapTuple	roleTup;
 	Form_pg_authid rform;
+	char 	*rname;
 
 	/*
 	 * Don't do scans if we're bootstrapping, none of the system catalogs
@@ -485,16 +486,25 @@ InitializeSessionUserId(const char *rolename, Oid roleid)
 	AssertState(!OidIsValid(AuthenticatedUserId));
 
 	if (rolename != NULL)
+	{
 		roleTup = SearchSysCache1(AUTHNAME, PointerGetDatum(rolename));
+		if (!HeapTupleIsValid(roleTup))
+			ereport(FATAL,
+	(errcode(ERRCODE_INVALID_AUTHORIZATION_SPECIFICATION),
+	 errmsg("role \"%s\" does not exist", rolename)));
+	}
 	else
+	{
 		roleTup = SearchSysCache1(AUTHOID, ObjectIdGetDatum(roleid));
-	if (!HeapTupleIsValid(roleTup))
-		ereport(FATAL,
-(errcode(ERRCODE_INVALID_AUTHORIZATION_SPECIFICATION),
- errmsg("role \"%s\" does not exist", rolename)));
+		if (!HeapTupleIsValid(roleTup))
+			ereport(FATAL,
+	(errcode(ERRCODE_INVALID_AUTHORIZATION_SPECIFICATION),
+	 errmsg("role (OID = %u) does not exist", roleid)));
+	}
 
 	rform = (Form_pg_authid) GETSTRUCT(roleTup);
 	roleid = HeapTupleGetOid(roleTup);
+	rname = NameStr(rform->rolname);
 
 	AuthenticatedUserId = roleid;
 	AuthenticatedUserIsSuperuser = rform->rolsuper;
@@ -520,7 +530,7 @@ InitializeSessionUserId(const char *rolename, Oid roleid)
 			ereport(FATAL,
 	(errcode(ERRCODE_INVALID_AUTHORIZATION_SPECIFICATION),
 	 errmsg("role \"%s\" is not permitted to log in",
-			rolename)));
+			rname)));
 
 		/*
 		 * Check connection limit for this role.
@@ -538,11 +548,11 @@ InitializeSessionUserId(const char *rolename, Oid roleid)
 			ereport(FATAL,
 	(errcode(ERRCODE_TOO_MANY_CONNECTIONS),
 	 errmsg("too many connections for role \"%s\"",
-			rolename)));
+			rname)));
 	}
 
 	/* Record username and superuser status as GUC settings too */
-	SetConfigOption("session_authorization", rolename,
+	SetConfigOption("session_authorization", rname,
 	PGC_BACKEND, PGC_S_OVERRIDE);
 	SetConfigOption("is_superuser",
 	AuthenticatedUserIsSuperuser ? "on" : "off",

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers