Re: [HACKERS] [POC] hash partitioning
Hi Amul, On 09/28/2017 05:56 AM, amul sul wrote: It does not really do the partition pruning via constraint exclusion and I don't think anyone is going to use the remainder in the where condition to fetch data and hash partitioning is not meant for that. But I am sure that we could solve this problem using your and Beena's work toward faster partition pruning[1] and Runtime Partition Pruning[2]. Will think on this changes if it is required for the pruning feature. Could you rebase on latest master ? Best regards, Jesper -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] initdb w/ restart
Hi, In a case of initdb /tmp/pgsql followed by pg_ctl -D /tmp/pgsql/ -l /tmp/logfile restart you'll get pg_ctl: PID file "/tmp/pgsql/postmaster.pid" does not exist Is server running? starting server anyway pg_ctl: could not read file "/tmp/pgsql/postmaster.opts" The attached patch changes the message to "trying to start server anyway" to highlight it is an attempt, not something that will happen. Probably not a good idea to change the logic around pg_ctl.c:688, hence this suggestion. Thoughts ? Best regards, Jesper >From 9e8cdda3173a25f1e14cccd7261877f160d1b0f7 Mon Sep 17 00:00:00 2001 From: jesperpedersen Date: Thu, 28 Sep 2017 15:31:24 -0400 Subject: [PATCH] Change message for restarting a server from a directory without a PID file. This account for the case where a restart happens after an initdb Author: Jesper Pedersen --- src/bin/pg_ctl/pg_ctl.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/bin/pg_ctl/pg_ctl.c b/src/bin/pg_ctl/pg_ctl.c index 4e02c4cea1..f5281a36a8 100644 --- a/src/bin/pg_ctl/pg_ctl.c +++ b/src/bin/pg_ctl/pg_ctl.c @@ -965,7 +965,7 @@ do_restart(void) write_stderr(_("%s: PID file \"%s\" does not exist\n"), progname, pid_file); write_stderr(_("Is server running?\n")); - write_stderr(_("starting server anyway\n")); + write_stderr(_("trying to start server anyway\n")); do_start(); return; } -- 2.13.5 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partitions: \d vs \d+
On 09/28/2017 09:19 AM, Maksim Milyutin wrote: E.g. "No partition constraint" vs. "Partition constraint: satisfies_hash_partition(...)". I also noticed ambiguity in printing "No partition constraint" in non-verbose mode and "Partition constraint:..." in verbose one for partition tables regardless of the type of partition. Attached small patch removes any output about partition constraint in non-verbose mode. Yeah, that could be one way. It should likely be backported to REL_10_STABLE, so the question is if we are too late in the release cycle to change that output. Best regards, Jesper -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Partitions: \d vs \d+
Hi, Using hash partitions I noticed that \d gives D=# \d T_p63 Table "public.T_p63" Column | Type | Collation | Nullable | Default ---+---+---+--+- Partition of: T FOR VALUES WITH (modulus 64, remainder 63) No partition constraint Indexes: "T_p63" btree (X, Y) where as \d+ gives D=# \d+ T_p63 Table "public.T_p63" Column | Type | Collation | Nullable | Default | Storage | Stats target | Description ---+---+---+--+-+--+--+- Partition of: T FOR VALUES WITH (modulus 64, remainder 63) Partition constraint: satisfies_hash_partition(64, 63, hashint4extended(X, '8816678312871386367'::bigint)) Indexes: "T_p63" btree (X, Y) E.g. "No partition constraint" vs. "Partition constraint: satisfies_hash_partition(...)". Current master (7769fc000) with [1] and [2]. [1] https://commitfest.postgresql.org/14/1059/ [2] https://commitfest.postgresql.org/14/1089/ Best regards, Jesper -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Multicolumn hash indexes
On 09/26/2017 08:11 PM, Robert Haas wrote: On Tue, Sep 26, 2017 at 7:18 PM, Tom Lane wrote: Tomasz Ostrowski writes: I've noticed that hash indexes can't currently (in PG10) be multicolumn. Are they technically hard to implement or just nobody took such a feature? It's not simple, particularly not if you wish that the index would support queries specifying conditions for just a subset of the indexed columns (an assumption that's buried pretty deeply in the planner, for one thing). Then you couldn't compute the hash. Whoa, that seems like moving the goalposts. Somebody could design a hash index that was intended to answer such queries, but it's not the one we've got. I think we should just aim to support equality queries on all columns. That seems like a fairly straightforward generalization of what we've already got. This would require that the applications that are using the database knows about the index structure in order to pass down the right columns. I would say that in most cases that applications doesn't know about index structures. So, multiple indexes would have to be created (col1), (col1, col2), (col1, col3), ... which isn't ideal. Maybe an initial proof-of-concept could store the hash of the first column (col1) plus the hash of all columns (col1, col2, col3) in the index, and see what requirements / design decisions would appear from that. Best regards, Jesper -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [POC] hash partitioning
On 09/27/2017 03:05 AM, amul sul wrote: Attached rebased patch, thanks. While reading through the patch I thought it would be better to keep MODULUS and REMAINDER in caps, if CREATE TABLE was in caps too in order to highlight that these are "keywords" for hash partition. Also updated some of the documentation. Thanks a lot for the patch, included in the attached version. Thank you. Based on [1] I have moved the patch to "Ready for Committer". [1] https://www.postgresql.org/message-id/CA%2BTgmoYsw3pusDen4_A44c7od%2BbEAST0eYo%2BjODtyofR0W2soQ%40mail.gmail.com Best regards, Jesper -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] path toward faster partition pruning
On 09/26/2017 10:33 AM, Robert Haas wrote: On Tue, Sep 26, 2017 at 9:00 AM, Jesper Pedersen wrote: Could you share your thoughts on the usage of PartitionAppendInfo's min_datum_idx / max_datum_idx ? Especially in relation to hash partitions. This brings up something that I've kind of been thinking about. There are sort of four cases when it comes to partition pruning: 1. There is exactly one matching partition. For example, this happens when there is an equality constraint on every partition column. 2. There are multiple matching partitions which are consecutive. For example, there is a single level of range partitioning with no default partition and the single partitioning column is constrained by < > <= or >=. 3. There are multiple matching partitions which are not consecutive. This case is probably rare, but it can happen if there is a default partition, if there are list partitions with multiple bounds that are interleaved (e.g. p1 allows (1, 4), p2 allows (2), p3 allows (3, 5), and the query allows values >= 4 and <= 5), if the query involves OR conditions, or if there are multiple levels of partitioning (e.g. partition by a, subpartition by b, put a range constraint on a and an equality constraint on b). 4. There are no matching partitions. One of the goals of this algorithm is to be fast. The obvious way to cater to case (3) is to iterate through all partitions and test whether each one works, returning a Bitmapset, but that is O(n). Admittedly, it might be O(n) with a pretty small constant factor, but it still seems like exactly the sort of thing that we want to avoid given the desire to scale to higher partition counts. I propose that we create a structure that looks like this: struct foo { int min_partition; int max_partition; Bitmapset *extra_partitions; }; This indicates that all partitions from min_partition to max_partition need to be scanned, and in addition any partitions in extra_partitions need to be scanned. Assuming that we only consider cases where all partition keys or a leading subset of the partition keys are constrained, we'll generally be able to get by with just setting min_partition and max_partition, but extra_partitions can be used to handle default partitions and interleaved list bounds. For equality on all partitioning columns, we can do a single bsearch of the bounds to identify the target partition at a given partitioning level, and the same thing works for a single range-bound. If there are two range-bounds (< and > or <= and >= or whatever) we need to bsearch twice. The default partition, if any and if matched, must also be included. When there are multiple levels of partitioning things get a bit more complex -- if someone wants to knock out a partition that breaks up the range, we might need to shrink the main range to cover part of it and kick the other indexes out to extra_partitions. But the good thing is that in common cases with only O(lg n) effort we can return O(1) data that describes what will be scanned. In cases where that's not practical we expend more effort but still prune with maximal effectiveness. For OLTP style applications 1) would be the common case, and with hash partitions it would be one equality constraint. So, changing the method signature to use a data type as you described above instead of the explicit min_datum_idx / max_datum_idx output parameters would be more clear. One could advocate (*cough*) that the hash partition patch [1] should be merged first in order to find other instances of where other CommitFest entries doesn't account for hash partitions at the moment in their method signatures; Beena noted something similar in [2]. I know that you said otherwise [3], but this is CommitFest 1, so there is time for a revert later, and hash partitions are already useful in internal testing. [1] https://commitfest.postgresql.org/14/1059/ [2] https://www.postgresql.org/message-id/CAOG9ApE16ac-_VVZVvv0gePSgkg_BwYEV1NBqZFqDR2bBE0X0A%40mail.gmail.com [3] http://rhaas.blogspot.com/2017/08/plans-for-partitioning-in-v11.html Best regards, Jesper -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Improve catcache/syscache performance.
On 09/26/2017 06:41 AM, tushar wrote: On 09/22/2017 11:45 AM, Andres Freund wrote: Here's a variant that cleans up the previous changes a bit, and adds some further improvements: I tested with different pgbench options with master v/s patch and found an improvement. I have applied 001 and 003 patch on PG Head ,patch 0002 was already committed. Virtual Machine configuration - Centos 6.5 x64 / 16 GB RAM / 8 VCPU core processor Scaling factor=30 pgbench -M prepared -T 200 postgres PG Head - tps = 902.225954 (excluding connections establishing). PG HEAD+patch - tps = 1001.896381 (10.97+% vs. head) pgbench -M prepared -T 300 postgres PG Head - tps = 920.108333 (excluding connections establishing). PG HEAD+patch - tps = 1023.89542 (11.19+% vs. head) pgbench -M prepared -T 500 postgres PG Head - tps = 995.178227 (excluding connections establishing) PG HEAD+patch - tps = 1078.3 (+8.34% vs. head) Later I modified the create_many_cols.sql file (previously attached) and instead of only using int , I mixed it with varchar/int4/numeric/float and run pgbench with different time duration pgbench -M prepared -f /tmp/pgbench-many-cols.sql -T 300 postgres PG Head - tps = 5540.143877 (excluding connections establishing). PG HEAD+patch - tps = 5679.713493 (2.50+% vs. head) pgbench -M prepared -f /tmp/pgbench-many-cols.sql -T 500 postgres PG Head - tps = 5519.212709 (excluding connections establishing). PG HEAD+patch - tps = 5967.059155 (8.11+% vs. head) pgbench -M prepared -f /tmp/pgbench-many-cols.sql -T 700 postgres PG Head - tps = 5640.314495(excluding connections establishing). PG HEAD+patch - tps = 6012.223147 (6.59+% vs. head) I'm also seeing a speedup on a 2S/28C/56T/256Gb + 2 x RAID10 SSD machine using -M prepared, -M prepared -N and -M prepared -S scenarios with various scale factors, and custom queries. Small typo in 0002- / commit 791961: diff --git a/src/include/utils/hashutils.h b/src/include/utils/hashutils.h index 35281689e8..366bd0e78b 100644 --- a/src/include/utils/hashutils.h +++ b/src/include/utils/hashutils.h @@ -22,7 +22,7 @@ hash_combine(uint32 a, uint32 b) /* - * Simple inline murmur hash implementation hashing a 32 bit ingeger, for + * Simple inline murmur hash implementation hashing a 32 bit integer, for * performance. */ static inline uint32 Best regards, Jesper -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] path toward faster partition pruning
Hi Amit, On 09/15/2017 04:50 AM, Amit Langote wrote: On 2017/09/15 11:16, Amit Langote wrote: I will post rebased patches later today, although I think the overall design of the patch on the planner side of things is not quite there yet. Of course, your and others' feedback is greatly welcome. Rebased patches attached. Because Dilip complained earlier today about clauses of the form (const op var) not causing partition-pruning, I've added code to commute the clause where it is required. Some other previously mentioned limitations remain -- no handling of OR clauses, no elimination of redundant clauses for given partitioning column, etc. A note about 0001: this patch overlaps with 0003-Canonical-partition-scheme.patch from the partitionwise-join patch series that Ashutosh Bapat posted yesterday [1]. Because I implemented the planner-portion of this patch based on what 0001 builds, I'm posting it here. It might actually turn out that we will review and commit 0003-Canonical-partition-scheme.patch on that thread, but meanwhile apply 0001 if you want to play with the later patches. I would certainly like to review 0003-Canonical-partition-scheme.patch myself, but won't be able to immediately (see below). Could you share your thoughts on the usage of PartitionAppendInfo's min_datum_idx / max_datum_idx ? Especially in relation to hash partitions. I'm looking at get_partitions_for_keys. Best regards, Jesper -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [POC] hash partitioning
On 09/15/2017 02:30 AM, amul sul wrote: Attached rebased patch, thanks. While reading through the patch I thought it would be better to keep MODULUS and REMAINDER in caps, if CREATE TABLE was in caps too in order to highlight that these are "keywords" for hash partition. Also updated some of the documentation. V20 patch passes make check-world, and my testing (typical 64 partitions, and various ATTACH/DETACH scenarios). Thanks for working on this ! Best regards, Jesper >From 189a40a5ca6c7a1bc79b750cbc95584b3061fda5 Mon Sep 17 00:00:00 2001 From: jesperpedersen Date: Mon, 18 Sep 2017 11:13:54 -0400 Subject: [PATCH] * Documentation updates * Use caps for MODULUS / REMAINDER when CREATE TABLE is so too --- doc/src/sgml/ddl.sgml | 10 - doc/src/sgml/ref/alter_table.sgml | 2 +- src/backend/catalog/partition.c| 8 +++ src/test/regress/expected/alter_table.out | 20 +- src/test/regress/expected/create_table.out | 34 +++--- src/test/regress/sql/alter_table.sql | 20 +- src/test/regress/sql/create_table.sql | 28 7 files changed, 61 insertions(+), 61 deletions(-) diff --git a/doc/src/sgml/ddl.sgml b/doc/src/sgml/ddl.sgml index 24b36caad3..e38d8fc0a0 100644 --- a/doc/src/sgml/ddl.sgml +++ b/doc/src/sgml/ddl.sgml @@ -2881,11 +2881,11 @@ VALUES ('Albany', NULL, NULL, 'NY'); - The table is partitioned by specifying modulus and remainder for each - partition. Each partition holds rows for which the hash value of - partition keys when divided by specified modulus produces specified - remainder. For more clarification on modulus and remainder please refer - . + The table is partitioned by specifying a modulus and a remainder for each + partition. Each partition will hold the rows for which the hash value of + the partition key divided by the specified modulus will produce the specified + remainder. Refer to + for additional clarification on modulus and remainder. diff --git a/doc/src/sgml/ref/alter_table.sgml b/doc/src/sgml/ref/alter_table.sgml index a6eefb8564..b5fb93edac 100644 --- a/doc/src/sgml/ref/alter_table.sgml +++ b/doc/src/sgml/ref/alter_table.sgml @@ -1424,7 +1424,7 @@ ALTER TABLE cities Attach a partition to hash partitioned table: ALTER TABLE orders -ATTACH PARTITION orders_p4 FOR VALUES WITH (modulus 4, remainder 3); +ATTACH PARTITION orders_p4 FOR VALUES WITH (MODULUS 4, REMAINDER 3); diff --git a/src/backend/catalog/partition.c b/src/backend/catalog/partition.c index 5c11c7ecea..3696b9a711 100644 --- a/src/backend/catalog/partition.c +++ b/src/backend/catalog/partition.c @@ -1694,13 +1694,13 @@ make_partition_op_expr(PartitionKey key, int keynum, * CREATE TABLE simple_hash (a int, b char(10)) PARTITION BY HASH (a, b); * * CREATE TABLE p_p1 PARTITION OF simple_hash - * FOR VALUES WITH (modulus 2, remainder 1); + * FOR VALUES WITH (MODULUS 2, REMAINDER 1); * CREATE TABLE p_p2 PARTITION OF simple_hash - * FOR VALUES WITH (modulus 4, remainder 2); + * FOR VALUES WITH (MODULUS 4, REMAINDER 2); * CREATE TABLE p_p3 PARTITION OF simple_hash - * FOR VALUES WITH (modulus 8, remainder 0); + * FOR VALUES WITH (MODULUS 8, REMAINDER 0); * CREATE TABLE p_p4 PARTITION OF simple_hash - * FOR VALUES WITH (modulus 8, remainder 4); + * FOR VALUES WITH (MODULUS 8, REMAINDER 4); * * This function will return one of the following in the form of an * expression: diff --git a/src/test/regress/expected/alter_table.out b/src/test/regress/expected/alter_table.out index 124cbe483c..304fb97291 100644 --- a/src/test/regress/expected/alter_table.out +++ b/src/test/regress/expected/alter_table.out @@ -3490,22 +3490,22 @@ CREATE TABLE hash_parted ( a int, b int ) PARTITION BY HASH (a custom_opclass); -CREATE TABLE hpart_1 PARTITION OF hash_parted FOR VALUES WITH (modulus 4, remainder 0); +CREATE TABLE hpart_1 PARTITION OF hash_parted FOR VALUES WITH (MODULUS 4, REMAINDER 0); CREATE TABLE fail_part (LIKE hpart_1); -ALTER TABLE hash_parted ATTACH PARTITION fail_part FOR VALUES WITH (modulus 8, remainder 4); +ALTER TABLE hash_parted ATTACH PARTITION fail_part FOR VALUES WITH (MODULUS 8, REMAINDER 4); ERROR: partition "fail_part" would overlap partition "hpart_1" -ALTER TABLE hash_parted ATTACH PARTITION fail_part FOR VALUES WITH (modulus 8, remainder 0); +ALTER TABLE hash_parted ATTACH PARTITION fail_part FOR VALUES WITH (MODULUS 8, REMAINDER 0); ERROR: partition "fail_part" would overlap partition "hpart_1" DROP TABLE fail_part; -- check validation when attaching hash partitions -- check that violating rows are correctly reported CREATE TABLE hpart_2 (LIKE hash_parted); INSERT INTO hpart_2 VALUES (3, 0); -ALTER TABLE hash_parted ATTACH PARTITION hpart_2 FOR VALUES WITH (modulus 4, remainder 1
Re: [HACKERS] [POC] hash partitioning
On 09/14/2017 01:52 PM, Robert Haas wrote: On Thu, Sep 14, 2017 at 1:07 PM, Jesper Pedersen wrote: Yeah, it would be nice to have a syntax like ) PARTITION BY HASH (col) WITH (AUTO_CREATE = 64); But then there also needs to be a way to create the 64 associated indexes too for everything to be easy. Well, for that, there's this proposal: http://postgr.es/m/c8fe4f6b-ff46-aae0-89e3-e936a35f0...@postgrespro.ru As several people have right pointed out, there's a lot of work to be done on partitioning it to get it to where we want it to be. Even in v10, it's got significant benefits, such as much faster bulk-loading, but I don't hear anybody disputing the notion that a lot more work is needed. The good news is that a lot of that work is already in progress; the bad news is that a lot of that work is not done yet. But I think that's OK. We can't solve every problem at once, and I think we're moving things along here at a reasonably brisk pace. That didn't stop me from complaining bitterly to someone just yesterday that we aren't moving faster still, but unfortunately EnterpriseDB has only been able to get 12 developers to do any work at all on partitioning this release cycle, and 3 of those have so far helped only with review and benchmarking. It's a pity we can't do more, but considering how many community projects are 1-person efforts I think it's pretty good. To be clear, I know you're not (or at least I assume you're not) trying to beat me up about this, just raising a concern, and I'm not trying to beat you up either, just let you know that it is definitely on the radar screen but not there yet. Definitely not a complain about the work being done. I think the scope of Amul's and others work on hash partition support is where it needs to be. Improvements can always follow in future release. My point was that is easy to script the definition of the partitions and their associated indexes, so it is more important to focus on the core functionality with the developer / review resources available. However, it is a little bit difficult to follow the dependencies between different partition patches, so I may not always provide sane feedback, as seen in [1]. [1] https://www.postgresql.org/message-id/579077fd-8f07-aff7-39bc-b92c855cdb70%40redhat.com Best regards, Jesper -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [POC] hash partitioning
On 09/14/2017 12:56 PM, Robert Haas wrote: On Thu, Sep 14, 2017 at 12:54 PM, David Fetter wrote: Should we be pointing the gun away from people's feet by making hash partitions that cover the space automagically when the partitioning scheme[1] is specified? In other words, do we have a good reason to have only some of the hash partitions so defined by default? Sure, we can add some convenience syntax for that, but I'd like to get the basic stuff working before doing that kind of polishing. If nothing else, I assume Keith Fiske's pg_partman will provide a way to magically DTRT about an hour after this goes in. But probably we can do better in core easily enough. Yeah, it would be nice to have a syntax like ) PARTITION BY HASH (col) WITH (AUTO_CREATE = 64); But then there also needs to be a way to create the 64 associated indexes too for everything to be easy. Best regards, Jesper -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [POC] hash partitioning
Hi, On 09/14/2017 12:05 PM, Robert Haas wrote: On Thu, Sep 14, 2017 at 11:39 AM, Jesper Pedersen wrote: When I do CREATE TABLE mytab ( a integer NOT NULL, b integer NOT NULL, c integer, d integer ) PARTITION BY HASH (b); and create 64 partitions; CREATE TABLE mytab_p00 PARTITION OF mytab FOR VALUES WITH (MODULUS 64, REMAINDER 0); ... CREATE TABLE mytab_p63 PARTITION OF mytab FOR VALUES WITH (MODULUS 64, REMAINDER 63); and associated indexes CREATE INDEX idx_p00 ON mytab_p00 USING btree (b, a); ... CREATE INDEX idx_p63 ON mytab_p63 USING btree (b, a); Populate the database, and do ANALYZE. Given EXPLAIN (ANALYZE, VERBOSE, BUFFERS ON) SELECT a, b, c, d FROM mytab WHERE b = 42 gives Append -> Index Scan using idx_p00 (cost rows=7) (actual rows=0) ... -> Index Scan using idx_p63 (cost rows=7) (actual rows=0) E.g. all partitions are being scanned. Of course one partition will contain the rows I'm looking for. Yeah, we need Amit Langote's work in http://postgr.es/m/098b9c71-1915-1a2a-8d52-1a7a50ce7...@lab.ntt.co.jp to land and this patch to be adapted to make use of it. I think that's the major thing still standing in the way of this. Concerns were also raised about not having a way to see the hash function, but we fixed that in 81c5e46c490e2426db243eada186995da5bb0ba7 and hopefully this patch has been updated to use a seed (I haven't looked yet). And there was a concern about hash functions not being portable, but the conclusion of that was basically that most people think --load-via-partition-root will be a satisfactory workaround for cases where that becomes a problem (cf. commit 23d7680d04b958de327be96ffdde8f024140d50e). So this is the major remaining issue that I know about. Thanks for the information, Robert ! Best regards, Jesper -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [POC] hash partitioning
Hi Amul, On 09/14/2017 04:58 AM, amul sul wrote: On Wed, Sep 13, 2017 at 7:43 PM, Jesper Pedersen This patch needs a rebase. Thanks for your note. Attached is the patch rebased on the latest master head. Also added error on creating default partition for the hash partitioned table, and updated document & test script for the same. Thanks ! When I do CREATE TABLE mytab ( a integer NOT NULL, b integer NOT NULL, c integer, d integer ) PARTITION BY HASH (b); and create 64 partitions; CREATE TABLE mytab_p00 PARTITION OF mytab FOR VALUES WITH (MODULUS 64, REMAINDER 0); ... CREATE TABLE mytab_p63 PARTITION OF mytab FOR VALUES WITH (MODULUS 64, REMAINDER 63); and associated indexes CREATE INDEX idx_p00 ON mytab_p00 USING btree (b, a); ... CREATE INDEX idx_p63 ON mytab_p63 USING btree (b, a); Populate the database, and do ANALYZE. Given EXPLAIN (ANALYZE, VERBOSE, BUFFERS ON) SELECT a, b, c, d FROM mytab WHERE b = 42 gives Append -> Index Scan using idx_p00 (cost rows=7) (actual rows=0) ... -> Index Scan using idx_p63 (cost rows=7) (actual rows=0) E.g. all partitions are being scanned. Of course one partition will contain the rows I'm looking for. Best regards, Jesper -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fix performance degradation of contended LWLock on NUMA
On 09/11/2017 11:01 AM, Jesper Pedersen wrote: Thanks for working on this ! Moved to "Ready for Committer". Best regards, Jesper -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [POC] hash partitioning
Hi Amul, On 09/08/2017 08:40 AM, amul sul wrote: Rebased 0002 against this commit & renamed to 0001, PFA. This patch needs a rebase. Best regards, Jesper -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] POC: Cache data in GetSnapshotData()
Hi, On 08/29/2017 05:04 AM, Mithun Cy wrote: Test Setting: = Server configuration: ./postgres -c shared_buffers=8GB -N 300 -c min_wal_size=15GB -c max_wal_size=20GB -c checkpoint_timeout=900 -c maintenance_work_mem=1GB -c checkpoint_completion_target=0.9 -c wal_buffers=256MB & pgbench configuration: scale_factor = 300 ./pgbench -c $threads -j $threads -T $time_for_reading -M prepared -S postgres The machine has 64 cores with this patch I can see server starts improvement after 64 clients. I have tested up to 256 clients. Which shows performance improvement nearly max 39% [1]. This is the best case for the patch where once computed snapshotData is reused further. The worst case seems to be small, quick write transactions which frequently invalidate the cached SnapshotData before it is reused by any next GetSnapshotData call. As of now, I tested simple update tests: pgbench -M Prepared -N on the same machine with the above server configuration. I do not see much change in TPS numbers. All TPS are median of 3 runs Clients TPS-With Patch 05 TPS-Base%Diff 1 752.461117755.186777 -0.3% 64 32171.296537 31202.153576 +3.1% 128 41059.660769 40061.929658 +2.49% I will do some profiling and find out why this case is not costing us some performance due to caching overhead. I have done a run with this patch on a 2S/28C/56T/256Gb w/ 2 x RAID10 SSD machine. Both for -M prepared, and -M prepared -S I'm not seeing any improvements (1 to 375 clients); e.g. +-1%. Although the -M prepared -S case should improve, I'm not sure that the extra overhead in the -M prepared case is worth the added code complexity. Best regards, Jesper -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fix performance degradation of contended LWLock on NUMA
Hi, On 09/08/2017 03:35 PM, Sokolov Yura wrote: I'm seeing -M prepared: Up to 11% improvement -M prepared -S: No improvement, no regression ("noise") -M prepared -N: Up to 12% improvement for all runs the improvement shows up the closer you get to the number of CPU threads, or above. Although I'm not seeing the same improvements as you on very large client counts there are definitely improvements :) It is expected: - patch "fixes NUMA": for example, it doesn't give improvement on 1 socket at all (I've tested it using numactl to bind to 1 socket) - and certainly it gives less improvement on 2 sockets than on 4 sockets (and 28 cores vs 72 cores also gives difference), - one of hot points were CLogControlLock, and it were fixed with "Group mode for CLOG updates" [1] I'm planning to re-test that patch. +static inline bool +LWLockAttemptLockOrQueueSelf(LWLock *lock, LWLockMode mode, LWLockMode waitmode) I'll leave it to the Committer to decide if this method is too big to be "inline". GCC 4.9 doesn't want to inline it without directive, and function call is then remarkable in profile. Attach contains version with all suggestions applied except remove of "inline". Yes, ideally the method will be kept at "inline". Open questions: --- * spins_per_delay as extern * Calculation of skip_wait_list Currently calculation of skip_wait_list is mostly empirical (ie best i measured). Ok, good to know. I strongly think that instead of spins_per_delay something dependent on concrete lock should be used. I tried to store it in a LWLock itself, but it were worse. Yes, LWLock should be kept as small as possible, and cache line aligned due to the cache storms, as shown by perf c2c. Recently I understand it should be stored in array indexed by tranche, but I didn't implement it yet, and therefore didn't measure. Different constants for the LWLock could have an impact, but the constants would also be dependent on machine setup, and work load. Thanks for working on this ! Best regards, Jesper -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fix performance degradation of contended LWLock on NUMA
Hi, On 07/18/2017 01:20 PM, Sokolov Yura wrote: I'm sending rebased version with couple of one-line tweaks. (less skip_wait_list on shared lock, and don't update spin-stat on aquiring) I have been running this patch on a 2S/28C/56T/256Gb w/ 2 x RAID10 SSD setup (1 to 375 clients on logged tables). I'm seeing -M prepared: Up to 11% improvement -M prepared -S: No improvement, no regression ("noise") -M prepared -N: Up to 12% improvement for all runs the improvement shows up the closer you get to the number of CPU threads, or above. Although I'm not seeing the same improvements as you on very large client counts there are definitely improvements :) Some comments: == lwlock.c: - + * This race is avoiding by taking lock for wait list using CAS with a old + * state value, so it could succeed only if lock is still held. Necessary flag + * is set with same CAS. + * + * Inside LWLockConflictsWithVar wait list lock is reused to protect variable + * value. So first it is acquired to check variable value, then flags are + * set with second CAS before queueing to wait list to ensure lock were not + * released yet. * This race is avoided by taking a lock for the wait list using CAS with the old * state value, so it would only succeed if lock is still held. Necessary flag * is set using the same CAS. * * Inside LWLockConflictsWithVar the wait list lock is reused to protect the variable * value. First the lock is acquired to check the variable value, then flags are * set with a second CAS before queuing to the wait list in order to ensure that the lock was not * released yet. +static void +add_lwlock_stats_spin_stat(LWLock *lock, SpinDelayStatus *delayStatus) Add a method description. + /* +* This barrier is never needed for correctness, and it is no-op +* on x86. But probably first iteration of cas loop in +* ProcArrayGroupClearXid will succeed oftener with it. +*/ * "more often" +static inline bool +LWLockAttemptLockOrQueueSelf(LWLock *lock, LWLockMode mode, LWLockMode waitmode) I'll leave it to the Committer to decide if this method is too big to be "inline". + /* +* We intentionally do not call finish_spin_delay here, cause loop above +* usually finished in queueing into wait list on contention, and doesn't +* reach spins_per_delay (and so, doesn't sleep inside of +* perform_spin_delay). Also, different LWLocks has very different +* contention pattern, and it is wrong to update spin-lock statistic based +* on LWLock contention. +*/ /* * We intentionally do not call finish_spin_delay here, because the loop above * usually finished by queuing into the wait list on contention, and doesn't * reach spins_per_delay thereby doesn't sleep inside of * perform_spin_delay. Also, different LWLocks has very different * contention pattern, and it is wrong to update spin-lock statistic based * on LWLock contention. */ s_lock.c: - + if (status->spins == 0) + /* but we didn't spin either, so ignore */ + return; Use { } for the if, or move the comment out of the nesting for readability. Open questions: --- * spins_per_delay as extern * Calculation of skip_wait_list You could run the patch through pgindent too. Passes make check-world. Status: Waiting on Author Best regards, Jesper -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fix performance of generic atomics
Hi Jeff, On 09/05/2017 03:47 PM, Jeff Janes wrote: I ran pgbench (-M prepared) with synchronous_commit 'on' and 'off' using both logged and unlogged tables. Also ran an internal benchmark which didn't show anything either. What scale factor and client count? How many cores per socket? It looks like Sokolov was just starting to see gains at 200 clients on 72 cores, using -N transaction. I have done a run with scale factor 300, and another with 3000 on a 2S/28C/56T/256Gb w/ 2 x RAID10 SSD machine; up to 200 clients. I would consider the runs as "noise" as I'm seeing +-1% for all client counts, so nothing like Yura is seeing in [1] for the higher client counts. I did a run with -N too using scale factor 300, using the settings in [1], but with same result (+-1%). [1] https://www.postgresql.org/message-id/d62d7d9d473d07e172d799d5a57e7...@postgrespro.ru Best regards, Jesper -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fix performance of generic atomics
Hi, On 05/25/2017 11:12 AM, Sokolov Yura wrote: I agree that lonely semicolon looks bad. Applied your suggestion for empty loop body (/* skip */). Patch in first letter had while(true), but I removed it cause I think it is uglier: - `while(true)` was necessary for grouping read with `if`, - but now there is single statement in a loop body and it is condition for loop exit, so it is clearly just a loop. Optimization is valid cause compare_exchange always store old value in `old` variable in a same atomic manner as atomic read. I have tested this patch on a 2-socket machine, but don't see any performance change in the various runs. However, there is no regression either in all cases. As such, I have marked the entry "Ready for Committer". Remember to add a version postfix to your patches such that is easy to identify which is the latest version. Best regards, Jesper -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fix performance of generic atomics
On 09/05/2017 02:24 PM, Tom Lane wrote: Jesper Pedersen writes: I have tested this patch on a 2-socket machine, but don't see any performance change in the various runs. However, there is no regression either in all cases. Hm, so if we can't demonstrate a performance win, it's hard to justify risking touching this code. What test case(s) did you use? I ran pgbench (-M prepared) with synchronous_commit 'on' and 'off' using both logged and unlogged tables. Also ran an internal benchmark which didn't show anything either. Setting the entry back to "Needs Review" for additional feedback from others. Best regards, Jesper -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fix performance of generic atomics
Hi, On 05/25/2017 11:12 AM, Sokolov Yura wrote: I agree that lonely semicolon looks bad. Applied your suggestion for empty loop body (/* skip */). Patch in first letter had while(true), but I removed it cause I think it is uglier: - `while(true)` was necessary for grouping read with `if`, - but now there is single statement in a loop body and it is condition for loop exit, so it is clearly just a loop. Optimization is valid cause compare_exchange always store old value in `old` variable in a same atomic manner as atomic read. I have tested this patch on a 2-socket machine, but don't see any performance change in the various runs. However, there is no regression either in all cases. As such, I have marked the entry "Ready for Committer". Remember to add a version postfix to your patches such that is easy to identify which is the latest version. Best regards, Jesper -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Page Scan Mode in Hash Index
On 08/24/2017 01:21 AM, Ashutosh Sharma wrote: Done. Attached are the patches with above changes. Thanks ! Based on the feedback in this thread, I have moved the patch to "Ready for Committer". Best regards, Jesper -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Page Scan Mode in Hash Index
On 08/23/2017 07:38 AM, Amit Kapila wrote: Thanks for the new version. I again looked at the patches and fixed quite a few comments in the code and ReadMe. You have forgotten to update README for the changes in vacuum patch (0003-Improve-locking-startegy-during-VACUUM-in-Hash-Index_v7). I don't have anything more to add. If you are okay with changes, then we can move it to Ready For Committer unless someone else has some more comments. Just some minor comments. README: + it's pin till the end of scan) its pin till the end of the scan) +To minimize lock/unlock traffic, hash index scan always searches entire hash To minimize lock/unlock traffic, hash index scan always searches the entire hash hashsearch.c: +static inline void _hash_saveitem(HashScanOpaque so, int itemIndex, + OffsetNumber offnum, IndexTuple itup); There are other instances of "inline" in the code base, so I guess that this is ok. +* Advance to next tuple on current page; or if there's no more, try to Advance to the next tuple on the current page; or if done, try to Best regards, Jesper -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] dtrace probes
On 04/20/2017 10:30 AM, Jesper Pedersen wrote: I think this fix is harmless and has some value in terms of consistency. One minor suggestion is that you should leave a space after typecasting. - TRACE_POSTGRESQL_LWLOCK_WAIT_DONE(T_NAME(lock), mode); + TRACE_POSTGRESQL_LWLOCK_WAIT_DONE(T_NAME(lock), (int)mode); There should be a space like "(int) mode". v2 attached. I managed to attach the same patch again, so here is v3. Best regards, Jesper From 0d964df84950ca90c08ed6dd77a575d4b70ea7db Mon Sep 17 00:00:00 2001 From: jesperpedersen Date: Tue, 18 Apr 2017 11:44:18 -0400 Subject: [PATCH] Explicit cast LWLockMode, which an enum, to int in order to match the dtrace definition of the lwlock methods. Thereby all call sites will have the same definition instead of a mix between signed and unsigned. Author: Jesper Pedersen --- src/backend/storage/lmgr/lwlock.c | 18 +- 1 file changed, 9 insertions(+), 9 deletions(-) diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c index 3e13394..c551be2 100644 --- a/src/backend/storage/lmgr/lwlock.c +++ b/src/backend/storage/lmgr/lwlock.c @@ -1226,7 +1226,7 @@ LWLockAcquire(LWLock *lock, LWLockMode mode) #endif LWLockReportWaitStart(lock); - TRACE_POSTGRESQL_LWLOCK_WAIT_START(T_NAME(lock), mode); + TRACE_POSTGRESQL_LWLOCK_WAIT_START(T_NAME(lock), (int) mode); for (;;) { @@ -1248,7 +1248,7 @@ LWLockAcquire(LWLock *lock, LWLockMode mode) } #endif - TRACE_POSTGRESQL_LWLOCK_WAIT_DONE(T_NAME(lock), mode); + TRACE_POSTGRESQL_LWLOCK_WAIT_DONE(T_NAME(lock), (int) mode); LWLockReportWaitEnd(); LOG_LWDEBUG("LWLockAcquire", lock, "awakened"); @@ -1257,7 +1257,7 @@ LWLockAcquire(LWLock *lock, LWLockMode mode) result = false; } - TRACE_POSTGRESQL_LWLOCK_ACQUIRE(T_NAME(lock), mode); + TRACE_POSTGRESQL_LWLOCK_ACQUIRE(T_NAME(lock), (int) mode); /* Add lock to list of locks held by this backend */ held_lwlocks[num_held_lwlocks].lock = lock; @@ -1308,14 +1308,14 @@ LWLockConditionalAcquire(LWLock *lock, LWLockMode mode) RESUME_INTERRUPTS(); LOG_LWDEBUG("LWLockConditionalAcquire", lock, "failed"); - TRACE_POSTGRESQL_LWLOCK_CONDACQUIRE_FAIL(T_NAME(lock), mode); + TRACE_POSTGRESQL_LWLOCK_CONDACQUIRE_FAIL(T_NAME(lock), (int) mode); } else { /* Add lock to list of locks held by this backend */ held_lwlocks[num_held_lwlocks].lock = lock; held_lwlocks[num_held_lwlocks++].mode = mode; - TRACE_POSTGRESQL_LWLOCK_CONDACQUIRE(T_NAME(lock), mode); + TRACE_POSTGRESQL_LWLOCK_CONDACQUIRE(T_NAME(lock), (int) mode); } return !mustwait; } @@ -1387,7 +1387,7 @@ LWLockAcquireOrWait(LWLock *lock, LWLockMode mode) #endif LWLockReportWaitStart(lock); - TRACE_POSTGRESQL_LWLOCK_WAIT_START(T_NAME(lock), mode); + TRACE_POSTGRESQL_LWLOCK_WAIT_START(T_NAME(lock), (int) mode); for (;;) { @@ -1405,7 +1405,7 @@ LWLockAcquireOrWait(LWLock *lock, LWLockMode mode) Assert(nwaiters < MAX_BACKENDS); } #endif - TRACE_POSTGRESQL_LWLOCK_WAIT_DONE(T_NAME(lock), mode); + TRACE_POSTGRESQL_LWLOCK_WAIT_DONE(T_NAME(lock), (int) mode); LWLockReportWaitEnd(); LOG_LWDEBUG("LWLockAcquireOrWait", lock, "awakened"); @@ -1435,7 +1435,7 @@ LWLockAcquireOrWait(LWLock *lock, LWLockMode mode) /* Failed to get lock, so release interrupt holdoff */ RESUME_INTERRUPTS(); LOG_LWDEBUG("LWLockAcquireOrWait", lock, "failed"); - TRACE_POSTGRESQL_LWLOCK_ACQUIRE_OR_WAIT_FAIL(T_NAME(lock), mode); + TRACE_POSTGRESQL_LWLOCK_ACQUIRE_OR_WAIT_FAIL(T_NAME(lock), (int) mode); } else { @@ -1443,7 +1443,7 @@ LWLockAcquireOrWait(LWLock *lock, LWLockMode mode) /* Add lock to list of locks held by this backend */ held_lwlocks[num_held_lwlocks].lock = lock; held_lwlocks[num_held_lwlocks++].mode = mode; - TRACE_POSTGRESQL_LWLOCK_ACQUIRE_OR_WAIT(T_NAME(lock), mode); + TRACE_POSTGRESQL_LWLOCK_ACQUIRE_OR_WAIT(T_NAME(lock), (int) mode); } return !mustwait; -- 2.7.4 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] dtrace probes
Hi, On 04/20/2017 09:24 AM, Amit Kapila wrote: The lwlock dtrace probes define LWLockMode as int, and the TRACE_POSTGRESQL_LWLOCK methods are called using both a variable and constant definition. This leads to a mix of argument definitions depending on the call site, as seen in probes.txt file. A fix is to explicit cast 'mode' to int such that all call sites will use the argument #2 4 signed bytes definition. Attached patch does this. I think this fix is harmless and has some value in terms of consistency. One minor suggestion is that you should leave a space after typecasting. - TRACE_POSTGRESQL_LWLOCK_WAIT_DONE(T_NAME(lock), mode); + TRACE_POSTGRESQL_LWLOCK_WAIT_DONE(T_NAME(lock), (int)mode); There should be a space like "(int) mode". v2 attached. I have verified all dtraces probes for their type, and only the lock__ methods doesn't aligned with its actual types. Do you see any problem with that? Not really, but it would be more clear what the value space of each of the parameters were. Depending on the feedback I can add this patch to the open item list in order to fix it for PostgreSQL 10. Is there any commit in PG-10 which has caused this behavior? If not, then I don't think it should be added to open items of PG-10. It is really a bug fix, so it could even be back patched. Thanks for the feedback ! Best regards, Jesper >From 7175dc8e05ff703229bd6cab6b254587ffc076c8 Mon Sep 17 00:00:00 2001 From: jesperpedersen Date: Tue, 18 Apr 2017 11:44:18 -0400 Subject: [PATCH] Explicit cast LWLockMode, which an enum, to int in order to match the dtrace definition of the lwlock methods. Thereby all call sites will have the same definition instead of a mix between signed and unsigned. Author: Jesper Pedersen --- src/backend/storage/lmgr/lwlock.c | 18 +- 1 file changed, 9 insertions(+), 9 deletions(-) diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c index 3e13394..abf5fbb 100644 --- a/src/backend/storage/lmgr/lwlock.c +++ b/src/backend/storage/lmgr/lwlock.c @@ -1226,7 +1226,7 @@ LWLockAcquire(LWLock *lock, LWLockMode mode) #endif LWLockReportWaitStart(lock); - TRACE_POSTGRESQL_LWLOCK_WAIT_START(T_NAME(lock), mode); + TRACE_POSTGRESQL_LWLOCK_WAIT_START(T_NAME(lock), (int)mode); for (;;) { @@ -1248,7 +1248,7 @@ LWLockAcquire(LWLock *lock, LWLockMode mode) } #endif - TRACE_POSTGRESQL_LWLOCK_WAIT_DONE(T_NAME(lock), mode); + TRACE_POSTGRESQL_LWLOCK_WAIT_DONE(T_NAME(lock), (int)mode); LWLockReportWaitEnd(); LOG_LWDEBUG("LWLockAcquire", lock, "awakened"); @@ -1257,7 +1257,7 @@ LWLockAcquire(LWLock *lock, LWLockMode mode) result = false; } - TRACE_POSTGRESQL_LWLOCK_ACQUIRE(T_NAME(lock), mode); + TRACE_POSTGRESQL_LWLOCK_ACQUIRE(T_NAME(lock), (int)mode); /* Add lock to list of locks held by this backend */ held_lwlocks[num_held_lwlocks].lock = lock; @@ -1308,14 +1308,14 @@ LWLockConditionalAcquire(LWLock *lock, LWLockMode mode) RESUME_INTERRUPTS(); LOG_LWDEBUG("LWLockConditionalAcquire", lock, "failed"); - TRACE_POSTGRESQL_LWLOCK_CONDACQUIRE_FAIL(T_NAME(lock), mode); + TRACE_POSTGRESQL_LWLOCK_CONDACQUIRE_FAIL(T_NAME(lock), (int)mode); } else { /* Add lock to list of locks held by this backend */ held_lwlocks[num_held_lwlocks].lock = lock; held_lwlocks[num_held_lwlocks++].mode = mode; - TRACE_POSTGRESQL_LWLOCK_CONDACQUIRE(T_NAME(lock), mode); + TRACE_POSTGRESQL_LWLOCK_CONDACQUIRE(T_NAME(lock), (int)mode); } return !mustwait; } @@ -1387,7 +1387,7 @@ LWLockAcquireOrWait(LWLock *lock, LWLockMode mode) #endif LWLockReportWaitStart(lock); - TRACE_POSTGRESQL_LWLOCK_WAIT_START(T_NAME(lock), mode); + TRACE_POSTGRESQL_LWLOCK_WAIT_START(T_NAME(lock), (int)mode); for (;;) { @@ -1405,7 +1405,7 @@ LWLockAcquireOrWait(LWLock *lock, LWLockMode mode) Assert(nwaiters < MAX_BACKENDS); } #endif - TRACE_POSTGRESQL_LWLOCK_WAIT_DONE(T_NAME(lock), mode); + TRACE_POSTGRESQL_LWLOCK_WAIT_DONE(T_NAME(lock), (int)mode); LWLockReportWaitEnd(); LOG_LWDEBUG("LWLockAcquireOrWait", lock, "awakened"); @@ -1435,7 +1435,7 @@ LWLockAcquireOrWait(LWLock *lock, LWLockMode mode) /* Failed to get lock, so release interrupt holdoff */ RESUME_INTERRUPTS(); LOG_LWDEBUG("LWLockAcquireOrWait", lock, "failed"); - TRACE_POSTGRESQL_LWLOCK_ACQUIRE_OR_WAIT_FAIL(T_NAME(lock), mode); + TRACE_POSTGRESQL_LWLOCK_ACQUIRE_OR_WAIT_FAIL(T_NAME(lock), (int)mode); } else { @@ -1443,7 +1443,7 @@ LWLockAcquireOrWait(LWLock *lock, LWLockMode mode) /* Add lock to list of locks held by this backend */ held_lwlocks[num_held_lwlocks].lock = lock; held_lwlocks[num_held_lwlocks++].mode = mode; - TRACE_POSTGRESQL_LWLOCK_ACQUIRE_OR_WAIT(T_NAME(lock), mode); + TRACE_POSTGRESQL_LWLOCK_ACQUIRE_OR_WAIT(T_NAME(lock), (int)mode); }
[HACKERS] dtrace probes
Hi, The lwlock dtrace probes define LWLockMode as int, and the TRACE_POSTGRESQL_LWLOCK methods are called using both a variable and constant definition. This leads to a mix of argument definitions depending on the call site, as seen in probes.txt file. A fix is to explicit cast 'mode' to int such that all call sites will use the argument #2 4 signed bytes definition. Attached patch does this. I have verified all dtraces probes for their type, and only the lock__ methods doesn't aligned with its actual types. However, that would require a change to probes.d, and therefore PostgreSQL 11 material. Depending on the feedback I can add this patch to the open item list in order to fix it for PostgreSQL 10. Best regards, Jesper >From d6f5c119c057c7ff8c84f88bb4122a1ca245a7d4 Mon Sep 17 00:00:00 2001 From: jesperpedersen Date: Tue, 18 Apr 2017 11:44:18 -0400 Subject: [PATCH] Explicit cast LWLockMode, which an enum, to int in order to match the dtrace definition of the lwlock methods. Thereby all call sites will have the same definition instead of a mix between signed and unsigned. Author: Jesper Pedersen --- src/backend/storage/lmgr/lwlock.c | 18 +- 1 file changed, 9 insertions(+), 9 deletions(-) diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c index 3e13394..abf5fbb 100644 --- a/src/backend/storage/lmgr/lwlock.c +++ b/src/backend/storage/lmgr/lwlock.c @@ -1226,7 +1226,7 @@ LWLockAcquire(LWLock *lock, LWLockMode mode) #endif LWLockReportWaitStart(lock); - TRACE_POSTGRESQL_LWLOCK_WAIT_START(T_NAME(lock), mode); + TRACE_POSTGRESQL_LWLOCK_WAIT_START(T_NAME(lock), (int)mode); for (;;) { @@ -1248,7 +1248,7 @@ LWLockAcquire(LWLock *lock, LWLockMode mode) } #endif - TRACE_POSTGRESQL_LWLOCK_WAIT_DONE(T_NAME(lock), mode); + TRACE_POSTGRESQL_LWLOCK_WAIT_DONE(T_NAME(lock), (int)mode); LWLockReportWaitEnd(); LOG_LWDEBUG("LWLockAcquire", lock, "awakened"); @@ -1257,7 +1257,7 @@ LWLockAcquire(LWLock *lock, LWLockMode mode) result = false; } - TRACE_POSTGRESQL_LWLOCK_ACQUIRE(T_NAME(lock), mode); + TRACE_POSTGRESQL_LWLOCK_ACQUIRE(T_NAME(lock), (int)mode); /* Add lock to list of locks held by this backend */ held_lwlocks[num_held_lwlocks].lock = lock; @@ -1308,14 +1308,14 @@ LWLockConditionalAcquire(LWLock *lock, LWLockMode mode) RESUME_INTERRUPTS(); LOG_LWDEBUG("LWLockConditionalAcquire", lock, "failed"); - TRACE_POSTGRESQL_LWLOCK_CONDACQUIRE_FAIL(T_NAME(lock), mode); + TRACE_POSTGRESQL_LWLOCK_CONDACQUIRE_FAIL(T_NAME(lock), (int)mode); } else { /* Add lock to list of locks held by this backend */ held_lwlocks[num_held_lwlocks].lock = lock; held_lwlocks[num_held_lwlocks++].mode = mode; - TRACE_POSTGRESQL_LWLOCK_CONDACQUIRE(T_NAME(lock), mode); + TRACE_POSTGRESQL_LWLOCK_CONDACQUIRE(T_NAME(lock), (int)mode); } return !mustwait; } @@ -1387,7 +1387,7 @@ LWLockAcquireOrWait(LWLock *lock, LWLockMode mode) #endif LWLockReportWaitStart(lock); - TRACE_POSTGRESQL_LWLOCK_WAIT_START(T_NAME(lock), mode); + TRACE_POSTGRESQL_LWLOCK_WAIT_START(T_NAME(lock), (int)mode); for (;;) { @@ -1405,7 +1405,7 @@ LWLockAcquireOrWait(LWLock *lock, LWLockMode mode) Assert(nwaiters < MAX_BACKENDS); } #endif - TRACE_POSTGRESQL_LWLOCK_WAIT_DONE(T_NAME(lock), mode); + TRACE_POSTGRESQL_LWLOCK_WAIT_DONE(T_NAME(lock), (int)mode); LWLockReportWaitEnd(); LOG_LWDEBUG("LWLockAcquireOrWait", lock, "awakened"); @@ -1435,7 +1435,7 @@ LWLockAcquireOrWait(LWLock *lock, LWLockMode mode) /* Failed to get lock, so release interrupt holdoff */ RESUME_INTERRUPTS(); LOG_LWDEBUG("LWLockAcquireOrWait", lock, "failed"); - TRACE_POSTGRESQL_LWLOCK_ACQUIRE_OR_WAIT_FAIL(T_NAME(lock), mode); + TRACE_POSTGRESQL_LWLOCK_ACQUIRE_OR_WAIT_FAIL(T_NAME(lock), (int)mode); } else { @@ -1443,7 +1443,7 @@ LWLockAcquireOrWait(LWLock *lock, LWLockMode mode) /* Add lock to list of locks held by this backend */ held_lwlocks[num_held_lwlocks].lock = lock; held_lwlocks[num_held_lwlocks++].mode = mode; - TRACE_POSTGRESQL_LWLOCK_ACQUIRE_OR_WAIT(T_NAME(lock), mode); + TRACE_POSTGRESQL_LWLOCK_ACQUIRE_OR_WAIT(T_NAME(lock), (int)mode); } return !mustwait; -- 2.7.4 /usr/local/bin/postgres postgresql:clog__checkpoint__start [sema 0xe2351a] location #1 0x4fc706 argument #1 1 signed bytes @ 0 location #2 0x4fc72d argument #1 1 signed bytes @ 1 /usr/local/bin/postgres postgresql:clog__checkpoint__done [sema 0xe2351c] location #1 0x4fc725 argument #1 1 signed bytes @ 0 location #2 0x4fc74c argument #1 1 signed bytes @ 1 /usr/local/bin/postgres postgresql:multixact__checkpoint__start [sema 0xe23522] location #1 0x500d07 argument #1 1 signed bytes @ 0 location #2 0x500dbc argument #1 1 signed bytes @ 1 /usr/local/bin/postgres postgresql:mu
Re: [HACKERS] bumping HASH_VERSION to 3
On 03/31/2017 02:17 PM, Robert Haas wrote: Starting a new thread about this to get more visibility. Despite the extensive work that has been done on hash indexes this release, we have thus far not made any change to the on-disk format that is not nominally backward-compatible. Commit 293e24e507838733aba4748b514536af2d39d7f2 did make a change for new hash indexes, but included backward-compatibility code so that old indexes would continue to work. However, I'd like to also commit Mithun Cy's patch to expand hash indexes more gradually -- latest version in http://postgr.es/m/cad__oujd-ibxm91zcqziayftwqjxnfqgmv361v9zke83s6i...@mail.gmail.com -- and that's not backward-compatible. It would be possible to write code to convert the old metapage format to the new metapage format introduced by that patch, and it wouldn't be very hard, but I think it would be better to NOT do that, and instead force everybody upgrading to v10 to rebuild all of their hash indexes. If we don't do that, then we'll never know whether instances of hash index corruption reported against v10 or higher are caused by defects in the new code, because there's always the chance that the hash index could have been built on a pre-v10 version, got corrupted because of the lack of WAL-logging, and then been brought up to v10+ via pg_upgrade. Forcing a reindex in v10 kills three birds with one stone: - No old, not logged, possibly corrupt hash indexes floating around after an upgrade to v10. - Can remove the backward-compatibility code added by 293e24e507838733aba4748b514536af2d39d7f2 instead of keeping it around forever. - No need to worry about doing an in-place upgrade of the metapage for the above-mentioned patch. Thoughts? +1 Best regards, Jesper -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Page Scan Mode in Hash Index
Hi Ashutosh, On 03/29/2017 09:16 PM, Ashutosh Sharma wrote: This patch needs a rebase. Please try applying these patches on top of [1]. I think you should be able to apply it cleanly. Sorry, I think I forgot to mention this point in my earlier mail. [1] - https://www.postgresql.org/message-id/CAE9k0P%3DV2LhtyeMXd295fhisp%3DNWUhRVJ9EZQCDowWiY9rSohQ%40mail.gmail.com Thanks, that works ! As you have provided a patch for Robert's comments, and no other review have been posted I'm moving this patch to "Ready for Committer" for additional committer feedback. Best regards, Jesper -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Page Scan Mode in Hash Index
Hi, On 03/27/2017 09:34 AM, Ashutosh Sharma wrote: Hi, I think you should consider refactoring this so that it doesn't need to use goto. Maybe move the while (offnum <= maxoff) logic into a helper function and have it return itemIndex. If itemIndex == 0, you can call it again. okay, Added a helper function for _hash_readpage(). Please check v4 patch attached with this mail. >> This is not a full review, but I'm out of time for the moment. No worries. I will be ready for your further review comments any time. Thanks for the review. This patch needs a rebase. Best regards, Jesper -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [POC] A better way to expand hash indexes.
Hi Mithun, On 03/26/2017 01:56 AM, Mithun Cy wrote: Thanks, Amit for the review. On Sat, Mar 25, 2017 at 7:03 PM, Amit Kapila wrote: I think one-dimensional patch has fewer places to touch, so that looks better to me. However, I think there is still hard coding and assumptions in code which we should try to improve. Great!, I will continue with spares 1-dimensional improvement. I ran some performance scenarios on the patch to see if the increased 'spares' allocation had an impact. I havn't found any regressions in that regard. Attached patch contains some small fixes, mainly to the documentation - on top of v7. Best regards, Jesper >From 5545e48ab7136f17b3d471e0ee679a6db6040865 Mon Sep 17 00:00:00 2001 From: jesperpedersen Date: Mon, 27 Mar 2017 14:15:00 -0400 Subject: [PATCH] Small fixes --- src/backend/access/hash/README | 50 +++--- src/backend/access/hash/hashpage.c | 26 ++-- src/backend/access/hash/hashutil.c | 24 +- 3 files changed, 50 insertions(+), 50 deletions(-) diff --git a/src/backend/access/hash/README b/src/backend/access/hash/README index 6721ee1..ca46de7 100644 --- a/src/backend/access/hash/README +++ b/src/backend/access/hash/README @@ -58,50 +58,50 @@ rules to support a variable number of overflow pages while not having to move primary bucket pages around after they are created. Primary bucket pages (henceforth just "bucket pages") are allocated in -power-of-2 groups, called "split points" in the code. That means on every new -split point we double the existing number of buckets. And, it seems bad to -allocate huge chunks of bucket pages all at once and we take ages to consume it. -To avoid this exponential growth of index size, we did a trick to breakup -allocation of buckets at splitpoint into 4 equal phases. If 2^x is the total -buckets need to be allocated at a splitpoint (from now on we shall call this -as splitpoint group), then we allocate 1/4th (2^(x - 2)) of total buckets at -each phase of splitpoint group. Next quarter of allocation will only happen if +power-of-2 groups, called "split points" in the code. That means at every new +split point we double the existing number of buckets. Allocating huge chucks +of bucket pages all at once isn't optimal and we will take ages to consume those. +To avoid this exponential growth of index size, we did use a trick to breakup +allocation of buckets at the split points into 4 equal phases. If 2^x is the total +buckets needed to be allocated at a split point (from now on we shall call this +a split point group), then we allocate 1/4th (2^(x - 2)) of total buckets at +each phase of the split point group. Next quarter of allocation will only happen if buckets of previous phase has been already consumed. Since for buckets number -< 4 we cannot further divide it in to multiple phases, the first splitpoint +< 4 we cannot further divide it in to multiple phases, the first split point group 0's allocation is done as follows {1, 1, 1, 1} = 4 buckets in total, the numbers in curly braces indicate number of buckets allocated within each phase -of splitpoint group 0. In next splitpoint group 1 the allocation phases will -be as follow {1, 1, 1, 1} = 8 buckets in total. And, for splitpoint group 2 +of split point group 0. In next split point group 1 the allocation phases will +be as follow {1, 1, 1, 1} = 8 buckets in total. And, for split point group 2 and 3 allocation phase will be {2, 2, 2, 2} = 16 buckets in total and -{4, 4, 4, 4} = 32 buckets in total. We can see that at each splitpoint group -we double the total number of buckets from previous group but in a incremental -phase. The bucket pages allocated within one phase of a splitpoint group will +{4, 4, 4, 4} = 32 buckets in total. We can see that at each split point group +we double the total number of buckets from previous group but in an incremental +phase. The bucket pages allocated within one phase of a split point group will appear consecutively in the index. This addressing scheme allows the physical location of a bucket page to be computed from the bucket number relatively easily, using only a small amount of control information. If we look at the -function _hash_spareindex for a given bucket number we first compute splitpoint -group it belongs to and then the phase with in it to which the bucket belongs -to. Adding them we get the global splitpoint phase number S to which the +function _hash_spareindex for a given bucket number we first compute the split point +group it belongs to and then the phase to which the bucket belongs +to. Adding them we get the global split point phase number S to which the bucket belongs and then simply add "hashm_spares[S] + 1" (where hashm_spares[] is an array stored in the metapage) with given bucket number to compute its physical address. The hashm_spares[S] can be interpreted as the total number of overflow pages that have been
Re: [HACKERS] Page Scan Mode in Hash Index
Hi, On 03/23/2017 02:11 PM, Ashutosh Sharma wrote: On Thu, Mar 23, 2017 at 8:29 PM, Jesper Pedersen wrote: 0001v2: In hashgettuple() you can remove the 'currItem' and 'offnum' from the 'else' part, and do the assignment inside if (so->numKilled < MaxIndexTuplesPerPage) instead. Done. Please have a look into the attached v3 patch. No new comments for 0002 and 0003. okay. Thanks. I'll keep the entry in 'Needs Review' if Alexander, or others, want to add their feedback. (Best to post the entire patch series each time) Best regards, Jesper -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Page Scan Mode in Hash Index
Hi, On 03/22/2017 09:32 AM, Ashutosh Sharma wrote: Done. Please refer to the attached v2 version of patch. Thanks. 1) 0001-Rewrite-hash-index-scans-to-work-a-page-at-a-time.patch: this patch rewrites the hash index scan module to work in page-at-a-time mode. It basically introduces two new functions-- _hash_readpage() and _hash_saveitem(). The former is used to load all the qualifying tuples from a target bucket or overflow page into an items array. The latter one is used by _hash_readpage to save all the qualifying tuples found in a page into an items array. Apart from that, this patch bascially cleans _hash_first(), _hash_next and hashgettuple(). 0001v2: In hashgettuple() you can remove the 'currItem' and 'offnum' from the 'else' part, and do the assignment inside if (so->numKilled < MaxIndexTuplesPerPage) instead. No new comments for 0002 and 0003. Best regards, Jesper -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Page Scan Mode in Hash Index
Hi, On 02/14/2017 12:27 AM, Ashutosh Sharma wrote: Currently, Hash Index scan works tuple-at-a-time, i.e. for every qualifying tuple in a page, it acquires and releases the lock which eventually increases the lock/unlock traffic. For example, if an index page contains 100 qualified tuples, the current hash index scan has to acquire and release the lock 100 times to read those qualified tuples which is not good from performance perspective and it also impacts concurency with VACUUM. Considering above points, I would like to propose a patch that allows hash index scan to work in page-at-a-time mode. In page-at-a-time mode, once lock is acquired on a target bucket page, the entire page is scanned and all the qualified tuples are saved into backend's local memory. This reduces the lock/unlock calls for retrieving tuples from a page. Moreover, it also eliminates the problem of re-finding the position of the last returned index tuple and more importanly it allows VACUUM to release lock on current page before moving to the next page which eventually improves it's concurrency with scan. Attached patch modifies hash index scan code for page-at-a-time mode. For better readability, I have splitted it into 3 parts, Due to the commits on master these patches applies with hunks. The README should be updated to mention the use of page scan. hash.h needs pg_indent. 1) 0001-Rewrite-hash-index-scans-to-work-a-page-at-a-time.patch: this patch rewrites the hash index scan module to work in page-at-a-time mode. It basically introduces two new functions-- _hash_readpage() and _hash_saveitem(). The former is used to load all the qualifying tuples from a target bucket or overflow page into an items array. The latter one is used by _hash_readpage to save all the qualifying tuples found in a page into an items array. Apart from that, this patch bascially cleans _hash_first(), _hash_next and hashgettuple(). For _hash_next I don't see this - can you explain ? + * + * On failure exit (no more tuples), we release pin and set + * so->currPos.buf to InvalidBuffer. + * Returns true if any matching items are found else returns false. s/Returns/Return/g 2) 0002-Remove-redundant-function-_hash_step-and-some-of-the.patch: this patch basically removes the redundant function _hash_step() and some of the unused members of HashScanOpaqueData structure. Looks good. 3) 0003-Improve-locking-startegy-during-VACUUM-in-Hash-Index.patch: this patch basically improves the locking strategy for VACUUM in hash index. As the new hash index scan works in page-at-a-time, vacuum can release the lock on previous page before acquiring a lock on the next page, hence, improving hash index concurrency. +* As the new hash index scan work in page at a time mode, Remove 'new'. I have also done the benchmarking of this patch and would like to share the results for the same, Firstly, I have done the benchmarking with non-unique values and i could see a performance improvement of 4-7%. For the detailed results please find the attached file 'results-non-unique values-70ff', and ddl.sql, test.sql are test scripts used in this experimentation. The detail of non-default GUC params and pgbench command are mentioned in the result sheet. I also did the benchmarking with unique values at 300 and 1000 scale factor and its results are provided in 'results-unique-values-default-ff'. I'm seeing similar results, and especially with write heavy scenarios. Best regards, Jesper -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pageinspect: Hash index support
On 02/03/2017 11:41 AM, Jesper Pedersen wrote: contrib/pageinspect actually seems to lack *any* infrastructure for sharing functions across modules. It's time to remedy that. I propose inventing "pageinspect.h" to hold commonly visible declarations, and moving get_page_from_raw() into rawpage.c, which seems like a reasonably natural home for it. (Alternatively, we could invent a pageinspect.c file to hold utility functions, but that might be overkill.) No objections. Attached is v1 of this w/ verify_hash_page() using get_page_from_raw(). Sorry for all the problems. Disregard, as Tom has committed a fix. Best regards, Jesper -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pageinspect: Hash index support
Hi, On 02/03/2017 11:36 AM, Robert Haas wrote: On Fri, Feb 3, 2017 at 10:11 AM, Tom Lane wrote: BTW, the buildfarm is still crashing on ia64 and sparc64 members. I believe this is the same problem addressed in 84ad68d64 for pageinspect's GIN functions, to wit, the payload of a "bytea" is not maxaligned, so machines that care about alignment won't be happy when trying to fetch 64-bit values out of a bytea page image. Clearly, the fix should be to start using the get_page_from_raw() function that Peter introduced in that patch. But it's static in ginfuncs.c, which I thought was a mistake at the time, and it's proven so now. contrib/pageinspect actually seems to lack *any* infrastructure for sharing functions across modules. It's time to remedy that. I propose inventing "pageinspect.h" to hold commonly visible declarations, and moving get_page_from_raw() into rawpage.c, which seems like a reasonably natural home for it. (Alternatively, we could invent a pageinspect.c file to hold utility functions, but that might be overkill.) No objections. Attached is v1 of this w/ verify_hash_page() using get_page_from_raw(). Sorry for all the problems. Best regards, Jesper >From d674c58098580ec59f9e1e47255c68b8497178b1 Mon Sep 17 00:00:00 2001 From: jesperpedersen Date: Fri, 3 Feb 2017 11:28:19 -0500 Subject: [PATCH] Add pageinspect.h --- contrib/pageinspect/brinfuncs.c | 1 + contrib/pageinspect/btreefuncs.c | 1 + contrib/pageinspect/fsmfuncs.c| 1 + contrib/pageinspect/ginfuncs.c| 21 +- contrib/pageinspect/hashfuncs.c | 3 +- contrib/pageinspect/pageinspect.h | 61 +++ contrib/pageinspect/rawpage.c | 25 7 files changed, 92 insertions(+), 21 deletions(-) create mode 100644 contrib/pageinspect/pageinspect.h diff --git a/contrib/pageinspect/brinfuncs.c b/contrib/pageinspect/brinfuncs.c index 7b877e3..23136d2 100644 --- a/contrib/pageinspect/brinfuncs.c +++ b/contrib/pageinspect/brinfuncs.c @@ -9,6 +9,7 @@ */ #include "postgres.h" +#include "pageinspect.h" #include "access/htup_details.h" #include "access/brin.h" #include "access/brin_internal.h" diff --git a/contrib/pageinspect/btreefuncs.c b/contrib/pageinspect/btreefuncs.c index 3f09d5f..e63b5fb 100644 --- a/contrib/pageinspect/btreefuncs.c +++ b/contrib/pageinspect/btreefuncs.c @@ -27,6 +27,7 @@ #include "postgres.h" +#include "pageinspect.h" #include "access/nbtree.h" #include "catalog/namespace.h" #include "catalog/pg_am.h" diff --git a/contrib/pageinspect/fsmfuncs.c b/contrib/pageinspect/fsmfuncs.c index 6541646..372de39 100644 --- a/contrib/pageinspect/fsmfuncs.c +++ b/contrib/pageinspect/fsmfuncs.c @@ -19,6 +19,7 @@ #include "postgres.h" +#include "pageinspect.h" #include "funcapi.h" #include "lib/stringinfo.h" #include "miscadmin.h" diff --git a/contrib/pageinspect/ginfuncs.c b/contrib/pageinspect/ginfuncs.c index cea77d3..5cfcbcb 100644 --- a/contrib/pageinspect/ginfuncs.c +++ b/contrib/pageinspect/ginfuncs.c @@ -9,6 +9,7 @@ */ #include "postgres.h" +#include "pageinspect.h" #include "access/gin.h" #include "access/gin_private.h" #include "access/htup_details.h" @@ -29,26 +30,6 @@ PG_FUNCTION_INFO_V1(gin_page_opaque_info); PG_FUNCTION_INFO_V1(gin_leafpage_items); -static Page -get_page_from_raw(bytea *raw_page) -{ - int raw_page_size; - Page page; - - raw_page_size = VARSIZE(raw_page) - VARHDRSZ; - if (raw_page_size < BLCKSZ) - ereport(ERROR, -(errcode(ERRCODE_INVALID_PARAMETER_VALUE), - errmsg("input page too small (%d bytes)", raw_page_size))); - - /* make a copy so that the page is properly aligned for struct access */ - page = palloc(raw_page_size); - memcpy(page, VARDATA(raw_page), raw_page_size); - - return page; -} - - Datum gin_metapage_info(PG_FUNCTION_ARGS) { diff --git a/contrib/pageinspect/hashfuncs.c b/contrib/pageinspect/hashfuncs.c index 83b4698..09bb9ee 100644 --- a/contrib/pageinspect/hashfuncs.c +++ b/contrib/pageinspect/hashfuncs.c @@ -10,6 +10,7 @@ #include "postgres.h" +#include "pageinspect.h" #include "access/hash.h" #include "access/htup_details.h" #include "catalog/pg_type.h" @@ -67,7 +68,7 @@ verify_hash_page(bytea *raw_page, int flags) errdetail("Expected size %d, got %d", BLCKSZ, raw_page_size))); - page = VARDATA(raw_page); + page = get_page_from_raw(raw_page); if (PageIsNew(page)) ereport(ERROR, diff --git a/contrib/pageinspect/pageinspect.h b/contrib/pageinspect/pageinspect.h new file mode 100644 index 000..9ec3aad --- /dev/null +++ b/contrib/pageinspect/pageinspect.h @@ -0,0 +1,61 @@ +/* + * pageinspect.h + * Functions to investigate the content of indexes + * + * Copyright (c) 2017, PostgreSQL Global Development Group + * + * IDENTIFICATION + * contrib/pageinspect/pageinspect.h + */ +#ifndef PAGEINSPECT_H +#define PAGEINSPECT_H + +#include "postgres.h" + +#include "funcapi.h" + +// Raw page functions + +extern Datum get_raw_page(P
Re: [HACKERS] pageinspect: Hash index support
On 02/02/2017 02:28 PM, Jesper Pedersen wrote: On 02/02/2017 02:24 PM, Robert Haas wrote: So, committed. Wow, I wish every patch had this many reviewers. Thanks Robert ! This message should have included a thank you to everybody who provided valuable feedback for this feature, and for that I'm sorry. Best regards, Jesper -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pageinspect: Hash index support
On 02/02/2017 02:24 PM, Robert Haas wrote: So, committed. Wow, I wish every patch had this many reviewers. Thanks Robert ! Best regards, Jesper -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Microvacuum support for Hash Index
On 01/23/2017 02:53 PM, Jesper Pedersen wrote: I have done some more testing with this, and have moved to the patch back to 'Needs Review' pending Amit's comments. Moved to "Ready for Committer". Best regards, Jesper -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Microvacuum support for Hash Index
Hi Ashutosh, On 01/20/2017 03:24 PM, Jesper Pedersen wrote: Yeah, those are the steps; just with a Skylake laptop. However, I restarted with a fresh master, with WAL v8 and MV v5, and can't reproduce the issue. I have done some more testing with this, and have moved to the patch back to 'Needs Review' pending Amit's comments. Best regards, Jesper -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Microvacuum support for Hash Index
Hi Ashutosh, On 01/20/2017 04:18 AM, Ashutosh Sharma wrote: okay, Thanks for confirming that. I would like to update you that I am not able to reproduce this issue at my end. I suspect that the steps i am following might be slightly different than your's. Could you please have a look at steps mentioned below and confirm if there is something different that I am doing. Firstly, I am running the test-case on following git commit in head: commit ba61a04bc7fefeee03416d9911eb825c4897c223 Author: Tom Lane Date: Thu Jan 19 19:52:13 2017 -0500 Avoid core dump for empty prepared statement in an aborted transaction. Brown-paper-bag bug in commit ab1f0c822: the old code here coped with null CachedPlanSource.raw_parse_tree, the new code not so much. Per report from Dave Cramer. On top of above commit, I have applied WAL v8 patch for hash index and MV v5 patch. Now, with an --enable-cassert build I am following below steps: 1) Created a 'test' database 2) psql -d test -f ~/ddl.sql where ddl.sql is, -- ddl.sql -- CREATE TABLE test AS SELECT generate_series(1, 10) AS id, 0 AS val; CREATE INDEX IF NOT EXISTS idx_id ON test USING hash (id); CREATE INDEX IF NOT EXISTS idx_val ON test USING hash (val); ANALYZE; -- ddl.sql -- 3) pgbench -M prepared -c 10 -j 10 -T 1800 -f ~/test.sql test where test.sql is, -- test.sql -- \set id random(1,10) \set val random(0,10) BEGIN; UPDATE test SET val = :val WHERE id = :id; COMMIT; -- test.sql -- Machine details are as follows: Architecture: x86_64 CPU op-mode(s):32-bit, 64-bit Byte Order:Little Endian CPU(s):128 On-line CPU(s) list: 0-127 Thread(s) per core:2 Core(s) per socket:8 Socket(s): 8 Also, It would be great if you could confirm as if you have been getting this issue repeatedly. Thanks. Yeah, those are the steps; just with a Skylake laptop. However, I restarted with a fresh master, with WAL v8 and MV v5, and can't reproduce the issue. Best regards, Jesper -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Microvacuum support for Hash Index
Hi Ashutosh, On 01/10/2017 08:40 AM, Jesper Pedersen wrote: On 01/10/2017 05:24 AM, Ashutosh Sharma wrote: Thanks for reporting this problem. It is basically coming because i forgot to unpin the bucketbuf in hash_xlog_vacuum_one_page(). Please find the attached v5 patch that fixes the issue. The crash is now fixed, but the --- test.sql --- \set id random(1,10) \set val random(0,10) BEGIN; UPDATE test SET val = :val WHERE id = :id; COMMIT; --- test.sql --- case gives client 6 aborted in command 3 of script 0; ERROR: deadlock detected DETAIL: Process 14608 waits for ShareLock on transaction 1444620; blocked by process 14610. Process 14610 waits for ShareLock on transaction 1444616; blocked by process 14608. HINT: See server log for query details. CONTEXT: while rechecking updated tuple (12,3) in relation "test" ... using pgbench -M prepared -c 10 -j 10 -T 300 -f test.sql test I'm not seeing this deadlock with just the WAL v8 patch applied. Best regards, Jesper -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pageinspect: Hash index support
Hi, On 01/18/2017 04:54 AM, Ashutosh Sharma wrote: Is there a reason for keeping the input arguments for hash_bitmap_info() different from hash_page_items()? Yes, there are two reasons behind it. Firstly, we need metapage to identify the bitmap page that holds the information about the overflow page passed as an input to this function. Secondly, we will have to input overflow block number as an input to this function so as to determine the overflow bit number which can be used further to identify the bitmap page. + /* Read the metapage so we can determine which bitmap page to use */ + metabuf = _hash_getbuf(indexRel, HASH_METAPAGE, HASH_READ, LH_META_PAGE); + metap = HashPageGetMeta(BufferGetPage(metabuf)); + + /* Identify overflow bit number */ + ovflbitno = _hash_ovflblkno_to_bitno(metap, ovflblkno); + + bitmappage = ovflbitno >> BMPG_SHIFT(metap); + bitmapbit = ovflbitno & BMPG_MASK(metap); + + if (bitmappage >= metap->hashm_nmaps) + ereport(ERROR, + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), +errmsg("invalid overflow bit number %u", ovflbitno))); + + bitmapblkno = metap->hashm_mapp[bitmappage]; As discussed off-list (along with my patch that included some of these fixes), it would require calculation from the user to be able to provide the necessary pages through get_raw_page(). Other ideas on how to improve this are most welcome. Apart from above comments, the attached patch also handles all the comments from Mithun. Please, include a version number for your patch files in the future. Fixed in this version: * verify_hash_page: Display magic in hex, like hash_metapage_info * Update header for hash_page_type Moving the patch back to 'Needs Review'. Best regards, Jesper >From 8e5a68cfaf979bcab88fb9358b88a89bc780a277 Mon Sep 17 00:00:00 2001 From: jesperpedersen Date: Wed, 18 Jan 2017 08:42:02 -0500 Subject: [PATCH] Add support for hash index in pageinspect contrib module v15 Authors: Ashutosh Sharma and Jesper Pedersen. --- contrib/pageinspect/Makefile | 10 +- contrib/pageinspect/hashfuncs.c | 574 ++ contrib/pageinspect/pageinspect--1.5--1.6.sql | 76 contrib/pageinspect/pageinspect.control | 2 +- doc/src/sgml/pageinspect.sgml | 148 +++ src/backend/access/hash/hashovfl.c| 8 +- src/include/access/hash.h | 1 + 7 files changed, 810 insertions(+), 9 deletions(-) create mode 100644 contrib/pageinspect/hashfuncs.c create mode 100644 contrib/pageinspect/pageinspect--1.5--1.6.sql diff --git a/contrib/pageinspect/Makefile b/contrib/pageinspect/Makefile index 87a28e9..052a8e1 100644 --- a/contrib/pageinspect/Makefile +++ b/contrib/pageinspect/Makefile @@ -2,13 +2,13 @@ MODULE_big = pageinspect OBJS = rawpage.o heapfuncs.o btreefuncs.o fsmfuncs.o \ - brinfuncs.o ginfuncs.o $(WIN32RES) + brinfuncs.o ginfuncs.o hashfuncs.o $(WIN32RES) EXTENSION = pageinspect -DATA = pageinspect--1.5.sql pageinspect--1.4--1.5.sql \ - pageinspect--1.3--1.4.sql pageinspect--1.2--1.3.sql \ - pageinspect--1.1--1.2.sql pageinspect--1.0--1.1.sql \ - pageinspect--unpackaged--1.0.sql +DATA = pageinspect--1.5.sql pageinspect--1.5--1.6.sql \ + pageinspect--1.4--1.5.sql pageinspect--1.3--1.4.sql \ + pageinspect--1.2--1.3.sql pageinspect--1.1--1.2.sql \ + pageinspect--1.0--1.1.sql pageinspect--unpackaged--1.0.sql PGFILEDESC = "pageinspect - functions to inspect contents of database pages" REGRESS = page btree brin gin diff --git a/contrib/pageinspect/hashfuncs.c b/contrib/pageinspect/hashfuncs.c new file mode 100644 index 000..f157fcc --- /dev/null +++ b/contrib/pageinspect/hashfuncs.c @@ -0,0 +1,574 @@ +/* + * hashfuncs.c + * Functions to investigate the content of HASH indexes + * + * Copyright (c) 2017, PostgreSQL Global Development Group + * + * IDENTIFICATION + * contrib/pageinspect/hashfuncs.c + */ + +#include "postgres.h" + +#include "access/hash.h" +#include "access/htup_details.h" +#include "catalog/pg_am.h" +#include "funcapi.h" +#include "miscadmin.h" +#include "utils/builtins.h" + +PG_FUNCTION_INFO_V1(hash_page_type); +PG_FUNCTION_INFO_V1(hash_page_stats); +PG_FUNCTION_INFO_V1(hash_page_items); +PG_FUNCTION_INFO_V1(hash_bitmap_info); +PG_FUNCTION_INFO_V1(hash_metapage_info); + +#define IS_HASH(r) ((r)->rd_rel->relam == HASH_AM_OID) + +/* + * structure for single hash page statistics + * + */ +typedef struct HashPageStat +{ + uint32 live_items; + uint32 dead_items; + uint32 page_size; + uint32 free_size; + + /* opaque data */ + BlockNumber hasho_prevblkno; + BlockNumber hasho_nextblkno; + Bucket hasho_bucket; + uint16 hasho_flag; + uint16 hasho_page_id; +} HashPage
Re: [HACKERS] Write Ahead Logging for Hash Indexes
On 12/27/2016 01:58 AM, Amit Kapila wrote: After recent commit's 7819ba1e and 25216c98, this patch requires a rebase. Attached is the rebased patch. This needs a rebase after commit e898437. Best regards, Jesper -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pageinspect: Hash index support
Hi, On 01/11/2017 03:16 PM, Ashutosh Sharma wrote: I have rephrased it to make it more clear. Rebased, and removed the compile warn in hashfuncs.c Best regards, Jesper >From 8a07230b89b97280f0f1d645145da1fd140969c6 Mon Sep 17 00:00:00 2001 From: jesperpedersen Date: Thu, 12 Jan 2017 14:00:45 -0500 Subject: [PATCH] Add support for hash index in pageinspect contrib module v13 Authors: Ashutosh Sharma and Jesper Pedersen. --- contrib/pageinspect/Makefile | 10 +- contrib/pageinspect/hashfuncs.c | 548 ++ contrib/pageinspect/pageinspect--1.5--1.6.sql | 76 contrib/pageinspect/pageinspect.control | 2 +- doc/src/sgml/pageinspect.sgml | 148 +++ src/backend/access/hash/hashovfl.c| 52 +-- src/include/access/hash.h | 1 + 7 files changed, 806 insertions(+), 31 deletions(-) create mode 100644 contrib/pageinspect/hashfuncs.c create mode 100644 contrib/pageinspect/pageinspect--1.5--1.6.sql diff --git a/contrib/pageinspect/Makefile b/contrib/pageinspect/Makefile index 87a28e9..d7f3645 100644 --- a/contrib/pageinspect/Makefile +++ b/contrib/pageinspect/Makefile @@ -2,13 +2,13 @@ MODULE_big = pageinspect OBJS = rawpage.o heapfuncs.o btreefuncs.o fsmfuncs.o \ - brinfuncs.o ginfuncs.o $(WIN32RES) + brinfuncs.o ginfuncs.o hashfuncs.o $(WIN32RES) EXTENSION = pageinspect -DATA = pageinspect--1.5.sql pageinspect--1.4--1.5.sql \ - pageinspect--1.3--1.4.sql pageinspect--1.2--1.3.sql \ - pageinspect--1.1--1.2.sql pageinspect--1.0--1.1.sql \ - pageinspect--unpackaged--1.0.sql +DATA = pageinspect--1.5.sql pageinspect--1.5--1.6.sql \ + pageinspect--1.4--1.5.sql pageinspect--1.3--1.4.sql \ + pageinspect--1.2--1.3.sql pageinspect--1.1--1.2.sql \ + pageinspect--1.0--1.1.sql pageinspect--unpackaged--1.0.sql PGFILEDESC = "pageinspect - functions to inspect contents of database pages" REGRESS = page btree brin gin diff --git a/contrib/pageinspect/hashfuncs.c b/contrib/pageinspect/hashfuncs.c new file mode 100644 index 000..25e9641 --- /dev/null +++ b/contrib/pageinspect/hashfuncs.c @@ -0,0 +1,548 @@ +/* + * hashfuncs.c + * Functions to investigate the content of HASH indexes + * + * Copyright (c) 2017, PostgreSQL Global Development Group + * + * IDENTIFICATION + * contrib/pageinspect/hashfuncs.c + */ + +#include "postgres.h" + +#include "access/hash.h" +#include "access/htup_details.h" +#include "catalog/pg_am.h" +#include "funcapi.h" +#include "miscadmin.h" +#include "utils/builtins.h" + +PG_FUNCTION_INFO_V1(hash_page_type); +PG_FUNCTION_INFO_V1(hash_page_stats); +PG_FUNCTION_INFO_V1(hash_page_items); +PG_FUNCTION_INFO_V1(hash_bitmap_info); +PG_FUNCTION_INFO_V1(hash_metapage_info); + +#define IS_HASH(r) ((r)->rd_rel->relam == HASH_AM_OID) + +/* + * structure for single hash page statistics + * + */ +typedef struct HashPageStat +{ + uint32 live_items; + uint32 dead_items; + uint32 page_size; + uint32 free_size; + + /* opaque data */ + BlockNumber hasho_prevblkno; + BlockNumber hasho_nextblkno; + Bucket hasho_bucket; + uint16 hasho_flag; + uint16 hasho_page_id; +} HashPageStat; + + +/* + * Verify that the given bytea contains a HASH page, or die in the attempt. + * A pointer to the page is returned. + */ +static Page +verify_hash_page(bytea *raw_page, int flags) +{ + Page page; + int raw_page_size; + HashPageOpaque pageopaque; + + raw_page_size = VARSIZE(raw_page) - VARHDRSZ; + + if (raw_page_size != BLCKSZ) + ereport(ERROR, +(errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("input page too small"), + errdetail("Expected size %d, got %d", + BLCKSZ, raw_page_size))); + + page = VARDATA(raw_page); + + if (PageIsNew(page)) + ereport(ERROR, +(errcode(ERRCODE_INDEX_CORRUPTED), + errmsg("index table contains unexpected zero page"))); + + if (PageGetSpecialSize(page) != MAXALIGN(sizeof(HashPageOpaqueData))) + ereport(ERROR, +(errcode(ERRCODE_INDEX_CORRUPTED), + errmsg("index table contains corrupted page"))); + + pageopaque = (HashPageOpaque) PageGetSpecialPointer(page); + if (pageopaque->hasho_page_id != HASHO_PAGE_ID) + ereport(ERROR, +(errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("page is not a HASH page"), + errdetail("Expected %08x, got %08x.", + HASHO_PAGE_ID, pageopaque->hasho_page_id))); + + if (!(pageopaque->hasho_flag & flags)) + ereport(ERROR, +(errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("page is not a hash page of expected type"), + errdetail("Expected hash page type: %08x got: %08x.", +flags, pageopaque->hasho_flag))); + return page; +} + +/* ---
Re: [HACKERS] Retiring from the Core Team
On 01/11/2017 07:29 PM, Josh Berkus wrote: You will have noticed that I haven't been very active for the past year. My new work on Linux containers and Kubernetes has been even more absorbing than I anticipated, and I just haven't had a lot of time for PostgreSQL work. For that reason, as of today, I am stepping down from the PostgreSQL Core Team. Thanks for all your work over the years ! Best regards, Jesper -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pageinspect: Hash index support
On 01/10/2017 10:39 AM, Tom Lane wrote: Robert Haas writes: No, you're missing the point. The patch doesn't need to add pageinspect--1.6.sql at all, nor does it remove pageinspect--1.5.sql. It only needs to add pageinspect--1.5--1.6.sql and change the default version to 1.6. No other changes are required. Right, this is a new recommended process since commit 40b449ae8, which see for rationale. Use, eg, commit 11da83a0e as a model for extension update patches. Thanks for the commit ids ! Revised patched attached. Best regards, Jesper >From 2c75e4af17903c22961f23ed5455614340d0dd4e Mon Sep 17 00:00:00 2001 From: jesperpedersen Date: Tue, 3 Jan 2017 07:49:42 -0500 Subject: [PATCH] Add support for hash index in pageinspect contrib module v11 Authors: Ashutosh Sharma and Jesper Pedersen. --- contrib/pageinspect/Makefile | 10 +- contrib/pageinspect/hashfuncs.c | 558 ++ contrib/pageinspect/pageinspect--1.5--1.6.sql | 76 contrib/pageinspect/pageinspect.control | 2 +- doc/src/sgml/pageinspect.sgml | 149 +++ src/backend/access/hash/hashovfl.c| 52 +-- src/include/access/hash.h | 1 + 7 files changed, 817 insertions(+), 31 deletions(-) create mode 100644 contrib/pageinspect/hashfuncs.c create mode 100644 contrib/pageinspect/pageinspect--1.5--1.6.sql diff --git a/contrib/pageinspect/Makefile b/contrib/pageinspect/Makefile index 87a28e9..d7f3645 100644 --- a/contrib/pageinspect/Makefile +++ b/contrib/pageinspect/Makefile @@ -2,13 +2,13 @@ MODULE_big = pageinspect OBJS = rawpage.o heapfuncs.o btreefuncs.o fsmfuncs.o \ - brinfuncs.o ginfuncs.o $(WIN32RES) + brinfuncs.o ginfuncs.o hashfuncs.o $(WIN32RES) EXTENSION = pageinspect -DATA = pageinspect--1.5.sql pageinspect--1.4--1.5.sql \ - pageinspect--1.3--1.4.sql pageinspect--1.2--1.3.sql \ - pageinspect--1.1--1.2.sql pageinspect--1.0--1.1.sql \ - pageinspect--unpackaged--1.0.sql +DATA = pageinspect--1.5.sql pageinspect--1.5--1.6.sql \ + pageinspect--1.4--1.5.sql pageinspect--1.3--1.4.sql \ + pageinspect--1.2--1.3.sql pageinspect--1.1--1.2.sql \ + pageinspect--1.0--1.1.sql pageinspect--unpackaged--1.0.sql PGFILEDESC = "pageinspect - functions to inspect contents of database pages" REGRESS = page btree brin gin diff --git a/contrib/pageinspect/hashfuncs.c b/contrib/pageinspect/hashfuncs.c new file mode 100644 index 000..525e780 --- /dev/null +++ b/contrib/pageinspect/hashfuncs.c @@ -0,0 +1,558 @@ +/* + * hashfuncs.c + * Functions to investigate the content of HASH indexes + * + * Copyright (c) 2017, PostgreSQL Global Development Group + * + * IDENTIFICATION + * contrib/pageinspect/hashfuncs.c + */ + +#include "postgres.h" + +#include "access/hash.h" +#include "access/htup_details.h" +#include "catalog/pg_am.h" +#include "funcapi.h" +#include "miscadmin.h" +#include "utils/builtins.h" + +PG_FUNCTION_INFO_V1(hash_page_type); +PG_FUNCTION_INFO_V1(hash_page_stats); +PG_FUNCTION_INFO_V1(hash_page_items); +PG_FUNCTION_INFO_V1(hash_bitmap_info); +PG_FUNCTION_INFO_V1(hash_metapage_info); + +#define IS_HASH(r) ((r)->rd_rel->relam == HASH_AM_OID) + +/* + * structure for single hash page statistics + * + */ +typedef struct HashPageStat +{ + uint32 live_items; + uint32 dead_items; + uint32 page_size; + uint32 free_size; + + /* opaque data */ + BlockNumber hasho_prevblkno; + BlockNumber hasho_nextblkno; + Bucket hasho_bucket; + uint16 hasho_flag; + uint16 hasho_page_id; +} HashPageStat; + + +/* + * Verify that the given bytea contains a HASH page, or die in the attempt. + * A pointer to the page is returned. + */ +static Page +verify_hash_page(bytea *raw_page, int flags) +{ + Page page; + int raw_page_size; + HashPageOpaque pageopaque; + + raw_page_size = VARSIZE(raw_page) - VARHDRSZ; + + if (raw_page_size != BLCKSZ) + ereport(ERROR, +(errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("input page too small"), + errdetail("Expected size %d, got %d", + BLCKSZ, raw_page_size))); + + page = VARDATA(raw_page); + + if (PageIsNew(page)) + ereport(ERROR, +(errcode(ERRCODE_INDEX_CORRUPTED), + errmsg("index table contains unexpected zero page"))); + + if (PageGetSpecialSize(page) != MAXALIGN(sizeof(HashPageOpaqueData))) + ereport(ERROR, +(errcode(ERRCODE_INDEX_CORRUPTED), + errmsg("index table contains corrupted page"))); + + pageopaque = (HashPageOpaque) PageGetSpecialPointer(page); + if (pageopaque->hasho_page_id != HASHO_PAGE_ID) + ereport(ERROR, +(errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("page is not a HASH page"), + errdetail("Expected %08x, got %08x.", + HASHO_PAGE_ID, pageopaque-
Re: [HACKERS] Microvacuum support for Hash Index
Hi Ashutosh, On 01/10/2017 05:24 AM, Ashutosh Sharma wrote: Thanks for reporting this problem. It is basically coming because i forgot to unpin the bucketbuf in hash_xlog_vacuum_one_page(). Please find the attached v5 patch that fixes the issue. The crash is now fixed, but the --- test.sql --- \set id random(1,10) \set val random(0,10) BEGIN; UPDATE test SET val = :val WHERE id = :id; COMMIT; --- test.sql --- case gives client 6 aborted in command 3 of script 0; ERROR: deadlock detected DETAIL: Process 14608 waits for ShareLock on transaction 1444620; blocked by process 14610. Process 14610 waits for ShareLock on transaction 1444616; blocked by process 14608. HINT: See server log for query details. CONTEXT: while rechecking updated tuple (12,3) in relation "test" ... using pgbench -M prepared -c 10 -j 10 -T 300 -f test.sql test Best regards, Jesper -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Microvacuum support for Hash Index
Hi Ashutosh, On 01/06/2017 12:54 AM, Ashutosh Sharma wrote: using pgbench -M prepared -c 10 -j 10 -T 600 -f test.sql test crashes after a few minutes with TRAP: FailedAssertion("!(LWLockHeldByMeInMode(((LWLock*) (&(bufHdr)->content_lock)), LW_EXCLUSIVE))", File: "bufmgr.c", Line: 3781) Attached v4 patch fixes this assertion failure. Yes, that fixes the problem. However (master / WAL v7 / MV v4), --- ddl.sql --- CREATE TABLE test AS SELECT generate_series(1, 10) AS id, 0 AS val; CREATE INDEX IF NOT EXISTS idx_id ON test USING hash (id); CREATE INDEX IF NOT EXISTS idx_val ON test USING hash (val); ANALYZE; --- ddl.sql --- --- test.sql --- \set id random(1,10) \set val random(0,10) BEGIN; DELETE FROM test WHERE id = :id; INSERT INTO test VALUES (:id, :val); COMMIT; --- test.sql --- gives #9 0x0098a83e in elog_finish (elevel=20, fmt=0xb6ea92 "incorrect local pin count: %d") at elog.c:1378 #10 0x007f0b33 in LockBufferForCleanup (buffer=1677) at bufmgr.c:3605 #11 0x00549390 in XLogReadBufferForRedoExtended (record=0x2afced8, block_id=1 '\001', mode=RBM_NORMAL, get_cleanup_lock=1 '\001', buf=0x7ffe3ee27c8c) at xlogutils.c:394 #12 0x004c5026 in hash_xlog_vacuum_one_page (record=0x2afced8) at hash_xlog.c:1109 #13 0x004c5547 in hash_redo (record=0x2afced8) at hash_xlog.c:1214 #14 0x0053a361 in StartupXLOG () at xlog.c:6975 #15 0x007a4ca0 in StartupProcessMain () at startup.c:216 on the slave instance in a master-slave setup. Also, the src/backend/access/README file should be updated with a description of the changes which this patch provides. Best regards, Jesper -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Microvacuum support for Hash Index
Hi Ashutosh, On 01/04/2017 06:13 AM, Ashutosh Sharma wrote: Attached is the v3 patch rebased on postgreSQL HEAD and WAL v7 patch. It also takes care of all the previous comments from Jesper - [1]. With an --enable-cassert build (master / WAL v7 / MV v3) and -- ddl.sql -- CREATE TABLE test AS SELECT generate_series(1, 10) AS id, 0 AS val; CREATE INDEX IF NOT EXISTS idx_id ON test USING hash (id); CREATE INDEX IF NOT EXISTS idx_val ON test USING hash (val); ANALYZE; -- ddl.sql -- -- test.sql -- \set id random(1,10) \set val random(0,10) BEGIN; UPDATE test SET val = :val WHERE id = :id; COMMIT; -- test.sql -- using pgbench -M prepared -c 10 -j 10 -T 600 -f test.sql test crashes after a few minutes with TRAP: FailedAssertion("!(LWLockHeldByMeInMode(((LWLock*) (&(bufHdr)->content_lock)), LW_EXCLUSIVE))", File: "bufmgr.c", Line: 3781) BTW, better rename 'hashkillitems' to '_hash_kill_items' to follow the naming convention in hash.h Best regards, Jesper -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pageinspect: Hash index support
Hi Ashutosh, On 01/05/2017 07:13 AM, Ashutosh Sharma wrote: * Readded pageinspect--1.6.sql In order to have the latest pageinspect interface in 1 file, as we need something to install from. I think there should be no problem even if we simply add pageinspect--1.5--1.6.sql file instead of removing 1.5.sql as with the latest changes to the create extension infrastructure in postgresql it automatically finds a lower version script if unable to find the default version script and then upgrade it to the latest version. Removing --1.5.sql otherwise would give test=# CREATE EXTENSION "pageinspect"; ERROR: extension "pageinspect" has no installation script nor update path for version "1.6" I didn't get this. Do you mean to say that if you add 1.6.sql and do not remove 1.5.sql you get this error when trying to add pageinspect module. I didn't get any such error. See below, postgres=# create extension pageinspect WITH VERSION '1.6'; CREATE EXTENSION The previous patch was using pageinspect--1.5.sql as a base, and then uses pageinspect--1.5-1.6.sql to upgrade to version 1.6. Removing pageinspect--1.5.sql, and adding pageinspect--1.6.sql with the current interface will use pageinspect--1.6.sql directly where as existing installations will use the upgrade scripts. Hence I don't see a reason why we should keep pageinspect--1.5.sql around when we can provide a clear interface description in a pageinspect--1.6.sql. Peter has already written a test-case-[1] based on your earlier patch for supporting hash index with pageinspect module. Once the latest patch (v10) becomes stable i will share a separete patch having a test-case for hash index. Infact I will try to modify an already existing patch by Peter. Ok, sounds good. Best regards, Jesper -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pageinspect: Hash index support
Hi Ashutosh, On 12/20/2016 05:55 AM, Ashutosh Sharma wrote: 1) It introduces two new functions hash_page_type() and hash_bitmap_info(). hash_page_type basically displays the type of hash page whereas hash_bitmap_info() shows the status of a bit for a particular overflow page in bitmap page of hash index. 2) The functions hash_page_stats() and hash_page_items() basically shows the information about data stored in bucket and overflow pages of hash index. If a metapage or bitmap page is passed as an input to this function it throws an error saying "Expected page type:'' got: ''". 3) It also improves verify_hash_page() function to handle any type of page in hash index. It is now being used as verify_hash_page('raw_page', ) to verify if the page passed to this function is of 'page_type' or not. Here page_type can be LH_META_PAGE, LH_BUCKET_PAGE, LH_OVERFLOW_PAGE, LH_BITMAP_PAGE. Here is an updated patch with some changes. * Rename convert_ovflblkno_to_bitno to _hash_ovflblkno_to_bitno Such that there is only 1 method, which is exposed * Readded pageinspect--1.6.sql In order to have the latest pageinspect interface in 1 file, as we need something to install from. Removing --1.5.sql otherwise would give test=# CREATE EXTENSION "pageinspect"; ERROR: extension "pageinspect" has no installation script nor update path for version "1.6" * Minor documentation changes Look it over, and maybe there is a simple test case we can add for this. Best regards, Jesper >From 46374ba3aba57d290c1940fee4a12ed31c12342f Mon Sep 17 00:00:00 2001 From: jesperpedersen Date: Tue, 3 Jan 2017 07:49:42 -0500 Subject: [PATCH] Add support for hash index in pageinspect contrib module v10 Authors: Ashutosh Sharma and Jesper Pedersen. --- contrib/pageinspect/Makefile | 10 +- contrib/pageinspect/hashfuncs.c | 558 ++ contrib/pageinspect/pageinspect--1.5--1.6.sql | 76 contrib/pageinspect/pageinspect--1.5.sql | 279 - contrib/pageinspect/pageinspect--1.6.sql | 351 contrib/pageinspect/pageinspect.control | 2 +- doc/src/sgml/pageinspect.sgml | 149 +++ src/backend/access/hash/hashovfl.c| 52 +-- src/include/access/hash.h | 1 + 9 files changed, 1168 insertions(+), 310 deletions(-) create mode 100644 contrib/pageinspect/hashfuncs.c create mode 100644 contrib/pageinspect/pageinspect--1.5--1.6.sql delete mode 100644 contrib/pageinspect/pageinspect--1.5.sql create mode 100644 contrib/pageinspect/pageinspect--1.6.sql diff --git a/contrib/pageinspect/Makefile b/contrib/pageinspect/Makefile index 87a28e9..ecf0df0 100644 --- a/contrib/pageinspect/Makefile +++ b/contrib/pageinspect/Makefile @@ -2,13 +2,13 @@ MODULE_big = pageinspect OBJS = rawpage.o heapfuncs.o btreefuncs.o fsmfuncs.o \ - brinfuncs.o ginfuncs.o $(WIN32RES) + brinfuncs.o ginfuncs.o hashfuncs.o $(WIN32RES) EXTENSION = pageinspect -DATA = pageinspect--1.5.sql pageinspect--1.4--1.5.sql \ - pageinspect--1.3--1.4.sql pageinspect--1.2--1.3.sql \ - pageinspect--1.1--1.2.sql pageinspect--1.0--1.1.sql \ - pageinspect--unpackaged--1.0.sql +DATA = pageinspect--1.6.sql pageinspect--1.5--1.6.sql \ + pageinspect--1.4--1.5.sql pageinspect--1.3--1.4.sql \ + pageinspect--1.2--1.3.sql pageinspect--1.1--1.2.sql \ + pageinspect--1.0--1.1.sql pageinspect--unpackaged--1.0.sql PGFILEDESC = "pageinspect - functions to inspect contents of database pages" REGRESS = page btree brin gin diff --git a/contrib/pageinspect/hashfuncs.c b/contrib/pageinspect/hashfuncs.c new file mode 100644 index 000..525e780 --- /dev/null +++ b/contrib/pageinspect/hashfuncs.c @@ -0,0 +1,558 @@ +/* + * hashfuncs.c + * Functions to investigate the content of HASH indexes + * + * Copyright (c) 2017, PostgreSQL Global Development Group + * + * IDENTIFICATION + * contrib/pageinspect/hashfuncs.c + */ + +#include "postgres.h" + +#include "access/hash.h" +#include "access/htup_details.h" +#include "catalog/pg_am.h" +#include "funcapi.h" +#include "miscadmin.h" +#include "utils/builtins.h" + +PG_FUNCTION_INFO_V1(hash_page_type); +PG_FUNCTION_INFO_V1(hash_page_stats); +PG_FUNCTION_INFO_V1(hash_page_items); +PG_FUNCTION_INFO_V1(hash_bitmap_info); +PG_FUNCTION_INFO_V1(hash_metapage_info); + +#define IS_HASH(r) ((r)->rd_rel->relam == HASH_AM_OID) + +/* + * structure for single hash page statistics + * + */ +typedef struct HashPageStat +{ + uint32 live_items; + uint32 dead_items; + uint32 page_size; + uint32 free_size; + + /* opaque data */ + BlockNumber hasho_prevblkno; + BlockNumber hasho_nextblkno; + Bucket hasho_bucket; + uint16 hasho_flag; + uint16 hash
Re: [HACKERS] Microvacuum support for Hash Index
On 11/11/2016 12:11 AM, Ashutosh Sharma wrote: Hi Jesper, Some initial comments. _hash_vacuum_one_page: + END_CRIT_SECTION(); + _hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK); The _hash_chgbufaccess() needs a comment. You also need a place where you call pfree for so->killedItems - maybe in hashkillitems(). Thanks for reviewing this patch. I would like to update you that this patch has got dependency on a patch for concurrent hash index and WAL log in hash index. So, till these two patches for hash index are not stable I won't be able to share you a next version of patch for supporting microvacuum in hash index. This can be rebased on the WAL v7 patch [1]. In addition to the previous comments you need to take commit 7819ba into account. [1] https://www.postgresql.org/message-id/CAA4eK1%2BdmGNTFMnLO4EbOWJDHUq%3D%2Ba2L8T%3D72ifXeh-Kd8HOsg%40mail.gmail.com Best regards, Jesper -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pageinspect: Hash index support
Hi, On 12/20/2016 05:55 AM, Ashutosh Sharma wrote: Attached is the revised patch. Please have a look and let me know your feedback. I have also changed the status for this patch in the upcoming commitfest to "needs review". Thanks. I was planning to submit a new version next week for CF/January, so I'll review your changes with the previous feedback in mind. Thanks for working on this ! Best regards, Jesper -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_catversion builtin function
On 12/14/2016 08:52 AM, Robert Haas wrote: But I understand your concern, so "Rejected" is ok under https://commitfest.postgresql.org/12/906/ I have a better reason for rejecting this patch: we already have this feature. rhaas=# select catalog_version_no from pg_control_system(); catalog_version_no 201612081 (1 row) Ah, perfect ! Thanks, Robert Best regards, Jesper -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_catversion builtin function
On 12/13/2016 10:33 AM, Tom Lane wrote: Jesper Pedersen writes: Attached is a new builtin function that exposes the CATALOG_VERSION_NO constant under the pg_catversion() function, e.g. I'm pretty sure that we intentionally didn't expose that, reasoning that users should only care about the user-visible version number. What exactly is the argument for exposing this? I'm using it to get the catalog version from a running instance in order to figure out if a dump/restore is needed for the next daily build -- instead of keeping the catversion.h file around for each installation, with script magic. Test databases are external to PostgreSQL's test suite, and one is quite big, so "cp" is faster than dump/restore :) But I understand your concern, so "Rejected" is ok under https://commitfest.postgresql.org/12/906/ Best regards, Jesper -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash Indexes
On 12/11/2016 11:37 PM, Amit Kapila wrote: On Sun, Dec 11, 2016 at 11:54 AM, Amit Kapila wrote: On Wed, Dec 7, 2016 at 2:02 AM, Jeff Janes wrote: With above fixes, the test ran successfully for more than a day. There was a small typo in the previous patch which is fixed in attached. I don't think it will impact the test results if you have already started the test with the previous patch, but if not, then it is better to test with attached. A mix work load (INSERT, DELETE and VACUUM primarily) is successful here too using _v2. Thanks ! Best regards, Jesper -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] pg_catversion builtin function
Hi Hackers, Attached is a new builtin function that exposes the CATALOG_VERSION_NO constant under the pg_catversion() function, e.g. test=# SELECT pg_catversion(); pg_catversion --- 201612121 (1 row) Although it mostly useful during the development cycle to verify if dump/restore is needed; it could have other use-cases. I'm unsure of the OID assignment rules - feel free to point me towards information regarding this. I'll register this patch with the next CF. Best regards, Jesper >From 39d52f5389bf3ef1814c1f201df6531feb2a5c7f Mon Sep 17 00:00:00 2001 From: jesperpedersen Date: Tue, 13 Dec 2016 08:27:18 -0500 Subject: [PATCH] pg_catversion builtin function --- doc/src/sgml/func.sgml | 6 ++ src/backend/utils/adt/version.c | 10 ++ src/include/catalog/pg_proc.h | 3 +++ src/include/utils/builtins.h| 1 + 4 files changed, 20 insertions(+) diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml index 0f9c9bf..6fc78ab 100644 --- a/doc/src/sgml/func.sgml +++ b/doc/src/sgml/func.sgml @@ -15497,6 +15497,12 @@ SELECT * FROM pg_ls_dir('.') WITH ORDINALITY AS t(ls,n); text PostgreSQL version information. See also for a machine-readable version. + + + pg_catversion() + int + returns the catalog version number + diff --git a/src/backend/utils/adt/version.c b/src/backend/utils/adt/version.c index f247992..55f97fa 100644 --- a/src/backend/utils/adt/version.c +++ b/src/backend/utils/adt/version.c @@ -14,6 +14,7 @@ #include "postgres.h" +#include "catalog/catversion.h" #include "utils/builtins.h" @@ -22,3 +23,12 @@ pgsql_version(PG_FUNCTION_ARGS) { PG_RETURN_TEXT_P(cstring_to_text(PG_VERSION_STR)); } + +/* + * Return the catalog version number + */ +Datum +pgsql_catversion(PG_FUNCTION_ARGS) +{ + PG_RETURN_INT32(CATALOG_VERSION_NO); +} diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h index cd7b909..b23f54a 100644 --- a/src/include/catalog/pg_proc.h +++ b/src/include/catalog/pg_proc.h @@ -235,6 +235,9 @@ DATA(insert OID = 1258 ( textcat PGNSP PGUID 12 1 0 0 0 f f f f t f i s 2 0 DATA(insert OID = 84 ( boolne PGNSP PGUID 12 1 0 0 0 f f f t t f i s 2 0 16 "16 16" _null_ _null_ _null_ _null_ _null_ boolne _null_ _null_ _null_ )); DATA(insert OID = 89 ( version PGNSP PGUID 12 1 0 0 0 f f f f t f s s 0 0 25 "" _null_ _null_ _null_ _null_ _null_ pgsql_version _null_ _null_ _null_ )); DESCR("PostgreSQL version string"); +DATA(insert OID = 7000 ( pg_catversionPGNSP PGUID 12 1 0 0 0 f f f f t f s s 0 0 23 "" _null_ _null_ _null_ _null_ _null_ pgsql_catversion _null_ _null_ _null_ )); +DESCR("PostgreSQL catalog version number"); + DATA(insert OID = 86 ( pg_ddl_command_in PGNSP PGUID 12 1 0 0 0 f f f f t f i s 1 0 32 "2275" _null_ _null_ _null_ _null_ _null_ pg_ddl_command_in _null_ _null_ _null_ )); DESCR("I/O"); diff --git a/src/include/utils/builtins.h b/src/include/utils/builtins.h index 7ed1623..83b2846 100644 --- a/src/include/utils/builtins.h +++ b/src/include/utils/builtins.h @@ -893,6 +893,7 @@ extern Datum text_format_nv(PG_FUNCTION_ARGS); /* version.c */ extern Datum pgsql_version(PG_FUNCTION_ARGS); +extern Datum pgsql_catversion(PG_FUNCTION_ARGS); /* xid.c */ extern Datum xidin(PG_FUNCTION_ARGS); -- 2.7.4 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Microvacuum support for Hash Index
On 11/11/2016 12:11 AM, Ashutosh Sharma wrote: Thanks for reviewing this patch. I would like to update you that this patch has got dependency on a patch for concurrent hash index and WAL log in hash index. So, till these two patches for hash index are not stable I won't be able to share you a next version of patch for supporting microvacuum in hash index. As the concurrent hash index patch was committed in 6d46f4 this patch needs a rebase. I have moved this submission to the next CF. Thanks for working on this ! Best regards, Jesper -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Cache Hash Index meta page.
On 09/28/2016 11:55 AM, Mithun Cy wrote: On Tue, Sep 27, 2016 at 1:53 AM, Jeff Janes wrote: > I think that this needs to be updated again for v8 of concurrent and v5 of wal Adding the rebased patch over [1] + [2] As the concurrent hash index patch was committed in 6d46f4 this patch needs a rebase. I have moved this submission to the next CF. Thanks for working on this ! Best regards, Jesper -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Microvacuum support for Hash Index
Hi, On 11/02/2016 01:38 AM, Ashutosh Sharma wrote: While replaying the delete/vacuum record on standby, it can conflict with some already running queries. Basically the replay can remove some row which can be visible on standby. You need to resolve conflicts similar to what we do in btree delete records (refer btree_xlog_delete). Agreed. Thanks for putting this point. I have taken care of it in the attached v2 patch. Some initial comments. _hash_vacuum_one_page: + END_CRIT_SECTION(); + _hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK); The _hash_chgbufaccess() needs a comment. You also need a place where you call pfree for so->killedItems - maybe in hashkillitems(). Best regards, Jesper -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pageinspect: Hash index support
On 11/01/2016 03:28 PM, Peter Eisentraut wrote: Ok, thanks for your feedback. Maybe "Returned with Feedback" is more appropriate, as there is still development left. I have applied the documentation change that introduced subsections, which seems quite useful independently. I have also committed the tests that I proposed and will work through the failures. As no new patch has been posted for the 2016-11 CF, I will close the patch entry now. Please submit an updated patch when you have the time, keeping an eye on ongoing work to update hash indexes. Agreed. Best regards, Jesper -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pageinspect: Hash index support
On 09/29/2016 04:02 PM, Peter Eisentraut wrote: On 9/29/16 4:00 PM, Peter Eisentraut wrote: Since the commit fest is drawing to a close, I'll set this patch as returned with feedback. Actually, the CF app informs me that moving to the next CF is more appropriate, so I have done that. Ok, thanks for your feedback. Maybe "Returned with Feedback" is more appropriate, as there is still development left. Best regards, Jesper -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pageinspect: Hash index support
On 09/29/2016 11:58 AM, Peter Eisentraut wrote: On 9/27/16 10:10 AM, Jesper Pedersen wrote: contrib/pageinspect/pageinspect--1.5--1.6.sql | 59 contrib/pageinspect/pageinspect--1.5.sql | 279 -- contrib/pageinspect/pageinspect--1.6.sql | 346 ++ I think there is still a misunderstanding about these extension files. You only need to supply pageinspect--1.5--1.6.sql and leave all the other ones alone. Ok. Left the 'DATA' section in the Makefile, and the control file as is. Best regards, Jesper >From b48623beda6aad43116d46ff44de55bdc7547325 Mon Sep 17 00:00:00 2001 From: jesperpedersen Date: Fri, 5 Aug 2016 10:16:32 -0400 Subject: [PATCH] pageinspect: Hash index support --- contrib/pageinspect/Makefile | 10 +- contrib/pageinspect/hashfuncs.c | 408 ++ contrib/pageinspect/pageinspect--1.5--1.6.sql | 59 contrib/pageinspect/pageinspect.control | 2 +- doc/src/sgml/pageinspect.sgml | 146 - 5 files changed, 609 insertions(+), 16 deletions(-) create mode 100644 contrib/pageinspect/hashfuncs.c create mode 100644 contrib/pageinspect/pageinspect--1.5--1.6.sql diff --git a/contrib/pageinspect/Makefile b/contrib/pageinspect/Makefile index a98237e..c73d728 100644 --- a/contrib/pageinspect/Makefile +++ b/contrib/pageinspect/Makefile @@ -2,13 +2,13 @@ MODULE_big = pageinspect OBJS = rawpage.o heapfuncs.o btreefuncs.o fsmfuncs.o \ - brinfuncs.o ginfuncs.o $(WIN32RES) + brinfuncs.o ginfuncs.o hashfuncs.o $(WIN32RES) EXTENSION = pageinspect -DATA = pageinspect--1.5.sql pageinspect--1.4--1.5.sql \ - pageinspect--1.3--1.4.sql pageinspect--1.2--1.3.sql \ - pageinspect--1.1--1.2.sql pageinspect--1.0--1.1.sql \ - pageinspect--unpackaged--1.0.sql +DATA = pageinspect--1.6.sql pageinspect--1.5--1.6.sql \ + pageinspect--1.4--1.5.sql pageinspect--1.3--1.4.sql \ + pageinspect--1.2--1.3.sql pageinspect--1.1--1.2.sql \ + pageinspect--1.0--1.1.sql pageinspect--unpackaged--1.0.sql PGFILEDESC = "pageinspect - functions to inspect contents of database pages" ifdef USE_PGXS diff --git a/contrib/pageinspect/hashfuncs.c b/contrib/pageinspect/hashfuncs.c new file mode 100644 index 000..a76b683 --- /dev/null +++ b/contrib/pageinspect/hashfuncs.c @@ -0,0 +1,408 @@ +/* + * hashfuncs.c + * Functions to investigate the content of HASH indexes + * + * Copyright (c) 2016, PostgreSQL Global Development Group + * + * IDENTIFICATION + * contrib/pageinspect/hashfuncs.c + */ + +#include "postgres.h" + +#include "access/hash.h" +#include "access/htup_details.h" +#include "funcapi.h" +#include "miscadmin.h" +#include "utils/builtins.h" + +PG_FUNCTION_INFO_V1(hash_metapage_info); +PG_FUNCTION_INFO_V1(hash_page_items); +PG_FUNCTION_INFO_V1(hash_page_stats); + +/* + * structure for single hash page statistics + * + */ +typedef struct HashPageStat +{ + uint32 live_items; + uint32 dead_items; + uint32 page_size; + uint32 free_size; + char *type; + + /* opaque data */ + BlockNumber hasho_prevblkno; + BlockNumber hasho_nextblkno; + Bucket hasho_bucket; + uint16 hasho_flag; + uint16 hasho_page_id; +} HashPageStat; + + +/* + * Verify that the given bytea contains a HASH page, or die in the attempt. + * A pointer to the page is returned. + */ +static Page +verify_hash_page(bytea *raw_page, bool metap) +{ + Page page; + int raw_page_size; + HashPageOpaque pageopaque; + + raw_page_size = VARSIZE(raw_page) - VARHDRSZ; + + if (raw_page_size != BLCKSZ) + ereport(ERROR, +(errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("input page too small"), + errdetail("Expected size %d, got %d", + BLCKSZ, raw_page_size))); + + page = VARDATA(raw_page); + pageopaque = (HashPageOpaque) PageGetSpecialPointer(page); + if (pageopaque->hasho_page_id != HASHO_PAGE_ID) + ereport(ERROR, +(errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("page is not a HASH page"), + errdetail("Expected %08x, got %08x.", + HASHO_PAGE_ID, pageopaque->hasho_page_id))); + + if (metap) + { + if (pageopaque->hasho_flag != LH_META_PAGE) + ereport(ERROR, + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("page is not a metadata page"), + errdetail("Expected %08x, got %08x.", + LH_META_PAGE, pageopaque->hasho_flag))); + } + else + { + if (pageopaque->hasho_flag == LH_META_PAGE) + ereport(ERROR, + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("page is a metadata page"), + errdetail("Flags %08x.", + pageopaque->hasho_flag))); + } + + return page; +} + +/* - + * GetHashPageStatistics() + * + * C
Re: [HACKERS] Write Ahead Logging for Hash Indexes
On 09/25/2016 01:00 AM, Amit Kapila wrote: Attached patch fixes the problem, now we do perform full page writes for bitmap pages. Apart from that, I have rebased the patch based on latest concurrent index patch [1]. I have updated the README as well to reflect the WAL logging related information for different operations. With attached patch, all the review comments or issues found till now are addressed. I have been running various tests, and applications with this patch together with the CHI v8 patch [1]. As I havn't seen any failures and doesn't currently have additional feedback I'm moving this patch to "Ready for Committer" for their feedback. If others have comments, move the patch status back in the CommitFest application, please. [1] https://www.postgresql.org/message-id/CAA4eK1%2BX%3D8sUd1UCZDZnE3D9CGi9kw%2Bkjxp2Tnw7SX5w8pLBNw%40mail.gmail.com Best regards, Jesper -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash Indexes
On 09/20/2016 09:02 AM, Amit Kapila wrote: On Fri, Sep 16, 2016 at 11:22 AM, Amit Kapila wrote: I do want to work on it, but it is always possible that due to some other work this might get delayed. Also, I think there is always a chance that while doing that work, we face some problem due to which we might not be able to use that optimization. So I will go with your suggestion of removing hashscan.c and it's usage for now and then if required we will pull it back. If nobody else thinks otherwise, I will update this in next patch version. In the attached patch, I have removed the support of hashscans. I think it might improve performance by few percentage (especially for single row fetch transactions) as we have registration and destroy of hashscans. I have been running various tests, and applications with this patch together with the WAL v5 patch [1]. As I havn't seen any failures and doesn't currently have additional feedback I'm moving this patch to "Ready for Committer" for their feedback. If others have comments, move the patch status back in the CommitFest application, please. [1] https://www.postgresql.org/message-id/CAA4eK1KE%3D%2BkkowyYD0vmch%3Dph4ND3H1tViAB%2B0cWTHqjZDDfqg%40mail.gmail.com Best regards, Jesper -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pageinspect: Hash index support
On 09/26/2016 10:45 PM, Peter Eisentraut wrote: On 9/26/16 1:39 PM, Jesper Pedersen wrote: Left as is, since BuildTupleFromCStrings() vs. xyzGetDatum() are equally readable in this case. But, I can change the patch if needed. The point is that to use BuildTupleFromCStrings() you need to convert numbers to strings, and then they are converted back. This is not a typical way to write row-returning functions. Ok. Changed: * BuildTupleFromCStrings -> xyzGetDatum * 'type' field: char -> text w/ full description * Removed 'type' information from documentation Best regards, Jesper >From d6dd73ffef9deafc938b9fe7119db0928494cbf9 Mon Sep 17 00:00:00 2001 From: jesperpedersen Date: Fri, 5 Aug 2016 10:16:32 -0400 Subject: [PATCH] pageinspect: Hash index support --- contrib/pageinspect/Makefile | 10 +- contrib/pageinspect/hashfuncs.c | 408 ++ contrib/pageinspect/pageinspect--1.5--1.6.sql | 59 contrib/pageinspect/pageinspect--1.5.sql | 279 -- contrib/pageinspect/pageinspect--1.6.sql | 346 ++ contrib/pageinspect/pageinspect.control | 2 +- doc/src/sgml/pageinspect.sgml | 146 - 7 files changed, 955 insertions(+), 295 deletions(-) create mode 100644 contrib/pageinspect/hashfuncs.c create mode 100644 contrib/pageinspect/pageinspect--1.5--1.6.sql delete mode 100644 contrib/pageinspect/pageinspect--1.5.sql create mode 100644 contrib/pageinspect/pageinspect--1.6.sql diff --git a/contrib/pageinspect/Makefile b/contrib/pageinspect/Makefile index a98237e..c73d728 100644 --- a/contrib/pageinspect/Makefile +++ b/contrib/pageinspect/Makefile @@ -2,13 +2,13 @@ MODULE_big = pageinspect OBJS = rawpage.o heapfuncs.o btreefuncs.o fsmfuncs.o \ - brinfuncs.o ginfuncs.o $(WIN32RES) + brinfuncs.o ginfuncs.o hashfuncs.o $(WIN32RES) EXTENSION = pageinspect -DATA = pageinspect--1.5.sql pageinspect--1.4--1.5.sql \ - pageinspect--1.3--1.4.sql pageinspect--1.2--1.3.sql \ - pageinspect--1.1--1.2.sql pageinspect--1.0--1.1.sql \ - pageinspect--unpackaged--1.0.sql +DATA = pageinspect--1.6.sql pageinspect--1.5--1.6.sql \ + pageinspect--1.4--1.5.sql pageinspect--1.3--1.4.sql \ + pageinspect--1.2--1.3.sql pageinspect--1.1--1.2.sql \ + pageinspect--1.0--1.1.sql pageinspect--unpackaged--1.0.sql PGFILEDESC = "pageinspect - functions to inspect contents of database pages" ifdef USE_PGXS diff --git a/contrib/pageinspect/hashfuncs.c b/contrib/pageinspect/hashfuncs.c new file mode 100644 index 000..a76b683 --- /dev/null +++ b/contrib/pageinspect/hashfuncs.c @@ -0,0 +1,408 @@ +/* + * hashfuncs.c + * Functions to investigate the content of HASH indexes + * + * Copyright (c) 2016, PostgreSQL Global Development Group + * + * IDENTIFICATION + * contrib/pageinspect/hashfuncs.c + */ + +#include "postgres.h" + +#include "access/hash.h" +#include "access/htup_details.h" +#include "funcapi.h" +#include "miscadmin.h" +#include "utils/builtins.h" + +PG_FUNCTION_INFO_V1(hash_metapage_info); +PG_FUNCTION_INFO_V1(hash_page_items); +PG_FUNCTION_INFO_V1(hash_page_stats); + +/* + * structure for single hash page statistics + * + */ +typedef struct HashPageStat +{ + uint32 live_items; + uint32 dead_items; + uint32 page_size; + uint32 free_size; + char *type; + + /* opaque data */ + BlockNumber hasho_prevblkno; + BlockNumber hasho_nextblkno; + Bucket hasho_bucket; + uint16 hasho_flag; + uint16 hasho_page_id; +} HashPageStat; + + +/* + * Verify that the given bytea contains a HASH page, or die in the attempt. + * A pointer to the page is returned. + */ +static Page +verify_hash_page(bytea *raw_page, bool metap) +{ + Page page; + int raw_page_size; + HashPageOpaque pageopaque; + + raw_page_size = VARSIZE(raw_page) - VARHDRSZ; + + if (raw_page_size != BLCKSZ) + ereport(ERROR, +(errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("input page too small"), + errdetail("Expected size %d, got %d", + BLCKSZ, raw_page_size))); + + page = VARDATA(raw_page); + pageopaque = (HashPageOpaque) PageGetSpecialPointer(page); + if (pageopaque->hasho_page_id != HASHO_PAGE_ID) + ereport(ERROR, +(errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("page is not a HASH page"), + errdetail("Expected %08x, got %08x.", + HASHO_PAGE_ID, pageopaque->hasho_page_id))); + + if (metap) + { + if (pageopaque->hasho_flag != LH_META_PAGE) + ereport(ERROR, + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("page is not a metadata page"), + errdetail("Expected %08x, got %08x.", + LH_META_PAGE, pageopaque->hasho_flag))); + } + else + { + if (pageopaque->hasho_flag == LH_
Re: [HACKERS] pageinspect: Hash index support
On 09/23/2016 01:56 AM, Amit Kapila wrote: While looking at patch, I noticed below code which seems somewhat problematic: + stat->max_avail = BLCKSZ - (BLCKSZ - phdr->pd_special + SizeOfPageHeaderData); + + /* page type (flags) */ + if (opaque->hasho_flag & LH_META_PAGE) + stat->type = 'm'; + else if (opaque->hasho_flag & LH_OVERFLOW_PAGE) + stat->type = 'v'; + else if (opaque->hasho_flag & LH_BUCKET_PAGE) + stat->type = 'b'; + else if (opaque->hasho_flag & LH_BITMAP_PAGE) + stat->type = 'i'; In the above code, it appears that you are trying to calculate max_avail space for all pages in same way. Don't you need to calculate it differently for bitmap page or meta page as they don't share the same format as other type of pages? Correct, however the max_avail calculation was removed in v6, since we don't display average item size anymore. Thanks for the feedback ! Best regards, Jesper -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pageinspect: Hash index support
Hi, On 09/23/2016 12:10 AM, Peter Eisentraut wrote: On 9/21/16 9:30 AM, Jesper Pedersen wrote: Attached is v5, which add basic page verification. There are still some things that I found that appear to follow the old style (btree) conventions instead the new style (brin, gin) conventions. - Rename hash_metap to hash_metapage_info. Done. - Don't use BuildTupleFromCStrings(). (There is nothing inherently wrong with it, but why convert numbers to strings and back again when it can be done more directly.) Left as is, since BuildTupleFromCStrings() vs. xyzGetDatum() are equally readable in this case. But, I can change the patch if needed. - Documentation for hash_page_stats still has blkno argument. Fixed. - In hash_metap, the magic field could be type bytea, so that it displays in hex. Or text like the brin functions. Changed to 'text'. Some other comments: - hash_metap result fields spares and mapp should be arrays of integer. B-tree and BRIN uses a 'text' field as output, so left as is. (Incidentally, the comment in hash.h talks about bitmaps[] but I think it means hashm_mapp[].) - As of very recently, we don't need to move pageinspect--1.5.sql to pageinspect--1.6.sql anymore. Just add pageinspect--1.5--1.6.sql. We need to rename the pageinspect--1.5.sql file and add the new methods, right ? Or ? - In the documentation for hash_page_stats(), the "+" sign is misaligned. Fixed. - In hash_page_items, the fields itemlen, nulls, vars are always 16, false, false. So perhaps there is no need for them. Similarly, the hash_page_stats in hash_page_stats is always 16. Fixed, by removing them. We can add them back later if needed. - The data field could be of type bytea. Left as is, for same reasons as 'spares' and 'mapp'. - Add a pointer in the documentation to the relevant header file. Done. Bonus: - Add subsections in the documentation (general functions, B-tree functions, etc.) Done. - Add tests. Thanks for the review ! Best regards, Jesper >From 24c3ff3cb48ca2076d589eda9027a0abe73b5aad Mon Sep 17 00:00:00 2001 From: jesperpedersen Date: Fri, 5 Aug 2016 10:16:32 -0400 Subject: [PATCH] pageinspect: Hash index support --- contrib/pageinspect/Makefile | 10 +- contrib/pageinspect/hashfuncs.c | 400 ++ contrib/pageinspect/pageinspect--1.5--1.6.sql | 59 contrib/pageinspect/pageinspect--1.5.sql | 279 -- contrib/pageinspect/pageinspect--1.6.sql | 346 ++ contrib/pageinspect/pageinspect.control | 2 +- doc/src/sgml/pageinspect.sgml | 151 +- 7 files changed, 952 insertions(+), 295 deletions(-) create mode 100644 contrib/pageinspect/hashfuncs.c create mode 100644 contrib/pageinspect/pageinspect--1.5--1.6.sql delete mode 100644 contrib/pageinspect/pageinspect--1.5.sql create mode 100644 contrib/pageinspect/pageinspect--1.6.sql diff --git a/contrib/pageinspect/Makefile b/contrib/pageinspect/Makefile index a98237e..c73d728 100644 --- a/contrib/pageinspect/Makefile +++ b/contrib/pageinspect/Makefile @@ -2,13 +2,13 @@ MODULE_big = pageinspect OBJS = rawpage.o heapfuncs.o btreefuncs.o fsmfuncs.o \ - brinfuncs.o ginfuncs.o $(WIN32RES) + brinfuncs.o ginfuncs.o hashfuncs.o $(WIN32RES) EXTENSION = pageinspect -DATA = pageinspect--1.5.sql pageinspect--1.4--1.5.sql \ - pageinspect--1.3--1.4.sql pageinspect--1.2--1.3.sql \ - pageinspect--1.1--1.2.sql pageinspect--1.0--1.1.sql \ - pageinspect--unpackaged--1.0.sql +DATA = pageinspect--1.6.sql pageinspect--1.5--1.6.sql \ + pageinspect--1.4--1.5.sql pageinspect--1.3--1.4.sql \ + pageinspect--1.2--1.3.sql pageinspect--1.1--1.2.sql \ + pageinspect--1.0--1.1.sql pageinspect--unpackaged--1.0.sql PGFILEDESC = "pageinspect - functions to inspect contents of database pages" ifdef USE_PGXS diff --git a/contrib/pageinspect/hashfuncs.c b/contrib/pageinspect/hashfuncs.c new file mode 100644 index 000..96e72a9 --- /dev/null +++ b/contrib/pageinspect/hashfuncs.c @@ -0,0 +1,400 @@ +/* + * hashfuncs.c + * Functions to investigate the content of HASH indexes + * + * Copyright (c) 2016, PostgreSQL Global Development Group + * + * IDENTIFICATION + * contrib/pageinspect/hashfuncs.c + */ + +#include "postgres.h" + +#include "access/hash.h" +#include "funcapi.h" +#include "miscadmin.h" + +PG_FUNCTION_INFO_V1(hash_metapage_info); +PG_FUNCTION_INFO_V1(hash_page_items); +PG_FUNCTION_INFO_V1(hash_page_stats); + +/* + * structure for single hash page statistics + * + */ +typedef struct HashPageStat +{ + uint32 live_items; + uint32 dead_items; + uint32 page_size; + uint32 free_size; + char type; + + /* opaque data */ + BlockNumber hasho_prevblkno; + Bl
Re: [HACKERS] pageinspect: Hash index support
On 09/21/2016 08:43 AM, Michael Paquier wrote: page_stats / page_items should not be used on the metadata page. As these functions are marked as superuser only it is expected that people provides the correct input, especially since the raw page structure is used as the input. btree functions use the block number to do some sanity checks. You cannot do that here as only bytea functions are available, but you could do it in verify_hash_page by looking at the opaque data and look at LH_META_PAGE. Then add a boolean argument into verify_hash_page to see if the caller expects a meta page or not and just issue an error. Actually it would be a good idea to put in those safeguards, even if I agree with you that calling those functions is at the risk of the user... Could you update the patch in this sense? I had fun doing the same tests, aka running the items and stats functions on a meta page, and the meta function on a non-meta page, but at my surprise I did not see a crash, so perhaps I was lucky and perhaps that was because of OSX. Attached is v5, which add basic page verification. Thanks for the feedback ! Best regards, Jesper >From 05fe7b9056bcbb2f29809798229640499576c3a9 Mon Sep 17 00:00:00 2001 From: jesperpedersen Date: Fri, 5 Aug 2016 10:16:32 -0400 Subject: [PATCH] pageinspect: Hash index support --- contrib/pageinspect/Makefile | 10 +- contrib/pageinspect/hashfuncs.c | 422 ++ contrib/pageinspect/pageinspect--1.5--1.6.sql | 63 contrib/pageinspect/pageinspect--1.5.sql | 279 - contrib/pageinspect/pageinspect--1.6.sql | 338 + contrib/pageinspect/pageinspect.control | 2 +- doc/src/sgml/pageinspect.sgml | 104 +++ 7 files changed, 933 insertions(+), 285 deletions(-) create mode 100644 contrib/pageinspect/hashfuncs.c create mode 100644 contrib/pageinspect/pageinspect--1.5--1.6.sql delete mode 100644 contrib/pageinspect/pageinspect--1.5.sql create mode 100644 contrib/pageinspect/pageinspect--1.6.sql diff --git a/contrib/pageinspect/Makefile b/contrib/pageinspect/Makefile index a98237e..c73d728 100644 --- a/contrib/pageinspect/Makefile +++ b/contrib/pageinspect/Makefile @@ -2,13 +2,13 @@ MODULE_big = pageinspect OBJS = rawpage.o heapfuncs.o btreefuncs.o fsmfuncs.o \ - brinfuncs.o ginfuncs.o $(WIN32RES) + brinfuncs.o ginfuncs.o hashfuncs.o $(WIN32RES) EXTENSION = pageinspect -DATA = pageinspect--1.5.sql pageinspect--1.4--1.5.sql \ - pageinspect--1.3--1.4.sql pageinspect--1.2--1.3.sql \ - pageinspect--1.1--1.2.sql pageinspect--1.0--1.1.sql \ - pageinspect--unpackaged--1.0.sql +DATA = pageinspect--1.6.sql pageinspect--1.5--1.6.sql \ + pageinspect--1.4--1.5.sql pageinspect--1.3--1.4.sql \ + pageinspect--1.2--1.3.sql pageinspect--1.1--1.2.sql \ + pageinspect--1.0--1.1.sql pageinspect--unpackaged--1.0.sql PGFILEDESC = "pageinspect - functions to inspect contents of database pages" ifdef USE_PGXS diff --git a/contrib/pageinspect/hashfuncs.c b/contrib/pageinspect/hashfuncs.c new file mode 100644 index 000..3ee57dc --- /dev/null +++ b/contrib/pageinspect/hashfuncs.c @@ -0,0 +1,422 @@ +/* + * hashfuncs.c + * Functions to investigate the content of HASH indexes + * + * Copyright (c) 2016, PostgreSQL Global Development Group + * + * IDENTIFICATION + * contrib/pageinspect/hashfuncs.c + */ + +#include "postgres.h" + +#include "access/hash.h" +#include "funcapi.h" +#include "miscadmin.h" + +PG_FUNCTION_INFO_V1(hash_metap); +PG_FUNCTION_INFO_V1(hash_page_items); +PG_FUNCTION_INFO_V1(hash_page_stats); + +/* + * structure for single hash page statistics + * + */ +typedef struct HashPageStat +{ + uint32 live_items; + uint32 dead_items; + uint32 page_size; + uint32 max_avail; + uint32 free_size; + uint32 avg_item_size; + char type; + + /* opaque data */ + BlockNumber hasho_prevblkno; + BlockNumber hasho_nextblkno; + Bucket hasho_bucket; + uint16 hasho_flag; + uint16 hasho_page_id; +} HashPageStat; + + +/* + * Verify that the given bytea contains a HASH page, or die in the attempt. + * A pointer to the page is returned. + */ +static Page +verify_hash_page(bytea *raw_page, bool metap) +{ + Page page; + int raw_page_size; + HashPageOpaque pageopaque; + + raw_page_size = VARSIZE(raw_page) - VARHDRSZ; + + if (raw_page_size != BLCKSZ) + ereport(ERROR, +(errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("input page too small"), + errdetail("Expected size %d, got %d", + BLCKSZ, raw_page_size))); + + page = VARDATA(raw_page); + pageopaque = (HashPageOpaque) PageGetSpecialPointer(page); + if (pageopaque->hasho_page_id != HASHO_PAGE_ID) + ereport(ERROR, +(errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("page is not a HASH page"), + errdetail("Expected %08x, got %08x.", + HASHO_PAGE_ID, pageopaque->hasho_p
Re: [HACKERS] pageinspect: Hash index support
On 09/21/2016 02:14 AM, Michael Paquier wrote: Adjusted in v4. Thanks for the new version. Code/doc will need an update once the CHI patch goes in. If you are referring to the WAL support, I guess that this patch or the other patch could just rebase and adjust as needed. It is the main patch [1] that defines the new constants for page type. But I'll submit an update for pageinspect when needed. hash_page_items and hash_page_stats share a lot of common points with their btree equivalents, still doing a refactoring would just impact the visibility of the code, and it is wanted as educational in this module, so let's go with things the way you suggest. Ok. + + The type information will be 'm' for a metadata page, + 'v' for an overflow page, 'b' for a bucket page, + 'i' for a bitmap page, and 'u' for an unused page. + Other functions don't go into this level of details, so I would just rip out this paragraph. I'll add an annotation for this part, and leave it for the committer to decide, since Jeff wanted documentation for the 'type' information. The patch looks in pretty good shape to me, so I am switching it as ready for committer. Thanks for your feedback ! Best regards, Jesper -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pageinspect: Hash index support
Hi, On 09/21/2016 07:24 AM, Ashutosh Sharma wrote: The development branch is @ https://github.com/jesperpedersen/postgres/tree/pageinspect_hash I am working on centOS 7. I am still facing the issue when applying your patch using "git apply" command but if i use 'patch' command it works fine however i am seeing some warning messages with 'patch' command as well. git apply w/ v4 works here, so you will have to investigate what happens on your side. I continued reviewing your v4 patch and here are some more comments: 1). Got below error if i pass meta page to hash_page_items(). Does hash_page_items() accept only bucket or bitmap page as input? postgres=# SELECT * FROM hash_page_items(get_raw_page('hash_i4_index', 0)); ERROR: invalid memory alloc request size 18446744073709551593 page_stats / page_items should not be used on the metadata page. As these functions are marked as superuser only it is expected that people provides the correct input, especially since the raw page structure is used as the input. 2). Server crashed when below query was executed on a code that was build with cassert-enabled. postgres=# SELECT * FROM hash_page_stats(get_raw_page('hash_i4_index', 0)); server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. The connection to the server was lost. Attempting reset: Failed. !> \q The callstack is as follows: (gdb) bt #0 0x7fef794f15f7 in raise () from /lib64/libc.so.6 #1 0x7fef794f2ce8 in abort () from /lib64/libc.so.6 #2 0x00967381 in ExceptionalCondition (conditionName=0x7fef699c942d "!(((id)->lp_len != 0))", errorType=0x7fef699c92d4 "FailedAssertion", fileName=0x7fef699c9397 "hashfuncs.c", lineNumber=123) at assert.c:54 #3 0x7fef699c6ed6 in GetHashPageStatistics (page=0x1060c2c "", stat=0x7ffd76846230) at hashfuncs.c:123 #4 0x7fef699c70a4 in hash_page_stats (fcinfo=0x7ffd768463e0) at hashfuncs.c:169 #5 0x00682e6b in ExecMakeTableFunctionResult (funcexpr=0x10457f0, econtext=0x1044e28, argContext=0x104ee38, expectedDesc=0x1047640, randomAccess=0 '\000') at execQual.c:2216 #6 0x006a88e2 in FunctionNext (node=0x1044d10) at nodeFunctionscan.c:94 #7 0x0068a3d9 in ExecScanFetch (node=0x1044d10, accessMtd=0x6a882b , recheckMtd=0x6a8c10 ) at execScan.c:95 #8 0x0068a448 in ExecScan (node=0x1044d10, accessMtd=0x6a882b , recheckMtd=0x6a8c10 ) at execScan.c:145 #9 0x006a8c45 in ExecFunctionScan (node=0x1044d10) at nodeFunctionscan.c:268 #10 0x0067f0cf in ExecProcNode (node=0x1044d10) at execProcnode.c:449 #11 0x0067b40b in ExecutePlan (estate=0x1044bf8, planstate=0x1044d10, use_parallel_mode=0 '\000', operation=CMD_SELECT, sendTuples=1 '\001', numberTuples=0, direction=ForwardScanDirection, dest=0x10444c8) at execMain.c:1567 #12 0x00679527 in standard_ExecutorRun (queryDesc=0xfac778, direction=ForwardScanDirection, count=0) at execMain.c:338 #13 0x006793ab in ExecutorRun (queryDesc=0xfac778, direction=ForwardScanDirection, count=0) at execMain.c:286 #14 0x00816b3e in PortalRunSelect (portal=0xfa49e8, forward=1 '\001', count=0, dest=0x10444c8) at pquery.c:948 #15 0x008167d1 in PortalRun (portal=0xfa49e8, count=9223372036854775807, isTopLevel=1 '\001', dest=0x10444c8, altdest=0x10444c8, completionTag=0x7ffd76846c60 "") at pquery.c:789 #16 0x00810a27 in exec_simple_query (query_string=0x1007018 "SELECT * FROM hash_page_stats(get_raw_page('hash_i4_index', 0));") at postgres.c:1094 #17 0x00814b5e in PostgresMain (argc=1, argv=0xfb1d08, dbname=0xf873d8 "postgres", username=0xfb1b70 "edb") at postgres.c:4070 #18 0x0078990c in BackendRun (port=0xfacb50) at postmaster.c:4260 Same deal here. 3). Could you please let me know, what does the hard coded values '*5' and '+1' represents in the below lines of code. You have used them when allocating memory for storing spare pages allocated before certain split point and block number of bitmap pages inside hash_metap(). spares = palloc0(HASH_MAX_SPLITPOINTS * 5 + 1); mapp = palloc0(HASH_MAX_BITMAPS * 5 + 1); I guess it would be better to add some comments if using any hard coded values. It is the space needed to output the values. 4). Inside hash_page_stats(), you are first calling verify_hash_page() to verify if it is a hash page or not and No, we assume that the page is a valid hash page structure, since there is an explicit cast w/o any further checks. then calling GetHashPageStatistics() to get the page stats wherein you are trying to check what is the type of hash page. Below is the code snippet for this, + /* page type (flags) */ + if (opaque->hasho_flag & LH_META_PAGE) + stat->type = 'm'; + else if (opaque->hasho_flag & LH_OVERFLOW_PAGE) + stat->type = 'v'; + else if (opaque->hasho_flag & LH_BUCKET_PAGE) + stat->type = 'b'; + else if (opaque-
Re: [HACKERS] Write Ahead Logging for Hash Indexes
On 09/16/2016 02:42 AM, Amit Kapila wrote: Okay, Thanks for pointing out the same. I have fixed it. Apart from that, I have changed _hash_alloc_buckets() to initialize the page instead of making it completely Zero because of problems discussed in another related thread [1]. I have also updated README. Thanks. This needs a rebase against the CHI v8 [1] patch. [1] https://www.postgresql.org/message-id/CAA4eK1+X=8sud1uczdzne3d9cgi9kw+kjxp2tnw7sx5w8pl...@mail.gmail.com Best regards, Jesper -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pageinspect: Hash index support
Hi, On 09/20/2016 01:46 PM, Ashutosh Sharma wrote: I am getting a fatal error along with some warnings when trying to apply the v3 patch using "git apply" command. [ashu@localhost postgresql]$ git apply 0001-pageinspect-Hash-index-support_v3.patch 0001-pageinspect-Hash-index-support_v3.patch:29: trailing whitespace. brinfuncs.o ginfuncs.o hashfuncs.o $(WIN32RES) 0001-pageinspect-Hash-index-support_v3.patch:36: trailing whitespace. DATA = pageinspect--1.6.sql pageinspect--1.5--1.6.sql \ 0001-pageinspect-Hash-index-support_v3.patch:37: trailing whitespace. pageinspect--1.4--1.5.sql pageinspect--1.3--1.4.sql \ 0001-pageinspect-Hash-index-support_v3.patch:38: trailing whitespace. pageinspect--1.2--1.3.sql pageinspect--1.1--1.2.sql \ 0001-pageinspect-Hash-index-support_v3.patch:39: trailing whitespace. pageinspect--1.0--1.1.sql pageinspect--unpackaged--1.0.sql fatal: git apply: bad git-diff - expected /dev/null on line 46 Which platform are you on ? The development branch is @ https://github.com/jesperpedersen/postgres/tree/pageinspect_hash Also, i think the USAGE for hash_metap() is refering to hash_metap_bytea(). Please correct it. I have just started reviewing the patch, will keep on posting my comments upon further review. Fixed in v4. Thanks for the review. Best regards, Jesper -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pageinspect: Hash index support
On 09/20/2016 12:45 PM, Jeff Janes wrote: Is the 2nd "1" in this call needed? SELECT * FROM hash_page_stats(get_raw_page('mytab_index', 1), 1) As far as I can tell, that argument is only used to stuff into the output field "blkno", it is not used to instruct the interpretation of the raw page itself. It doesn't seem worthwhile to have the parameter that only echos back to the user what the user already put in (twice). The only existing funtions which take the blkno argument are those that don't use the get_raw_page style. Also, should we document what the single letter values mean in the hash_page_stats.type column? It is not obvious that 'i' means bitmap, for example. Adjusted in v4. Code/doc will need an update once the CHI patch goes in. Best regards, Jesper >From 1f27a2bb28cc6dfea9cba015d7cceab768f67d0a Mon Sep 17 00:00:00 2001 From: jesperpedersen Date: Fri, 5 Aug 2016 10:16:32 -0400 Subject: [PATCH] pageinspect: Hash index support --- contrib/pageinspect/Makefile | 10 +- contrib/pageinspect/hashfuncs.c | 403 ++ contrib/pageinspect/pageinspect--1.5--1.6.sql | 63 contrib/pageinspect/pageinspect--1.5.sql | 279 -- contrib/pageinspect/pageinspect--1.6.sql | 338 + contrib/pageinspect/pageinspect.control | 2 +- doc/src/sgml/pageinspect.sgml | 104 +++ 7 files changed, 914 insertions(+), 285 deletions(-) create mode 100644 contrib/pageinspect/hashfuncs.c create mode 100644 contrib/pageinspect/pageinspect--1.5--1.6.sql delete mode 100644 contrib/pageinspect/pageinspect--1.5.sql create mode 100644 contrib/pageinspect/pageinspect--1.6.sql diff --git a/contrib/pageinspect/Makefile b/contrib/pageinspect/Makefile index a98237e..c73d728 100644 --- a/contrib/pageinspect/Makefile +++ b/contrib/pageinspect/Makefile @@ -2,13 +2,13 @@ MODULE_big = pageinspect OBJS = rawpage.o heapfuncs.o btreefuncs.o fsmfuncs.o \ - brinfuncs.o ginfuncs.o $(WIN32RES) + brinfuncs.o ginfuncs.o hashfuncs.o $(WIN32RES) EXTENSION = pageinspect -DATA = pageinspect--1.5.sql pageinspect--1.4--1.5.sql \ - pageinspect--1.3--1.4.sql pageinspect--1.2--1.3.sql \ - pageinspect--1.1--1.2.sql pageinspect--1.0--1.1.sql \ - pageinspect--unpackaged--1.0.sql +DATA = pageinspect--1.6.sql pageinspect--1.5--1.6.sql \ + pageinspect--1.4--1.5.sql pageinspect--1.3--1.4.sql \ + pageinspect--1.2--1.3.sql pageinspect--1.1--1.2.sql \ + pageinspect--1.0--1.1.sql pageinspect--unpackaged--1.0.sql PGFILEDESC = "pageinspect - functions to inspect contents of database pages" ifdef USE_PGXS diff --git a/contrib/pageinspect/hashfuncs.c b/contrib/pageinspect/hashfuncs.c new file mode 100644 index 000..6efbf22 --- /dev/null +++ b/contrib/pageinspect/hashfuncs.c @@ -0,0 +1,403 @@ +/* + * hashfuncs.c + * Functions to investigate the content of HASH indexes + * + * Copyright (c) 2016, PostgreSQL Global Development Group + * + * IDENTIFICATION + * contrib/pageinspect/hashfuncs.c + */ + +#include "postgres.h" + +#include "access/hash.h" +#include "funcapi.h" +#include "miscadmin.h" + +PG_FUNCTION_INFO_V1(hash_metap); +PG_FUNCTION_INFO_V1(hash_page_items); +PG_FUNCTION_INFO_V1(hash_page_stats); + +/* + * structure for single hash page statistics + * + */ +typedef struct HashPageStat +{ + uint32 live_items; + uint32 dead_items; + uint32 page_size; + uint32 max_avail; + uint32 free_size; + uint32 avg_item_size; + char type; + + /* opaque data */ + BlockNumber hasho_prevblkno; + BlockNumber hasho_nextblkno; + Bucket hasho_bucket; + uint16 hasho_flag; + uint16 hasho_page_id; +} HashPageStat; + + +/* + * Verify that the given bytea contains a HASH page, or die in the attempt. + * A pointer to the page is returned. + */ +static Page +verify_hash_page(bytea *raw_page) +{ + Page page; + int raw_page_size; + HashPageOpaque pageopaque; + + raw_page_size = VARSIZE(raw_page) - VARHDRSZ; + + if (raw_page_size != BLCKSZ) + ereport(ERROR, +(errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("input page too small"), + errdetail("Expected size %d, got %d", + BLCKSZ, raw_page_size))); + + page = VARDATA(raw_page); + pageopaque = (HashPageOpaque) PageGetSpecialPointer(page); + if (pageopaque->hasho_page_id != HASHO_PAGE_ID) + ereport(ERROR, +(errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("page is not a HASH page"), + errdetail("Expected %08x, got %08x.", + HASHO_PAGE_ID, pageopaque->hasho_page_id))); + + return page; +} + +/* - + * GetHashPageStatistics() + * + * Collect statistics of single hash page + * - + */ +static void +GetHashPageStatistics(Page page, HashPageStat *stat) +{ + PageHeader phdr = (PageHeader) page; + OffsetNumber maxoff = PageGetMaxOffsetNumber(pa
Re: [HACKERS] pageinspect: Hash index support
On 09/20/2016 03:19 AM, Michael Paquier wrote: You did not get right the comments from Alvaro upthread. The following functions are added with this patch: function hash_metap(text) function hash_metap_bytea(bytea) function hash_page_items(text,integer) function hash_page_items_bytea(bytea) function hash_page_stats(text,integer) function hash_page_stats_bytea(bytea,integer) Now the following set of functions would be sufficient: function hash_metapage_info(bytea) function hash_page_items(bytea) function hash_page_stats(bytea) The last time pageinspect has been updated, when BRIN functions have been added, it has been discussed to just use (bytea) as an argument interface and just rely on get_raw_page() to get the pages wanted, so I think that we had better stick with that and keep things simple. Yes, I know, Alvaro and you voted for the bytea methods, and Jeff asked for both. Attached is v3 with only the bytea based methods. Alvaro, Michael and Jeff - Thanks for the review ! Best regards, Jesper >From 0aff82ccb40f00efe9e48cacef9c8a45c1327da2 Mon Sep 17 00:00:00 2001 From: jesperpedersen Date: Fri, 5 Aug 2016 10:16:32 -0400 Subject: [PATCH] pageinspect: Hash index support --- contrib/pageinspect/Makefile | 10 +- contrib/pageinspect/hashfuncs.c | 408 ++ contrib/pageinspect/pageinspect--1.5--1.6.sql | 64 contrib/pageinspect/pageinspect--1.5.sql | 279 -- contrib/pageinspect/pageinspect--1.6.sql | 339 + contrib/pageinspect/pageinspect.control | 2 +- doc/src/sgml/pageinspect.sgml | 100 +++ 7 files changed, 917 insertions(+), 285 deletions(-) create mode 100644 contrib/pageinspect/hashfuncs.c create mode 100644 contrib/pageinspect/pageinspect--1.5--1.6.sql delete mode 100644 contrib/pageinspect/pageinspect--1.5.sql create mode 100644 contrib/pageinspect/pageinspect--1.6.sql diff --git a/contrib/pageinspect/Makefile b/contrib/pageinspect/Makefile index a98237e..c73d728 100644 --- a/contrib/pageinspect/Makefile +++ b/contrib/pageinspect/Makefile @@ -2,13 +2,13 @@ MODULE_big = pageinspect OBJS = rawpage.o heapfuncs.o btreefuncs.o fsmfuncs.o \ - brinfuncs.o ginfuncs.o $(WIN32RES) + brinfuncs.o ginfuncs.o hashfuncs.o $(WIN32RES) EXTENSION = pageinspect -DATA = pageinspect--1.5.sql pageinspect--1.4--1.5.sql \ - pageinspect--1.3--1.4.sql pageinspect--1.2--1.3.sql \ - pageinspect--1.1--1.2.sql pageinspect--1.0--1.1.sql \ - pageinspect--unpackaged--1.0.sql +DATA = pageinspect--1.6.sql pageinspect--1.5--1.6.sql \ + pageinspect--1.4--1.5.sql pageinspect--1.3--1.4.sql \ + pageinspect--1.2--1.3.sql pageinspect--1.1--1.2.sql \ + pageinspect--1.0--1.1.sql pageinspect--unpackaged--1.0.sql PGFILEDESC = "pageinspect - functions to inspect contents of database pages" ifdef USE_PGXS diff --git a/contrib/pageinspect/hashfuncs.c b/contrib/pageinspect/hashfuncs.c new file mode 100644 index 000..4d2cd2a --- /dev/null +++ b/contrib/pageinspect/hashfuncs.c @@ -0,0 +1,408 @@ +/* + * hashfuncs.c + * Functions to investigate the content of HASH indexes + * + * Copyright (c) 2016, PostgreSQL Global Development Group + * + * IDENTIFICATION + * contrib/pageinspect/hashfuncs.c + */ + +#include "postgres.h" + +#include "access/hash.h" +#include "funcapi.h" +#include "miscadmin.h" + +PG_FUNCTION_INFO_V1(hash_metap); +PG_FUNCTION_INFO_V1(hash_page_items); +PG_FUNCTION_INFO_V1(hash_page_stats); + +/* + * structure for single hash page statistics + * + */ +typedef struct HashPageStat +{ + uint32 blkno; + uint32 live_items; + uint32 dead_items; + uint32 page_size; + uint32 max_avail; + uint32 free_size; + uint32 avg_item_size; + char type; + + /* opaque data */ + BlockNumber hasho_prevblkno; + BlockNumber hasho_nextblkno; + Bucket hasho_bucket; + uint16 hasho_flag; + uint16 hasho_page_id; +} HashPageStat; + + +/* + * Verify that the given bytea contains a HASH page, or die in the attempt. + * A pointer to the page is returned. + */ +static Page +verify_hash_page(bytea *raw_page) +{ + Page page; + int raw_page_size; + HashPageOpaque pageopaque; + + raw_page_size = VARSIZE(raw_page) - VARHDRSZ; + + if (raw_page_size != BLCKSZ) + ereport(ERROR, +(errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("input page too small"), + errdetail("Expected size %d, got %d", + BLCKSZ, raw_page_size))); + + page = VARDATA(raw_page); + pageopaque = (HashPageOpaque) PageGetSpecialPointer(page); + if (pageopaque->hasho_page_id != HASHO_PAGE_ID) + ereport(ERROR, +(errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("page is not a HASH page"), + errdetail("Expected %08x, got %08x.", + HASHO_PAGE_ID, pageopaque->hasho_page_id))); + + return page; +} + +/* - + * GetHashPageStatistics() + *
Re: [HACKERS] pageinspect: Hash index support
On 09/14/2016 04:21 PM, Jeff Janes wrote: I suggest that pageinspect functions are more convenient to use via the get_raw_page interface, that is, instead of reading the buffer themselves, the buffer is handed over from elsewhere and they receive it as bytea. This enables other use cases such as grabbing a page from one server and examining it in another one. I've never needed to do that, but it does sound like it might be useful. But it is also annoying to have to do that when you want to examine a current server. Could we use overloading, so that it can be used in either way? For hash_page_items, the 'data' is always a 32 bit integer, isn't it? (unlike other pageinspect features, where the data could be any user-defined data type). If so, I'd rather see it in plain hex (without the spaces, without the trailing zeros) + /* page type (flags) */ + if (opaque->hasho_flag & LH_UNUSED_PAGE) + stat->type = 'u'; This can never be true, because LH_UNUSED_PAGE is zero. You should make this be the fall-through case, and have LH_META_PAGE be explicitly tested for. This version adds the overloaded get_raw_page based methods. However, I'm not verifying the page other than page_id, as hasho_flag can contain multiple flags in Amit's patches. And, I don't see having different page ids as a benefit - at least not part of this patch. We should probably just have one of the method sets, so I'll send a v3 with the set voted for. I kept the 'data' field as is, for now. Best regards, Jesper >From 0dc44e4b3cc1d31c53684a41fbd7959978e69089 Mon Sep 17 00:00:00 2001 From: jesperpedersen Date: Fri, 5 Aug 2016 10:16:32 -0400 Subject: [PATCH] pageinspect: Hash index support --- contrib/pageinspect/Makefile | 10 +- contrib/pageinspect/hashfuncs.c | 766 ++ contrib/pageinspect/pageinspect--1.5--1.6.sql | 111 contrib/pageinspect/pageinspect--1.5.sql | 279 -- contrib/pageinspect/pageinspect--1.6.sql | 386 + contrib/pageinspect/pageinspect.control | 2 +- doc/src/sgml/pageinspect.sgml | 103 7 files changed, 1372 insertions(+), 285 deletions(-) create mode 100644 contrib/pageinspect/hashfuncs.c create mode 100644 contrib/pageinspect/pageinspect--1.5--1.6.sql delete mode 100644 contrib/pageinspect/pageinspect--1.5.sql create mode 100644 contrib/pageinspect/pageinspect--1.6.sql diff --git a/contrib/pageinspect/Makefile b/contrib/pageinspect/Makefile index a98237e..c73d728 100644 --- a/contrib/pageinspect/Makefile +++ b/contrib/pageinspect/Makefile @@ -2,13 +2,13 @@ MODULE_big = pageinspect OBJS = rawpage.o heapfuncs.o btreefuncs.o fsmfuncs.o \ - brinfuncs.o ginfuncs.o $(WIN32RES) + brinfuncs.o ginfuncs.o hashfuncs.o $(WIN32RES) EXTENSION = pageinspect -DATA = pageinspect--1.5.sql pageinspect--1.4--1.5.sql \ - pageinspect--1.3--1.4.sql pageinspect--1.2--1.3.sql \ - pageinspect--1.1--1.2.sql pageinspect--1.0--1.1.sql \ - pageinspect--unpackaged--1.0.sql +DATA = pageinspect--1.6.sql pageinspect--1.5--1.6.sql \ + pageinspect--1.4--1.5.sql pageinspect--1.3--1.4.sql \ + pageinspect--1.2--1.3.sql pageinspect--1.1--1.2.sql \ + pageinspect--1.0--1.1.sql pageinspect--unpackaged--1.0.sql PGFILEDESC = "pageinspect - functions to inspect contents of database pages" ifdef USE_PGXS diff --git a/contrib/pageinspect/hashfuncs.c b/contrib/pageinspect/hashfuncs.c new file mode 100644 index 000..380203c --- /dev/null +++ b/contrib/pageinspect/hashfuncs.c @@ -0,0 +1,766 @@ +/* + * hashfuncs.c + * Functions to investigate the content of HASH indexes + * + * Copyright (c) 2016, PostgreSQL Global Development Group + * + * IDENTIFICATION + * contrib/pageinspect/hashfuncs.c + */ + +#include "postgres.h" + +#include "access/hash.h" +#include "access/htup_details.h" +#include "catalog/namespace.h" +#include "catalog/pg_am.h" +#include "funcapi.h" +#include "miscadmin.h" +#include "utils/builtins.h" +#include "utils/rel.h" + + +PG_FUNCTION_INFO_V1(hash_metap_bytea); +PG_FUNCTION_INFO_V1(hash_metap); +PG_FUNCTION_INFO_V1(hash_page_items_bytea); +PG_FUNCTION_INFO_V1(hash_page_items); +PG_FUNCTION_INFO_V1(hash_page_stats_bytea); +PG_FUNCTION_INFO_V1(hash_page_stats); + +#define IS_INDEX(r) ((r)->rd_rel->relkind == RELKIND_INDEX) +#define IS_HASH(r) ((r)->rd_rel->relam == HASH_AM_OID) + +/* note: BlockNumber is unsigned, hence can't be negative */ +#define CHECK_RELATION_BLOCK_RANGE(rel, blkno) { \ + if ( RelationGetNumberOfBlocks(rel) <= (BlockNumber) (blkno) ) \ + elog(ERROR, "block number out of range"); } + +/* + * structure for single hash page statistics + * + */ +typedef struct HashPageStat +{ + uint32 blkno; + uint32 live_items; + uint32 dead_items; + uint32 page_size; + uint32 max_avail; + uint32 free_size; + uint32 avg_item_size; + char type; + + /* opaque data */ + B
Re: [HACKERS] Hash Indexes
On 09/16/2016 03:18 AM, Amit Kapila wrote: Attached is a run with 1000 rows. I think 1000 is also less, you probably want to run it for 100,000 or more rows. I suspect that the reason why you are seeing the large difference between btree and hash index is that the range of values is narrow and there may be many overflow pages. Attached is 100,000. I think for CHI is would be Robert's and others feedback. For WAL, there is [1]. I have fixed your feedback for WAL and posted the patch. Thanks ! I think the remaining thing to handle for Concurrent Hash Index patch is to remove the usage of hashscan.c from code if no one objects to it, do let me know if I am missing something here. Like Robert said, hashscan.c can always come back, and it would take a call-stack out of the 'am' methods. Best regards, Jesper -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash Indexes
On 09/15/2016 02:03 AM, Amit Kapila wrote: Same thing here - where the fields involving the hash index aren't updated. Do you mean that for such cases also you see 40-60% gain? No, UPDATEs are around 10-20% for our cases. I have done a run to look at the concurrency / TPS aspect of the implementation - to try something different than Mark's work on testing the pgbench setup. With definitions as above, with SELECT as -- select.sql -- \set id random(1,10) BEGIN; SELECT * FROM test WHERE id = :id; COMMIT; and UPDATE/Indexed with an index on 'val', and finally UPDATE/Nonindexed w/o one. [1] [2] [3] is new_hash - old_hash is the existing hash implementation on master. btree is master too. Machine is a 28C/56T with 256Gb RAM with 2 x RAID10 SSD for data + wal. Clients ran with -M prepared. [1] https://www.postgresql.org/message-id/caa4ek1+erbp+7mdkkahjzwq_dtdkocbpt7lswfwcqvuhbxz...@mail.gmail.com [2] https://www.postgresql.org/message-id/CAD__OujvYghFX_XVkgRcJH4VcEbfJNSxySd9x=1wp5vylvk...@mail.gmail.com [3] https://www.postgresql.org/message-id/caa4ek1juyr_ab7bxfnsg5+jqhiwgklkgacfk9bfd4mlffk6...@mail.gmail.com Don't know if you find this useful due to the small number of rows, but let me know if there are other tests I can run, f.ex. bump the number of rows. It might be useful to test with higher number of rows because with so less data contention is not visible, Attached is a run with 1000 rows. but I think in general with your, jeff's and mine own tests it is clear that there is significant win for read-only cases and for read-write cases where index column is not updated. Also, we don't find any regression as compare to HEAD which is sufficient to prove the worth of patch. Very much agreed. I think we should not forget that one of the other main reason for this patch is to allow WAL logging for hash indexes. Absolutely. There are scenarios that will have a benefit of switching to a hash index. I think for now, we have done sufficient tests for this patch to ensure it's benefit, now if any committer wants to see something more we can surely do it. Ok. I think the important thing at this stage is to find out what more (if anything) is left to make this patch as "ready for committer". I think for CHI is would be Robert's and others feedback. For WAL, there is [1]. [1] https://www.postgresql.org/message-id/5f8b4681-1229-92f4-4315-57d780d9c128%40redhat.com Best regards, Jesper -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash Indexes
Hi, On 09/14/2016 07:24 AM, Amit Kapila wrote: On Wed, Sep 14, 2016 at 12:29 AM, Jesper Pedersen wrote: On 09/13/2016 07:26 AM, Amit Kapila wrote: Attached, new version of patch which contains the fix for problem reported on write-ahead-log of hash index thread [1]. I have been testing patch in various scenarios, and it has a positive performance impact in some cases. This is especially seen in cases where the values of the indexed column are unique - SELECTs can see a 40-60% benefit over a similar query using b-tree. Here, I think it is better if we have the data comparing the situation of hash index with respect to HEAD as well. What I mean to say is that you are claiming that after the hash index improvements SELECT workload is 40-60% better, but where do we stand as of HEAD? The tests I have done are with a copy of a production database using the same queries sent with a b-tree index for the primary key, and the same with a hash index. Those are seeing a speed-up of the mentioned 40-60% in execution time - some involve JOINs. Largest of those tables is 390Mb with a CHAR() based primary key. UPDATE also sees an improvement. Can you explain this more? Is it more compare to HEAD or more as compare to Btree? Isn't this contradictory to what the test in below mail shows? Same thing here - where the fields involving the hash index aren't updated. In cases where the indexed column value isn't unique, it takes a long time to build the index due to the overflow page creation. Also in cases where the index column is updated with a high number of clients, ala -- ddl.sql -- CREATE TABLE test AS SELECT generate_series(1, 10) AS id, 0 AS val; CREATE INDEX IF NOT EXISTS idx_id ON test USING hash (id); CREATE INDEX IF NOT EXISTS idx_val ON test USING hash (val); ANALYZE; -- test.sql -- \set id random(1,10) \set val random(0,10) BEGIN; UPDATE test SET val = :val WHERE id = :id; COMMIT; w/ 100 clients - it takes longer than the b-tree counterpart (2921 tps for hash, and 10062 tps for b-tree). Thanks for doing the tests. Have you applied both concurrent index and cache the meta page patch for these tests? So from above tests, we can say that after these set of patches read-only workloads will be significantly improved even better than btree in quite-a-few useful cases. Agreed. However when the indexed column is updated, there is still a large gap as compare to btree (what about the case when the indexed column is not updated in read-write transaction as in our pgbench read-write transactions, by any chance did you ran any such test?). I have done a run to look at the concurrency / TPS aspect of the implementation - to try something different than Mark's work on testing the pgbench setup. With definitions as above, with SELECT as -- select.sql -- \set id random(1,10) BEGIN; SELECT * FROM test WHERE id = :id; COMMIT; and UPDATE/Indexed with an index on 'val', and finally UPDATE/Nonindexed w/o one. [1] [2] [3] is new_hash - old_hash is the existing hash implementation on master. btree is master too. Machine is a 28C/56T with 256Gb RAM with 2 x RAID10 SSD for data + wal. Clients ran with -M prepared. [1] https://www.postgresql.org/message-id/caa4ek1+erbp+7mdkkahjzwq_dtdkocbpt7lswfwcqvuhbxz...@mail.gmail.com [2] https://www.postgresql.org/message-id/CAD__OujvYghFX_XVkgRcJH4VcEbfJNSxySd9x=1wp5vylvk...@mail.gmail.com [3] https://www.postgresql.org/message-id/caa4ek1juyr_ab7bxfnsg5+jqhiwgklkgacfk9bfd4mlffk6...@mail.gmail.com Don't know if you find this useful due to the small number of rows, but let me know if there are other tests I can run, f.ex. bump the number of rows. I think we need to focus on improving cases where index columns are updated, but it is better to do that work as a separate patch. Ok. Best regards, Jesper -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash Indexes
On 09/13/2016 07:26 AM, Amit Kapila wrote: Attached, new version of patch which contains the fix for problem reported on write-ahead-log of hash index thread [1]. I have been testing patch in various scenarios, and it has a positive performance impact in some cases. This is especially seen in cases where the values of the indexed column are unique - SELECTs can see a 40-60% benefit over a similar query using b-tree. UPDATE also sees an improvement. In cases where the indexed column value isn't unique, it takes a long time to build the index due to the overflow page creation. Also in cases where the index column is updated with a high number of clients, ala -- ddl.sql -- CREATE TABLE test AS SELECT generate_series(1, 10) AS id, 0 AS val; CREATE INDEX IF NOT EXISTS idx_id ON test USING hash (id); CREATE INDEX IF NOT EXISTS idx_val ON test USING hash (val); ANALYZE; -- test.sql -- \set id random(1,10) \set val random(0,10) BEGIN; UPDATE test SET val = :val WHERE id = :id; COMMIT; w/ 100 clients - it takes longer than the b-tree counterpart (2921 tps for hash, and 10062 tps for b-tree). Jeff mentioned upthread the idea of moving the lock to a bucket meta page instead of having it on the main meta page. Likely a question for the assigned committer. Thanks for working on this ! Best regards, Jesper -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Write Ahead Logging for Hash Indexes
On 09/13/2016 07:41 AM, Amit Kapila wrote: README: +in_complete split flag. The reader algorithm works correctly, as it will scan What flag ? in-complete-split flag which indicates that split has to be finished for that particular bucket. The value of these flags are LH_BUCKET_NEW_PAGE_SPLIT and LH_BUCKET_OLD_PAGE_SPLIT for new and old bucket respectively. It is set in hasho_flag in special area of page. I have slightly expanded the definition in README, but not sure if it is good idea to mention flags defined in hash.h. Let me know, if still it is unclear or you want some thing additional to be added in README. I think it is better now. hashxlog.c: hash_xlog_move_page_contents hash_xlog_squeeze_page Both have "bukcetbuf" (-> "bucketbuf"), and + if (BufferIsValid(bukcetbuf)); -> + if (BufferIsValid(bucketbuf)) and indent following line: LockBufferForCleanup(bukcetbuf); hash_xlog_delete has the "if" issue too. Fixed all the above cosmetic issues. I meant there is an extra ';' on the "if" statements: + if (BufferIsValid(bukcetbuf)); <-- in hash_xlog_move_page_contents, hash_xlog_squeeze_page and hash_xlog_delete. Best regards, Jesper -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash Indexes
On 09/12/2016 10:42 PM, Amit Kapila wrote: The following script hangs on idx_val creation - just with v5, WAL patch not applied. Are you sure it is actually hanging? I see 100% cpu for a few minutes but the index eventually completes ok for me (v5 patch applied to today's master). It completed for me as well. The second index creation is taking more time and cpu, because it is just inserting duplicate values which need lot of overflow pages. Yeah, sorry for the false alarm. It just took 3m45s to complete on my machine. Best regards, Jesper -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Write Ahead Logging for Hash Indexes
Hi, On 09/07/2016 05:58 AM, Amit Kapila wrote: Okay, I have fixed this issue as explained above. Apart from that, I have fixed another issue reported by Mark Kirkwood upthread and few other issues found during internal testing by Ashutosh Sharma. The locking issue reported by Mark and Ashutosh is that the patch didn't maintain the locking order while adding overflow page as it maintains in other write operations (lock the bucket pages first and then metapage to perform the write operation). I have added the comments in _hash_addovflpage() to explain the locking order used in modified patch. During stress testing with pgbench using master-standby setup, we found an issue which indicates that WAL replay machinery doesn't expect completely zeroed pages (See explanation of RBM_NORMAL mode atop XLogReadBufferExtended). Previously before freeing the overflow page we were zeroing it, now I have changed it to just initialize the page such that the page will be empty. Apart from above, I have added support for old snapshot threshold in the hash index code. Thanks to Ashutosh Sharma for doing the testing of the patch and helping me in analyzing some of the above issues. I forgot to mention in my initial mail that Robert and I had some off-list discussions about the design of this patch, many thanks to him for providing inputs. Some initial feedback. README: +in_complete split flag. The reader algorithm works correctly, as it will scan What flag ? hashxlog.c: hash_xlog_move_page_contents hash_xlog_squeeze_page Both have "bukcetbuf" (-> "bucketbuf"), and + if (BufferIsValid(bukcetbuf)); -> + if (BufferIsValid(bucketbuf)) and indent following line: LockBufferForCleanup(bukcetbuf); hash_xlog_delete has the "if" issue too. hash.h: Move the XLog related content to hash_xlog.h Best regards, Jesper -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash Indexes
On 09/01/2016 11:55 PM, Amit Kapila wrote: I have fixed all other issues you have raised. Updated patch is attached with this mail. The following script hangs on idx_val creation - just with v5, WAL patch not applied. Best regards, Jesper zero.sql Description: application/sql -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Cache Hash Index meta page.
On 09/05/2016 02:50 PM, Mithun Cy wrote: On Sep 2, 2016 7:38 PM, "Jesper Pedersen" wrote: Could you provide a rebased patch based on Amit's v5 ? Please find the the patch, based on Amit's V5. I have fixed following things 1. now in "_hash_first" we check if (opaque->hasho_prevblkno == InvalidBlockNumber) to see if bucket is from older version hashindex and has been upgraded. Since as of now InvalidBlockNumber is one value greater than maximum value the variable "metap->hashm_maxbucket" can be set (see _hash_expandtable). We can distinguish it from rest. I tested the upgrade issue reported by amit. It is fixed now. 2. One case which buckets hasho_prevblkno is used is where we do backward scan. So now before testing for previous block number I test whether current page is bucket page if so we end the bucket scan (see changes in _hash_readprev). On other places where hasho_prevblkno is used it is not for bucket page, so I have not put any extra check to verify if is a bucket page. I think that the + pageopaque->hasho_prevblkno = metap->hashm_maxbucket; trick should be documented in the README, as hashm_maxbucket is defined as uint32 where as hasho_prevblkno is a BlockNumber. (All bucket variables should probably use the Bucket definition instead of uint32). For the archives, this patch conflicts with the WAL patch [1]. [1] https://www.postgresql.org/message-id/CAA4eK1JS%2BSiRSQBzEFpnsSmxZKingrRH7WNyWULJeEJSj1-%3D0w%40mail.gmail.com Best regards, Jesper -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Cache Hash Index meta page.
On 07/22/2016 06:02 AM, Mithun Cy wrote: I have created a patch to cache the meta page of Hash index in backend-private memory. This is to save reading the meta page buffer every time when we want to find the bucket page. In “_hash_first” call, we try to read meta page buffer twice just to make sure bucket is not split after we found bucket page. With this patch meta page buffer read is not done, if the bucket is not split after caching the meta page. Idea is to cache the Meta page data in rd_amcache and store maxbucket number in hasho_prevblkno of bucket primary page (which will always be NULL other wise, so reusing it here for this cause!!!). So when we try to do hash lookup for bucket page if locally cached maxbucket number is greater than or equal to bucket page's maxbucket number then we can say given bucket is not split after we have cached the meta page. Hence avoid reading meta page buffer. I have attached the benchmark results and perf stats (refer hash_index_perf_stat_and_benchmarking.odc [sheet 1: perf stats; sheet 2: Benchmark results). There we can see improvements at higher clients, as lwlock contentions due to buffer read are more at higher clients. If I apply the same patch on Amit's concurrent hash index patch [1] we can see improvements at lower clients also. Amit's patch has removed a heavy weight page lock which was the bottle neck at lower clients. Could you provide a rebased patch based on Amit's v5 ? Best regards, Jesper -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash Indexes
On 08/05/2016 07:36 AM, Amit Kapila wrote: On Thu, Aug 4, 2016 at 8:02 PM, Mithun Cy wrote: I did some basic testing of same. In that I found one issue with cursor. Thanks for the testing. The reason for failure was that the patch didn't take into account the fact that for scrolling cursors, scan can reacquire the lock and pin on bucket buffer multiple times. I have fixed it such that we release the pin on bucket buffers after we scan the last overflow page in bucket. Attached patch fixes the issue for me, let me know if you still see the issue. Needs a rebase. hashinsert.c +* reuse the space. There is no such apparent benefit from finsihing the -> finishing hashpage.c + * retrun the buffer, else return InvalidBuffer. -> return + if (blkno == P_NEW) + elog(ERROR, "hash AM does not use P_NEW"); Left over ? + * for unlocking it. -> for unlocking them. hashsearch.c +* bucket, but not pin, then acuire the lock on new bucket and again -> acquire hashutil.c + * half. It is mainly required to finsh the incomplete splits where we are -> finish Ran some tests on a CHAR() based column which showed good results. Will have to compare with a run with the WAL patch applied. make check-world passes. Best regards, Jesper -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] pageinspect: Hash index support
Hi, Attached is a patch that adds support for hash indexes in pageinspect. The functionality (hash_metap, hash_page_stats and hash_page_items) follows the B-tree functions. This patch will need an update once Amit's and Mithun's work on Concurrent Hash Indexes is committed to account for the new meta-page constants. I'll create a CommitFest entry for this submission. Feedback is most welcome. Best regards, Jesper >From 55262d5fa3822afbae94f4989627dd65e91fe098 Mon Sep 17 00:00:00 2001 From: jesperpedersen Date: Fri, 5 Aug 2016 10:16:32 -0400 Subject: [PATCH] pageinspect: Hash index support --- contrib/pageinspect/Makefile | 10 +- contrib/pageinspect/hashfuncs.c | 472 ++ contrib/pageinspect/pageinspect--1.5--1.6.sql | 64 contrib/pageinspect/pageinspect--1.5.sql | 279 --- contrib/pageinspect/pageinspect--1.6.sql | 339 ++ contrib/pageinspect/pageinspect.control | 2 +- doc/src/sgml/pageinspect.sgml | 100 ++ 7 files changed, 981 insertions(+), 285 deletions(-) create mode 100644 contrib/pageinspect/hashfuncs.c create mode 100644 contrib/pageinspect/pageinspect--1.5--1.6.sql delete mode 100644 contrib/pageinspect/pageinspect--1.5.sql create mode 100644 contrib/pageinspect/pageinspect--1.6.sql diff --git a/contrib/pageinspect/Makefile b/contrib/pageinspect/Makefile index a98237e..c73d728 100644 --- a/contrib/pageinspect/Makefile +++ b/contrib/pageinspect/Makefile @@ -2,13 +2,13 @@ MODULE_big = pageinspect OBJS = rawpage.o heapfuncs.o btreefuncs.o fsmfuncs.o \ - brinfuncs.o ginfuncs.o $(WIN32RES) + brinfuncs.o ginfuncs.o hashfuncs.o $(WIN32RES) EXTENSION = pageinspect -DATA = pageinspect--1.5.sql pageinspect--1.4--1.5.sql \ - pageinspect--1.3--1.4.sql pageinspect--1.2--1.3.sql \ - pageinspect--1.1--1.2.sql pageinspect--1.0--1.1.sql \ - pageinspect--unpackaged--1.0.sql +DATA = pageinspect--1.6.sql pageinspect--1.5--1.6.sql \ + pageinspect--1.4--1.5.sql pageinspect--1.3--1.4.sql \ + pageinspect--1.2--1.3.sql pageinspect--1.1--1.2.sql \ + pageinspect--1.0--1.1.sql pageinspect--unpackaged--1.0.sql PGFILEDESC = "pageinspect - functions to inspect contents of database pages" ifdef USE_PGXS diff --git a/contrib/pageinspect/hashfuncs.c b/contrib/pageinspect/hashfuncs.c new file mode 100644 index 000..598b7b2 --- /dev/null +++ b/contrib/pageinspect/hashfuncs.c @@ -0,0 +1,472 @@ +/* + * hashfuncs.c + * Functions to investigate the content of HASH indexes + * + * Copyright (c) 2016, PostgreSQL Global Development Group + * + * IDENTIFICATION + * contrib/pageinspect/hashfuncs.c + */ + +#include "postgres.h" + +#include "access/hash.h" +#include "access/htup_details.h" +#include "catalog/namespace.h" +#include "catalog/pg_am.h" +#include "funcapi.h" +#include "miscadmin.h" +#include "utils/builtins.h" +#include "utils/rel.h" + + +PG_FUNCTION_INFO_V1(hash_metap); +PG_FUNCTION_INFO_V1(hash_page_items); +PG_FUNCTION_INFO_V1(hash_page_stats); + +#define IS_INDEX(r) ((r)->rd_rel->relkind == RELKIND_INDEX) +#define IS_HASH(r) ((r)->rd_rel->relam == HASH_AM_OID) + +/* note: BlockNumber is unsigned, hence can't be negative */ +#define CHECK_RELATION_BLOCK_RANGE(rel, blkno) { \ + if ( RelationGetNumberOfBlocks(rel) <= (BlockNumber) (blkno) ) \ + elog(ERROR, "block number out of range"); } + +/* + * structure for single hash page statistics + * + */ +typedef struct HashPageStat +{ + uint32 blkno; + uint32 live_items; + uint32 dead_items; + uint32 page_size; + uint32 max_avail; + uint32 free_size; + uint32 avg_item_size; + char type; + + /* opaque data */ + BlockNumber hasho_prevblkno; + BlockNumber hasho_nextblkno; + Bucket hasho_bucket; + uint16 hasho_flag; + uint16 hasho_page_id; +} HashPageStat; + + +/* - + * GetHashPageStatistics() + * + * Collect statistics of single hash page + * - + */ +static void +GetHashPageStatistics(BlockNumber blkno, Buffer buffer, HashPageStat *stat) +{ + Page page = BufferGetPage(buffer); + PageHeader phdr = (PageHeader) page; + OffsetNumber maxoff = PageGetMaxOffsetNumber(page); + HashPageOpaque opaque = (HashPageOpaque) PageGetSpecialPointer(page); + int item_size = 0; + int off; + + stat->blkno = blkno; + + stat->max_avail = BLCKSZ - (BLCKSZ - phdr->pd_special + SizeOfPageHeaderData); + + stat->dead_items = stat->live_items = 0; + + stat->page_size = PageGetPageSize(page); + + /* page type (flags) */ + if (opaque->hasho_flag & LH_UNUSED_PAGE) + stat->type = 'u'; + else if (opaque->hasho_flag & LH_OVERFLOW_PAGE) + stat->type = 'v'; + else if (opaque->hasho_flag & LH_BUCKET_PAGE) + stat->type = 'b'; + else if (opaque->hasho_flag & LH_BITMAP_PAGE) + stat->type = 'i'; + else + stat->type = 'm'; + + /* hash page opaqu
Re: [HACKERS] Speedup twophase transactions
Hi, On 04/13/2016 10:31 AM, Stas Kelvich wrote: On 13 Apr 2016, at 01:04, Michael Paquier wrote: That's too late for 9.6 unfortunately, don't forget to add that in the next CF! Fixed patch attached. There already was infrastructure to skip currently held locks during replay of standby_redo() and I’ve extended that with check for prepared xids. The reason why I’m still active on this thread is because I see real problems in deploying 9.6 in current state. Let me stress my concern: current state of things _WILL_BREAK_ async replication in case of substantial load of two phase transactions on master. And a lot of J2EE apps falls into that category, as they wrap most of their transactions in prepare/commit. Slave server just will always increase it lag and will never catch up. It is possible to deal with that by switching to synchronous replication or inserting triggers with pg_sleep on master, but it doesn’t looks like normal behaviour of system. Discussed with Noah off-list I think we should revisit this for 9.6 due to the async replica lag as shown in [1]. The performance improvement for the master node is shown in [2]. As I see it there are 3 options to resolve this (in one way or another) * Leave as is, document the behaviour in release notes/documentation * Apply the patch for slaves * Revert the changes done to the twophase.c during the 9.6 cycle. All have pros/cons for the release. Latest slave patch by Stas is on [3]. Thoughts from others on the matter would be appreciated. [1] http://www.postgresql.org/message-id/e7497864-de11-4099-83f5-89fb97af5...@postgrespro.ru [2] http://www.postgresql.org/message-id/5693f703.3000...@redhat.com [3] https://commitfest.postgresql.org/10/523/ Best regards, Jesper -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Speedup twophase transactions
On 04/08/2016 02:37 PM, Robert Haas wrote: On Fri, Apr 8, 2016 at 8:49 AM, Jesper Pedersen wrote: On 04/07/2016 02:29 AM, Michael Paquier wrote: So recovery is conflicting here. My guess is that this patch is missing some lock cleanup. With the test case attached in my case the COMMIT PREPARED record does not even get replayed. Should we create an entry for the open item list [0] for this, due to the replication lag [1] ? CommitFest entry [2] Original commit [3] Cc'ed RMT. If there is something you think needs to be fixed that is a new issue in 9.6, then yes you should. I don't quite understand what thing is from reading this, so please make sure to describe it clearly. Michael, you seem to have the necessary permission for this. Could you add an entry ? Best regards, Jesper -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Speedup twophase transactions
On 04/08/2016 02:42 PM, Robert Haas wrote: On Tue, Jan 26, 2016 at 7:43 AM, Stas Kelvich wrote: Hi, Thanks for reviews and commit! I apologize for being clueless here, but was this patch committed? It's still marked as "Needs Review" in the CommitFest application. There are 2 parts to this - both in the same email thread. Part 1 [0] dealt with 2-phase commits on the master node. Part 2 [1] deals with replaying on slaves, which currently shows lag. There is still an open item found by Michael, so part 2 isn't ready to be moved to "Ready for Committer" yet. [0] http://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=978b2f65aa1262eb4ecbf8b3785cb1b9cf4db78e [1] https://commitfest.postgresql.org/9/523/ Best regards, Jesper -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Speedup twophase transactions
On 04/07/2016 02:29 AM, Michael Paquier wrote: So recovery is conflicting here. My guess is that this patch is missing some lock cleanup. With the test case attached in my case the COMMIT PREPARED record does not even get replayed. Should we create an entry for the open item list [0] for this, due to the replication lag [1] ? CommitFest entry [2] Original commit [3] Cc'ed RMT. [0] https://wiki.postgresql.org/wiki/PostgreSQL_9.6_Open_Items [1] http://www.postgresql.org/message-id/e7497864-de11-4099-83f5-89fb97af5...@postgrespro.ru [2] https://commitfest.postgresql.org/9/523/ [3] http://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=978b2f65aa1262eb4ecbf8b3785cb1b9cf4db78e Best regards, Jesper -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers
On 04/01/2016 04:39 PM, Andres Freund wrote: On April 1, 2016 10:25:51 PM GMT+02:00, Jesper Pedersen wrote: Hi, On 03/31/2016 06:21 PM, Andres Freund wrote: On March 31, 2016 11:13:46 PM GMT+02:00, Jesper Pedersen wrote: I can do a USE_CONTENT_LOCK run on 0003 if it is something for 9.6. Yes please. I think the lock variant is realistic, the lockless did isn't. I have done a run with -M prepared on unlogged running 10min per data point, up to 300 connections. Using data + wal on HDD. I'm not seeing a difference between with and without USE_CONTENT_LOCK -- all points are within +/- 0.5%. Let me know if there are other tests I can perform How do either compare to just 0002 applied? 0001 + 0002 compared to 0001 + 0002 + 0003 (either way) were pretty much the same +/- 0.5% on the HDD run. Best regards, Jesper -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers
Hi, On 03/31/2016 06:21 PM, Andres Freund wrote: On March 31, 2016 11:13:46 PM GMT+02:00, Jesper Pedersen wrote: I can do a USE_CONTENT_LOCK run on 0003 if it is something for 9.6. Yes please. I think the lock variant is realistic, the lockless did isn't. I have done a run with -M prepared on unlogged running 10min per data point, up to 300 connections. Using data + wal on HDD. I'm not seeing a difference between with and without USE_CONTENT_LOCK -- all points are within +/- 0.5%. Let me know if there are other tests I can perform. Best regards, Jesper -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers
Hi, On 03/30/2016 07:09 PM, Andres Freund wrote: Yes. That looks good. My testing shows that increasing the number of buffers can increase both throughput and reduce latency variance. The former is a smaller effect with one of the discussed patches applied, the latter seems to actually increase in scale (with increased throughput). I've attached patches to: 0001: Increase the max number of clog buffers 0002: Implement 64bit atomics fallback and optimize read/write 0003: Edited version of Simon's clog scalability patch WRT 0003 - still clearly WIP - I've: - made group_lsn pg_atomic_u64*, to allow for tear-free reads - split content from IO lock - made SimpleLruReadPage_optShared always return with only share lock held - Implement a different, experimental, concurrency model for SetStatusBit using cmpxchg. A define USE_CONTENT_LOCK controls which bit is used. I've tested this and saw this outperform Amit's approach. Especially so when using a read/write mix, rather then only reads. I saw over 30% increase on a large EC2 instance with -btpcb-like@1 -bselect-only@3. But that's in a virtualized environment, not very good for reproducability. Amit, could you run benchmarks on your bigger hardware? Both with USE_CONTENT_LOCK commented out and in? I think we should go for 1) and 2) unconditionally. And then evaluate whether to go with your, or 3) from above. If the latter, we've to do some cleanup :) I have been testing Amit's patch in various setups and work loads, with up to 400 connections on a 2 x Xeon E5-2683 (28C/56T @ 2 GHz), not seeing an improvement, but no regression either. Testing with 0001 and 0002 do show up to a 5% improvement when using a HDD for data + wal - about 1% when using 2 x RAID10 SSD - unlogged. I can do a USE_CONTENT_LOCK run on 0003 if it is something for 9.6. Thanks for your work on this ! Best regards, Jesper -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Speedup twophase transactions
On 03/30/2016 09:19 AM, Stas Kelvich wrote: > +++ b/src/test/recovery/t/006_twophase.pl > @@ -0,0 +1,226 @@ > +# Checks for recovery_min_apply_delay > +use strict; > This description is wrong, this file has been copied from 005. Yep, done. > > +my $node_master = get_new_node("Candie"); > +my $node_slave = get_new_node('Django'); > Please let's use a node names that are more descriptive. Hm, it’s hard to create descriptive names because test changes master/slave roles for that nodes several times during test. It’s possible to call them “node1” and “node2” but I’m not sure that improves something. But anyway I’m not insisting on that particular names and will agree with any reasonable suggestion. > > +# Switch to synchronous replication > +$node_master->append_conf('postgresql.conf', qq( > +synchronous_standby_names = '*' > +)); > +$node_master->restart; > Reloading would be fine. Good catch, done. > > + /* During replay that lock isn't really necessary, but let's take > it anyway */ > + LWLockAcquire(TwoPhaseStateLock, LW_EXCLUSIVE); > + for (i = 0; i < TwoPhaseState->numPrepXacts; i++) > + { > + gxact = TwoPhaseState->prepXacts[i]; > + proc = &ProcGlobal->allProcs[gxact->pgprocno]; > + pgxact = &ProcGlobal->allPgXact[gxact->pgprocno]; > + > + if (TransactionIdEquals(xid, pgxact->xid)) > + { > + gxact->locking_backend = MyBackendId; > + MyLockedGxact = gxact; > + break; > + } > + } > + LWLockRelease(TwoPhaseStateLock); > Not taking ProcArrayLock here? All accesses to 2pc dummies in ProcArray are covered with TwoPhaseStateLock, so I thick that’s safe. Also I’ve deleted comment above that block, probably it’s more confusing than descriptive. > > The comment at the top of XlogReadTwoPhaseData is incorrect. Yep, fixed. > > RecoverPreparedFromXLOG and RecoverPreparedFromFiles have a lot of > code in common, having this duplication is not good, and you could > simplify your patch. I reworked patch to avoid duplicated code between RecoverPreparedFromXLOG/RecoverPreparedFromFiles and also between FinishPreparedTransaction/XlogRedoFinishPrepared. Patch applies with hunks, and test cases are passing. Best regards, Jesper -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Speedup twophase transactions
On 03/18/2016 12:50 PM, Stas Kelvich wrote: On 11 Mar 2016, at 19:41, Jesper Pedersen wrote: Thanks for review, Jesper. Some comments: * The patch needs a rebase against the latest TwoPhaseFileHeader change Done. * Rework the check.sh script into a TAP test case (src/test/recovery), as suggested by Alvaro and Michael down thread Done. Originally I thought about reducing number of tests (11 right now), but now, after some debugging, I’m more convinced that it is better to include them all, as they are really testing different code paths. * Add documentation for RecoverPreparedFromXLOG Done. Thanks, Stas. I have gone over this version, and tested with --enable-tap-tests + make check in src/test/recovery, which passes. Simon, do you want to move this entry to "Ready for Committer" and take the 2REVIEWER section as part of that, or leave it in "Needs Review" and update the thread ? Best regards, Jesper -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers
On 03/15/2016 01:17 AM, Amit Kapila wrote: I have updated the comments and changed the name of one of a variable from "all_trans_same_page" to "all_xact_same_page" as pointed out offlist by Alvaro. I have done a run, and don't see any regressions. Intel Xeon 28C/56T @ 2GHz w/ 256GB + 2 x RAID10 (data + xlog) SSD. I can provide perf / flamegraph profiles if needed. Thanks for working on this ! Best regards, Jesper -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers