Re: Detecting File Damage & Inconsistencies
On Thu, 12 Nov 2020 at 06:42, tsunakawa.ta...@fujitsu.com wrote: > > From: Simon Riggs > > I would like to propose a few points that will help us detect file > > damage, inconsistencies in files and track actions of users. > > Hello, Simon san. Long time no see. I'm happy to see you be back here > recently. Thank you, happy to be back. It's good to have the time to contribute again. > What kind of improvement do you expect? What problems would this make > detectable? If a rogue user/process is suspected, this would allow you to identify more easily the changes made by specific sessions/users. > > * 2-byte pid (from MyProcPid) > > pid is 4 bytes on Windows. Isn't it also 4 byte on Linux when some kernel > parameter is set to a certain value? 4 bytes is no problem, thanks for pointing that out. -- Simon Riggshttp://www.EnterpriseDB.com/
Re: proposal: possibility to read dumped table's name from file
st 11. 11. 2020 v 16:17 odesílatel Alvaro Herrera napsal: > On 2020-Nov-11, Pavel Stehule wrote: > > > I think the proposed feature is very far to be the config file for > pg_dump > > (it implements a option "--filter"). This is not the target. It is not > > designed for this. This is just an alternative for options like -t, -T, > ... > > and I am sure so nobody will generate this file manually. Main target of > > this patch is eliminating problems with the max length of the command > line. > > So it is really not designed to be the config file for pg_dump. > > I agree that a replacement for existing command line arguments is a good > goal, but at the same time it's good to keep in mind the request that > more object types are supported as dumpable. While it's not necessary > that this infrastructure supports all object types in the first cut, > it'd be good to have it support that. I would propose that instead of a > single letter 't' etc we support keywords, maybe similar to those > returned by getObjectTypeDescription() (with additions -- for example > for "table data"). Then we can extend for other object types later > without struggling to find good letters for each. > > Of course we could allow abbrevations for common cases, such that "t" > means "table". > > For example: it'll be useful to support selective dumping of functions, > materialized views, foreign objects, etc. > Implementation of this is trivial. The hard work is mapping pg_dump options on database objects. t -> table is simple, but n -> schema looks a little bit inconsistent - although it is consistent with pg_dump. d or D - there is no system object like data. I am afraid so there are two independent systems - pg_dump options, and database objects, and it can be hard or not very practical to join these systems. Unfortunately there is not good consistency in the short options of pg_dump today. More - a lot of object names are multi words with inner space. This is not too practical. What about supporting two syntaxes? 1. first short current +-tndf filter - but the implementation should not be limited to one char - there can be any string until space 2. long syntax - all these pg_dump options has long options, and then we can extend this feature without any problem in future table|exclude-table|exclude-table-data|schema|exclude-schema|include-foreign-data=PATTERN so the content of filter file can looks like: +t mytable +t tabprefix* -t bigtable table=mytable2 exclude-table=mytable2 This format allows quick work for most common database objects, and it is extensible and consistent with pg_dump's long options. What do you think about it? Personally, I am thinking that it is over-engineering a little bit, maybe we can implement this feature just test first string after +- symbols (instead first char like now) - and any enhanced syntax can be implemented in future when there will be this requirement. Second syntax can be implemented very simply, because it can be identified by first char processing. We can implement second syntax only too. It will work too, but I think so short syntax is more practical for daily work (for common options). I expect so 99% percent of this objects will be "+t tablename". Regards Pavel
Re: Add statistics to pg_stat_wal view for wal related parameter tuning
On 2020/11/12 14:58, Fujii Masao wrote: On 2020/11/06 10:25, Masahiro Ikeda wrote: On 2020-10-30 11:50, Fujii Masao wrote: On 2020/10/29 17:03, Masahiro Ikeda wrote: Hi, Thanks for your comments and advice. I updated the patch. On 2020-10-21 18:03, Kyotaro Horiguchi wrote: At Tue, 20 Oct 2020 16:11:29 +0900, Masahiro Ikeda wrote in On 2020-10-20 12:46, Amit Kapila wrote: > I see that we also need to add extra code to capture these stats (some > of which is in performance-critical path especially in > XLogInsertRecord) which again makes me a bit uncomfortable. It might > be that it is all fine as it is very important to collect these stats > at cluster-level in spite that the same information can be gathered at > statement-level to help customers but I don't see a very strong case > for that in your proposal. We should avoid that duplication as possible even if the both number are important. Also about performance, I thought there are few impacts because it increments stats in memory. If I can implement to reuse pgWalUsage's value which already collects these stats, there is no impact in XLogInsertRecord. For example, how about pg_stat_wal() calculates the accumulated value of wal_records, wal_fpi, and wal_bytes to use pgWalUsage's value? I don't think that works, but it would work that pgstat_send_wal() takes the difference of that values between two successive calls. WalUsage prevWalUsage; ... pgstat_send_wal() { .. /* fill in some values using pgWalUsage */ WalStats.m_wal_bytes = pgWalUsage.wal_bytes - prevWalUsage.wal_bytes; WalStats.m_wal_records = pgWalUsage.wal_records - prevWalUsage.wal_records; WalStats.m_wal_wal_fpi = pgWalUsage.wal_fpi - prevWalUsage.wal_fpi; ... pgstat_send(&WalStats, sizeof(WalStats)); /* remember the current numbers */ prevWalUsage = pgWalUsage; Thanks for Horiguchi-san's advice, I changed to reuse pgWalUsage which is already defined and eliminates the extra overhead. + /* fill in some values using pgWalUsage */ + WalStats.m_wal_bytes = pgWalUsage.wal_bytes - prevWalUsage.wal_bytes; + WalStats.m_wal_records = pgWalUsage.wal_records - prevWalUsage.wal_records; + WalStats.m_wal_fpi = pgWalUsage.wal_fpi - prevWalUsage.wal_fpi; It's better to use WalUsageAccumDiff() here? Yes, thanks. I fixed it. prevWalUsage needs to be initialized with pgWalUsage? + if (AmWalWriterProcess()){ + WalStats.m_wal_write_walwriter++; + } + else + { + WalStats.m_wal_write_backend++; + } I think that it's better not to separate m_wal_write_xxx into two for walwriter and other processes. Instead, we can use one m_wal_write_xxx counter and make pgstat_send_wal() send also the process type to the stats collector. Then the stats collector can accumulate the counters per process type if necessary. If we adopt this approach, we can easily extend pg_stat_wal so that any fields can be reported per process type. I'll remove the above source code because these counters are not useful. On 2020-10-30 12:00, Fujii Masao wrote: On 2020/10/20 11:31, Masahiro Ikeda wrote: Hi, I think we need to add some statistics to pg_stat_wal view. Although there are some parameter related WAL, there are few statistics for tuning them. I think it's better to provide the following statistics. Please let me know your comments. ``` postgres=# SELECT * from pg_stat_wal; -[ RECORD 1 ]---+-- wal_records | 2000224 wal_fpi | 47 wal_bytes | 248216337 wal_buffers_full | 20954 wal_init_file | 8 wal_write_backend | 20960 wal_write_walwriter | 46 wal_write_time | 51 wal_sync_backend | 7 wal_sync_walwriter | 8 wal_sync_time | 0 stats_reset | 2020-10-20 11:04:51.307771+09 ``` 1. Basic statistics of WAL activity - wal_records: Total number of WAL records generated - wal_fpi: Total number of WAL full page images generated - wal_bytes: Total amount of WAL bytes generated To understand DB's performance, first, we will check the performance trends for the entire database instance. For example, if the number of wal_fpi becomes higher, users may tune "wal_compression", "checkpoint_timeout" and so on. Although users can check the above statistics via EXPLAIN, auto_explain, autovacuum and pg_stat_statements now, if users want to see the performance trends for the entire database, they must recalculate the statistics. I think it is useful to add the sum of the basic statistics. 2. WAL segment file creation - wal_init_file: Total number of WAL segment files created. To create a new WAL file may have an impact on the performance of a write-heavy workload generating lots of WAL. If this number is reported high, to reduce the number of this initialization, we can tune WAL-related parameters so that more "recycled" WAL files can be held. 3. Number of when WAL
Re: abstract Unix-domain sockets
On Wed, Nov 11, 2020 at 01:39:17PM +0100, Peter Eisentraut wrote: > Thinking about it further, I think the hint in the Unix-domain socket case > is bogus. A socket in the file-system namespace never reports EADDRINUSE > anyway, it just overwrites the file. For sockets in the abstract namespace, > you can get this error, but of course there is no file to remove. > > Perhaps we should change the hint in both the Unix and the IP cases to: > > "Is another postmaster already running at this address?" > (This also resolves the confusing reference to "port" in the Unix case.) Er, it is perfectly possible for two postmasters to use the same unix socket path, abstract or not, as long as they listen to different ports (all nodes in a single TAP test do that for example). So we should keep a reference to the port used in the log message, no? > Or we just drop the hint in the Unix case. The primary error message is > clear enough. Dropping the hint for the abstract case sounds fine to me. -- Michael signature.asc Description: PGP signature
Re: Clean up optional rules in grammar
On 2020-11-11 11:54, Heikki Linnakangas wrote: On 11/11/2020 11:12, Peter Eisentraut wrote: The attached patch cleans this up to make them all look like the first style. +1 done
Re: [HACKERS] logical decoding of two-phase transactions
The subscriber tests are updated to include test cases for "cascading" pub/sub scenarios. i.e.. NODE_A publisher => subscriber NODE_B publisher => subscriber NODE_C PSA only the modified v20-0006 patch (the other 5 patches remain unchanged) Kind Regards, Peter Smith. Fujitsu Australia. v20-0006-Support-2PC-txn-subscriber-tests.patch Description: Binary data
Re: In-place persistance change of a relation
At Wed, 11 Nov 2020 09:56:44 -0500, Stephen Frost wrote in > Greetings, > > * Kyotaro Horiguchi (horikyota@gmail.com) wrote: > > At Tue, 10 Nov 2020 09:33:12 -0500, Stephen Frost > > wrote in > > > * Kyotaro Horiguchi (horikyota@gmail.com) wrote: > > > > For fuel(?) of the discussion, I tried a very-quick PoC for in-place > > > > ALTER TABLE SET LOGGED/UNLOGGED and resulted as attached. After some > > > > trials of several ways, I drifted to the following way after poking > > > > several ways. > > > > > > > > 1. Flip BM_PERMANENT of active buffers > > > > 2. adding/removing init fork > > > > 3. sync files, > > > > 4. Flip pg_class.relpersistence. > > > > > > > > It always skips table copy in the SET UNLOGGED case, and only when > > > > wal_level=minimal in the SET LOGGED case. Crash recovery seems > > > > working by some brief testing by hand. > > > > > > Somehow missed that this patch more-or-less does what I was referring to > > > down-thread, but I did want to mention that it looks like it's missing a > > > necessary FlushRelationBuffers() call before the sync, otherwise there > > > could be dirty buffers for the relation that's being set to LOGGED (with > > > wal_level=minimal), which wouldn't be good. See the comments above > > > smgrimmedsync(). > > > > Right. Thanks. However, since SetRelFileNodeBuffersPersistence() > > called just above scans shared buffers so I don't want to just call > > FlushRelationBuffers() separately. Instead, I added buffer-flush to > > SetRelFileNodeBuffersPersistence(). > > Maybe I'm missing something, but it sure looks like in the patch that > SetRelFileNodeBuffersPersistence() is being called after the > smgrimmedsync() call, and I don't think you get to just switch the order > of those- the sync is telling the kernel to make sure it's written to > disk, while the FlushBuffer() is just writing it into the kernel but > doesn't provide any guarantee that the data has actually made it to > disk. We have to FlushBuffer() first, and then call smgrimmedsync(). > Perhaps there's a way to avoid having to go through shared buffers > twice, and I generally agreed it'd be good if we could avoid doing so, > but this approach doesn't look like it actually works. Yeah, sorry for the rare-baked version.. I was confused about the order at the time. The next version works like this: LOGGED->UNLOGGED for each relations: minimal> if it is index call ambuildempty() (which syncs the init fork) else WAL-log smgr_create then sync the init file. ... commit time: abort time: UNLOGGED->LOGGED for each relations: minimal> ... commit time: abort time: > > FWIW this is a revised version of the PoC, which has some known > > problems. > > > > - Flipping of Buffer persistence is not WAL-logged nor even be able to > > be safely roll-backed. (It might be better to drop buffers). > > Not sure if it'd be better to drop buffers or not, but figuring out how > to deal with rollback seems pretty important. How is the persistence > change in the catalog not WAL-logged though..? Rollback works as the above. Buffer persistence change is registered in pending-deletes. Persistence change in catalog is rolled back in the ordinary way (or automatically). If wal_level > minimal, persistence change of buffers is propagated to standbys by WAL. However I'm not sure we need wal-logging otherwise, the next version emits WAL since SMGR_CREATE is always logged by existing code. > > - This version handles indexes but not yet handle toast relatins. > > Would need to be fixed, of course. Fixed. > > - tableAMs are supposed to support this feature. (but I'm not sure > > it's worth allowing them not to do so). > > Seems like they should. Init fork of index relations needs a call to ambuildempty() instead of "log_smgrcreate-smgrimmedsync" after smgrcreate. Instead of adding similar interface in indexAm, I reverted changes of tableam and make RelationCreate/DropInitFork() directly do that. That introduces new include of amapi.h to storage.c, which is a bit uneasy. The previous version give up the in-place persistence change in the case where wal_level > minimal and SET LOGGED since that needs WAL to be emitted. However, in-place change still has the advantage of not running a table copy. So the next verson always runs persistence change in-place. As suggested by Andres, I'll send a summary of this patch. The patch will be attached to the coming mail. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: In-placre persistance change of a relation
At Wed, 11 Nov 2020 14:18:04 -0800, Andres Freund wrote in > Hi, > > I suggest outlining what you are trying to achieve here. Starting a new > thread and expecting people to dig through another thread to infer what > you are actually trying to achive isn't great. Agreed. I'll post that. Thanks. > FWIW, I'm *extremely* doubtful it's worth adding features that depend on > a PGC_POSTMASTER wal_level=minimal being used. Which this does, a far as > I understand. If somebody added support for dynamically adapting > wal_level (e.g. wal_level=auto, that increases wal_level to > replica/logical depending on the presence of replication slots), it'd > perhaps be different. Yes, this depends on wal_level=minimal for switching from UNLOGGED to LOGGED, that's similar to COPY/INSERT-to-intransaction-created-tables optimization for wal_level=minimal. And it expands that optimization to COPY/INSERT-to-existent-tables, which seems worth doing. Switching to LOGGED needs to emit the initial state to WAL... Hmm.. I came to think that even in that case skipping table copy reduces I/O significantly, even though FPI-WAL is emitted. > On 2020-11-11 17:33:17 +0900, Kyotaro Horiguchi wrote: > > FWIW this is a revised version of the PoC, which has some known > > problems. > > > > - Flipping of Buffer persistence is not WAL-logged nor even be able to > > be safely roll-backed. (It might be better to drop buffers). > > That's obviously a no-go. I think you might be able to address this if > you accept that the command cannot be run in a transaction (like > CONCURRENTLY). Then you can first do the catalog changes, change the > persistence level, and commit. Of course. The next version reverts persistence change at abort. Thanks! -- Kyotaro Horiguchi NTT Open Source Software Center
RE: Detecting File Damage & Inconsistencies
From: Simon Riggs > I would like to propose a few points that will help us detect file > damage, inconsistencies in files and track actions of users. Hello, Simon san. Long time no see. I'm happy to see you be back here recently. What kind of improvement do you expect? What problems would this make detectable? > * 2-byte pid (from MyProcPid) pid is 4 bytes on Windows. Isn't it also 4 byte on Linux when some kernel parameter is set to a certain value? Regards Takayuki Tsunakawa
Re: logical streaming of xacts via test_decoding is broken
On Thu, 12 Nov 2020 at 11:37 AM, Amit Kapila wrote: > On Thu, Nov 12, 2020 at 11:29 AM Dilip Kumar > wrote: > > > > On Thu, Nov 12, 2020 at 8:45 AM Amit Kapila > wrote: > > > > > > > > Another thing I am thinking let's just not expose skip_empty_stream to > > > the user and consider the behavior based on the value of > > > skip_empty_xacts. Internally, in the code, we can still have different > > > variables to distinguish between empty_xacts and empty_streams. > > > > Yeah, even I think in most of the cases it makes more sense to have > > skip_empty_xacts and skip_empty_stream similar values. So better we > > don't expose skip_empty_stream. I agree that we need to keep two > > variables to track the empty stream and empty xacts. > > > > So, let's try to do this way and if we see any problems then we can > re-think. Sounds good to me, I will send the updated patch. > -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: Feature request: Improve allowed values for generate series
On Wed, Nov 11, 2020 at 7:54 PM Eugen Konkov wrote: > Hello David, > > I have a table with services, each service have a period. After which > service is auto renewal > > Services also could be one-time. At this case its interval is '00:00:00' > In which case the concept of interval is undefined - there is no meaningful "second date" here, just the one expiration date - yet you are choosing to keep it in order to introduce an artificial similarity between one-time service and auto-renewal service. This use case isn't convincing for me. Writing the one-time service query without generate_series leaves out extraneous stuff that isn't important, which I would recommend even if generate_series were to work as described. If you are going to introduce code-specific stuff to make this work just write: SELECT * FROM generate_series( '2020-11-09', '2020-11-09', INTERVAL '1000 years' ); It is just as much a construction of code as the other. David J.
Re: logical streaming of xacts via test_decoding is broken
On Thu, Nov 12, 2020 at 11:29 AM Dilip Kumar wrote: > > On Thu, Nov 12, 2020 at 8:45 AM Amit Kapila wrote: > > > > > Another thing I am thinking let's just not expose skip_empty_stream to > > the user and consider the behavior based on the value of > > skip_empty_xacts. Internally, in the code, we can still have different > > variables to distinguish between empty_xacts and empty_streams. > > Yeah, even I think in most of the cases it makes more sense to have > skip_empty_xacts and skip_empty_stream similar values. So better we > don't expose skip_empty_stream. I agree that we need to keep two > variables to track the empty stream and empty xacts. > So, let's try to do this way and if we see any problems then we can re-think. -- With Regards, Amit Kapila.
Re: logical streaming of xacts via test_decoding is broken
On Thu, Nov 12, 2020 at 8:45 AM Amit Kapila wrote: > > On Wed, Nov 11, 2020 at 7:05 PM Dilip Kumar wrote: > > > > On Wed, Nov 11, 2020 at 6:59 PM Amit Kapila wrote: > > > > > > On Wed, Nov 11, 2020 at 10:00 AM Dilip Kumar > > > wrote: > > > > > > > > On Tue, Nov 10, 2020 at 7:20 PM Amit Kapila > > > > wrote: > > > > > > > > > > On Tue, Nov 10, 2020 at 2:25 PM Dilip Kumar > > > > > wrote: > > > > > > > > > > > > > > > > You can probably add a comment as to why we are disallowing this case. > > > > > I thought of considering 'stream-changes' parameter here because it > > > > > won't make sense to give this parameter without it, however, it seems > > > > > that is not necessary but maybe adding a comment > > > > > here in that regard would be a good idea. > > > > > > > > Should we also consider the case that if the user just passed > > > > skip_empty_streams to true then we should automatically set > > > > skip_empty_xacts to true? > > > > > > > > > > Is there any problem if we don't do this? Actually, adding > > > dependencies on parameters is confusing so I want to avoid that unless > > > it is really required. > > > > > > > And we will allow the 'skip_empty_streams' parameter only if > > > > stream-changes' is true. > > > > The reason behind this thought is that if the user doesn't pass any > > value for 'skip_empty_xacts' then the default value will be false and > > if the user only pass 'skip_empty_streams' to true then we will error > > out assuming that skip_empty_xacts is false but skip_empty_streams is > > true. So it seems instead of error out we can assume that > > skip_empty_streams true mean skip_empty_xacts is also true if nothing > > is passed for that. > > > > So, let's see the overall picture here. We can have four options: > skip_empty_xacts = true, skip_empty_stream = false; > skip_empty_xacts = true, skip_empty_stream = true; > skip_empty_xacts = false, skip_empty_stream = false; > skip_empty_xacts = false, skip_empty_stream = true; > > I think we want to say the first three could be supported and for the > last one either we can either give an error or make its behavior > similar to option-2? Is this what your understanding as well? For the last one if the user has specifically passed false for the skip_empty_xacts then error and if the user did not pass anything for skip_empty_xacts then make its behavior similar to option-2. > Another thing I am thinking let's just not expose skip_empty_stream to > the user and consider the behavior based on the value of > skip_empty_xacts. Internally, in the code, we can still have different > variables to distinguish between empty_xacts and empty_streams. Yeah, even I think in most of the cases it makes more sense to have skip_empty_xacts and skip_empty_stream similar values. So better we don't expose skip_empty_stream. I agree that we need to keep two variables to track the empty stream and empty xacts. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: Add statistics to pg_stat_wal view for wal related parameter tuning
On 2020/11/06 10:25, Masahiro Ikeda wrote: On 2020-10-30 11:50, Fujii Masao wrote: On 2020/10/29 17:03, Masahiro Ikeda wrote: Hi, Thanks for your comments and advice. I updated the patch. On 2020-10-21 18:03, Kyotaro Horiguchi wrote: At Tue, 20 Oct 2020 16:11:29 +0900, Masahiro Ikeda wrote in On 2020-10-20 12:46, Amit Kapila wrote: > I see that we also need to add extra code to capture these stats (some > of which is in performance-critical path especially in > XLogInsertRecord) which again makes me a bit uncomfortable. It might > be that it is all fine as it is very important to collect these stats > at cluster-level in spite that the same information can be gathered at > statement-level to help customers but I don't see a very strong case > for that in your proposal. We should avoid that duplication as possible even if the both number are important. Also about performance, I thought there are few impacts because it increments stats in memory. If I can implement to reuse pgWalUsage's value which already collects these stats, there is no impact in XLogInsertRecord. For example, how about pg_stat_wal() calculates the accumulated value of wal_records, wal_fpi, and wal_bytes to use pgWalUsage's value? I don't think that works, but it would work that pgstat_send_wal() takes the difference of that values between two successive calls. WalUsage prevWalUsage; ... pgstat_send_wal() { .. /* fill in some values using pgWalUsage */ WalStats.m_wal_bytes = pgWalUsage.wal_bytes - prevWalUsage.wal_bytes; WalStats.m_wal_records = pgWalUsage.wal_records - prevWalUsage.wal_records; WalStats.m_wal_wal_fpi = pgWalUsage.wal_fpi - prevWalUsage.wal_fpi; ... pgstat_send(&WalStats, sizeof(WalStats)); /* remember the current numbers */ prevWalUsage = pgWalUsage; Thanks for Horiguchi-san's advice, I changed to reuse pgWalUsage which is already defined and eliminates the extra overhead. + /* fill in some values using pgWalUsage */ + WalStats.m_wal_bytes = pgWalUsage.wal_bytes - prevWalUsage.wal_bytes; + WalStats.m_wal_records = pgWalUsage.wal_records - prevWalUsage.wal_records; + WalStats.m_wal_fpi = pgWalUsage.wal_fpi - prevWalUsage.wal_fpi; It's better to use WalUsageAccumDiff() here? Yes, thanks. I fixed it. prevWalUsage needs to be initialized with pgWalUsage? + if (AmWalWriterProcess()){ + WalStats.m_wal_write_walwriter++; + } + else + { + WalStats.m_wal_write_backend++; + } I think that it's better not to separate m_wal_write_xxx into two for walwriter and other processes. Instead, we can use one m_wal_write_xxx counter and make pgstat_send_wal() send also the process type to the stats collector. Then the stats collector can accumulate the counters per process type if necessary. If we adopt this approach, we can easily extend pg_stat_wal so that any fields can be reported per process type. I'll remove the above source code because these counters are not useful. On 2020-10-30 12:00, Fujii Masao wrote: On 2020/10/20 11:31, Masahiro Ikeda wrote: Hi, I think we need to add some statistics to pg_stat_wal view. Although there are some parameter related WAL, there are few statistics for tuning them. I think it's better to provide the following statistics. Please let me know your comments. ``` postgres=# SELECT * from pg_stat_wal; -[ RECORD 1 ]---+-- wal_records | 2000224 wal_fpi | 47 wal_bytes | 248216337 wal_buffers_full | 20954 wal_init_file | 8 wal_write_backend | 20960 wal_write_walwriter | 46 wal_write_time | 51 wal_sync_backend | 7 wal_sync_walwriter | 8 wal_sync_time | 0 stats_reset | 2020-10-20 11:04:51.307771+09 ``` 1. Basic statistics of WAL activity - wal_records: Total number of WAL records generated - wal_fpi: Total number of WAL full page images generated - wal_bytes: Total amount of WAL bytes generated To understand DB's performance, first, we will check the performance trends for the entire database instance. For example, if the number of wal_fpi becomes higher, users may tune "wal_compression", "checkpoint_timeout" and so on. Although users can check the above statistics via EXPLAIN, auto_explain, autovacuum and pg_stat_statements now, if users want to see the performance trends for the entire database, they must recalculate the statistics. I think it is useful to add the sum of the basic statistics. 2. WAL segment file creation - wal_init_file: Total number of WAL segment files created. To create a new WAL file may have an impact on the performance of a write-heavy workload generating lots of WAL. If this number is reported high, to reduce the number of this initialization, we can tune WAL-related parameters so that more "recycled" WAL files can be held. 3. Number of when WAL is flushed - wal_write_backend : Total num
Re: Proposition for autoname columns
On Wed, Nov 11, 2020 at 5:56 PM Bruce Momjian wrote: > On Thu, Nov 12, 2020 at 12:18:49AM +, Dagfinn Ilmari Mannsåker wrote: > > Bruce Momjian writes: > > > I think we could do it, but it would only work if the column was output > > > as a single json value, and not a multi-key/value field. I am afraid > if > > > we tried to do it, the result would be too inconsistent to be useful. > > > > Could this be done via the support function, so that the top-level > > operator/function in each select list item can return a suggested column > > name if the relevant arguments are constants? > > Yes, the user explicitly calling a function would be much easier to > predict. > > For the user an operator and a function are different ways to invoke the same underlying thing using different syntax. I'm not seeing how this syntax difference makes this any easier to implement for explicit function invocation compared to operator function invocation. David J.
Re: Is it useful to record whether plans are generic or custom?
čt 12. 11. 2020 v 2:50 odesílatel torikoshia napsal: > On 2020-09-29 02:39, legrand legrand wrote: > > Hi Atsushi, > > > > +1: Your proposal is a good answer for time based performance analysis > > (even if parsing durationor blks are not differentiated) . > > > > As it makes pgss number of columns wilder, may be an other solution > > would be to create a pg_stat_statements_xxx view with the same key > > as pgss (dbid,userid,queryid) and all thoses new counters. > > Thanks for your ideas and sorry for my late reply. > > It seems creating pg_stat_statements_xxx views both for generic and > custom plans is better than my PoC patch. > > However, I also began to wonder how effective it would be to just > distinguish between generic and custom plans. Custom plans can > include all sorts of plans. and thinking cache validation, generic > plans can also include various plans. > > Considering this, I'm starting to feel that it would be better to > not just keeping whether generic or cutom but the plan itself as > discussed in the below thread. > > > https://www.postgresql.org/message-id/flat/CAKU4AWq5_jx1Vyai0_Sumgn-Ks0R%2BN80cf%2Bt170%2BzQs8x6%3DHew%40mail.gmail.com#f57e64b8d37697c808e4385009340871 > > > Any thoughts? > yes, the plan self is very interesting information - and information if plan was generic or not is interesting too. It is other dimension of query - maybe there can be rule - for any query store max 100 most slows plans with all attributes. The next issue is fact so first first 5 execution of generic plans are not generic in real. This fact should be visible too. Regards Pavel > > Regards, > > -- > Atsushi Torikoshi > > >
Re: Detecting File Damage & Inconsistencies
On 11/11/20 21:56, Simon Riggs wrote: [ŝnip] REINDEX VERIFY After the new index is created, but before we drop the old index: Check whether the two indexes match: * checks whether the previous index had pointers to row versions that don't exist * checks whether the heap has rows that were not in the old index This approach piggybacks on existing operations. AccessShareLock is held on both indexes before the old one is dropped. FWIW, as long as it's optional (due to the added runtime), it'd be a welcome feature. Maybe something along the lines of: REINDEX (verify yes) Other ideas are welcome. I still have nightmares from an specific customer case w/ shared storage (using VxFS) among two postmaster instances ---supposedly could never be active concurrently, not completely sure that it didn't actually happen--- and the corruption that we found there. I seem to remember that they even had scripts to remove the locking when switching over and back :S I don't think Postgres can do much about this, but maybe someone can come up with a suitable countermeasure. Just my .02€ Thanks, / J.L.
Re: Use standard SIGHUP and SIGTERM handlers in autoprewarm module
On 2020/11/10 21:30, Bharath Rupireddy wrote: On Tue, Nov 10, 2020 at 3:04 PM Fujii Masao wrote: The main reason for having SetLatch() in SignalHandlerForConfigReload() is to wake up the calling process if waiting in WaitLatchOrSocket() or WaitLatch() and reload the new config file and use the reloaded config variables. Maybe we should give a thought on the scenarios in which the walreceiver process waits, and what happens in case the latch is set when SIGHUP is received. The difference is whether the config file is processed at the next wakeup (by force-reply-request or SIGTERM) of walreceiver or immediately. If server-reload happened frequently, say, several times per second(?), we should consider to reduce the useless reloading, but actually that's not the case. So, attached is the patch that makes walreceiver use both standard SIGTERM and SIGHUP handlers. Currently I've not found any actual issues by making walreceiver use standard SIGHUP handler, yet. I think it makes sense to replace WalRcv->latch with MyProc->procLatch(as they point to the same latch) in the functions that are called in the walreceiver process. However, we are using walrcv->latch with spinlock, say in WalRcvForceReply() and RequestXLogStreaming() both are called from the startup process. See commit 45f9d08684, that made the access to the walreceiver's latch(WalRcv->latch) by the other process(startup) spinlock protected And looks like, in general it's a standard practice to set latch to wake up the process if waiting in case a SIGHUP signal is received and reload the relevant config variables. Going by the above analysis, the v3 patch looks good to me. Thanks for the analysis! I pushed the patch. Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
Re: Background writer and checkpointer in crash recovery
On Wed, Nov 11, 2020 at 9:57 PM Simon Riggs wrote: > Having said that, we did raise the checkpoint_timeout by a lot, so the > situation today might be quite different. A large checkpoint_timeout > could eventually overflow shared buffers, with the right workload. FWIW Jakuk Wartak did manage to show a 1.64x speedup while running crash recovery of an insert-only workload (with a variant of this patch that I shared in another thread), albeit with aggressive tuning: https://www.postgresql.org/message-id/VI1PR0701MB6960EEB838D53886D8A180E3F6520%40VI1PR0701MB6960.eurprd07.prod.outlook.com > We don't have any stats to show whether this patch is worthwhile or > not, so I suggest adding the attached instrumentation patch as well so > we can see on production systems whether checkpoint_timeout is too > high by comparison with pg_stat_bgwriter. The patch is written in the > style of log_checkpoints. Very useful. I've also been wondering how to get that sort of information in hot standby.
RE: [Patch] Optimize dropping of relation buffers using dlist
The patch looks OK. I think as Thomas-san suggested, we can remove the modification to smgrnblocks() and don't care wheter the size is cached or not. But I think the current patch is good too, so I'd like to leave it up to a committer to decide which to choose. I measured performance in a different angle -- the time DropRelFileNodeBuffers() and DropRelFileNodeAllBuffers() took. That reveals the direct improvement and degradation. I used 1,000 tables, each of which is 1 MB. I used shared_buffers = 128 MB for the case where the traditional full buffer scan is done, and shared_buffers = 100 GB for the case where the optimization path takes effect. The results are almost good as follows: A. UNPATCHED 128 MB shared_buffers 1. VACUUM = 0.04 seconds 2. TRUNCATE = 0.04 seconds 100 GB shared_buffers 3. VACUUM = 69.4 seconds 4. TRUNCATE = 69.1 seconds B. PATCHED 128 MB shared_buffers (full scan) 5. VACUUM = 0.04 seconds 6. TRUNCATE = 0.07 seconds 100 GB shared_buffers (optimized path) 7. VACUUM = 0.02 seconds 8. TRUNCATE = 0.08 seconds So, I'd like to mark this as ready for committer. Regards Takayuki Tsunakawa
RE: [Patch] Optimize dropping of relation buffers using dlist
On Tuesday, November 10, 2020 3:10 PM, Tsunakawa-san wrote: > From: Jamison, Kirk/ジャミソン カーク > > So I proceeded to update the patches using the "cached" parameter and > > updated the corresponding comments to it in 0002. > > OK, I'm in favor of the name "cached" now, although I first agreed with > Horiguchi-san in that it's better to use a name that represents the nature > (accurate) of information rather than the implementation (cached). Having > a second thought, since smgr is a component that manages relation files on > storage (file system), lseek(SEEK_END) is the accurate value for smgr. The > cached value holds a possibly stale size up to which the relation has > extended. > > > The patch looks almost good except for the minor ones: Thank you for the review! > (1) > +extern BlockNumber smgrnblocks(SMgrRelation reln, ForkNumber > forknum, > +bool *accurate); > > It's still accurate here. Already fixed in 0002. > (2) > + * This is only called in recovery when the block count of any > fork is > + * cached and the total number of to-be-invalidated blocks per > relation > > count of any fork is > -> counts of all forks are Fixed in 0003/ > (3) > In 0004, I thought you would add the invalidated block counts of all relations > to determine if the optimization is done, as Horiguchi-san suggested. But I > find the current patch okay too. Yeah, I found my approach easier to implement. The new change in 0004 is that when entering the optimized path we now call FindAndDropRelFileNodeBuffers() instead of DropRelFileNodeBuffers(). I have attached all the updated patches. I'd appreciate your feedback. Regards, Kirk Jamison v31-0001-Prevent-invalidating-blocks-in-smgrextend-during.patch Description: v31-0001-Prevent-invalidating-blocks-in-smgrextend-during.patch v31-0002-Add-bool-param-in-smgrnblocks-for-cached-blocks.patch Description: v31-0002-Add-bool-param-in-smgrnblocks-for-cached-blocks.patch v31-0003-Optimize-DropRelFileNodeBuffers-during-recovery.patch Description: v31-0003-Optimize-DropRelFileNodeBuffers-during-recovery.patch v31-0004-TRUNCATE-optimization.patch Description: v31-0004-TRUNCATE-optimization.patch
Re: Strange GiST logic leading to uninitialized memory access in pg_trgm gist code
Hi! On Wed, Nov 11, 2020 at 12:53 PM Andrew Gierth wrote: > Now the obvious simple fix is just to reorder those last two operations, > and the original reporter verified that doing so fixed their problem > (patch attached). But I'd really like to understand the logic here and > whether there is any reason to have this special treatment at all - why > would it not be better to just cache all N items upfront and consider > them all as potential seeds? I think this comes from the idea that when N items are passed to the picksplit method, then the first N-1 are existing items on the page, while the last Nth is the new item to be inserted. So, we are trying to split first N-1 items and then insert the last Nth item there. But this is wrong for two reasons. 1) As you've pointed out, GiST code doesn't necessarily pass items to the picksplit method in that way. 2) Even if items are passed as assumed, there is no point in having special handling of the item to be inserted. It's better to consider the whole set of items to produce a better split. > Another issue I don't understand yet is that even though this code is > largely unchanged since 8.x, the original reporter could not reproduce > the crash on any version before 13.0. I think this is related to my commit 911e702077. It has changed the memory allocation for the signatures to support the signatures of variable length. So, it seems that despite the error existing since 8.x, it started causing segfaults only since 911e702077. > Anyone have any ideas? (If not, I'll commit and backpatch something like > the attached patch at some suitable time.) I would rather propose to rip off special handling of the last item completely (see the attached patch). -- Regards, Alexander Korotkov 0001-Fix-handling-of-the-last-item-in-gtrgm_picksplit.patch Description: Binary data
Re: Delay of standby shutdown
On 2020/11/04 9:35, Soumyadeep Chakraborty wrote: Hello Fujii, On Thu, Sep 17, 2020 at 6:49 AM Fujii Masao wrote: As far as I understand after debugging, the problem is as follows: Yes. 1. After the primary is stopped, and the standby startup process is waiting inside: (void) WaitLatch(&XLogCtl->recoveryWakeupLatch, WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH, wait_time, WAIT_EVENT_RECOVERY_RETRIEVE_RETRY_INTERVAL); it receives SIGTERM on account of stopping the standby and it leads to the WaitLatch call returning with WL_LATCH_SET. 2. WaitForWALToBecomeAvailable() then will return true after calling XLogFileReadAnyTLI() and eventually, XLogReadRecord() will return NULL since there is no new WAL to read, which means ReadRecord() will loop back and perform another XLogReadRecord(). 3. The additional XLogReadRecord() will lead to a 5s wait inside WaitForWALToBecomeAvailable() - a different WaitLatch() call this time: /* * Wait for more WAL to arrive. Time out after 5 seconds * to react to a trigger file promptly and to check if the * WAL receiver is still active. */ (void) WaitLatch(&XLogCtl->recoveryWakeupLatch, WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH, 5000L, WAIT_EVENT_RECOVERY_WAL_STREAM); 4. And then eventually, the code will handle interrupts here inside WaitForWALToBecomeAvailable() after the 5s wait: /* * This possibly-long loop needs to handle interrupts of startup * process. */ HandleStartupProcInterrupts(); To avoid this issue, I think that ReadRecord() should call HandleStartupProcInterrupts() whenever looping back to retry. Alternatively, perhaps we can place it inside WaitForWALToBecomeAvailable() (which already has a similar call), since it is more semantically aligned to checking for interrupts, rather than ReadRecord()? Like this: Yes. Your approach looks better to me. Attached is the updated version of the patch implementing that. Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index aa63f37615..7d97b96e72 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -12225,6 +12225,9 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess, WAIT_EVENT_RECOVERY_RETRIEVE_RETRY_INTERVAL); ResetLatch(MyLatch); now = GetCurrentTimestamp(); + + /* Handle interrupt signals of startup process */ + HandleStartupProcInterrupts(); } last_fail_time = now; currentSource = XLOG_FROM_ARCHIVE;
Re: logical streaming of xacts via test_decoding is broken
On Wed, Nov 11, 2020 at 7:05 PM Dilip Kumar wrote: > > On Wed, Nov 11, 2020 at 6:59 PM Amit Kapila wrote: > > > > On Wed, Nov 11, 2020 at 10:00 AM Dilip Kumar wrote: > > > > > > On Tue, Nov 10, 2020 at 7:20 PM Amit Kapila > > > wrote: > > > > > > > > On Tue, Nov 10, 2020 at 2:25 PM Dilip Kumar > > > > wrote: > > > > > > > > > > > > > You can probably add a comment as to why we are disallowing this case. > > > > I thought of considering 'stream-changes' parameter here because it > > > > won't make sense to give this parameter without it, however, it seems > > > > that is not necessary but maybe adding a comment > > > > here in that regard would be a good idea. > > > > > > Should we also consider the case that if the user just passed > > > skip_empty_streams to true then we should automatically set > > > skip_empty_xacts to true? > > > > > > > Is there any problem if we don't do this? Actually, adding > > dependencies on parameters is confusing so I want to avoid that unless > > it is really required. > > > > > And we will allow the 'skip_empty_streams' parameter only if > > > stream-changes' is true. > > The reason behind this thought is that if the user doesn't pass any > value for 'skip_empty_xacts' then the default value will be false and > if the user only pass 'skip_empty_streams' to true then we will error > out assuming that skip_empty_xacts is false but skip_empty_streams is > true. So it seems instead of error out we can assume that > skip_empty_streams true mean skip_empty_xacts is also true if nothing > is passed for that. > So, let's see the overall picture here. We can have four options: skip_empty_xacts = true, skip_empty_stream = false; skip_empty_xacts = true, skip_empty_stream = true; skip_empty_xacts = false, skip_empty_stream = false; skip_empty_xacts = false, skip_empty_stream = true; I think we want to say the first three could be supported and for the last one either we can either give an error or make its behavior similar to option-2? Is this what your understanding as well? Another thing I am thinking let's just not expose skip_empty_stream to the user and consider the behavior based on the value of skip_empty_xacts. Internally, in the code, we can still have different variables to distinguish between empty_xacts and empty_streams. -- With Regards, Amit Kapila.
RE: Disable WAL logging to speed up data loading
From: Stephen Frost > I'm not sure that I see this as really being much of an issue. Perhaps there > are > some things we can do, as I mentioned before, to make it easier for users to > have tables be created as unlogged from the start, or to be able to ALTER > TABLE a bunch of tables at once (using all in tablespace, or maybe having an > ALTER TABLE on a partitioned table cascade to the partitions), but overall the > risk here seems very low- clearly whatever processing is running to load the > data into a particular table knows what the table is and adding an ALTER > TABLE into it would be trivial. Thank you for your opinion. I'm glad to see great people like Robert, Magnus and you have given comments (honestly, I didn't expect this trivial feature wouldn't get attention.) I'll read those mails from the oldest including this, and reply to them if needed. > > Likewise, can't we do ALTER TABLE SET UNLOGGED/LOGGED against a > partitioned table? Currently, the statement succeeds but none of the > partitioned table nor its partitions is set unlogged (pg_class.relpersistence > remains 'p'). Is this intended? If it's a bug, I'm willing to fix it so > that it reports > an eror. Of course, it's good to make all partitions unlogged at once. > > I agree that this doesn't seem quite right and considering the way other > commands work like CREATE INDEX, I would think that doing such an ALTER > TABLE would recurse to the individual partitions (skipping over any which are > already set to the persistance desired..). OK, I'll try to propagate the alteration to partitions. (I wish the fix will be easy, but I'm afraid some devil is lurking...) I'd appreciate it if someone could kindly point out the devil if he/she knows the past history. Regards Takayuki Tsunakawa
Re: Hybrid Hash/Nested Loop joins and caching results from subplans
On Tue, 10 Nov 2020 at 12:55, David Rowley wrote: > > On Tue, 10 Nov 2020 at 12:49, Tom Lane wrote: > > > > Alvaro Herrera writes: > > > Are you taking into account the possibility that generated machine code > > > is a small percent slower out of mere bad luck? I remember someone > > > suggesting that they can make code 2% faster or so by inserting random > > > no-op instructions in the binary, or something like that. So if the > > > difference between v8 and v9 is that small, then it might be due to this > > > kind of effect. > > > > Yeah. I believe what this arises from is good or bad luck about relevant > > tight loops falling within or across cache lines, and that sort of thing. > > We've definitely seen performance changes up to a couple percent with > > no apparent change to the relevant code. > > It possibly is this issue. > > Normally how I build up my confidence in which is faster is why just > rebasing on master as it advances and see if the winner ever changes. > The theory here is if one patch is consistently the fastest, then > there's more chance if there being a genuine reason for it. I kicked off a script last night that ran benchmarks on master, v8 and v9 of the patch on 1 commit per day for the past 30 days since yesterday. The idea here is that as the code changes that if the performance differences are due to code alignment then there should be enough churn in 30 days to show if this is the case. The quickly put together script is attached. It would need quite a bit of modification to run on someone else's machine. This took about 20 hours to run. I found that v8 is faster on 28 out of 30 commits. In the two cases where v9 was faster, v9 took 99.8% and 98.5% of the time of v8. In the 28 cases where v8 was faster it was generally about 2-4% faster, but a couple of times 8-10% faster. Full results attached in .csv file. Also the query I ran to compare the results once loaded into Postgres. David #!/bin/bash seconds=30 scale=10 pg_ctl stop -D pgdata -w for sha in e578c17d8 ae0f7b11f b94109ce3 7f4708818 bc49f8780 540849814 929c69aa1 f49b85d78 bbb927b4d 555eb1a4f f8721bd75 83d727e5b 21d36747d ba9f18abd 20d3fe900 f893e68d7 ad1c36b07 f90149e62 b401fa206 f90e80b91 b17ff07aa 5b3dca096 bf797a8d9 113d3591b 5b7bfc397 5ee180a39 b4c9695e7 8b39345a9 8f113698b d2e4bf688 do cd ~/pg_src git reset --hard git clean -f git checkout $sha ./configure --prefix=/home/drowley/pg > /dev/null for branch in master resultcache_v8 resultcache_v9 do cd ~/pg_src git reset --hard git clean -f for file in /home/drowley/$branch/* do patch -p1 < $file done make clean -s make -j -s make install -s cd contrib/pg_prewarm make -j -s make install -s cd sleep 1 # create database and load data when doing the first branch if [ $branch = master ] then rm -rf pgdata initdb -D pgdata > /dev/null cp postgresql.conf pgdata pg_ctl start -D pgdata -l pg.log psql -c "drop table if exists hundredk, lookup1, lookup100;" postgres psql -c "create table hundredk (hundredk int, tenk int, thousand int, hundred int, ten int, one int);" postgres psql -c "insert into hundredk select x%10,x%1,x%1000,x%100,x%10,1 from generate_Series(1,$scale) x;" postgres psql -c "create table lookup100 (a int);" postgres psql -c "insert into lookup100 select x from generate_Series(1,$scale)x,generate_Series(1,100);" postgres psql -c "create index on lookup100 (a);" postgres psql -c "create table lookup1 (a int);" postgres psql -c "insert into lookup1 select x from generate_Series(1,$scale)x;" postgres psql -c "create index on lookup1 (a);" postgres psql -c "vacuum analyze lookup1, lookup100, hundredk;" postgres psql -c "create extension pg_prewarm;" postgres pg_ctl stop -D pgdata -w fi pg_ctl start -D pgdata -l pg.log -w psql -c "select pg_prewarm('lookup1'), pg_prewarm('lookup100'), pg_prewarm('hundredk');" postgres for tbl in lookup1 lookup100 do for target in "count(*)" "count(l.a)" "'*'" do for col in thousand hundred ten one do echo "select $ta
Re: enable pg_stat_statements to track rows processed by REFRESH MATERIALIZED VIEW
On 2020/11/10 17:29, Fujii Masao wrote: On 2020/11/05 23:54, Seino Yuki wrote: 2020-11-02 20:01 に Fujii Masao さんは書きました: On 2020/11/02 14:02, Yuki Seino wrote: The following review has been posted through the commitfest application: make installcheck-world: tested, passed Implements feature: tested, passed Spec compliant: tested, passed Documentation: tested, passed +1. I checked the patch and there were no problems. + PG_END_TRY(); + SetQueryCompletion(qc, CMDTAG_REFRESH_MATERIALIZED_VIEW, processed); Isn't it better to call SetQueryCompletion() in ExecRefreshMatView() instead of ProcessUtilitySlow() (e.g., ExecCreateTableAs() does)? Regards, Sorry. I missed it. I've incorporated your point into this patch. So the changes to "matview.h" and "utility.c" have been canceld. We also confirmed that the new patch passed the regression test. Thanks for updating the patch! + /* save the rowcount if we're given a qc to fill */ + SetQueryCompletion(qc, CMDTAG_REFRESH_MATERIALIZED_VIEW, processed); I added the check "if (qc)" into the above. I also added the following comments about that we don't display the rowcount in the command completion tag output though we save it in qc. There is the discussion related to this topic, at [1]. Thought? + * Save the rowcount so that pg_stat_statements can track the total number + * of rows processed by REFRESH MATERIALIZED VIEW command. Note that we + * still don't display the rowcount in the command completion tag output, + * i.e., the display_rowcount flag of CMDTAG_REFRESH_MATERIALIZED_VIEW + * command tag is left false in cmdtaglist.h. Otherwise, the change of + * completion tag output might break applications using it. Attached is the updated version of the patch. Barring no objection, I will commit that. Pushed. Thanks! Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
Re: Add docs stub for recovery.conf
On Thu, Nov 12, 2020 at 4:01 AM Bruce Momjian wrote: > On Wed, Nov 11, 2020 at 08:59:40PM +0100, Daniel Gustafsson wrote: > > > On 11 Nov 2020, at 20:44, Bruce Momjian wrote: > > > On Tue, Nov 10, 2020 at 01:38:14PM +0800, Craig Ringer wrote: > > > > >> I noticed that when recovery.conf was removed in 2dedf4d9a8 (yay!) > the docs for > > >> it were removed completely as well. That's largely sensible, but is > confusing > > >> when users have upgraded and are trying to find out what happened, or > how to > > >> configure equivalent functionality. > > > > > > I don't see the logic in carrying doc stuff that we don't have anymore. > > > > Well, we do have that already in 's sprinkled across the docs where > it > > makes sense to help transitioning users, like this one in func.sgml: > > > > "Prior to PostgreSQL 12, it was possible to skip arbitrary text in > the > > input string using non-letter or non-digit characters..." > > > > It doesn't seem like a terrible idea to do a similar one for > recovery.conf. > > I am fine with a tip. The patch looked like it was creating a new > chapter for it. > > It is. Or rather, an appendix right at the end to hold info on things we removed or renamed and where to find them now. You can't AFAICS make docbook create a toplevel linkable file for a . A won't un-break https://www.postgresql.org/docs/current/recovery-config.html or make help people who visit https://www.postgresql.org/docs/11/recovery-config.html figure out what's going on if they're using pg12 and there's no link to version 12 in the nav section. A won't add index entries for renamed settings, so someone looking up "standby_mode" can find out that we've switched to a file called "standby.signal" instead. Pretend you're a user who has upgraded from pg 11. You're looking at the Pg 12 docs. How long does it take you to find out how to make a server into a standby now? It took me longer than I would've expected...
Re: Add docs stub for recovery.conf
On Thu, Nov 12, 2020 at 3:44 AM Bruce Momjian wrote: > On Tue, Nov 10, 2020 at 01:38:14PM +0800, Craig Ringer wrote: > > Hi all > > > > I noticed that when recovery.conf was removed in 2dedf4d9a8 (yay!) the > docs for > > it were removed completely as well. That's largely sensible, but is > confusing > > when users have upgraded and are trying to find out what happened, or > how to > > configure equivalent functionality. > > I don't see the logic in carrying doc stuff that we don't have anymore. > I explained why. Here's how the rendered docs look: https://imgur.com/a/VyjzEw5 Think. You're used to recovery.conf. You've recently switched to pg 12. You search for "recovery.conf" or "primary_slot_name" or "standby_mode" or something. You of course land up at https://www.postgresql.org/docs/11/recovery-config.html or another version. What do you do now? There's no "12" or "current" link for your version. You don't follow postgres development closely, and you have no idea we removed the file. You might work it out from the release notes. But even then, if you try searching for "standby_mode" in the updated docs will tell you basically nothing, it has just vanished Also by simply deleting the page, we've broken web links to https://www.postgresql.org/docs/current/recovery-config.html So there are three really good reasons: * Help users of the web docs navigate to the right place when we remove things * Don't break web links (breaking links without redirects is bad, the web is sad) * Help upgrading users who know the old terms find the new terms I'd welcome your suggestions on other ways to arrange this, so long as it meets the basic requirement "retain the linktable target 'recovery-config' " In general I think it's quite unpleasant for users to have docs sections just vanish. I strongly suggest that we enact a policy going forwards that any or removal should be accompanied by a redirect or stub that helps users find the new contents. It regularly annoys me when I'm trying to navigate around various versions of the docs and things just vanish.
RE: POC: postgres_fdw insert batching
From: t...@corona.is.ed.ac.uk On Behalf Of > Does this patch affect trigger semantics on the base table? > > At the moment when I insert 1000 rows into a postgres_fdw table using a > single insert statement (e.g. INSERT INTO fdw_foo SELECT ... FROM bar) I > naively expect a "statement level" trigger on the base table to trigger > once. But this is not the case. The postgres_fdw implements this > operation as 1000 separate insert statements on the base table, so the > trigger happens 1000 times instead of once. Hence there is no > distinction between using a statement level and a row level trigger on > the base table in this context. > > So would this patch change the behaviour so only 10 separate insert > statements (each of 100 rows) would be made against the base table? > If so thats useful as it means improving performance using statement > level triggers becomes possible. But it would also result in more > obscure semantics and might break user processes dependent on the > existing behaviour after the patch is applied. Yes, the times the statement trigger defined on the base (remote) table will be reduced, as you said. > BTW is this subtlety documented, I haven't found anything but happy > to be proved wrong? Unfortunately, there doesn't seem to be any description on triggers on base tables. For example, if the local foreign table has an AFTER ROW trigger and its remote base table has a BEFORE ROW trigger that modifies the input record, it seems that the AFTER ROW trigger doesn't see the modified record. Regards Takayuki Tsunakawa
Re: Avoiding useless SHA256 initialization with backup manifests, breaking base backups with FIPS
On Tue, Nov 10, 2020 at 11:00:14AM +0900, Michael Paquier wrote: > Attached is a patch that I would like to back-patch down to v13 to > avoid this useless initialization, giving users the possibility to > take base backups with FIPS when not using a backup manifest. Without > the solution in the first paragraph, you cannot make use of backup > manifests at all with OpenSSL+FIPS (one can still enforce the use of > the in-core SHA2 implementation even if building with OpenSSL), but at > least it gives an escape route with 13. Okay. Hearing nothing, I have applied that. -- Michael signature.asc Description: PGP signature
Re: Is it useful to record whether plans are generic or custom?
On 2020-09-29 02:39, legrand legrand wrote: Hi Atsushi, +1: Your proposal is a good answer for time based performance analysis (even if parsing durationor blks are not differentiated) . As it makes pgss number of columns wilder, may be an other solution would be to create a pg_stat_statements_xxx view with the same key as pgss (dbid,userid,queryid) and all thoses new counters. Thanks for your ideas and sorry for my late reply. It seems creating pg_stat_statements_xxx views both for generic and custom plans is better than my PoC patch. However, I also began to wonder how effective it would be to just distinguish between generic and custom plans. Custom plans can include all sorts of plans. and thinking cache validation, generic plans can also include various plans. Considering this, I'm starting to feel that it would be better to not just keeping whether generic or cutom but the plan itself as discussed in the below thread. https://www.postgresql.org/message-id/flat/CAKU4AWq5_jx1Vyai0_Sumgn-Ks0R%2BN80cf%2Bt170%2BzQs8x6%3DHew%40mail.gmail.com#f57e64b8d37697c808e4385009340871 Any thoughts? Regards, -- Atsushi Torikoshi
Re: recovery_target immediate timestamp
On 2020/11/12 6:00, Euler Taveira wrote: Hi, While restoring a backup using recovery_target immediate, I noticed that there isn't a timestamp available. LOG: consistent recovery state reached at 0/A000100 LOG: recovery stopping after reaching consistency LOG: pausing at the end of recovery HINT: Execute pg_wal_replay_resume() to promote. LOG: database system is ready to accept read only connections if you decide to use one of the targets or just recover until the end of the WAL, you get a (last completed transaction) timestamp. LOG: redo done at 0/10FFEC38 system usage: CPU: user: 0.10 s, system: 0.05 s, elapsed: 1.65 s LOG: last completed transaction was at log time 2020-11-11 17:27:31.715251-03 LOG: restored log file "00010010" from archive cp: cannot stat '/a/pgarchive/0002.history': No such file or directory LOG: selected new timeline ID: 2 LOG: archive recovery complete cp: cannot stat '/a/pgarchive/0001.history': No such file or directory LOG: database system is ready to accept connections I dig into the pg_waldump output to figure out the timestamp, however, the checkpoint timestamp isn't printed by pg_waldump. The checkpoint timestamp might be useful information at least when you set large values for a checkpoint or need to investigate a performance/corruption issue. The first patch adds a new message that prints the latest completed checkpoint when the consistent state is reached. I'm not sure how useful this information is in practice. It also exposes the checkpoint timestamp in debug messages. ereport(DEBUG1, (errmsg("checkpoint record is at %X/%X", (uint32) (checkPointLoc >> 32), (uint32) checkPointLoc))); + ereport(DEBUG1, + (errmsg("checkpoint time is %s", str_time(checkPoint.time; The above first debug message displays the LSN of the checkpoint record. OTOH, the second message displays the time when the checkpoint started (not the time when checkpoint record was written at the end of checkpoint). So isn't it confusing to display those inconsistent information together? LOG: consistent recovery state reached at 0/A000100 DETAIL: Last completed checkpoint was at log time 2020-11-11 17:31:50 -03. LOG: recovery stopping after reaching consistency LOG: pausing at the end of recovery HINT: Execute pg_wal_replay_resume() to promote. LOG: database system is ready to accept read only connections . . . DEBUG: checkpoint record is at 0/A60 DEBUG: checkpoint time is 2020-11-11 17:34:19 -03 The second patch provides the checkpoint timestamp in the pg_waldump output and also when you enable wal_debug parameter. The pg_waldump output looks like +1 +#ifdef FRONTEND + strftime(checkpointstr, sizeof(checkpointstr), "%c", localtime(&time_tmp)); +#else + pg_strftime(checkpointstr, sizeof(checkpointstr), "%c", pg_localtime(&time_tmp, log_timezone)); +#endif You can simplify the code by using timestamptz_to_str() here instead, like xact_desc_commit() does. Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
Re: pg_upgrade analyze script
On Wed, Nov 11, 2020 at 04:51:27PM +0100, Magnus Hagander wrote: > Gah. For some reason, I did not spot this email until after I had > applied the other patch. And not only that, I also missed one line in > my patch that you included in yours :/ > > My apologies. No worries, thanks. -- Michael signature.asc Description: PGP signature
Re: Proposition for autoname columns
On Thu, Nov 12, 2020 at 12:18:49AM +, Dagfinn Ilmari Mannsåker wrote: > Bruce Momjian writes: > > I think we could do it, but it would only work if the column was output > > as a single json value, and not a multi-key/value field. I am afraid if > > we tried to do it, the result would be too inconsistent to be useful. > > Could this be done via the support function, so that the top-level > operator/function in each select list item can return a suggested column > name if the relevant arguments are constants? Yes, the user explicitly calling a function would be much easier to predict. -- Bruce Momjian https://momjian.us EnterpriseDB https://enterprisedb.com The usefulness of a cup is in its emptiness, Bruce Lee
Re: Proposition for autoname columns
Bruce Momjian writes: > On Mon, Nov 2, 2020 at 05:05:29PM +0200, Eugen Konkov wrote: >> Hello Pgsql-hackers, >> >> When selecting data from json column it named as '?column?' >> tucha=# select info->>'suma', docn from document order by id desc limit 5; >> ?column? | docn >> --+-- >> 665.97 | 695 >> 513.51 | 632 >> 665.97 | 4804 >> 492.12 | 4315 >> 332.98 | 1302 >> (5 rows) >> >> It would be useful if the name of column will be autoassigned based on >> name of json key. Like at next query: >> >> tucha=# select info->>'suma' as suma, docn from document order by id desc >> limit 5; >> suma | docn >> +-- >> 665.97 | 695 >> 513.51 | 632 >> 665.97 | 4804 >> 492.12 | 4315 >> 332.98 | 1302 >> (5 rows) >> >> >> Would it be useful this auto assigned name for column from json? > > I think we could do it, but it would only work if the column was output > as a single json value, and not a multi-key/value field. I am afraid if > we tried to do it, the result would be too inconsistent to be useful. Could this be done via the support function, so that the top-level operator/function in each select list item can return a suggested column name if the relevant arguments are constants? - ilmari -- - Twitter seems more influential [than blogs] in the 'gets reported in the mainstream press' sense at least. - Matt McLeod - That'd be because the content of a tweet is easier to condense down to a mainstream media article. - Calle Dybedahl
Returning NULL from satisfies_hash_partition() is a bad idea
So far as I can tell, the partition machinery is critically dependent on the idea that partition constraint conditions can never return NULL, only true or false. This is explicitly noted in the comments in get_qual_for_list, for instance, and it's visibly true by construction for both list and range constraints. But hash partition constraints are just calls to satisfies_hash_partition(), and look what we've got there: /* Return null if the parent OID, modulus, or remainder is NULL. */ if (PG_ARGISNULL(0) || PG_ARGISNULL(1) || PG_ARGISNULL(2)) PG_RETURN_NULL(); parent = try_relation_open(parentId, AccessShareLock); if (parent == NULL) PG_RETURN_NULL(); OK, the first one probably shouldn't happen, but it's far from clear that the second cannot. If we did return NULL, that would be taken as a "pass" of the constraint, possibly allowing a non-matching row to be inserted into a partition. So this seems like a seriously bad idea. I don't have a strong position on whether it'd be better to return FALSE (causing a constraint failure error) or just throw an error outright. But we can't leave it like this. regards, tom lane
Re: PATCH: Batch/pipelining support for libpq
Hi, On 2020-11-03 10:42:34 -0300, Alvaro Herrera wrote: > It would definitely help if you (and others) could think about the API > being added: Does it fulfill the promises being made? Does it offer the > guarantees that real-world apps want to have? I'm not much of an > application writer myself -- particularly high-traffic apps that would > want to use this. Somewhere earlier in this thread there was a patch with support for batching in pgbench. I think it'd be good to refresh that. Both because it shows at least some real-world-lite usage of the feature and because we need a way to stress it to see whether it has unnecessary bottlenecks. Greetings, Andres Freund
Tracking cluster upgrade and configuration history
Hello Hackers, While supporting customers, it would frequently be useful to have more information about the history of a cluster. For example, which prior versions were ever installed and running on the cluster? Has the cluster ever been run with fsync=off? When did the server last enter recovery, if ever? Was a backup_label file present at that time? Some of this type of information could strictly be fixed size, such as a fixed set of timestamps for the time at which a fixed set of things last occurred, or a fixed set of bits indicating whether a fixed set of things ever happened. Some other types would be variable size, but hopefully short in practice, like a list of all postgres versions that have ever been run on the cluster. Logging the information via the usual log mechanism seems insufficient, as log files may get rotated and this information lost. Would it be acceptable to store some fixed set of flag bits and timestamps in pg_control? Space there is at a premium. Would it make sense to alternately, or additionally, store some of this information in a flat text file in pg_data, say a new file named "cluster_history" or such? I'm happy to put together a more concrete proposal, but solicit your opinions first on the merits of the idea generally, and if you think the idea good, on the specifics you'd like to see included. Thanks! — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: In-placre persistance change of a relation
Hi, I suggest outlining what you are trying to achieve here. Starting a new thread and expecting people to dig through another thread to infer what you are actually trying to achive isn't great. FWIW, I'm *extremely* doubtful it's worth adding features that depend on a PGC_POSTMASTER wal_level=minimal being used. Which this does, a far as I understand. If somebody added support for dynamically adapting wal_level (e.g. wal_level=auto, that increases wal_level to replica/logical depending on the presence of replication slots), it'd perhaps be different. On 2020-11-11 17:33:17 +0900, Kyotaro Horiguchi wrote: > FWIW this is a revised version of the PoC, which has some known > problems. > > - Flipping of Buffer persistence is not WAL-logged nor even be able to > be safely roll-backed. (It might be better to drop buffers). That's obviously a no-go. I think you might be able to address this if you accept that the command cannot be run in a transaction (like CONCURRENTLY). Then you can first do the catalog changes, change the persistence level, and commit. Greetings, Andres Freund
Re: Allow matching whole DN from a client certificate
Greetings, * Andrew Dunstan (and...@dunslane.net) wrote: > Currently we only match the Common Name (CN) of a client certificate > when authenticating a user. The attached patch allows matching the > entire Distinguished Name (DN) of the certificate. This is enabled by > the HBA line option "clientname", which can take the values "CN" or > "DN". "CN" is the default. > > The idea is that you might have a role with a CN of, say, "dbauser" in > two different parts of the organization, say one with "OU=marketing" and > the other with "OU=engineering", and you only want to allow access to > one of them. > > This feature is best used in conjunction with a map. e.g. in testing I > have this pg_hba.conf line: > > hostssl all all 127.0.0.1/32 cert clientname=DN map=dn > > and this pg_ident.conf line: > > dn /^C=US,ST=North.Carolina,O=test,OU=eng,CN=andrew$ andrew > > If people like this idea I'll add tests and docco and add it to the next CF. Yeah, this is definitely a worthwhile feature. Thanks, Stephen signature.asc Description: PGP signature
recovery_target immediate timestamp
Hi, While restoring a backup using recovery_target immediate, I noticed that there isn't a timestamp available. LOG: consistent recovery state reached at 0/A000100 LOG: recovery stopping after reaching consistency LOG: pausing at the end of recovery HINT: Execute pg_wal_replay_resume() to promote. LOG: database system is ready to accept read only connections if you decide to use one of the targets or just recover until the end of the WAL, you get a (last completed transaction) timestamp. LOG: redo done at 0/10FFEC38 system usage: CPU: user: 0.10 s, system: 0.05 s, elapsed: 1.65 s LOG: last completed transaction was at log time 2020-11-11 17:27:31.715251-03 LOG: restored log file "00010010" from archive cp: cannot stat '/a/pgarchive/0002.history': No such file or directory LOG: selected new timeline ID: 2 LOG: archive recovery complete cp: cannot stat '/a/pgarchive/0001.history': No such file or directory LOG: database system is ready to accept connections I dig into the pg_waldump output to figure out the timestamp, however, the checkpoint timestamp isn't printed by pg_waldump. The checkpoint timestamp might be useful information at least when you set large values for a checkpoint or need to investigate a performance/corruption issue. The first patch adds a new message that prints the latest completed checkpoint when the consistent state is reached. It also exposes the checkpoint timestamp in debug messages. LOG: consistent recovery state reached at 0/A000100 DETAIL: Last completed checkpoint was at log time 2020-11-11 17:31:50 -03. LOG: recovery stopping after reaching consistency LOG: pausing at the end of recovery HINT: Execute pg_wal_replay_resume() to promote. LOG: database system is ready to accept read only connections . . . DEBUG: checkpoint record is at 0/A60 DEBUG: checkpoint time is 2020-11-11 17:34:19 -03 The second patch provides the checkpoint timestamp in the pg_waldump output and also when you enable wal_debug parameter. The pg_waldump output looks like rmgr: XLOGlen (rec/tot):114/ 114, tx: 0, lsn: 0/0A60, prev 0/0A28, desc: CHECKPOINT_ONLINE redo 0/A28; timestamp qua 11 nov 2020 17:34:19 -03; tli 1; prev tli 1; fpw true; xid 0:519; oid 24576; multi 1; offset 0; oldest xid 501 in DB 1; oldest multi 1 in DB 1; oldest/newest commit timestamp xid: 0/0; oldest running xid 519; online and the debug messages are DEBUG: checkpoint record is at 0/A60 DEBUG: checkpoint time is 2020-11-11 17:37:47 -03 LOG: REDO @ 0/A60; LSN 0/AD8: prev 0/A28; xid 0; len 88 - XLOG/CHECKPOINT_ONLINE: redo 0/A28; timestamp Wed Nov 11 17:37:47 2020; tli 1; prev tli 1; fpw true; xid 0:519; oid 24576; multi 1; offset 0; oldest xid 501 in DB 1; oldest multi 1 in DB 1; oldest/newest commit timestamp xid: 0/0; oldest running xid 519; online Regards, -- Euler Taveira http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services From 4b725a45dfb89ee9a463e6ee7987cd8b660053cd Mon Sep 17 00:00:00 2001 From: Euler Taveira Date: Wed, 11 Nov 2020 11:24:25 -0300 Subject: [PATCH 1/2] Report latest completed checkpoint timestamp A new detail message prints the latest completed checkpoint when reaching the consistent state. This information is useful when you use recovery_target = immediate because it is not clear what timestamp it stops applying the WAL. It also adds new debug messages that report the checkpoint timestamp. It might be useful for debug purposes. --- src/backend/access/transam/xlog.c | 8 +++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index aa63f37615..cf70950932 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -6524,6 +6524,8 @@ StartupXLOG(void) ereport(DEBUG1, (errmsg("checkpoint record is at %X/%X", (uint32) (checkPointLoc >> 32), (uint32) checkPointLoc))); + ereport(DEBUG1, + (errmsg("checkpoint time is %s", str_time(checkPoint.time; InRecovery = true; /* force recovery even if SHUTDOWNED */ /* @@ -6657,6 +6659,8 @@ StartupXLOG(void) ereport(DEBUG1, (errmsg("checkpoint record is at %X/%X", (uint32) (checkPointLoc >> 32), (uint32) checkPointLoc))); + ereport(DEBUG1, + (errmsg("checkpoint time is %s", str_time(checkPoint.time; } else { @@ -8033,7 +8037,9 @@ CheckRecoveryConsistency(void) ereport(LOG, (errmsg("consistent recovery state reached at %X/%X", (uint32) (lastReplayedEndRecPtr >> 32), - (uint32) lastReplayedEndRecPtr))); + (uint32) lastReplayedEndRecPtr), + errdetail("Last completed checkpoint was at log time %s.", + str_time(ControlFile->checkPointCopy.time; } /* -- 2.20.1 From 9b7b3effad8c860a003dc60eb1ea53aa15a26ad9 Mon Sep 17 00:00:00 2001 From: Euler Taveira Date: Wed, 11 Nov 202
Detecting File Damage & Inconsistencies
I would like to propose a few points that will help us detect file damage, inconsistencies in files and track actions of users. Optionally, we could add these fields onto the WAL commit record: * 8-byte session start time (from MyStartTime) * 2-byte pid (from MyProcPid) * 4-byte user oid Identified by a name flag bit, XACT_XINFO_HAS_USERINFO. That allows us to match up transactions with the server log %c option. Another option might allow text username to be added to each commit as well. XLogLongPageHeaderData has 4 bytes of unused data because of alignment. We could use that space for 1) the Xid Epoch value or 2) a CRC value - since only WAL records are covered by a CRC, not page or file headers. Perhaps we should add both? There are also 2 bytes unused in the XLogRecord header (with 8 byte alignment). We could optionally use that space for the pid that wrote the record, but that's not compelling. What can we use those 2 bytes for? REINDEX VERIFY After the new index is created, but before we drop the old index: Check whether the two indexes match: * checks whether the previous index had pointers to row versions that don't exist * checks whether the heap has rows that were not in the old index This approach piggybacks on existing operations. AccessShareLock is held on both indexes before the old one is dropped. Other ideas are welcome. Thoughts? -- Simon Riggshttp://www.EnterpriseDB.com/
Re: cutting down the TODO list thread
Here is the next section on data types, proposed to be moved to the "not worth doing" page. As before, if there are any objections, do speak up. I'll make the move in a few days. **Datatypes - Fix data types where equality comparison is not intuitive, e.g. box There is likely no way to do this without breaking applications. We already have a big warning about this in the docs. - Add IMMUTABLE column attribute Controversial *Domains (this entire section would go) - Allow functions defined as casts to domains to be called during casting - Allow values to be cast to domain types - Make domains work better with polymorphic functions Old and difficult *Date/Time - Allow TIMESTAMP WITH TIME ZONE to store the original timezone information - Have timestamp subtraction not call justify_hours() Very old - Allow a comma to denote fractional seconds in ISO-8601-compliant times (and timestamps) Apparent lack of interest *Arrays - Add function to detect if an array is empty - Improve handling of NULLs in arrays Lack of interest *Money (this entire section would go) - Add locale-aware MONEY type, and support multiple currencies - MONEY dumps in a locale-specific format making it difficult to restore to a system with a different locale The money datatype seems kind of obsolete anyway, and there doesn't seem to be demand to improve it. *Text Search - Allow dictionaries to change the token that is passed on to later dictionaries - Consider a function-based API for '@@' searches Very old - Improve text search error messages One of the gripes has been fixed already, in any case it's very old - tsearch and tsdicts regression tests fail in Turkish locale on glibc Bug report that refers to locale behavior from 2009 - Improve handling of dash and plus signs in email address user names, and perhaps improve URL parsing Difficult *XML (difficult section -- plenty of bugs which should be fixed, but also old and low interest) - Allow XML arrays to be cast to other data types Very old - Allow reliable XML operation non-UTF8 server encodings Difficult - Move XSLT from contrib/xml2 to a more reasonable location Lack of consensus - Improve the XSLT parameter passing API Lack of consensus - xpath_table needs to be implemented/implementable to get rid of contrib/xml2 - xpath_table is pretty broken anyway Unclear path forward - better handling of XPath data types - Improve handling of PIs and DTDs in xmlconcat() Zero interest - Restructure XML and /contrib/xml2 functionality As discussed in the thread, it's an unrealistically large project -- John Naylor EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Allow matching whole DN from a client certificate
Currently we only match the Common Name (CN) of a client certificate when authenticating a user. The attached patch allows matching the entire Distinguished Name (DN) of the certificate. This is enabled by the HBA line option "clientname", which can take the values "CN" or "DN". "CN" is the default. The idea is that you might have a role with a CN of, say, "dbauser" in two different parts of the organization, say one with "OU=marketing" and the other with "OU=engineering", and you only want to allow access to one of them. This feature is best used in conjunction with a map. e.g. in testing I have this pg_hba.conf line: hostssl all all 127.0.0.1/32 cert clientname=DN map=dn and this pg_ident.conf line: dn /^C=US,ST=North.Carolina,O=test,OU=eng,CN=andrew$ andrew If people like this idea I'll add tests and docco and add it to the next CF. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com " diff --git a/src/backend/libpq/auth.c b/src/backend/libpq/auth.c index d132c5cb48..7e40de82e0 100644 --- a/src/backend/libpq/auth.c +++ b/src/backend/libpq/auth.c @@ -2869,12 +2869,15 @@ static int CheckCertAuth(Port *port) { int status_check_usermap = STATUS_ERROR; + char *peer_username; Assert(port->ssl); /* Make sure we have received a username in the certificate */ - if (port->peer_cn == NULL || - strlen(port->peer_cn) <= 0) + peer_username = port->hba->clientcertname == clientCertCN ? port->peer_cn : port->peer_dn; + + if (peer_username == NULL || + strlen(peer_username) <= 0) { ereport(LOG, (errmsg("certificate authentication failed for user \"%s\": client certificate contains no user name", @@ -2882,8 +2885,8 @@ CheckCertAuth(Port *port) return STATUS_ERROR; } - /* Just pass the certificate cn to the usermap check */ - status_check_usermap = check_usermap(port->hba->usermap, port->user_name, port->peer_cn, false); + /* Just pass the certificate cn/dn to the usermap check */ + status_check_usermap = check_usermap(port->hba->usermap, port->user_name, peer_username, false); if (status_check_usermap != STATUS_OK) { /* @@ -2894,7 +2897,7 @@ CheckCertAuth(Port *port) if (port->hba->clientcert == clientCertFull && port->hba->auth_method != uaCert) { ereport(LOG, - (errmsg("certificate validation (clientcert=verify-full) failed for user \"%s\": CN mismatch", + (errmsg("certificate validation (clientcert=verify-full) failed for user \"%s\": CN/DN mismatch", port->user_name))); } } diff --git a/src/backend/libpq/be-secure-openssl.c b/src/backend/libpq/be-secure-openssl.c index e10260051f..dbb32c86bb 100644 --- a/src/backend/libpq/be-secure-openssl.c +++ b/src/backend/libpq/be-secure-openssl.c @@ -520,22 +520,30 @@ aloop: /* Get client certificate, if available. */ port->peer = SSL_get_peer_certificate(port->ssl); - /* and extract the Common Name from it. */ + /* and extract the Common Name / Distinguished Name from it. */ port->peer_cn = NULL; + port->peer_dn = NULL; port->peer_cert_valid = false; if (port->peer != NULL) { int len; + X509_NAME *x509name = X509_get_subject_name(port->peer); + char *peer_cn; + char *peer_dn; + BIO *bio = NULL; + BUF_MEM*bio_buf = NULL; - len = X509_NAME_get_text_by_NID(X509_get_subject_name(port->peer), - NID_commonName, NULL, 0); - if (len != -1) + if (!x509name) { - char *peer_cn; + return -1; + } + len = X509_NAME_get_text_by_NID(x509name, NID_commonName, NULL, 0); + if (len != -1) + { peer_cn = MemoryContextAlloc(TopMemoryContext, len + 1); - r = X509_NAME_get_text_by_NID(X509_get_subject_name(port->peer), - NID_commonName, peer_cn, len + 1); + r = X509_NAME_get_text_by_NID(x509name, NID_commonName, peer_cn, + len + 1); peer_cn[len] = '\0'; if (r != len) { @@ -559,6 +567,32 @@ aloop: port->peer_cn = peer_cn; } + + bio = BIO_new(BIO_s_mem()); + if (!bio) + { + return -1; + } + /* use commas instead of slashes */ + X509_NAME_print_ex(bio, x509name, 0, XN_FLAG_SEP_COMMA_PLUS); + BIO_get_mem_ptr(bio, &bio_buf); + peer_dn = MemoryContextAlloc(TopMemoryContext, bio_buf->length + 1); + memcpy(peer_dn, bio_buf->data, bio_buf->length); + peer_dn[bio_buf->length] = '\0'; + if (bio_buf->length != strlen(peer_dn)) + { + ereport(COMMERROR, + (errcode(ERRCODE_PROTOCOL_VIOLATION), + errmsg("SSL certificate's distinguished name contains embedded null"))); + BIO_free(bio); + pfree(peer_dn); + return -1; + } + + BIO_free(bio); + + port->peer_dn = peer_dn; + port->peer_cert_valid = true; } @@ -590,6 +624,12 @@ be_tls_close(Port *port) pfree(port->peer_cn); port->peer_cn = NULL; } + + if (port->peer_dn) + { + pfree(port->peer_dn); + port->peer_dn = NULL; + } } ssize_t diff --git a/src/backend/libpq/be-secure.c b/src/backend/libpq/be-secure.c index 2ae507a902..ef1320e686 100644 --- a/src/backend/libpq/be-secure.c +++ b/src/backend/
Re: Add docs stub for recovery.conf
> On 11 Nov 2020, at 21:01, Bruce Momjian wrote: > > On Wed, Nov 11, 2020 at 08:59:40PM +0100, Daniel Gustafsson wrote: >>> On 11 Nov 2020, at 20:44, Bruce Momjian wrote: >>> On Tue, Nov 10, 2020 at 01:38:14PM +0800, Craig Ringer wrote: >> I noticed that when recovery.conf was removed in 2dedf4d9a8 (yay!) the docs for it were removed completely as well. That's largely sensible, but is confusing when users have upgraded and are trying to find out what happened, or how to configure equivalent functionality. >>> >>> I don't see the logic in carrying doc stuff that we don't have anymore. >> >> Well, we do have that already in 's sprinkled across the docs where it >> makes sense to help transitioning users, like this one in func.sgml: >> >>"Prior to PostgreSQL 12, it was possible to skip arbitrary text in the >>input string using non-letter or non-digit characters..." >> >> It doesn't seem like a terrible idea to do a similar one for recovery.conf. > > I am fine with a tip. The patch looked like it was creating a new > chapter for it. I admittedly hadn't looked at the patch, but now that I have I agree with not adding a separate "obsolete" topic for it. I'd prefer to use tips within the docs, they will also help guide users who search for recovery.conf and lands in the tip which is next to the relevant updated documentation on the topic. cheers ./daniel
Re: Add docs stub for recovery.conf
On Wed, Nov 11, 2020 at 08:59:40PM +0100, Daniel Gustafsson wrote: > > On 11 Nov 2020, at 20:44, Bruce Momjian wrote: > > On Tue, Nov 10, 2020 at 01:38:14PM +0800, Craig Ringer wrote: > > >> I noticed that when recovery.conf was removed in 2dedf4d9a8 (yay!) the > >> docs for > >> it were removed completely as well. That's largely sensible, but is > >> confusing > >> when users have upgraded and are trying to find out what happened, or how > >> to > >> configure equivalent functionality. > > > > I don't see the logic in carrying doc stuff that we don't have anymore. > > Well, we do have that already in 's sprinkled across the docs where it > makes sense to help transitioning users, like this one in func.sgml: > > "Prior to PostgreSQL 12, it was possible to skip arbitrary text in the > input string using non-letter or non-digit characters..." > > It doesn't seem like a terrible idea to do a similar one for recovery.conf. I am fine with a tip. The patch looked like it was creating a new chapter for it. -- Bruce Momjian https://momjian.us EnterpriseDB https://enterprisedb.com The usefulness of a cup is in its emptiness, Bruce Lee
Re: Add docs stub for recovery.conf
> On 11 Nov 2020, at 20:44, Bruce Momjian wrote: > On Tue, Nov 10, 2020 at 01:38:14PM +0800, Craig Ringer wrote: >> I noticed that when recovery.conf was removed in 2dedf4d9a8 (yay!) the docs >> for >> it were removed completely as well. That's largely sensible, but is confusing >> when users have upgraded and are trying to find out what happened, or how to >> configure equivalent functionality. > > I don't see the logic in carrying doc stuff that we don't have anymore. Well, we do have that already in 's sprinkled across the docs where it makes sense to help transitioning users, like this one in func.sgml: "Prior to PostgreSQL 12, it was possible to skip arbitrary text in the input string using non-letter or non-digit characters..." It doesn't seem like a terrible idea to do a similar one for recovery.conf. cheers ./daniel
Re: Add docs stub for recovery.conf
On Wed, Nov 11, 2020 at 12:44 PM Bruce Momjian wrote: > On Tue, Nov 10, 2020 at 01:38:14PM +0800, Craig Ringer wrote: > > Hi all > > > > I noticed that when recovery.conf was removed in 2dedf4d9a8 (yay!) the > docs for > > it were removed completely as well. That's largely sensible, but is > confusing > > when users have upgraded and are trying to find out what happened, or > how to > > configure equivalent functionality. > > I don't see the logic in carrying doc stuff that we don't have anymore. > > I do. Saying why something went away has value. For small stuff you have commit messages. For user-facing documentation stuff that warranted its own page, having said page remain and describe the change seems worthwhile. David J.
Re: Parallel bitmap index scan
Hi, I took a look at this today, doing a bit of stress-testing, and I can get it to crash because of segfaults in pagetable_create (not sure if the issue is there, it might be just a symptom of an issue elsewhere). Attached is a shell script I use to run the stress test - it's using 'test' database, generates tables of different size and then runs queries with various parameter combinations. It takes a while to trigger the crash, so it might depend on timing or something like that. I've also attached two examples of backtraces. I've also seen infinite loop in pagetable_create, but the crashes are much more common. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company Core was generated by `postgres: postgres test [local] SELECT '. Program terminated with signal SIGSEGV, Segmentation fault. #0 MemoryContextAllocZero (context=context@entry=0x561c0598cf40, size=size@entry=48) at mcxt.c:852 852 ret = context->methods->alloc(context, size); (gdb) bt #0 MemoryContextAllocZero (context=context@entry=0x561c0598cf40, size=size@entry=48) at mcxt.c:852 #1 0x561c0467b95c in pagetable_create (nelements=128, private_data=0x7f155f7ef000, ctx=0x561c0598cf40) at ../../../src/include/lib/simplehash.h:457 #2 tbm_create_pagetable (tbm=tbm@entry=0x7f155f7ef000) at tidbitmap.c:296 #3 0x561c0467bae3 in tbm_get_pageentry (tbm=tbm@entry=0x7f155f7ef000, pageno=3779) at tidbitmap.c:1303 #4 0x561c0467bed5 in tbm_union_page (a=a@entry=0x7f155f7ef000, bpage=0x7f155f7e21b8) at tidbitmap.c:514 #5 0x561c0467c5a0 in tbm_union (b=0x561c059cd468, a=0x7f155f7ef000) at tidbitmap.c:474 #6 tbm_union (a=0x7f155f7ef000, b=0x561c059cd468) at tidbitmap.c:457 #7 0x561c0467cb77 in tbm_merge (tbm=0x561c059cd468, dp_tbm=, dp_pagetable=0x7f155f8d4418) at tidbitmap.c:822 #8 0x561c0461c80f in BitmapHeapNext (node=node@entry=0x561c05996400) at nodeBitmapHeapscan.c:228 #9 0x561c0460f611 in ExecScanFetch (recheckMtd=0x561c0461bf10 , accessMtd=0x561c0461bfa0 , node=0x561c05996400) at execScan.c:133 #10 ExecScan (node=0x561c05996400, accessMtd=0x561c0461bfa0 , recheckMtd=0x561c0461bf10 ) at execScan.c:182 #11 0x561c04615d31 in ExecProcNode (node=0x561c05996400) at ../../../src/include/executor/executor.h:247 #12 fetch_input_tuple (aggstate=aggstate@entry=0x561c05995d98) at nodeAgg.c:589 #13 0x561c046188c8 in agg_retrieve_direct (aggstate=) at nodeAgg.c:2356 #14 ExecAgg (pstate=) at nodeAgg.c:2171 #15 0x561c0461f487 in ExecProcNode (node=0x561c05995d98) at ../../../src/include/executor/executor.h:247 #16 gather_getnext (gatherstate=0x561c05995bf8) at nodeGather.c:295 #17 ExecGather (pstate=0x561c05995bf8) at nodeGather.c:227 #18 0x561c04615d31 in ExecProcNode (node=0x561c05995bf8) at ../../../src/include/executor/executor.h:247 #19 fetch_input_tuple (aggstate=aggstate@entry=0x561c059955d0) at nodeAgg.c:589 #20 0x561c046188c8 in agg_retrieve_direct (aggstate=) at nodeAgg.c:2356 #21 ExecAgg (pstate=) at nodeAgg.c:2171 #22 0x561c04606b4b in ExecProcNode (node=0x561c059955d0) at ../../../src/include/executor/executor.h:247 #23 ExecutePlan (execute_once=, dest=0x561c059c6550, direction=, numberTuples=0, sendTuples=, operation=CMD_SELECT, use_parallel_mode=, planstate=0x561c059955d0, estate=0x561c05995378) at execMain.c:1542 #24 standard_ExecutorRun (queryDesc=0x561c058fdac8, direction=, count=0, execute_once=) at execMain.c:364 #25 0x561c0475f39c in PortalRunSelect (portal=0x561c0593f758, forward=, count=0, dest=) at pquery.c:912 #26 0x561c04760546 in PortalRun (portal=portal@entry=0x561c0593f758, count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=true, run_once=run_once@entry=true, dest=dest@entry=0x561c059c6550, altdest=altdest@entry=0x561c059c6550, qc=0x7ffd52084c80) at pquery.c:756 #27 0x561c0475c33c in exec_simple_query (query_string=0x561c058dbce8 "select count(b)from t where a between 3387 and 4027;") at postgres.c:1239 #28 0x561c0475dee0 in PostgresMain (argc=argc@entry=1, argv=argv@entry=0x7ffd520850e0, dbname=, username=) at postgres.c:4305 #29 0x561c046e709c in BackendRun (port=, port=) at postmaster.c:4488 #30 BackendStartup (port=) at postmaster.c:4210 #31 ServerLoop () at postmaster.c:1727 #32 0x561c046e7f3f in PostmasterMain (argc=, argv=0x561c058d6550) at postmaster.c:1400 #33 0x561c0448d970 in main (argc=3, argv=0x561c058d6550) at main.c:209 Core was generated by `postgres: postgres test [local] EXPLAIN '. Program terminated with signal SIGSEGV, Segmentation fault. #0 MemoryContextAllocZero (context=context@entry=0x561c0598ed90, size=size@entry=48) at mcxt.c:852 852 ret = context->methods->alloc(context, size); (gdb) bt #0 MemoryContextAllocZero (context=context@entry=0x561c0598ed90, size=size@entry=48) at mcxt.c:852 #1 0x561c0467b95c in
Re: Split copy.c
Hey Heikki, On Tue, Nov 3, 2020 at 1:30 AM Heikki Linnakangas wrote: Thanks for working on this refactor. LGTM! I had a few minor comments: 1. We should rename the CopyFormatOptions *cstate param in ProcessCopyOptions() to CopyFormatOptions *options and List *options to List *raw_options IMO, to make it more readable. 2. We need to update the header comment for Copy{From,To}StateData. It is currently the old comment from CopyStateData. 3. Can we add docs for the missing fields in the header comment for BeginCopyFrom()? 4. > /* > * Working state for COPY TO/FROM > */ > MemoryContext copycontext; /* per-copy execution context */ Comment needs to be updated for the COPY operation. 5. > I also split/duplicated the CopyStateData struct into CopyFromStateData > and CopyToStateData. Many of the fields were common, but many were not, > and I think some duplication is nicer than a struct where you use some > fields and others are unused. I put the common formatting options into a > new CopyFormatOptions struct. Would we be better off if we sub-struct CopyState <- Copy{From,To}State? Like this: typedef struct Copy{From|To}StateData { CopyState cs; // Fields specific to COPY FROM/TO follow.. } 6. > /* create workspace for CopyReadAttributes results */ > if (!cstate->opts.binary) Can we replace this if with an else? Regards, Soumyadeep (VMware)
Re: Add docs stub for recovery.conf
On Tue, Nov 10, 2020 at 01:38:14PM +0800, Craig Ringer wrote: > Hi all > > I noticed that when recovery.conf was removed in 2dedf4d9a8 (yay!) the docs > for > it were removed completely as well. That's largely sensible, but is confusing > when users have upgraded and are trying to find out what happened, or how to > configure equivalent functionality. I don't see the logic in carrying doc stuff that we don't have anymore. -- Bruce Momjian https://momjian.us EnterpriseDB https://enterprisedb.com The usefulness of a cup is in its emptiness, Bruce Lee
Re: Allow some recovery parameters to be changed with reload
Hello > Anyway, for now I think that your first patch would be enough, i.e., > just change the context of restore_command to PGC_SIGHUP. Glad to hear. Attached a rebased version of the original proposal. regards, Sergeidiff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index f043433e31..ec74cb43ad 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -3542,7 +3542,8 @@ restore_command = 'copy "C:\\server\\archivedir\\%f" "%p"' # Windows -This parameter can only be set at server start. +This parameter can only be set in the postgresql.conf +file or on the server command line. diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c index bb34630e8e..a732279f52 100644 --- a/src/backend/utils/misc/guc.c +++ b/src/backend/utils/misc/guc.c @@ -3699,7 +3699,7 @@ static struct config_string ConfigureNamesString[] = }, { - {"restore_command", PGC_POSTMASTER, WAL_ARCHIVE_RECOVERY, + {"restore_command", PGC_SIGHUP, WAL_ARCHIVE_RECOVERY, gettext_noop("Sets the shell command that will be called to retrieve an archived WAL file."), NULL }, diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample index 9cb571f7cc..9c9091e601 100644 --- a/src/backend/utils/misc/postgresql.conf.sample +++ b/src/backend/utils/misc/postgresql.conf.sample @@ -253,7 +253,6 @@ # placeholders: %p = path of file to restore # %f = file name only # e.g. 'cp /mnt/server/archivedir/%f %p' -# (change requires restart) #archive_cleanup_command = '' # command to execute at every restartpoint #recovery_end_command = '' # command to execute at completion of recovery
Re: Feature request: Improve allowed values for generate series
On Wed, Nov 11, 2020 at 12:12 PM Eugen Konkov wrote: > > > So I feature request to allow zero size step for cases when > start point is equest to finish > > > What do you think? > > > > hm probably with step 0 we always should generate series of one > value and exit, despite on finish value. > Because with step 0 we always stay at current position, so there is > always should be just one value. > > How is this better than writing "VALUES (start date)"? David J.
Re: Add session statistics to pg_stat_database
On Tue, 2020-11-10 at 15:03 +, Georgios Kokolatos wrote: > I noticed that the cfbot fails for this patch. > > For this, I am setting the status to: 'Waiting on Author'. Thanks for noticing, it was only the documentation build. Version 5 attached, status changed back to "waiting for review". Yours, Laurenz Albe From afc37856c12fd0a85587c638fca291a0b5652d9b Mon Sep 17 00:00:00 2001 From: Laurenz Albe Date: Wed, 11 Nov 2020 20:14:28 +0100 Subject: [PATCH] Add session statistics to pg_stat_database If "track_counts" is active, track the following per database: - total number of connections - number of sessions that ended other than with a client disconnect - total time spent in database sessions - total time spent executing queries - total idle in transaction time This is useful to check if connection pooling is working. It also helps to estimate the size of the connection pool required to keep the database busy, which depends on the percentage of the transaction time that is spent idling. Discussion: https://postgr.es/m/b07e1f9953701b90c66ed368656f2aef40cac4fb.ca...@cybertec.at Reviewed-By: Soumyadeep Chakraborty, Justin Pryzby, Masahiro Ikeda (This requires a catversion bump, as well as an update to PGSTAT_FILE_FORMAT_ID) --- doc/src/sgml/monitoring.sgml | 49 + src/backend/catalog/system_views.sql | 5 ++ src/backend/postmaster/pgstat.c | 105 ++- src/backend/tcop/postgres.c | 5 ++ src/backend/utils/adt/pgstatfuncs.c | 68 + src/include/catalog/pg_proc.dat | 20 + src/include/pgstat.h | 27 +++ src/test/regress/expected/rules.out | 5 ++ 8 files changed, 283 insertions(+), 1 deletion(-) diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml index 98e1995453..89610d1010 100644 --- a/doc/src/sgml/monitoring.sgml +++ b/doc/src/sgml/monitoring.sgml @@ -3704,6 +3704,55 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i + + + session_time double precision + + + Time spent by database sessions in this database, in milliseconds + (note that statistics are only updated when the state of a session + changes, so if sessions have been idle for a long time, this idle time + won't be included) + + + + + + active_time double precision + + + Time spent executing SQL statements in this database, in milliseconds + + + + + + idle_in_transaction_time double precision + + + Time spent idling while in a transaction in this database, in milliseconds + + + + + + connections bigint + + + Total number of connections established to this database + + + + + + aborted_sessions bigint + + + Number of database sessions to this database that were terminated + by something else than a regular client disconnection + + + stats_reset timestamp with time zone diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql index 2e4aa1c4b6..998b4d542a 100644 --- a/src/backend/catalog/system_views.sql +++ b/src/backend/catalog/system_views.sql @@ -924,6 +924,11 @@ CREATE VIEW pg_stat_database AS pg_stat_get_db_checksum_last_failure(D.oid) AS checksum_last_failure, pg_stat_get_db_blk_read_time(D.oid) AS blk_read_time, pg_stat_get_db_blk_write_time(D.oid) AS blk_write_time, +pg_stat_get_db_session_time(D.oid) AS session_time, +pg_stat_get_db_active_time(D.oid) AS active_time, +pg_stat_get_db_idle_in_transaction_time(D.oid) AS idle_in_transaction_time, +pg_stat_get_db_connections(D.oid) AS connections, +pg_stat_get_db_aborted_sessions(D.oid) AS aborted_sessions, pg_stat_get_db_stat_reset_time(D.oid) AS stats_reset FROM ( SELECT 0 AS oid, NULL::name AS datname diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c index e76e627c6b..9978aab60a 100644 --- a/src/backend/postmaster/pgstat.c +++ b/src/backend/postmaster/pgstat.c @@ -249,6 +249,9 @@ static int pgStatXactCommit = 0; static int pgStatXactRollback = 0; PgStat_Counter pgStatBlockReadTime = 0; PgStat_Counter pgStatBlockWriteTime = 0; +static PgStat_Counter pgStatActiveTime = 0; +static PgStat_Counter pgStatTransactionIdleTime = 0; +bool pgStatSessionDisconnected = false; /* Record that's written to 2PC state file when pgstat state is persisted */ typedef struct TwoPhasePgStatRecord @@ -334,6 +337,7 @@ static void pgstat_send_tabstat(PgStat_MsgTabstat *tsmsg); static void pgstat_send_funcstats(void); static void pgstat_send_slru(void); static HTAB *pgstat_collect_oids(Oid catalogid, AttrNumber anum_oid); +static void pgstat_send_connstats(bool dis
Re: Proposition for autoname columns
Hello Bruce, Wednesday, November 11, 2020, 5:56:08 PM, you wrote: > On Mon, Nov 2, 2020 at 05:05:29PM +0200, Eugen Konkov wrote: >> Hello Pgsql-hackers, >> >> When selecting data from json column it named as '?column?' >> tucha=# select info->>'suma', docn from document order by id desc limit 5; >> ?column? | docn >> --+-- >> 665.97 | 695 >> 513.51 | 632 >> 665.97 | 4804 >> 492.12 | 4315 >> 332.98 | 1302 >> (5 rows) >> >> It would be useful if the name of column will be autoassigned based on >> name of json key. Like at next query: >> >> tucha=# select info->>'suma' as suma, docn from document order by id desc >> limit 5; >> suma | docn >> +-- >> 665.97 | 695 >> 513.51 | 632 >> 665.97 | 4804 >> 492.12 | 4315 >> 332.98 | 1302 >> (5 rows) >> >> >> Would it be useful this auto assigned name for column from json? > I think we could do it, but it would only work if the column was output > as a single json value, and not a multi-key/value field. I am afraid if > we tried to do it, the result would be too inconsistent to be useful. cool, thank you. -- Best regards, Eugen Konkov
Re: Feature request: Improve allowed values for generate series
Hello Eugen, Wednesday, November 11, 2020, 8:50:59 PM, you wrote: > Hello Pgsql-hackers, > Seems I fall into corner case: test=>> SELECT * FROM generate_series( '2020-11-09', '2020-11-09', INTERVAL '00:00:00' ); > ERROR: step size cannot equal zero > But: test=>> SELECT * FROM generate_series( '2020-11-09', '2020-11-10', INTERVAL '1 day' ); > generate_series > > 2020-11-09 00:00:00+02 > 2020-11-10 00:00:00+02 > (2 rows) > Here we start at 2020-11-09, add interval of one day and finish at > 2020-11-10 > Done! series is generated. > In first case I expect that I start at 2020-11-09, add interval of zero > and finish at 2020-11-09 > Everything is consistent. test=>> SELECT * FROM generate_series( '2020-11-09', '2020-11-09', INTERVAL '00:00:00' ); > generate_series > > 2020-11-09 00:00:00+02 > (1 row) > So I feature request to allow zero size step for cases when start > point is equest to finish > What do you think? hm probably with step 0 we always should generate series of one value and exit, despite on finish value. Because with step 0 we always stay at current position, so there is always should be just one value. -- Best regards, Eugen Konkov
Re: Feature request: Improve allowed values for generate series
On Wed, Nov 11, 2020 at 11:59 AM Eugen Konkov wrote: > So I feature request to allow zero size step for cases when start > point is equest to finish > > What do you think? > I don't see how this is useful. If they are equal and you use a non-zero step you get back the one record you are looking for anyway, plus the non-zero step allows them to be unequal. If zero step is allowed it is only useful for when they are equal, being undefined when they are unequal. David J.
Re: Support for NSS as a libpq TLS backend
On Nov 11, 2020, at 10:17 AM, Jacob Champion wrote: > > (Two failures left on my machine.) False alarm -- the stderr debugging I'd added in to track down the assertion tripped up the "no stderr" tests. Zero failing tests now. --Jacob
Re: Feature request: Improve allowed values for generate series
st 11. 11. 2020 v 19:59 odesílatel Eugen Konkov napsal: > Hello Pgsql-hackers, > > Seems I fall into corner case: > > test=> SELECT * FROM generate_series( '2020-11-09', '2020-11-09', INTERVAL > '00:00:00' ); > ERROR: step size cannot equal zero > > But: > test=> SELECT * FROM generate_series( '2020-11-09', '2020-11-10', INTERVAL > '1 day' ); > generate_series > > 2020-11-09 00:00:00+02 > 2020-11-10 00:00:00+02 > (2 rows) > > Here we start at 2020-11-09, add interval of one day and finish at > 2020-11-10 > Done! series is generated. > > In first case I expect that I start at 2020-11-09, add interval of > zero and finish at 2020-11-09 > Everything is consistent. > > test=> SELECT * FROM generate_series( '2020-11-09', '2020-11-09', INTERVAL > '00:00:00' ); > generate_series > > 2020-11-09 00:00:00+02 > (1 row) > > > So I feature request to allow zero size step for cases when start > point is equest to finish > > What do you think? > What is the real use case? Current implementation is very simple - increment should not be zero, and then we know so there is no infinity cycle. Regards Pavel > > -- > Best regards, > Eugen Konkov > > > >
Feature request: Improve allowed values for generate series
Hello Pgsql-hackers, Seems I fall into corner case: test=> SELECT * FROM generate_series( '2020-11-09', '2020-11-09', INTERVAL '00:00:00' ); ERROR: step size cannot equal zero But: test=> SELECT * FROM generate_series( '2020-11-09', '2020-11-10', INTERVAL '1 day' ); generate_series 2020-11-09 00:00:00+02 2020-11-10 00:00:00+02 (2 rows) Here we start at 2020-11-09, add interval of one day and finish at 2020-11-10 Done! series is generated. In first case I expect that I start at 2020-11-09, add interval of zero and finish at 2020-11-09 Everything is consistent. test=> SELECT * FROM generate_series( '2020-11-09', '2020-11-09', INTERVAL '00:00:00' ); generate_series 2020-11-09 00:00:00+02 (1 row) So I feature request to allow zero size step for cases when start point is equest to finish What do you think? -- Best regards, Eugen Konkov
Re: Allow some recovery parameters to be changed with reload
On 2020/11/07 6:36, Sergei Kornilov wrote: Hello I'm wondering if it's safe to allow restore_command to be emptied during archive recovery. Even when it's emptied, archive recovery can proceed by reading WAL files from pg_wal directory. This is the same behavior as when restore_command is set to, e.g., /bin/false. I am always confused by this implementation detail. restore_command fails? Fine, let's just read file from pg_wal. But this is different topic... I do not know the history of this fatal ereport. It looks like "must specify restore_command when standby mode is not enabled" check is only intended to protect the user from misconfiguration and the rest code will treat empty restore_command correctly, just like /bin/false. Maybe. Anyway, for now I think that your first patch would be enough, i.e., just change the context of restore_command to PGC_SIGHUP. Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
Re: Support for NSS as a libpq TLS backend
On Nov 10, 2020, at 2:28 PM, Daniel Gustafsson wrote: > > Digging through the archives from when this landed in curl, the assertion > failure was never fully identified back then but happened spuriously. Which > version of NSPR is this happening with? This is NSPR 4.29, with debugging enabled. The fd that causes the assertion is the custom layer that's added during be_tls_open_server(), which connects a Port as the layer secret. It looks like NSPR is trying to help surface potential memory leaks by asserting if the secret is non-NULL at the time the stack is being closed. In this case, it doesn't matter since the Port lifetime is managed elsewhere, but it looks easy enough to add a custom close in the way that cURL and the NSPR test programs [1] do. Sample patch attached, which gets me to the end of the tests without any assertions. (Two failures left on my machine.) --Jacob [1] https://hg.mozilla.org/projects/nspr/file/bf6620c143/pr/tests/nblayer.c#l354 patch Description: patch
Re: PATCH: Attempt to make dbsize a bit more consistent
Hey Georgios, On Tue, Nov 10, 2020 at 6:20 AM wrote: > > > > > > > ‐‐‐ Original Message ‐‐‐ > On Monday, November 9, 2020 7:50 PM, Soumyadeep Chakraborty > wrote: > > > Hey Georgios, > > > > Thanks for looking for more avenues to invoke tableAM APIS! Please find > > my review below: > > A great review Soumyadeep, it is much appreciated. > Please remember to add yourself as a reviewer in the commitfest > [https://commitfest.postgresql.org/30/2701/] Ah yes. Sorry, I forgot that! > > > > On Tue, Oct 13, 2020 at 6:28 AM gkokola...@pm.me wrote: > > > > 1. > > > > > /* > > > > > > - - heap size, including FSM and VM > > > > > > - - table size, including FSM and VM > > > */ > > > > > > > We should not mention FSM and VM in dbsize.c at all as these are > > heap AM specific. We can say: > > table size, excluding TOAST relation > > Yeah, I was thinking that the notion that FSM and VM are still taken > into account should be stated. We are iterating over ForkNumber > after all. > > How about phrasing it as: > > + table size, including all implemented forks from the AM (e.g. FSM, VM) > + excluding TOAST relations > > Thoughts? Yes, I was thinking along the same lines. The concept of a "fork" forced should not be forced down into the tableAM. But that is a discussion for another day. We can probably say: + table size, including all implemented forks from the AM (e.g. FSM, VM + for the heap AM) excluding TOAST relations > > > > 2. > > > > > /* > > > > > > - Size of toast relation > > > */ > > > if (OidIsValid(rel->rd_rel->reltoastrelid)) > > > > > > > > > - size += calculate_toast_table_size(rel->rd_rel->reltoastrelid); > > > > > > - { > > > - Relation toastRel; > > > - > > > - toastRel = relation_open(rel->rd_rel->reltoastrelid, AccessShareLock); > > > > We can replace the OidIsValid check with relation_needs_toast_table() > > and then have the OidIsValid() check in an Assert. Perhaps, that will > > make things more readable. > > Please, allow me to kindly disagree. > > Relation is already open at this stage. Even create_toast_table(), the > internal workhorse for creating toast relations, does check reltoastrelid > to test if the relation is already toasted. > > Furthermore, we do call: > > + toastRel = relation_open(rel->rd_rel->reltoastrelid, AccessShareLock); > > and in order to avoid elog() errors underneath us, we ought to have > verified the validity of reltoastrelid. > > In short, I think that the code in the proposal is not too unreadable > nor that it breaks the coding patterns throughout the codebase. > > Am I too wrong? No not at all. The code in the proposal is indeed adhering to the codebase. What I was going for here was to increase the usage of relation_needs_toast_table(). What I meant was: if (table_relation_needs_toast_table(rel)) { if (!OidIsValid(rel->rd_rel->reltoastrelid)) { elog(ERROR, ); } //size calculation here.. } We want to error out if the toast table can't be found and the relation is expected to have one, which the existing code doesn't handle. > > > > > 3. > > > > > - if (rel->rd_rel->relkind == RELKIND_RELATION || > > > - rel->rd_rel->relkind == RELKIND_TOASTVALUE || > > > - rel->rd_rel->relkind == RELKIND_MATVIEW) > > > - size = calculate_table_size(rel); > > > - else > > > - { > > > - relation_close(rel, AccessShareLock); > > > - PG_RETURN_NULL(); > > > - } > > > > This leads to behavioral changes: > > > > I am talking about the case where one calls pg_table_size() on an index. > > W/o your patch, it returns the size of the index. W/ your patch it > > returns NULL. Judging by the documentation, this function should not > > ideally apply to indexes but it does! I have a sinking feeling that lots > > of users use it for this purpose, as there is no function to calculate > > the size of a single specific index (except pg_relation_size()). > > The same argument I have made above applies to sequences. Users may have > > trial-and-errored their way into finding that pg_table_size() can tell > > them the size of a specific index/sequence! I don't know how widespread > > the use is in the user community, so IMO maybe we should be conservative > > and not introduce this change? Alternatively, we could call out that > > pg_table_size() is only for tables by throwing an error if anything > > other than a table is passed in. > > > > If we decide to preserve the existing behavior of the pg_table_size(): > > It seems that for things not backed by the tableAM (indexes and > > sequences), they should still go through calculate_relation_size(). > > We can call table_relation_size() based on if relkind is > > RELKIND_RELATION, RELKIND_TOASTVALUE or RELKIND_MATVIEW. Perhaps it > > might be worthwhile making a new macro RELKIND_HAS_TABLE_STORAGE to > > capture these three cases (See RELKIND_HAS_STORAGE). This also ensures > > that we return 0 for things that don't qualify as RELKIND_HAS_STORAGE, > > such as a partitioned table (Cu
Re: Parallel copy
On Tue, Nov 10, 2020 at 7:27 PM Amit Kapila wrote: > > On Tue, Nov 10, 2020 at 7:12 PM vignesh C wrote: > > > > On Tue, Nov 3, 2020 at 2:28 PM Amit Kapila wrote: > > > > > > > I have worked to provide a patch for the parallel safety checks. It > > checks if parallely copy can be performed, Parallel copy cannot be > > performed for the following a) If relation is temporary table b) If > > relation is foreign table c) If relation has non parallel safe index > > expressions d) If relation has triggers present whose type is of non > > before statement trigger type e) If relation has check constraint > > which are not parallel safe f) If relation has partition and any > > partition has the above type. This patch has the checks for it. This > > patch will be used by parallel copy implementation. > > > > How did you ensure that this is sufficient? For parallel-insert's > patch we have enabled parallel-mode for Inserts and ran the tests with > force_parallel_mode to see if we are not missing anything. Also, it > seems there are many common things here w.r.t parallel-insert patch, > is it possible to prepare this atop that patch or do you have any > reason to keep this separate? > I have done similar testing for copy too, I had set force_parallel mode to regress, hardcoded in the code to pick parallel workers for copy operation and ran make installcheck-world to verify. Many checks in this patch are common between both patches, but I was not sure how to handle it as both the projects are in-progress and are being updated based on the reviewer's opinion. How to handle this? Thoughts? Regards, Vignesh EnterpriseDB: http://www.enterprisedb.com
Re: Implementing Incremental View Maintenance
Hello Konstantin, I remember testing it with pg_stat_statements (and planning counters enabled). Maybe identifying internal queries associated with this (simple) test case, could help dev team ? Regards PAscal -- Sent from: https://www.postgresql-archive.org/PostgreSQL-hackers-f1928748.html
Re: Prevent printing "next step instructions" in initdb and pg_upgrade
On Wed, Nov 11, 2020 at 05:39:10PM +0100, Magnus Hagander wrote: > On Wed, Nov 11, 2020 at 5:38 PM Bruce Momjian wrote: > > > > On Wed, Nov 11, 2020 at 11:21:22AM -0500, Bruce Momjian wrote: > > > On Wed, Nov 11, 2020 at 05:11:38PM +0100, Magnus Hagander wrote: > > > > > In summary, I think the vacuumdb --analyze is now a one-line command, > > > > > but the delete part can be complex and not easily typed. > > > > > > > > I definitely agree to that part -- my argument is that it's more > > > > complex (or rather, more platform-specific) than can easily be put in > > > > the script either. If it were to be replaced it wouldn't be withy "a > > > > command",it would be with instructions basically saying "delete the > > > > old cluster". Platform specific wrappers (debian, redhat, windows or > > > > whatever) knows more and could do a more complete job, but trying to > > > > make all that generic is very complicated. > > > > > > Agreed, but I don't see OS-specific instruction on how to delete the > > > tablespaces as reasonable --- that should be done by pg_upgrade, and we > > > have had almost no complaints about that. > > > > Stepping back, I have always felt there was need for a wrapper program > > to setup pg_upgrade, and clean up the old cluster on finish, though no > > one has created such a tool yet. > > Sure they have, it's just platform specific. > > pg_upgradecluster on debian is a great example :) Yes, true. Ideally it would be nice to have a cross-platform wrapper around pg_upgrade. -- Bruce Momjian https://momjian.us EnterpriseDB https://enterprisedb.com The usefulness of a cup is in its emptiness, Bruce Lee
Re: Proposition for autoname columns
On Wed, Nov 11, 2020 at 8:56 AM Bruce Momjian wrote: > > It would be useful if the name of column will be autoassigned based on > > name of json key. Like at next query: > > > > tucha=# select info->>'suma' as suma, docn from document order by id > desc limit 5; > > suma | docn > > +-- > > > Would it be useful this auto assigned name for column from json? > > I think we could do it, but it would only work if the column was output > as a single json value, and not a multi-key/value field. I am afraid if > we tried to do it, the result would be too inconsistent to be useful. > Doing it seems problematic given the nature of SQL and existing means to assign names to columns. If it can be done I don't see how the output value would make any difference. What is being asked for is the simple textual value on the right side of the ->> (and other similar) operators to be converted into a column name. I could image doing this at rewrite time by saying (in parse terms): info->>'suma to' becomes info->>'suma' AS "suma to" (specifically, add AS, double-quote the literal and stick it after the AS). If {AS "suma to"} isn't valid syntax for some value of "suma to" just drop the attempt and move on. I agree that this feature would be useful. David J.
Re: Prevent printing "next step instructions" in initdb and pg_upgrade
On Wed, Nov 11, 2020 at 5:38 PM Bruce Momjian wrote: > > On Wed, Nov 11, 2020 at 11:21:22AM -0500, Bruce Momjian wrote: > > On Wed, Nov 11, 2020 at 05:11:38PM +0100, Magnus Hagander wrote: > > > > In summary, I think the vacuumdb --analyze is now a one-line command, > > > > but the delete part can be complex and not easily typed. > > > > > > I definitely agree to that part -- my argument is that it's more > > > complex (or rather, more platform-specific) than can easily be put in > > > the script either. If it were to be replaced it wouldn't be withy "a > > > command",it would be with instructions basically saying "delete the > > > old cluster". Platform specific wrappers (debian, redhat, windows or > > > whatever) knows more and could do a more complete job, but trying to > > > make all that generic is very complicated. > > > > Agreed, but I don't see OS-specific instruction on how to delete the > > tablespaces as reasonable --- that should be done by pg_upgrade, and we > > have had almost no complaints about that. > > Stepping back, I have always felt there was need for a wrapper program > to setup pg_upgrade, and clean up the old cluster on finish, though no > one has created such a tool yet. Sure they have, it's just platform specific. pg_upgradecluster on debian is a great example :) -- Magnus Hagander Me: https://www.hagander.net/ Work: https://www.redpill-linpro.com/
Re: Prevent printing "next step instructions" in initdb and pg_upgrade
On Wed, Nov 11, 2020 at 11:21:22AM -0500, Bruce Momjian wrote: > On Wed, Nov 11, 2020 at 05:11:38PM +0100, Magnus Hagander wrote: > > > In summary, I think the vacuumdb --analyze is now a one-line command, > > > but the delete part can be complex and not easily typed. > > > > I definitely agree to that part -- my argument is that it's more > > complex (or rather, more platform-specific) than can easily be put in > > the script either. If it were to be replaced it wouldn't be withy "a > > command",it would be with instructions basically saying "delete the > > old cluster". Platform specific wrappers (debian, redhat, windows or > > whatever) knows more and could do a more complete job, but trying to > > make all that generic is very complicated. > > Agreed, but I don't see OS-specific instruction on how to delete the > tablespaces as reasonable --- that should be done by pg_upgrade, and we > have had almost no complaints about that. Stepping back, I have always felt there was need for a wrapper program to setup pg_upgrade, and clean up the old cluster on finish, though no one has created such a tool yet. -- Bruce Momjian https://momjian.us EnterpriseDB https://enterprisedb.com The usefulness of a cup is in its emptiness, Bruce Lee
Re: Supporting = operator in gin/gist_trgm_ops
The following review has been posted through the commitfest application: make installcheck-world: tested, passed Implements feature: tested, passed Spec compliant: not tested Documentation:not tested Hi, this patch implements a useful and missing feature. Thank you. It includes documentation, which to a non-native speaker as myself seems appropriate. It includes comprehensive tests that cover the implemented cases. In the thread Alexander has pointed out, quote: "It would be more efficient to generate trigrams for equal operator using generate_trgm() instead of generate_wildcard_trgm()" I will echo the sentiment, though from a slightly different and possibly not as important point of view. The method used to extract trigrams from the query should match the method used to extract trigrams from the values when they get added to the index. This is gin_extract_value_trgm() and is indeed using generate_trgm(). I have no opinion over Alexander's second comment regarding costing. I change the status to 'Waiting on Author', but please feel free to override my opinion if you feel I am wrong and reset it to 'Needs review'. Cheers, //Georgios The new status of this patch is: Waiting on Author
Re: Prevent printing "next step instructions" in initdb and pg_upgrade
On Wed, Nov 11, 2020 at 05:11:38PM +0100, Magnus Hagander wrote: > > Uh, pg_upgrade does enumerate things like tablespaces in > > create_script_for_old_cluster_deletion(). I think config file locations > > are beyond the scope of what we want pg_upgrade to handle. > > Ah, that's right. Forgot that it did actually handle tablespaces. > > But how would config file locations be beyond the scope? WIthout that, > it's incomplete for any user using that... I would be worried the config file would somehow be shared between old and new clusters. Potentially you could use the same pg_hba.conf for old and new, and then deleting one would stop both. The cluster and tablespace directories are clearly isolated to a single cluster. I am not sure that is 100% true of config files, or if deletion permission would be possible for those. It just seems too error-prone to attempt. > And for the vast majority of users, it would for example also have to > include the un-defining of a system service (systemd, sysvinit, or > whatever) as well, which would be even more out of scope by that > argument, yet at least as important in actually deleting the old > cluster... Well, the question is what can the user reaonsably do. Finding all the tablespace subdirecties is a pain, but managing systemctl seems like something they would already have had to deal with. I am happy for pg_upgrade to handle the mostly-internal parts and leave the user-facing stuff to the user, which is what the pg_upgrade usage instructions do as well. > > In summary, I think the vacuumdb --analyze is now a one-line command, > > but the delete part can be complex and not easily typed. > > I definitely agree to that part -- my argument is that it's more > complex (or rather, more platform-specific) than can easily be put in > the script either. If it were to be replaced it wouldn't be withy "a > command",it would be with instructions basically saying "delete the > old cluster". Platform specific wrappers (debian, redhat, windows or > whatever) knows more and could do a more complete job, but trying to > make all that generic is very complicated. Agreed, but I don't see OS-specific instruction on how to delete the tablespaces as reasonable --- that should be done by pg_upgrade, and we have had almost no complaints about that. -- Bruce Momjian https://momjian.us EnterpriseDB https://enterprisedb.com The usefulness of a cup is in its emptiness, Bruce Lee
Re: Prevent printing "next step instructions" in initdb and pg_upgrade
On Wed, Nov 11, 2020 at 4:53 PM Bruce Momjian wrote: > > On Mon, Nov 2, 2020 at 02:23:35PM +0100, Magnus Hagander wrote: > > On Tue, Oct 27, 2020 at 12:40 PM Bruce Momjian wrote: > > > > > > On Tue, Oct 27, 2020 at 12:35:56PM +0100, Peter Eisentraut wrote: > > > > On 2020-10-27 11:53, Bruce Momjian wrote: > > > > > On Tue, Oct 27, 2020 at 11:35:25AM +0100, Peter Eisentraut wrote: > > > > > > On 2020-10-06 12:26, Magnus Hagander wrote: > > > > > > > I went with the name --no-instructions to have the same name for > > > > > > > both > > > > > > > initdb and pg_upgrade. The downside is that "no-instructions" also > > > > > > > causes the scripts not to be written in pg_upgrade, which > > > > > > > arguably is a > > > > > > > different thing. We could go with "--no-instructions" and > > > > > > > "--no-scripts", but that would leave the parameters different. I > > > > > > > also > > > > > > > considered "--no-next-step", but that one didn't quite have the > > > > > > > right > > > > > > > ring to me. I'm happy for other suggestions on the parameter > > > > > > > names. > > > > > > > > > > > > What scripts are left after we remove the analyze script, as > > > > > > discussed in a > > > > > > different thread? > > > > > > > > > > There is still delete_old_cluster.sh. > > > > > > > > Well, that one can trivially be replaced by a printed instruction, too. > > > > > > True. On my system is it simply: > > > > > > rm -rf '/u/pgsql.old/data' > > > > > > The question is whether the user is going to record the vacuumdb and rm > > > instructions that display at the end of the pg_upgrade run, or do we > > > need to write it down for them in script files. > > > > That assumes for example that you've had no extra tablespaces defined > > in it. And it assumes your config files are actually in the same > > directory etc. > > > > Now, pg_upgrade *could* create a script that "actually works" for most > > things, since it connects to the system and could then enumerate > > things like tablespaces and config file locations, and generate a > > script that actually uses that information. > > Uh, pg_upgrade does enumerate things like tablespaces in > create_script_for_old_cluster_deletion(). I think config file locations > are beyond the scope of what we want pg_upgrade to handle. Ah, that's right. Forgot that it did actually handle tablespaces. But how would config file locations be beyond the scope? WIthout that, it's incomplete for any user using that... And for the vast majority of users, it would for example also have to include the un-defining of a system service (systemd, sysvinit, or whatever) as well, which would be even more out of scope by that argument, yet at least as important in actually deleting the old cluster... > In summary, I think the vacuumdb --analyze is now a one-line command, > but the delete part can be complex and not easily typed. I definitely agree to that part -- my argument is that it's more complex (or rather, more platform-specific) than can easily be put in the script either. If it were to be replaced it wouldn't be withy "a command",it would be with instructions basically saying "delete the old cluster". Platform specific wrappers (debian, redhat, windows or whatever) knows more and could do a more complete job, but trying to make all that generic is very complicated. -- Magnus Hagander Me: https://www.hagander.net/ Work: https://www.redpill-linpro.com/
Re: Implementing Incremental View Maintenance
On 05.10.2020 12:16, Yugo NAGATA wrote: Hi, Attached is the rebased patch (v18) to add support for Incremental Materialized View Maintenance (IVM). It is able to be applied to current latest master branch. Thank you very much for this work. I consider incremental materialized views as "reincarnation" of OLAP hypercubes. There are two approaches of making OLAP queries faster: 1. speed up query execution (using JIT, columnar store, vector operations and parallel execution) 2. precalculate requested data Incremental materialize views make it possible to implement second approach. But how competitive it is? I do not know current limitations of incremental materialized views, but I checked that basic OLAP functionality: JOIN+GROUP_BY+AGGREGATION is supported. The patch is not applied to the current master because makeFuncCall prototype is changed, I fixed it by adding COAERCE_CALL_EXPLICIT. Then I did the following simple test: 1. Create pgbench database with scale 100. pgbench speed at my desktop is about 10k TPS: pgbench -M prepared -N -c 10 -j 4 -T 30 -P 1 postgres tps = 10194.951827 (including connections establishing) 2. Then I created incremental materialized view: create incremental materialized view teller_sums as select t.tid,sum(abalance) from pgbench_accounts a join pgbench_tellers t on a.bid=t.bid group by t.tid; SELECT 1000 Time: 20805.230 ms (00:20.805) 20 second is reasonable time, comparable with time of database initialization. Then obviously we see advantages of precalculated aggregates: postgres=# select * from teller_sums where tid=1; tid | sum -+ 1 | -96427 (1 row) Time: 0.871 ms postgres=# select t.tid,sum(abalance) from pgbench_accounts a join pgbench_tellers t on a.bid=t.bid group by t.tid having t.tid=1 ; tid | sum -+ 1 | -96427 (1 row) Time: 915.508 ms Amazing. Almost 1000 times difference! 3. Run pgbench once again: Ooops! Now TPS are much lower: tps = 141.767347 (including connections establishing) Speed of updates is reduced more than 70 times! Looks like we loose parallelism because almost the same result I get with just one connection. 4. Finally let's create one more view (it is reasonable to expect that analytics will run many different queries and so need multiple views). create incremental materialized view teller_avgs as select t.tid,avg(abalance) from pgbench_accounts a join pgbench_tellers t on a.bid=t.bid group by t.tid; It is great that not only simple aggregates like SUM are supported, but also AVG. But insertion speed now is reduced twice - 72TPS. I tried to make some profiling but didn't see something unusual: 16.41% postgres postgres [.] ExecInterpExpr 8.78% postgres postgres [.] slot_deform_heap_tuple 3.23% postgres postgres [.] ExecMaterial 2.71% postgres postgres [.] AllocSetCheck 2.33% postgres postgres [.] AllocSetAlloc 2.29% postgres postgres [.] slot_getsomeattrs_int 2.26% postgres postgres [.] ExecNestLoop 2.11% postgres postgres [.] MemoryContextReset 1.98% postgres postgres [.] tts_minimal_store_tuple 1.87% postgres postgres [.] heap_compute_data_size 1.78% postgres postgres [.] fill_val 1.56% postgres postgres [.] tuplestore_gettuple 1.44% postgres postgres [.] sentinel_ok 1.35% postgres postgres [.] heap_fill_tuple 1.27% postgres postgres [.] tuplestore_gettupleslot 1.17% postgres postgres [.] ExecQual 1.14% postgres postgres [.] tts_minimal_clear 1.13% postgres postgres [.] CheckOpSlotCompatibility 1.10% postgres postgres [.] base_yyparse 1.10% postgres postgres [.] heapgetpage 1.04% postgres postgres [.] heap_form_minimal_tuple 1.00% postgres postgres [.] slot_getsomeattrs So good news is that incremental materialized views really work. And bad news is that maintenance overhead is too large which significantly restrict applicability of this approach. Certainly in case of dominated read-only workload such materialized views can significantly improve performance. But unfortunately my dream that them allow to combine OLAP+OLPT is not currently realized. -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
RE: POC: postgres_fdw insert batching
On Wed, 11 Nov 2020, tsunakawa.ta...@fujitsu.com wrote: This email was sent to you by someone outside of the University. You should only click on links or attachments if you are certain that the email is genuine and the content is safe. From: Tomas Vondra I see the patch builds the "bulk" query in execute_foreign_modify. IMO that's something we should do earlier, when we're building the simple query (for 1-row inserts). I'd understand if you were concerned about overhead in case of 1-row inserts, trying to not plan the bulk query until necessary, but I'm not sure this actually helps. Or was the goal to build a query for every possible number of slots? I don't think that's really useful, considering it requires deallocating the old plan, preparing a new one, etc. IMO it should be sufficient to have two queries - one for 1-row inserts, one for the full batch. The last incomplete batch can be inserted using a loop of 1-row queries. That's what my patch was doing, but I'm not insisting on that - it just seems like a better approach to me. So feel free to argue why this is better. Don't be concerned, the processing is not changed for 1-row inserts: the INSERT query string is built in PlanForeignModify(), and the remote statement is prepared in execute_foreign_modify() during the first call to ExecForeignInsert() and it's reused for subsequent ExecForeignInsert() calls. The re-creation of INSERT query string and its corresponding PREPARE happen when the number of tuples to be inserted is different from the previous call to ExecForeignInsert()/ExecForeignBulkInsert(). That's because we don't know how many tuples will be inserted during planning (PlanForeignModify) or execution (until the scan ends for SELECT). For example, if we insert 10,030 rows with the bulk size 100, the flow is: PlanForeignModify(): build the INSERT query string for 1 row ExecForeignBulkInsert(100): drop the INSERT query string and prepared statement for 1 row build the query string and prepare statement for 100 row INSERT execute it ExecForeignBulkInsert(100): reuse the prepared statement for 100 row INSERT and execute it ... ExecForeignBulkInsert(30): drop the INSERT query string and prepared statement for 100 row build the query string and prepare statement for 30 row INSERT execute it I think it'd be good to have such GUC, even if only for testing and development. We should probably have a way to disable the batching, which the GUC could also do, I think. So +1 to have the GUC. OK, I'll add it. The name would be max_bulk_insert_tuples, because a) it might cover bulk insert for local relations in the future, and b) "tuple" is used in cpu_(index_)tuple_cost and parallel_tuple_cost, while "row" or "record" is not used in GUC (except for row_security). The valid range would be between 1 and 1,000 (I said 10,000 previously, but I think it's overreaction and am a bit worried about unforseen trouble too many tuples might cause.) 1 disables the bulk processing and uses the traditonal ExecForeignInsert(). The default value is 100 (would 1 be sensible as a default value to avoid surprising users by increased memory usage?) 2. Should we accumulate records per relation in ResultRelInfo instead? That is, when inserting into a partitioned table that has foreign partitions, delay insertion until a certain number of input records accumulate, and then insert accumulated records per relation (e.g., 50 records to relation A, 30 records to relation B, and 20 records to relation C.) If we do that, I think there's a chunk of text missing here? If we do that, then what? Sorry, the two bullets below there are what follows. Perhaps I should have written ":" instead of ",". Anyway, I don't see why accumulating the records in ResultRelInfo would be better than what the patch does now. It seems to me like fairly specific to FDWs, so keeping it int FDW state seems appropriate. What would be the advantage of stashing it in ResultRelInfo? I thought of distributing input records to their corresponding partitions' ResultRelInfos. For example, input record for partition 1 comes, store it in the ResultRelInfo for partition 1, then input record for partition 2 comes, store it in the ResultRelInfo for partition 2. When a ResultRelInfo accumulates some number of rows, insert the accumulated rows therein into the partition. When the input endds, perform bulk inserts for ResultRelInfos that have accumulated rows. I think that's OK for most use cases, and if it's not (e.g. when there's something requiring the exact order of writes) then it's not possible to use batching. That's one of the reasons why I think we should have a GUC to disable the batching. Agreed. * Should the maximum count of accumulated records be applied per relation or the query? When many foreign partitions belong to a partitioned table, if the former is chosen, it may use much memory in total. If the latter i
Re: Proposition for autoname columns
On Mon, Nov 2, 2020 at 05:05:29PM +0200, Eugen Konkov wrote: > Hello Pgsql-hackers, > > When selecting data from json column it named as '?column?' > tucha=# select info->>'suma', docn from document order by id desc limit 5; > ?column? | docn > --+-- > 665.97 | 695 > 513.51 | 632 > 665.97 | 4804 > 492.12 | 4315 > 332.98 | 1302 > (5 rows) > > It would be useful if the name of column will be autoassigned based on > name of json key. Like at next query: > > tucha=# select info->>'suma' as suma, docn from document order by id desc > limit 5; > suma | docn > +-- > 665.97 | 695 > 513.51 | 632 > 665.97 | 4804 > 492.12 | 4315 > 332.98 | 1302 > (5 rows) > > > Would it be useful this auto assigned name for column from json? I think we could do it, but it would only work if the column was output as a single json value, and not a multi-key/value field. I am afraid if we tried to do it, the result would be too inconsistent to be useful. -- Bruce Momjian https://momjian.us EnterpriseDB https://enterprisedb.com The usefulness of a cup is in its emptiness, Bruce Lee
Re: Prevent printing "next step instructions" in initdb and pg_upgrade
On Mon, Nov 2, 2020 at 02:23:35PM +0100, Magnus Hagander wrote: > On Tue, Oct 27, 2020 at 12:40 PM Bruce Momjian wrote: > > > > On Tue, Oct 27, 2020 at 12:35:56PM +0100, Peter Eisentraut wrote: > > > On 2020-10-27 11:53, Bruce Momjian wrote: > > > > On Tue, Oct 27, 2020 at 11:35:25AM +0100, Peter Eisentraut wrote: > > > > > On 2020-10-06 12:26, Magnus Hagander wrote: > > > > > > I went with the name --no-instructions to have the same name for > > > > > > both > > > > > > initdb and pg_upgrade. The downside is that "no-instructions" also > > > > > > causes the scripts not to be written in pg_upgrade, which arguably > > > > > > is a > > > > > > different thing. We could go with "--no-instructions" and > > > > > > "--no-scripts", but that would leave the parameters different. I > > > > > > also > > > > > > considered "--no-next-step", but that one didn't quite have the > > > > > > right > > > > > > ring to me. I'm happy for other suggestions on the parameter names. > > > > > > > > > > What scripts are left after we remove the analyze script, as > > > > > discussed in a > > > > > different thread? > > > > > > > > There is still delete_old_cluster.sh. > > > > > > Well, that one can trivially be replaced by a printed instruction, too. > > > > True. On my system is it simply: > > > > rm -rf '/u/pgsql.old/data' > > > > The question is whether the user is going to record the vacuumdb and rm > > instructions that display at the end of the pg_upgrade run, or do we > > need to write it down for them in script files. > > That assumes for example that you've had no extra tablespaces defined > in it. And it assumes your config files are actually in the same > directory etc. > > Now, pg_upgrade *could* create a script that "actually works" for most > things, since it connects to the system and could then enumerate > things like tablespaces and config file locations, and generate a > script that actually uses that information. Uh, pg_upgrade does enumerate things like tablespaces in create_script_for_old_cluster_deletion(). I think config file locations are beyond the scope of what we want pg_upgrade to handle. In summary, I think the vacuumdb --analyze is now a one-line command, but the delete part can be complex and not easily typed. -- Bruce Momjian https://momjian.us EnterpriseDB https://enterprisedb.com The usefulness of a cup is in its emptiness, Bruce Lee
Re: pg_upgrade analyze script
On Wed, Nov 11, 2020 at 2:21 AM Michael Paquier wrote: > > On Tue, Nov 10, 2020 at 10:21:04AM +0900, Michael Paquier wrote: > > On Mon, Nov 09, 2020 at 03:47:22PM +0100, Peter Eisentraut wrote: > >> You should just remove those calls. There is no need to replace them with > >> vacuumdb calls. The reason those calls were there is that they were > >> testing > >> the generated script itself. If the script is gone, there is no more need. > >> There are already separate tests for testing vacuumdb. > > > > True, 102_vacuumdb_stages.pl already does all that. > > Let's fix that. From what I can see it only involves the attached, > and as a bobus it also reduces the number of extra characters showing > on my terminal after running pg_upgrade tests. Gah. For some reason, I did not spot this email until after I had applied the other patch. And not only that, I also missed one line in my patch that you included in yours :/ My apologies. -- Magnus Hagander Me: https://www.hagander.net/ Work: https://www.redpill-linpro.com/
Re: pg_upgrade analyze script
On Mon, Nov 9, 2020 at 3:47 PM Peter Eisentraut wrote: > > On 2020-11-09 11:22, Magnus Hagander wrote: > >> I have spotted one small-ish thing. This patch is missing to update > >> the following code in vcregress.pl: > >> print "\nSetting up stats on new cluster\n\n"; > >> system(".\\analyze_new_cluster.bat") == 0 or exit 1; > > Ah, nice catch -- thanks! I guess this is unfortunately not a test > > that's part of what cfbot tests. > > > > Untested on Windows, but following the patterns of the rows before it. > > I will go ahead and push this version in a bit. > > You should just remove those calls. There is no need to replace them > with vacuumdb calls. The reason those calls were there is that they > were testing the generated script itself. If the script is gone, there > is no more need. There are already separate tests for testing vacuumdb. > Good point. Done. -- Magnus Hagander Me: https://www.hagander.net/ Work: https://www.redpill-linpro.com/
Re: proposal: possibility to read dumped table's name from file
On 2020-Nov-11, Pavel Stehule wrote: > I think the proposed feature is very far to be the config file for pg_dump > (it implements a option "--filter"). This is not the target. It is not > designed for this. This is just an alternative for options like -t, -T, ... > and I am sure so nobody will generate this file manually. Main target of > this patch is eliminating problems with the max length of the command line. > So it is really not designed to be the config file for pg_dump. I agree that a replacement for existing command line arguments is a good goal, but at the same time it's good to keep in mind the request that more object types are supported as dumpable. While it's not necessary that this infrastructure supports all object types in the first cut, it'd be good to have it support that. I would propose that instead of a single letter 't' etc we support keywords, maybe similar to those returned by getObjectTypeDescription() (with additions -- for example for "table data"). Then we can extend for other object types later without struggling to find good letters for each. Of course we could allow abbrevations for common cases, such that "t" means "table". For example: it'll be useful to support selective dumping of functions, materialized views, foreign objects, etc.
Re: In-place persistance change of a relation
Greetings, * Kyotaro Horiguchi (horikyota@gmail.com) wrote: > At Tue, 10 Nov 2020 09:33:12 -0500, Stephen Frost wrote > in > > * Kyotaro Horiguchi (horikyota@gmail.com) wrote: > > > For fuel(?) of the discussion, I tried a very-quick PoC for in-place > > > ALTER TABLE SET LOGGED/UNLOGGED and resulted as attached. After some > > > trials of several ways, I drifted to the following way after poking > > > several ways. > > > > > > 1. Flip BM_PERMANENT of active buffers > > > 2. adding/removing init fork > > > 3. sync files, > > > 4. Flip pg_class.relpersistence. > > > > > > It always skips table copy in the SET UNLOGGED case, and only when > > > wal_level=minimal in the SET LOGGED case. Crash recovery seems > > > working by some brief testing by hand. > > > > Somehow missed that this patch more-or-less does what I was referring to > > down-thread, but I did want to mention that it looks like it's missing a > > necessary FlushRelationBuffers() call before the sync, otherwise there > > could be dirty buffers for the relation that's being set to LOGGED (with > > wal_level=minimal), which wouldn't be good. See the comments above > > smgrimmedsync(). > > Right. Thanks. However, since SetRelFileNodeBuffersPersistence() > called just above scans shared buffers so I don't want to just call > FlushRelationBuffers() separately. Instead, I added buffer-flush to > SetRelFileNodeBuffersPersistence(). Maybe I'm missing something, but it sure looks like in the patch that SetRelFileNodeBuffersPersistence() is being called after the smgrimmedsync() call, and I don't think you get to just switch the order of those- the sync is telling the kernel to make sure it's written to disk, while the FlushBuffer() is just writing it into the kernel but doesn't provide any guarantee that the data has actually made it to disk. We have to FlushBuffer() first, and then call smgrimmedsync(). Perhaps there's a way to avoid having to go through shared buffers twice, and I generally agreed it'd be good if we could avoid doing so, but this approach doesn't look like it actually works. > FWIW this is a revised version of the PoC, which has some known > problems. > > - Flipping of Buffer persistence is not WAL-logged nor even be able to > be safely roll-backed. (It might be better to drop buffers). Not sure if it'd be better to drop buffers or not, but figuring out how to deal with rollback seems pretty important. How is the persistence change in the catalog not WAL-logged though..? > - This version handles indexes but not yet handle toast relatins. Would need to be fixed, of course. > - tableAMs are supposed to support this feature. (but I'm not sure > it's worth allowing them not to do so). Seems like they should. Thanks, Stephen signature.asc Description: PGP signature
Re: Disable WAL logging to speed up data loading
Greetings, * tsunakawa.ta...@fujitsu.com (tsunakawa.ta...@fujitsu.com) wrote: > * ALTER TABLE SET UNLOGGED/LOGGED without data copy > Good: > - Does not require server restart (if this feature can be used in all > wal_level settings). > > Bad: > - The user has to maintain and modify some scripts to use ALTER TABLE when > adding or removing the tables/partitions to load data into. For example, if > the data loading job specifies a partitioned table, he may forget to add > ALTER TABLE for new partitions, resulting in slow data loading. I'm not sure that I see this as really being much of an issue. Perhaps there are some things we can do, as I mentioned before, to make it easier for users to have tables be created as unlogged from the start, or to be able to ALTER TABLE a bunch of tables at once (using all in tablespace, or maybe having an ALTER TABLE on a partitioned table cascade to the partitions), but overall the risk here seems very low- clearly whatever processing is running to load the data into a particular table knows what the table is and adding an ALTER TABLE into it would be trivial. Specifically, for a partitioned table, I would think the load would go something like: CREATE UNLOGGED TABLE ... load all of the data ALTER TABLE ... SET LOGGED ALTER TABLE ... ATTACH PARTITION If the data load could all be done in a single transaction then you wouldn't even need to create the table as UNLOGGED or issue the SET LOGGED, with wal_level=minimal, you just need to create the table in the same transaction that you do the data load in. > * wal_level = none > Good: > - Easy to use. The user does not have to be aware of what tables are loaded. > This can come in handy when migrating from an older version or another DBMS, > building test databases, and consolidating databases. A GUC that allowed users to set a default for newly created tables to be unlogged would also address this. > Bad: > - Requires server restart. Introducing yet another wal_level strikes me as a very large step, one that the arguments presented here for why it'd be worth it don't come anywhere near justifying that step. > I expect both features will be able to meet our customer's needs. The worst > scenario (I don't want to imagine!) is that neither feature fails to be > committed. So, let us continue both features. I'll join Horiguchi-san's new > thread, and please help us here too. (I'll catch up with the recent > discussion in this thread and reply.) While you're certainly welcome to spend your time where you wish to, if, as you say, making these changes to how tables can be switched from unlogged to logged with wal_level=minimal meets this use-case then that strikes me as definitely the right approach and removes any justification for adding another wal_level. > > Couldn't we have something like the following? > > > > ALTER TABLE table1, table2, table3 SET UNLOGGED; > > > > That is, multiple target object specification in ALTER TABLE sttatement. > > Likewise, can't we do ALTER TABLE SET UNLOGGED/LOGGED against a partitioned > table? Currently, the statement succeeds but none of the partitioned table > nor its partitions is set unlogged (pg_class.relpersistence remains 'p'). Is > this intended? If it's a bug, I'm willing to fix it so that it reports an > eror. Of course, it's good to make all partitions unlogged at once. I agree that this doesn't seem quite right and considering the way other commands work like CREATE INDEX, I would think that doing such an ALTER TABLE would recurse to the individual partitions (skipping over any which are already set to the persistance desired..). Thanks, Stephen signature.asc Description: PGP signature
Re: avoid bitmapOR-ing indexes with scan condition inconsistent with partition constraint
The following review has been posted through the commitfest application: make installcheck-world: tested, passed Implements feature: tested, passed Spec compliant: tested, passed Documentation:not tested I think that work on improving operator_predicate_proof should really be done in separate patch. And this minimal patch is doing it's work well. The new status of this patch is: Ready for Committer
Re: Skip ExecCheckRTPerms in CTAS with no data
On Tue, Nov 10, 2020 at 1:18 AM Anastasia Lubennikova wrote: > > I took a look at the patch. It is quite simple, so no comments about the > code. It would be good to add a test to select_into.sql to show that it > only applies to 'WITH NO DATA' and that subsequent insertions will fail > if permissions are not set. > Done. > > Maybe we should also mention it a documentation, but I haven't found any > specific paragraph about permissions on CTAS. > Yes we do not have anything related to permissions mentioned in the documentation. So, I'm not adding it now. Apart from the above, I also noticed that we unnecessarily allocate bulk insert state(16MB memory) in case of WITH NO DATA, just to free it in intorel_shutdown() without actually using it. So, in the v2 patch I have made changes to not allocate bulk insert state. Attaching v2 patch. Consider it for further review. With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com v2-Skip-Insert-Perm-Check-Bulk-Insert-State-alloc-in-WITH-NO-DATA.patch Description: Binary data
Re: logical streaming of xacts via test_decoding is broken
On Wed, Nov 11, 2020 at 6:59 PM Amit Kapila wrote: > > On Wed, Nov 11, 2020 at 10:00 AM Dilip Kumar wrote: > > > > On Tue, Nov 10, 2020 at 7:20 PM Amit Kapila wrote: > > > > > > On Tue, Nov 10, 2020 at 2:25 PM Dilip Kumar wrote: > > > > > > > > > > You can probably add a comment as to why we are disallowing this case. > > > I thought of considering 'stream-changes' parameter here because it > > > won't make sense to give this parameter without it, however, it seems > > > that is not necessary but maybe adding a comment > > > here in that regard would be a good idea. > > > > Should we also consider the case that if the user just passed > > skip_empty_streams to true then we should automatically set > > skip_empty_xacts to true? > > > > Is there any problem if we don't do this? Actually, adding > dependencies on parameters is confusing so I want to avoid that unless > it is really required. > > > And we will allow the 'skip_empty_streams' parameter only if > > stream-changes' is true. The reason behind this thought is that if the user doesn't pass any value for 'skip_empty_xacts' then the default value will be false and if the user only pass 'skip_empty_streams' to true then we will error out assuming that skip_empty_xacts is false but skip_empty_streams is true. So it seems instead of error out we can assume that skip_empty_streams true mean skip_empty_xacts is also true if nothing is passed for that. > Can't we simply ignore 'skip_empty_streams' if 'stream-changes' is not given? Yeah, we can do that. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: logical streaming of xacts via test_decoding is broken
On Wed, Nov 11, 2020 at 10:00 AM Dilip Kumar wrote: > > On Tue, Nov 10, 2020 at 7:20 PM Amit Kapila wrote: > > > > On Tue, Nov 10, 2020 at 2:25 PM Dilip Kumar wrote: > > > > > > > You can probably add a comment as to why we are disallowing this case. > > I thought of considering 'stream-changes' parameter here because it > > won't make sense to give this parameter without it, however, it seems > > that is not necessary but maybe adding a comment > > here in that regard would be a good idea. > > Should we also consider the case that if the user just passed > skip_empty_streams to true then we should automatically set > skip_empty_xacts to true? > Is there any problem if we don't do this? Actually, adding dependencies on parameters is confusing so I want to avoid that unless it is really required. > And we will allow the 'skip_empty_streams' parameter only if > stream-changes' is true. > Can't we simply ignore 'skip_empty_streams' if 'stream-changes' is not given? -- With Regards, Amit Kapila.
Re: cutting down the TODO list thread
On Tue, Nov 10, 2020 at 7:08 PM Bruce Momjian wrote: > On Tue, Nov 3, 2020 at 02:06:13PM -0400, John Naylor wrote: > > I was thinking of not having the next updates during commitfest, but it > could > > also be argued this is a type of review, and the things here will be > returned > > with feedback or rejected, in a way. Ultimately, it comes down to "when > time > > permits". > > I don't understand what this is referring to. Thanks for the rest of > the work. > This was awkwardly phrased, but I was concerned future proposals for removal would be easy to miss during commitfest. At this point, I'm thinking it isn't an issue. -- John Naylor EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Clean up optional rules in grammar
On Wed, Nov 11, 2020 at 5:13 AM Peter Eisentraut < peter.eisentr...@enterprisedb.com> wrote: > It's obviously confusing to have multiple different styles to do the > same thing. And these extra rules (including the empty ones) also end > up in the output, so they create more work down the line. > > The attached patch cleans this up to make them all look like the first > style. > +1 for standardizing in this area. It's worth noting that Bison 3.0 introduced %empty for this situation, which is self-documenting. Until we get there, this is a good step. -- John Naylor EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Prefer TG_TABLE_NAME over TG_RELNAME in tests
Thanks so much: the Bucardo PR has been merged in.
Re: ModifyTable overheads in generic plans
I'm still a bit confused and unhappy about the initialization of ResultRelInfos and the various fields in them. We've made progress in the previous patches, but it's still a bit messy. /* * If transition tuples will be captured, initialize a map to convert * child tuples into the format of the table mentioned in the query * (root relation), because the transition tuple store can only store * tuples in the root table format. However for INSERT, the map is * only initialized for a given partition when the partition itself is * first initialized by ExecFindPartition. Also, this map is also * needed if an UPDATE ends up having to move tuples across * partitions, because in that case the child tuple to be moved first * needs to be converted into the root table's format. In that case, * we use GetChildToRootMap() to either create one from scratch if * we didn't already create it here. * * Note: We cannot always initialize this map lazily, that is, use * GetChildToRootMap(), because AfterTriggerSaveEvent(), which needs * the map, doesn't have access to the "target" relation that is * needed to create the map. */ if (mtstate->mt_transition_capture && operation != CMD_INSERT) { Relationrelation = resultRelInfo->ri_RelationDesc; RelationtargetRel = mtstate->rootResultRelInfo->ri_RelationDesc; resultRelInfo->ri_ChildToRootMap = convert_tuples_by_name(RelationGetDescr(relation), RelationGetDescr(targetRel)); /* First time creating the map for this result relation. */ Assert(!resultRelInfo->ri_ChildToRootMapValid); resultRelInfo->ri_ChildToRootMapValid = true; } The comment explains that AfterTriggerSaveEvent() cannot use GetChildToRootMap(), because it doesn't have access to the root target relation. But there is a field for that in ResultRelInfo: ri_PartitionRoot. However, that's only set up when we do partition routing. How about we rename ri_PartitionRoot to e.g ri_RootTarget, and set it always, even for non-partition inheritance? We have that information available when we initialize the ResultRelInfo, so might as well. Some code currently checks ri_PartitionRoot, to determine if a tuple that's been inserted, has been routed. For example: /* * Also check the tuple against the partition constraint, if there is * one; except that if we got here via tuple-routing, we don't need to * if there's no BR trigger defined on the partition. */ if (resultRelationDesc->rd_rel->relispartition && (resultRelInfo->ri_PartitionRoot == NULL || (resultRelInfo->ri_TrigDesc && resultRelInfo->ri_TrigDesc->trig_insert_before_row))) ExecPartitionCheck(resultRelInfo, slot, estate, true); So if we set ri_PartitionRoot always, we would need some other way to determine if the tuple at hand has actually been routed or not. But wouldn't that be a good thing anyway? Isn't it possible that the same ResultRelInfo is sometimes used for routed tuples, and sometimes for tuples that have been inserted/updated "directly"? ExecLookupUpdateResultRelByOid() sets that field lazily, so I think it would be possible to get here with ri_PartitionRoot either set or not, depending on whether an earlier cross-partition update was routed to the table. The above check is just an optimization, to skip unnecessary ExecPartitionCheck() calls, but I think this snippet in ExecConstraints() needs to get this right: /* * If the tuple has been routed, it's been converted to the * partition's rowtype, which might differ from the root * table's. We must convert it back to the root table's * rowtype so that val_desc shown error message matches the * input tuple. */ if (resultRelInfo->ri_PartitionRoot) { AttrMap*map; rel = resultRelInfo->ri_PartitionRoot; tupdesc = RelationGetDescr(rel);
Re: Clean up optional rules in grammar
On 2020-11-11 12:43, Vik Fearing wrote: No objections, but could we also take this opportunity to standardize the comment itself? Even in your patch there is a mix of spacing and casing. My preference is /* EMPTY */. That is, uppercase with spaces, but whatever gets chosen is fine with me. Looks like /*EMPTY*/ is the most common right now. I agree this should be straightened out.
Re: abstract Unix-domain sockets
On 2020-11-10 07:24, Michael Paquier wrote: Can you sketch how you would structure this? I realize it's not very elegant, but I couldn't come up with a better way that didn't involve having to duplicate some of the error messages into multiple branches. I think that I would use a StringInfo to build each sentence of the hint separately. The first sentence, "Is another postmaster already running on port %d?" is already known. Then the second sentence could be built depending on the two other conditions. I'm not sure concatenating sentences like that is okay for translatability. FWIW, I think that it is confusing to mention in the hint to remove a socket file that cannot be removed. Thinking about it further, I think the hint in the Unix-domain socket case is bogus. A socket in the file-system namespace never reports EADDRINUSE anyway, it just overwrites the file. For sockets in the abstract namespace, you can get this error, but of course there is no file to remove. Perhaps we should change the hint in both the Unix and the IP cases to: "Is another postmaster already running at this address?" (This also resolves the confusing reference to "port" in the Unix case.) Or we just drop the hint in the Unix case. The primary error message is clear enough. -- Peter Eisentraut 2ndQuadrant, an EDB company https://www.2ndquadrant.com/
RE: pgbench: option delaying queries till connections establishment?
Hello, I can remove the line, but I strongly believe that reporting performance figures if some client connection failed thus the bench could not run as prescribed is a bad behavior. The good news is that it is probably quite unlikely. So I'd prefer to keep it and possibly submit a patch to change the behavior. I agree such a situation is very bad, and I understood you have a plan to submit patches for fix it. If so leaving lines as a TODO is OK. Thanks. Should be this one: https://commitfest.postgresql.org/30/2624/ This discussion is still on-going, but I can see that the starting time may be delayed for looking up all pgbench-variables. Yep, that's it. (I think the status of this thread might be wrong. it should be 'Needs review,' but now 'Waiting on Author.') I changed it to "Needs review". This patch is mostly good and can change a review status soon, however, I think it should wait that related patch. Hmmm. Please discuss how to fix it with Tom, I would not have the presumption to pressure Tom's agenda in any way! and this will commit soon. and this will wait till its time comes. In the mean time, I think that you should put the patch status as you see fit, independently of the other patch: it seems unlikely that they would be committed together, and I'll have to merge the remaining one anyway. -- Fabien.
Re: making update/delete of inheritance trees scale better
On 29/10/2020 15:03, Amit Langote wrote: Rebased over the recent executor result relation related commits. ModifyTablePath didn't get the memo that a ModifyTable can only have one subpath after these patches. Attached patch, on top of your v5 patches, cleans that up. - Heikki diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c index fffe5bb7f0e..715eaece37e 100644 --- a/src/backend/optimizer/plan/createplan.c +++ b/src/backend/optimizer/plan/createplan.c @@ -2623,7 +2623,7 @@ static ModifyTable * create_modifytable_plan(PlannerInfo *root, ModifyTablePath *best_path) { ModifyTable *plan; - Path *subpath = (Path *) linitial(best_path->subpaths); + Path *subpath = best_path->subpath; Plan *subplan; /* Build the plan. */ diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c index a1c015435f9..2c8f8b7fc0c 100644 --- a/src/backend/optimizer/plan/planner.c +++ b/src/backend/optimizer/plan/planner.c @@ -1825,8 +1825,7 @@ grouping_planner(PlannerInfo *root, double tuple_fraction) rootRelation, root->partColsUpdated, resultRelations, - list_make1(path), - list_make1(root), + path, updateTargetLists, withCheckOptionLists, returningLists, diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c index 5581cd30088..bf4d7fc348b 100644 --- a/src/backend/optimizer/util/pathnode.c +++ b/src/backend/optimizer/util/pathnode.c @@ -3521,8 +3521,7 @@ create_lockrows_path(PlannerInfo *root, RelOptInfo *rel, * 'partColsUpdated' is true if any partitioning columns are being updated, * either from the target relation or a descendent partitioned table. * 'resultRelations' is an integer list of actual RT indexes of target rel(s) - * 'subpaths' is a list of Path(s) producing source data (one per rel) - * 'subroots' is a list of PlannerInfo structs (one per rel) + * 'subpath' is a Path producing source data * 'updateTargetLists' is a list of UPDATE targetlists. * 'withCheckOptionLists' is a list of WCO lists (one per rel) * 'returningLists' is a list of RETURNING tlists (one per rel) @@ -3535,15 +3534,14 @@ create_modifytable_path(PlannerInfo *root, RelOptInfo *rel, CmdType operation, bool canSetTag, Index nominalRelation, Index rootRelation, bool partColsUpdated, - List *resultRelations, List *subpaths, - List *subroots, List *updateTargetLists, + List *resultRelations, Path *subpath, + List *updateTargetLists, List *withCheckOptionLists, List *returningLists, List *rowMarks, OnConflictExpr *onconflict, int epqParam) { ModifyTablePath *pathnode = makeNode(ModifyTablePath); double total_size; - ListCell *lc; Assert(withCheckOptionLists == NIL || list_length(resultRelations) == list_length(withCheckOptionLists)); @@ -3575,18 +3573,13 @@ create_modifytable_path(PlannerInfo *root, RelOptInfo *rel, pathnode->path.total_cost = 0; pathnode->path.rows = 0; total_size = 0; - foreach(lc, subpaths) - { - Path *subpath = (Path *) lfirst(lc); - if (lc == list_head(subpaths)) /* first node? */ - pathnode->path.startup_cost = subpath->startup_cost; - pathnode->path.total_cost += subpath->total_cost; - if (returningLists != NIL) - { - pathnode->path.rows += subpath->rows; - total_size += subpath->pathtarget->width * subpath->rows; - } + pathnode->path.startup_cost = subpath->startup_cost; + pathnode->path.total_cost += subpath->total_cost; + if (returningLists != NIL) + { + pathnode->path.rows += subpath->rows; + total_size += subpath->pathtarget->width * subpath->rows; } /* @@ -3605,8 +3598,7 @@ create_modifytable_path(PlannerInfo *root, RelOptInfo *rel, pathnode->rootRelation = rootRelation; pathnode->partColsUpdated = partColsUpdated; pathnode->resultRelations = resultRelations; - pathnode->subpaths = subpaths; - pathnode->subroots = subroots; + pathnode->subpath = subpath; pathnode->updateTargetLists = updateTargetLists; pathnode->withCheckOptionLists = withCheckOptionLists; pathnode->returningLists = returningLists; diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h index 291c27ac503..6e8f2734f1d 100644 --- a/src/include/nodes/pathnodes.h +++ b/src/include/nodes/pathnodes.h @@ -1824,8 +1824,7 @@ typedef struct ModifyTablePath Index rootRelation; /* Root RT index, if target is partitioned */ bool partColsUpdated; /* some part key in hierarchy updated */ List *resultRelations; /* integer list of RT indexes */ - List *subpaths; /* Path(s) producing source data */ - List *subroots; /* per-target-table PlannerInfos */ + Path *subpath; /* Path producing source data */ List *updateTargetLists; /* per-target-table UPDATE tlists */ List *withCheckOptionLists; /* per-target-table WCO lists */ List *returningLists; /* per-target-table
Re: Clean up optional rules in grammar
On 11/11/20 10:12 AM, Peter Eisentraut wrote: > There are a number of rules like this in the grammar: > > opt_foo: FOO > | /*EMPTY*/ > ; > > And there are some like this: > > opt_foo: FOO {} > | /*EMPTY*/ {} > ; > > and some even like this: > > %type opt_foo > > opt_foo: FOO { $$ = NULL; } > | /*EMPTY*/ { $$ = NULL; } > ; > > (I mean here specifically those rules where FOO is a noise word and the > actions are the same in each branch.) > > It's obviously confusing to have multiple different styles to do the > same thing. And these extra rules (including the empty ones) also end > up in the output, so they create more work down the line. > > The attached patch cleans this up to make them all look like the first > style. No objections, but could we also take this opportunity to standardize the comment itself? Even in your patch there is a mix of spacing and casing. My preference is /* EMPTY */. That is, uppercase with spaces, but whatever gets chosen is fine with me. -- Vik Fearing
RE: pgbench: option delaying queries till connections establishment?
Dear Fabien, > Usually I run many pgbench through scripts, so I'm probably not there to > check a lone stderr failure at the beginning if performance figures are > actually reported. > I can remove the line, but I strongly believe that reporting performance > figures if some client connection failed thus the bench could not run as > prescribed is a bad behavior. The good news is that it is probably quite > unlikely. So I'd prefer to keep it and possibly submit a patch to change > the behavior. I agree such a situation is very bad, and I understood you have a plan to submit patches for fix it. If so leaving lines as a TODO is OK. > Should be this one: https://commitfest.postgresql.org/30/2624/ This discussion is still on-going, but I can see that the starting time may be delayed for looking up all pgbench-variables. (I think the status of this thread might be wrong. it should be 'Needs review,' but now 'Waiting on Author.') This patch is mostly good and can change a review status soon, however, I think it should wait that related patch. Please discuss how to fix it with Tom, and this will commit soon. Hayato Kuroda FUJITSU LIMITED
Re: [HACKERS] logical decoding of two-phase transactions
Did some further tests on the problem I saw and I see that it does not have anything to do with this patch. I picked code from top of head. If I have enough changes in a transaction to initiate streaming, then it will also stream the same changes after a commit. BEGIN; INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1,800) g(i); SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1'); ** see streamed output here ** END; SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1'); ** see the same streamed output here ** I think this is because since the transaction has not been committed, SnapBuildCommitTxn is not called which is what moves the "builder->start_decoding_at", and as a result later calls to pg_logical_slot_get_changes will start from the previous lsn. I did do a quick test in pgoutput using pub/sub and I dont see duplication of data there but I haven't verified exactly what happens. regards, Ajin Cherian Fujitsu Australia
Re: Clean up optional rules in grammar
On 11/11/2020 11:12, Peter Eisentraut wrote: The attached patch cleans this up to make them all look like the first style. +1 - Heikki
Re: warn_unused_results
On 2020-11-10 04:34, Michael Paquier wrote: I am not sure about the addition of repalloc(), as it is quite obvious that one has to use its result. Lists are fine, these are proper to PG internals and beginners tend to be easily confused in the way to use them. realloc() is listed in the GCC documentation as the reason this option exists, and glibc tags its realloc() with this attribute, so doing the same for repalloc() seems sensible. Good point. committed -- Peter Eisentraut 2ndQuadrant, an EDB company https://www.2ndquadrant.com/
Strange GiST logic leading to uninitialized memory access in pg_trgm gist code
(From a report by user "ftzdomino" on IRC of a segfault while loading data with a pg_trgm gist index) If gtrgm_picksplit is invoked on a vector of exactly 2 items (which I think is rare, but it can happen if gistSplit recurses or I think in cases of secondary splits), then it tries to access cache[2] without ever having initialized it, causing hilarity to ensue. What I don't entirely understand is why the code is insisting on treating the last item as special: given N items, it tries to find seeds from the first N-1 items only. This would make a vague sort of sense if the N'th item was always the just-inserted one, but this doesn't appear to be always the case (e.g. recursive calls in gistSplit, or secondary splits), and even if it were it's not clear that it would be correct logic. What gtrgm_picksplit currently does, as I read it, is: take first N-1 items (note that entryvec->n is N+1, it sets maxoff = entryvec->n - 2) populate a cache of their signatures find the two furthest apart as seeds if we didn't choose two, then default to items 1 and 2 as seeds (note here that if N=2 then item 2 is not cached) make datums from the cache entries of the two seeds (explodes here when N=2) Increase maxoff and construct the cache entry for item N Split all N items using the two seeds Now the obvious simple fix is just to reorder those last two operations, and the original reporter verified that doing so fixed their problem (patch attached). But I'd really like to understand the logic here and whether there is any reason to have this special treatment at all - why would it not be better to just cache all N items upfront and consider them all as potential seeds? Another issue I don't understand yet is that even though this code is largely unchanged since 8.x, the original reporter could not reproduce the crash on any version before 13.0. Anyone have any ideas? (If not, I'll commit and backpatch something like the attached patch at some suitable time.) -- Andrew (irc:RhodiumToad) diff --git a/contrib/pg_trgm/trgm_gist.c b/contrib/pg_trgm/trgm_gist.c index 9937ef9253..c7f2873e13 100644 --- a/contrib/pg_trgm/trgm_gist.c +++ b/contrib/pg_trgm/trgm_gist.c @@ -847,15 +847,22 @@ gtrgm_picksplit(PG_FUNCTION_ARGS) v->spl_nleft = 0; v->spl_nright = 0; + /* + * We ignored the last entry in the list above, and did not populate its + * cache entry yet, but if we were called with exactly 2 items then seed_2 + * now points to it, so we must fill in the cache entry before trying to + * build datums from the seeds. + */ + maxoff = OffsetNumberNext(maxoff); + fillcache(&cache[maxoff], GETENTRY(entryvec, maxoff), + &cache_sign[siglen * maxoff], siglen); + /* form initial .. */ datum_l = gtrgm_alloc(cache[seed_1].allistrue, siglen, cache[seed_1].sign); datum_r = gtrgm_alloc(cache[seed_2].allistrue, siglen, cache[seed_2].sign); union_l = GETSIGN(datum_l); union_r = GETSIGN(datum_r); - maxoff = OffsetNumberNext(maxoff); - fillcache(&cache[maxoff], GETENTRY(entryvec, maxoff), - &cache_sign[siglen * maxoff], siglen); /* sort before ... */ costvector = (SPLITCOST *) palloc(sizeof(SPLITCOST) * maxoff);