Re: [HACKERS] Surprising behaviour of \set AUTOCOMMIT ON
Just for information, PG current behavior, "\set AUTOCOMMIT OFF" implicitly does/open "BEGIN;" block So, "\set AUTOCOMMIT ON" has no effect once "\set AUTOCOMMIT OFF" is issued until "END;" or "COMMIT;" or "ROLLBACK;" however, I think if exit session release the transactions then change session should also release the transactions Thanks Sridhar On Mon, Aug 8, 2016 at 10:34 PM, Vik Fearing wrote: > On 08/08/16 17:02, Robert Haas wrote: > > On Mon, Aug 8, 2016 at 10:10 AM, Rahila Syed > wrote: > >> Thank you for inputs everyone. > >> > >> The opinions on this thread can be classified into following > >> 1. Commit > >> 2. Rollback > >> 3. Error > >> 4. Warning > >> > >> As per opinion upthread, issuing implicit commit immediately after > switching > >> autocommit to ON, can be unsafe if it was not desired. While I agree > that > >> its difficult to judge users intention here, but if we were to base it > on > >> some assumption, the closest would be implicit COMMIT in my > opinion.There is > >> higher likelihood of a user being happy with issuing a commit when > setting > >> autocommit ON than a transaction being rolled back. Also there are > quite > >> some interfaces which provide this. > >> > >> As mentioned upthread, issuing a warning on switching back to autocommit > >> will not be effective inside a script. It won't allow subsequent > commands to > >> be committed as set autocommit to ON is not committed. Scripts will > have to > >> be rerun with changes which will impact user friendliness. > >> > >> While I agree that issuing an ERROR and rolling back the transaction > ranks > >> higher in safe behaviour, it is not as common (according to instances > stated > >> upthread) as immediately committing any open transaction when switching > back > >> to autocommit. > > > > I think I like the option of having psql issue an error. On the > > server side, the transaction would still be open, but the user would > > receive a psql error message and the autocommit setting would not be > > changed. So the user could type COMMIT or ROLLBACK manually and then > > retry changing the value of the setting. > > This is my preferred action. > > > Alternatively, I also think it would be sensible to issue an immediate > > COMMIT when the autocommit setting is changed from off to on. That > > was my first reaction. > > I don't care for this very much. > > > Aborting the server-side transaction - with or without notice - > > doesn't seem very reasonable. > > Agreed. > -- > Vik Fearing +33 6 46 75 15 36 > http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support > > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers >
Re: [HACKERS] dsm_unpin_segment
On Tue, Aug 9, 2016 at 12:53 PM, Tom Lane wrote: > The larger picture here is that Robert is exhibiting a touching but > unfounded faith that extensions using this feature will contain zero bugs. > IMO there needs to be some positive defense against mistakes in using the > pin/unpin API. As things stand, multiple pin requests don't have any > fatal consequences (especially not on non-Windows), so I have little > confidence that it's not happening in the field. I have even less > confidence that there wouldn't be too many unpin requests. Ok, here is a version that defends against invalid sequences of pin/unpin calls. I had to move dsm_impl_pin_segment into the block protected by DynamicSharedMemoryControlLock, so that it could come after the already-pinned check, but before updating any state, since it makes a Windows syscall that can fail. That said, I've only tested on Unix and will need to ask someone to test on Windows. > What exactly > is an extension going to be doing to ensure that it doesn't do too many of > one or the other? An extension that manages segment lifetimes like this needs a carefully designed protocol to get it right, probably involving state in another shared memory area and some interlocking, not to mention a lot of thought about cleanup. Here's one use case: I have a higher level object, a multi-segment-backed shared memory allocator, which owns any number of segments that together form a shared memory area. The protocol is that the allocator always pins segments when it needs to create new ones, because they need to survive as long as the control segment, even though no one backend is guaranteed to have all of the auxiliary segments mapped in (since they're created and attached on demand). But when the control segment is detached by all backends and is due to be destroyed, then we need to unpin all the auxiliary segments so they can also be destroyed, and that can be done from an on_dsm_detach callback on the control segment. So I'm riding on the coat tails of the existing cleanup mechanism for the control segment, while making sure that the auxiliary segments get pinned and unpinned exactly once. I'll have more to say about that when I post that patch... -- Thomas Munro http://www.enterprisedb.com dsm-unpin-segment-v2.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Wait events monitoring future development
From: pgsql-hackers-ow...@postgresql.org > I used to think of that this kind of features should be enabled by default, > because when I was working at the previous company, I had only few features > to understand what is happening inside PostgreSQL by observing production > databases. I needed those features enabled in the production databases when > I was called. > > However, now I have another opinion. When we release the next major release > saying 10.0 with the wait monitoring, many people will start their benchmark > test with a configuration with *the default values*, and if they see some > performance decrease, for example around 10%, they will be talking about > it as the performance decrease in PostgreSQL 10.0. It means PostgreSQL will > be facing difficult reputation. > > So, I agree with the features should be disabled by default for a while. I understand your feeling well. This is a difficult decision. Let's hope for trivial overhead. Regards Takayuki Tsunakawa -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Wait events monitoring future development
From: pgsql-hackers-ow...@postgresql.org > If you want to know why people are against enabling this monitoring by > default, above is the reason. What percentage of people do you think would > be willing to take a 10% performance penalty for monitoring like this? I > would bet very few, but the argument above doesn't seem to address the fact > it is a small percentage. > > In fact, the argument above goes even farther, saying that we should enable > it all the time because people will be unwilling to enable it on their own. > I have to question the value of the information if users are not willing > to enable it. And the solution proposed is to force the 10% default overhead > on everyone, whether they are currently doing debugging, whether they will > ever do this level of debugging, because people will be too scared to enable > it. (Yes, I think Oracle took this > approach.) > > We can talk about this feature all we want, but if we are not willing to > be realistic in how much performance penalty the _average_ user is willing > to lose to have this monitoring, I fear we will make little progress on > this feature. OK, 10% was an overstatement. Anyway, As Amit said, we can discuss the default value based on the performance evaluation before release. As another idea, we can stand on the middle ground. Interestingly, MySQL also enables their event monitoring (Performance Schema) by default, but not all events are collected. I guess highly encountered events are not collected by default to minimize the overhead. http://dev.mysql.com/doc/refman/5.7/en/performance-schema-quick-start.html -- Assuming that the Performance Schema is available, it is enabled by default. ... [mysqld] performance_schema=ON ... Initially, not all instruments and consumers are enabled, so the performance schema does not collect all events. To turn all of these on and enable event timing, execute two statements (the row counts may differ depending on MySQL version): mysql> UPDATE setup_instruments SET ENABLED = 'YES', TIMED = 'YES'; Query OK, 560 rows affected (0.04 sec) mysql> UPDATE setup_consumers SET ENABLED = 'YES'; Query OK, 10 rows affected (0.00 sec) -- BTW, I remember EnterpriseDB has a wait event monitoring feature. Is it disabled by default? What was the overhead? Regards Takayuki Tsunakawa -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Declarative partitioning
On 2016/08/09 6:02, Robert Haas wrote: > On Mon, Aug 8, 2016 at 1:40 AM, Amit Langote > wrote: >>> +1, if we could do it. It will need a change in the way Amit's patch stores >>> partitioning scheme in PartitionDesc. >> >> Okay, I will try to implement this in the next version of the patch. >> >> One thing that comes to mind is what if a user wants to apply hash >> operator class equality to list partitioned key by specifying a hash >> operator class for the corresponding column. In that case, we would not >> have the ordering procedure with an hash operator class, hence any >> ordering based optimization becomes impossible to implement. The current >> patch rejects a column for partition key if its type does not have a btree >> operator class for both range and list methods, so this issue doesn't >> exist, however it could be seen as a limitation. > > Yes, I think you should expect careful scrutiny of that issue. It > seems clear to me that range partitioning requires a btree opclass, > that hash partitioning requires a hash opclass, and that list > partitioning requires at least one of those things. It would probably > be reasonable to pick one or the other and insist that list > partitioning always requires exactly that, but I can't see how it's > reasonable to insist that you must have both types of opclass for any > type of partitioning. So because we intend to implement optimizations for list partition metadata that presuppose existence of corresponding btree operator class, we should just always require user to specify one (or error out if user doesn't specify and a default one doesn't exist). That way, we explicitly do not support specify hash equality operator for list partitioning. >> So, we have 3 choices for the internal representation of list partitions: >> >> Choice 1 (the current approach): Load them in the same order as they are >> found in the partition catalog: >> >> Table 1: p1 {'b', 'f'}, p2 {'c', 'd'}, p3 {'a', 'e'} >> Table 2: p1 {'c', 'd'}, p2 {'a', 'e'}, p3 {'b', 'f'} >> >> In this case, mismatch on the first list would make the two tables >> incompatibly partitioned, whereas they really aren't incompatible. > > Such a limitation seems clearly unacceptable. We absolutely must be > able to match up compatible partitioning schemes without getting > confused by little details like the order of the partitions. Agreed. Will change my patch to adopt the below method. >> Choice 2: Representation with 2 arrays: >> >> Table 1: ['a', 'b', 'c', 'd', 'e', 'f'], [3, 1, 2, 2, 3, 1] >> Table 2: ['a', 'b', 'c', 'd', 'e', 'f'], [2, 3, 1, 1, 2, 3] >> >> It still doesn't help the case of pairwise joins because it's hard to tell >> which value belongs to which partition (the 2nd array carries the original >> partition numbers). Although it might still work for tuple-routing. > > It's very good for tuple routing. It can also be used to match up > partitions for pairwise joins. Compare the first arrays. If they are > unequal, stop. Else, compare the second arrays, incrementally > building a mapping between them and returning false if the mapping > turns out to be non-bijective. For example, in this case, we look at > index 0 and decide that 3 -> 2. We look at index 1 and decide 1 -> 3. > We look at index 2 and decide 2 -> 1. We look at index 4 and find > that we already have a mapping for 2, but it's compatible because we > need 2 -> 1 and that's what is already there. Similarly for the > remaining entries. This is really a pretty easy loop to write and it > should run very quickly. I see, it does make sense to try to implement this way. >> Choice 3: Order all lists' elements for each list individually and then >> order the lists themselves on their first values: >> >> Table 1: p3 {'a', 'e'}, p2 {'b', 'f'}, p1 {'c', 'd'} >> Table 2: p2 {'a', 'e'}, p1 {'b', 'f'}, p3 {'c', 'd'} >> >> This representation makes pairing partitions for pairwise joining >> convenient but for tuple-routing we still need to visit each partition in >> the worst case. > > I think this is clearly not good enough for tuple routing. If the > algorithm I proposed above turns out to be too slow for matching > partitions, then we could keep both this representation and the > previous one. We are not limited to just one. But I don't think > that's likely to be the case. I agree. Let's see how the option 2 turns out. > Also, note that all of this presupposes we're doing range > partitioning, or perhaps list partitioning with a btree opclass. For > partitioning based on a hash opclass, you'd organize the data based on > the hash values rather than range comparisons. Yes, the current patch does not implement hash partitioning, although I have to think about how to support the hash case when designing the internal data structures. By the way, I am planning to start a new thread with the latest set of patches which I will post in a day or two. I have tried to implement all the bug fixes and improvements
Re: [HACKERS] Wait events monitoring future development
2016-08-07 21:03 GMT+09:00 Ilya Kosmodemiansky : > I've summarized Wait events monitoring discussion at Developer unconference > in Ottawa this year on wiki: > > https://wiki.postgresql.org/wiki/PgCon_2016_Developer_Unconference/Wait_events_monitoring > > (Thanks to Alexander Korotkov for patiently pushing me to make this thing > finally done) Thanks for your effort to make us move forward. > If you attended, fill free to point me out if I missed something, I will put > it on the wiki too. > > Wait event monitoring looks ones again stuck on the way through community > approval in spite of huge progress done last year in that direction. The > importance of the topic is beyond discussion now, if you talk to any > PostgreSQL person about implementing such a tool in Postgres and if the > person does not get exited, probably you talk to a full-time PostgreSQL > developer;-) Obviously it needs a better design, both the user interface and > implementation, and perhaps this is why full-time developers are still > sceptical. > > In order to move forward, imho we need at least some steps, whose steps can > be done in parallel > > 1. Further requirements need to be collected from DBAs. > >If you are a PostgreSQL DBA with Oracle experience and use perf for > troubleshooting Postgres - you are an ideal person to share your experience, > but everyone is welcome. > > 2. Further pg_wait_sampling performance testing needed and in different > environments. > >According to developers, overhead is small, but many people have doubts > that it can be much more significant for intensive workloads. Obviously, it > is not an easy task to test, because you need to put doubtfully > non-production ready code into mission-critical production for such tests. >As a result it will be clear if this design should be abandoned and we > need to think about less-invasive solutions or this design is acceptable. > > Any thoughts? Seems a good starting point. I'm interested in both, and I would like to contribute with running (or writing) several tests. Regards, -- Satoshi Nagayasu -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Wait events monitoring future development
2016-08-09 11:49 GMT+09:00 Joshua D. Drake : > On 08/08/2016 07:37 PM, Bruce Momjian wrote: >> >> On Tue, Aug 9, 2016 at 02:06:40AM +, Tsunakawa, Takayuki wrote: >>> >>> I hope wait event monitoring will be on by default even if the overhead >>> is not >>> almost zero, because the data needs to be readily available for faster >>> troubleshooting. IMO, the benefit would be worth even 10% overhead. If >>> you >>> disable it by default because of overhead, how can we convince users to >>> enable >>> it in production systems to solve some performance problem? I’m afraid >>> severe >>> users would say “we can’t change any setting that might cause more >>> trouble, so >>> investigate the cause with existing information.” >> >> >> If you want to know why people are against enabling this monitoring by >> default, above is the reason. What percentage of people do you think >> would be willing to take a 10% performance penalty for monitoring like >> this? I would bet very few, but the argument above doesn't seem to >> address the fact it is a small percentage. > > > I would argue it is zero. There are definitely users for this feature but to > enable it by default is looking for trouble. *MOST* users do not need this. I used to think of that this kind of features should be enabled by default, because when I was working at the previous company, I had only few features to understand what is happening inside PostgreSQL by observing production databases. I needed those features enabled in the production databases when I was called. However, now I have another opinion. When we release the next major release saying 10.0 with the wait monitoring, many people will start their benchmark test with a configuration with *the default values*, and if they see some performance decrease, for example around 10%, they will be talking about it as the performance decrease in PostgreSQL 10.0. It means PostgreSQL will be facing difficult reputation. So, I agree with the features should be disabled by default for a while. Regards, -- Satoshi Nagayasu -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Wait events monitoring future development
On 08/08/2016 07:37 PM, Bruce Momjian wrote: On Tue, Aug 9, 2016 at 02:06:40AM +, Tsunakawa, Takayuki wrote: I hope wait event monitoring will be on by default even if the overhead is not almost zero, because the data needs to be readily available for faster troubleshooting. IMO, the benefit would be worth even 10% overhead. If you disable it by default because of overhead, how can we convince users to enable it in production systems to solve some performance problem? I’m afraid severe users would say “we can’t change any setting that might cause more trouble, so investigate the cause with existing information.” If you want to know why people are against enabling this monitoring by default, above is the reason. What percentage of people do you think would be willing to take a 10% performance penalty for monitoring like this? I would bet very few, but the argument above doesn't seem to address the fact it is a small percentage. I would argue it is zero. There are definitely users for this feature but to enable it by default is looking for trouble. *MOST* users do not need this. Sincerely, JD -- Command Prompt, Inc. http://the.postgres.company/ +1-503-667-4564 PostgreSQL Centered full stack support, consulting and development. Everyone appreciates your honesty, until you are honest with them. Unless otherwise stated, opinions are my own. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Wait events monitoring future development
On Tue, Aug 9, 2016 at 02:06:40AM +, Tsunakawa, Takayuki wrote: > I hope wait event monitoring will be on by default even if the overhead is not > almost zero, because the data needs to be readily available for faster > troubleshooting. IMO, the benefit would be worth even 10% overhead. If you > disable it by default because of overhead, how can we convince users to enable > it in production systems to solve some performance problem? I’m afraid severe > users would say “we can’t change any setting that might cause more trouble, so > investigate the cause with existing information.” If you want to know why people are against enabling this monitoring by default, above is the reason. What percentage of people do you think would be willing to take a 10% performance penalty for monitoring like this? I would bet very few, but the argument above doesn't seem to address the fact it is a small percentage. In fact, the argument above goes even farther, saying that we should enable it all the time because people will be unwilling to enable it on their own. I have to question the value of the information if users are not willing to enable it. And the solution proposed is to force the 10% default overhead on everyone, whether they are currently doing debugging, whether they will ever do this level of debugging, because people will be too scared to enable it. (Yes, I think Oracle took this approach.) We can talk about this feature all we want, but if we are not willing to be realistic in how much performance penalty the _average_ user is willing to lose to have this monitoring, I fear we will make little progress on this feature. -- Bruce Momjian http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Wait events monitoring future development
From: pgsql-hackers-ow...@postgresql.org [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Ilya Kosmodemiansky I've summarized Wait events monitoring discussion at Developer unconference in Ottawa this year on wiki: https://wiki.postgresql.org/wiki/PgCon_2016_Developer_Unconference/Wait_events_monitoring I hope wait event monitoring will be on by default even if the overhead is not almost zero, because the data needs to be readily available for faster troubleshooting. IMO, the benefit would be worth even 10% overhead. If you disable it by default because of overhead, how can we convince users to enable it in production systems to solve some performance problem? I’m afraid severe users would say “we can’t change any setting that might cause more trouble, so investigate the cause with existing information.” We should positively consider the performance with wait event monitoring on as the new normal. Then, we should develop more features that leverage the wait event data, so that wait event data is crucial. The manual explains to users that wait event monitoring can be turned off for maximal performance but it’s not recommended. BTW, taking advantage of this chance, why don’t we enrich the content of performance tuning in the manual? At least it needs to be explained how to analyze the wait event data and tune the system. Performance Tips https://www.postgresql.org/docs/devel/static/performance-tips.html Regards Takayuki Tsunakawa
Re: [HACKERS] dsm_unpin_segment
Thomas Munro writes: > Yeah, I was considering unbalanced pin/unpin requests to be a > programming error. To be more defensive about that, how about I add a > boolean 'pinned' to dsm_control_item, and elog(ERROR, ...) if it's not > in the expected state when you try to pin or unpin? Well, what you have there is a one-bit-wide pin request counter. I do not see why that's better than an actual counter, but if that's what you want to do, fine. The larger picture here is that Robert is exhibiting a touching but unfounded faith that extensions using this feature will contain zero bugs. IMO there needs to be some positive defense against mistakes in using the pin/unpin API. As things stand, multiple pin requests don't have any fatal consequences (especially not on non-Windows), so I have little confidence that it's not happening in the field. I have even less confidence that there wouldn't be too many unpin requests. What exactly is an extension going to be doing to ensure that it doesn't do too many of one or the other? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] dsm_unpin_segment
On Tue, Aug 9, 2016 at 12:22 PM, Robert Haas wrote: > On Mon, Aug 8, 2016 at 7:33 PM, Tom Lane wrote: >> Thomas Munro writes: >>> Please find attached a patch to add a corresponding operation >>> 'dsm_unpin_segment'. This gives you a way to ask for the segment to >>> survive only until you decide to unpin it, at which point the usual >>> reference counting semantics apply again. It decrements the reference >>> count, undoing the effect of dsm_pin_segment and destroying the >>> segment if appropriate. >> >> What happens if dsm_unpin_segment is called more times than >> dsm_pin_segment? Seems like you could try to destroy a segment that >> still has processes attached. > > Calling dsm_pin_segment more than once is not supported and has never > been supported. As the comments explain: > > * This function should not be called more than once per segment; > * on Windows, doing so will create unnecessary handles which will > * consume system resources to no benefit. > > Therefore, I don't see the problem. You can pin a segment that is not > pinned, and you can unpin a segment that is pinned. You may not > re-pin a segment that is already pinned, nor unpin a segment that is > not pinned. If you try to do so, you are using the API contrary to > specification, and if it breaks (as it will) you get to keep both > pieces. > > We could add the reference counting behavior for which you are asking, > but that seems to be an entirely new feature for which I know of no > demand. Yeah, I was considering unbalanced pin/unpin requests to be a programming error. To be more defensive about that, how about I add a boolean 'pinned' to dsm_control_item, and elog(ERROR, ...) if it's not in the expected state when you try to pin or unpin? -- Thomas Munro http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] dsm_unpin_segment
On Mon, Aug 8, 2016 at 7:33 PM, Tom Lane wrote: > Thomas Munro writes: >> DSM segments have a concept of 'pinning'. Normally, segments are >> destroyed when they are no longer mapped by any backend, using a >> reference counting scheme. If you call dsm_pin_segment(segment), that >> is disabled so that the segment won't be destroyed until the cluster >> is shut down. It works by incrementing the reference count an extra >> time. > >> Please find attached a patch to add a corresponding operation >> 'dsm_unpin_segment'. This gives you a way to ask for the segment to >> survive only until you decide to unpin it, at which point the usual >> reference counting semantics apply again. It decrements the reference >> count, undoing the effect of dsm_pin_segment and destroying the >> segment if appropriate. > > What happens if dsm_unpin_segment is called more times than > dsm_pin_segment? Seems like you could try to destroy a segment that > still has processes attached. Calling dsm_pin_segment more than once is not supported and has never been supported. As the comments explain: * This function should not be called more than once per segment; * on Windows, doing so will create unnecessary handles which will * consume system resources to no benefit. Therefore, I don't see the problem. You can pin a segment that is not pinned, and you can unpin a segment that is pinned. You may not re-pin a segment that is already pinned, nor unpin a segment that is not pinned. If you try to do so, you are using the API contrary to specification, and if it breaks (as it will) you get to keep both pieces. We could add the reference counting behavior for which you are asking, but that seems to be an entirely new feature for which I know of no demand. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] dsm_unpin_segment
Thomas Munro writes: > DSM segments have a concept of 'pinning'. Normally, segments are > destroyed when they are no longer mapped by any backend, using a > reference counting scheme. If you call dsm_pin_segment(segment), that > is disabled so that the segment won't be destroyed until the cluster > is shut down. It works by incrementing the reference count an extra > time. > Please find attached a patch to add a corresponding operation > 'dsm_unpin_segment'. This gives you a way to ask for the segment to > survive only until you decide to unpin it, at which point the usual > reference counting semantics apply again. It decrements the reference > count, undoing the effect of dsm_pin_segment and destroying the > segment if appropriate. What happens if dsm_unpin_segment is called more times than dsm_pin_segment? Seems like you could try to destroy a segment that still has processes attached. I don't object to the concept, but you need a less half-baked implementation if you want to add this. I'd suggest separate counters for process attaches and pin requests, with code in dsm_unpin_segment to disallow decrementing the pin request count below zero, and segment destruction only when both counters go to zero. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Slowness of extended protocol
>> On the other hand, usage of some well-defined statement name to trigger >> the special case would be fine: all pgbouncer versions would pass those >> parse/bind/exec message as if it were regular messages. > > I do not accept this idea that retroactively defining special semantics > for certain statement names is not a protocol break. If it causes any > change in what the server's response would be, then it is a protocol > break. +1. It definitely is a protocol break. Best regards, -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese:http://www.sraoss.co.jp -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] dsm_unpin_segment
Hi hackers, DSM segments have a concept of 'pinning'. Normally, segments are destroyed when they are no longer mapped by any backend, using a reference counting scheme. If you call dsm_pin_segment(segment), that is disabled so that the segment won't be destroyed until the cluster is shut down. It works by incrementing the reference count an extra time. Please find attached a patch to add a corresponding operation 'dsm_unpin_segment'. This gives you a way to ask for the segment to survive only until you decide to unpin it, at which point the usual reference counting semantics apply again. It decrements the reference count, undoing the effect of dsm_pin_segment and destroying the segment if appropriate. I think this is very useful for any core or extension code that wants to store data in dynamic shared memory that survives even when no backends are running, without having to commit to keeping the segment forever. We have several projects in the 10.x pipeline which make use of DSM segments, and we ran into the need for this finer grained control of their lifetime. Thanks to my colleague Amit Kapila for the Windows part, and also to Robert Haas for internal feedback on an earlier version. The Windows implementation has an extra quirk: Windows has its own reference counting scheme that destroys mapped memory when there are no attached processes, so pinning is implemented by sending a duplicate of the handle into the Postmaster process to keep the segment alive. This patch needs to close that handle when unpinning. Amazingly, that can be done without any cooperation from the postmaster. I'd be grateful for any feedback or thoughts, and will add this to the commitfest. -- Thomas Munro http://www.enterprisedb.com dsm-unpin-segment.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Wait events monitoring future development
On Mon, Aug 8, 2016 at 11:47:11PM +0200, Ilya Kosmodemiansky wrote: > On Mon, Aug 8, 2016 at 7:03 PM, Bruce Momjian wrote: > > It seems asking users to run pg_test_timing before deploying to check > > the overhead would be sufficient. > > I'am not sure. Time measurement for waits is slightly more complicated > than a time measurement for explain analyze: a good workload plus > using gettimeofday in a straightforward manner can cause huge > overhead. Thats why a proper testing is important - if we can see a > significant performance drop if we have for example large > shared_buffers with the same concurrency, that shows gettimeofday is > too expensive to use. Am I correct, that we do not have such accurate > tests now? Well, if we find that pg_test_timing is insufficient, we can perhaps add a parallel test option to that utility. > My another concern is, that it is a bad idea to release a feature, > which allegedly has huge performance impact even if it is not turned > on by default. I often meet people who do not use exceptions in > plpgsql because a tip "A block containing an EXCEPTION clause is > significantly more expensive to enter ..." in PostgreSQL documentation Well, if we document that is can be slow, it is up to the user to decide if they want to use it. -- Bruce Momjian http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Wait events monitoring future development
On Mon, Aug 8, 2016 at 7:03 PM, Bruce Momjian wrote: > It seems asking users to run pg_test_timing before deploying to check > the overhead would be sufficient. I'am not sure. Time measurement for waits is slightly more complicated than a time measurement for explain analyze: a good workload plus using gettimeofday in a straightforward manner can cause huge overhead. Thats why a proper testing is important - if we can see a significant performance drop if we have for example large shared_buffers with the same concurrency, that shows gettimeofday is too expensive to use. Am I correct, that we do not have such accurate tests now? My another concern is, that it is a bad idea to release a feature, which allegedly has huge performance impact even if it is not turned on by default. I often meet people who do not use exceptions in plpgsql because a tip "A block containing an EXCEPTION clause is significantly more expensive to enter ..." in PostgreSQL documentation -- Ilya Kosmodemiansky, PostgreSQL-Consulting.com tel. +14084142500 cell. +4915144336040 i...@postgresql-consulting.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Declarative partitioning
On Mon, Aug 8, 2016 at 1:40 AM, Amit Langote wrote: >> +1, if we could do it. It will need a change in the way Amit's patch stores >> partitioning scheme in PartitionDesc. > > Okay, I will try to implement this in the next version of the patch. > > One thing that comes to mind is what if a user wants to apply hash > operator class equality to list partitioned key by specifying a hash > operator class for the corresponding column. In that case, we would not > have the ordering procedure with an hash operator class, hence any > ordering based optimization becomes impossible to implement. The current > patch rejects a column for partition key if its type does not have a btree > operator class for both range and list methods, so this issue doesn't > exist, however it could be seen as a limitation. Yes, I think you should expect careful scrutiny of that issue. It seems clear to me that range partitioning requires a btree opclass, that hash partitioning requires a hash opclass, and that list partitioning requires at least one of those things. It would probably be reasonable to pick one or the other and insist that list partitioning always requires exactly that, but I can't see how it's reasonable to insist that you must have both types of opclass for any type of partitioning. >> This way specifications {('e', 'i', 'f'), ('h', 'd','m') and ('l', 'b', >> 'a')} and {('h', 'd','m') , ('e', 'i', 'f'), and ('l', 'b', 'a')} which are >> essentially same, are represented in different ways. It makes matching >> partitions for partition-wise join a bit tedius. We have to make sure that >> the first array matches for both the joining relations and then make sure >> that all the values belonging to the same partition for one table also >> belong to the same partition in the other table. Some more complex logic >> for matching subsets of lists for partition-wise join. > > So, we have 3 choices for the internal representation of list partitions: > > Choice 1 (the current approach): Load them in the same order as they are > found in the partition catalog: > > Table 1: p1 {'b', 'f'}, p2 {'c', 'd'}, p3 {'a', 'e'} > Table 2: p1 {'c', 'd'}, p2 {'a', 'e'}, p3 {'b', 'f'} > > In this case, mismatch on the first list would make the two tables > incompatibly partitioned, whereas they really aren't incompatible. Such a limitation seems clearly unacceptable. We absolutely must be able to match up compatible partitioning schemes without getting confused by little details like the order of the partitions. > Choice 2: Representation with 2 arrays: > > Table 1: ['a', 'b', 'c', 'd', 'e', 'f'], [3, 1, 2, 2, 3, 1] > Table 2: ['a', 'b', 'c', 'd', 'e', 'f'], [2, 3, 1, 1, 2, 3] > > It still doesn't help the case of pairwise joins because it's hard to tell > which value belongs to which partition (the 2nd array carries the original > partition numbers). Although it might still work for tuple-routing. It's very good for tuple routing. It can also be used to match up partitions for pairwise joins. Compare the first arrays. If they are unequal, stop. Else, compare the second arrays, incrementally building a mapping between them and returning false if the mapping turns out to be non-bijective. For example, in this case, we look at index 0 and decide that 3 -> 2. We look at index 1 and decide 1 -> 3. We look at index 2 and decide 2 -> 1. We look at index 4 and find that we already have a mapping for 2, but it's compatible because we need 2 -> 1 and that's what is already there. Similarly for the remaining entries. This is really a pretty easy loop to write and it should run very quickly. > Choice 3: Order all lists' elements for each list individually and then > order the lists themselves on their first values: > > Table 1: p3 {'a', 'e'}, p2 {'b', 'f'}, p1 {'c', 'd'} > Table 2: p2 {'a', 'e'}, p1 {'b', 'f'}, p3 {'c', 'd'} > > This representation makes pairing partitions for pairwise joining > convenient but for tuple-routing we still need to visit each partition in > the worst case. I think this is clearly not good enough for tuple routing. If the algorithm I proposed above turns out to be too slow for matching partitions, then we could keep both this representation and the previous one. We are not limited to just one. But I don't think that's likely to be the case. Also, note that all of this presupposes we're doing range partitioning, or perhaps list partitioning with a btree opclass. For partitioning based on a hash opclass, you'd organize the data based on the hash values rather than range comparisons. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Heap WARM Tuples - Design Draft
On Mon, Aug 8, 2016 at 5:24 PM, Bruce Momjian wrote: > On Mon, Aug 8, 2016 at 03:36:12PM -0300, Claudio Freire wrote: >> I think I prefer a more thorough approach. >> >> Increment/decrement may get very complicated with custom opclasses, >> for instance. A column-change bitmap won't know how to handle >> funcional indexes, etc. > > Hot does HOT handle them? If it does, it has to look at all the columns > passes to the expression index, so it seems to be the same amount of > work. The way HOT handles it IIRC induces false positives (as in: HOT is forbidden when it might be ok) because it may flag as changed expressions that don't change. Think date_trunc('day', timestamp), if the timestamp changes within a day, there's no change in reality, but the column changed so HOT is forbidden. But the logic won't create false negatives (allow HOT when it's not allowed). But for WARM that might be the case, because the guarantee that you need is that no key already present in the WARM chain will be re-added. An index over (a - b) could have both a and b in increasing order yet repeat keys and violate the WARM chain invariant. >> What I intend to try, is modify btree to allow efficient search of a >> key-ctid pair, by adding the ctid to the sort order. Only inner pages >> need to be changed, to include the ctid in the pointers, leaf pages >> already have the ctid there, so they don't need any kind of change. A >> bit in the metapage could indicate whether it's a new format or an old >> one, and yes, only new indices will be able to use WARM. >> >> But with efficient key-ctid searches, you handle all cases, and not >> just a few common ones. > > True. I am worried about page spills caused by having to insert a rows > into an existing page and and index entry having to be pushed to an > adjacent page to maintain ctid index order. I don't think it would be a concern, as inserting a serial column shouldn't. Maybe some pages will split that wouldn't have, but the split will add room to the new leaf pages and some splits that would have split before won't happen afterwards. Think of it as equivalent to adding the oid to the index - it's some immutable attribute of the row being inserted, the fact that it is a tid shouldn't make a difference to the performance of the btree, aside from the extra comparisons perhaps. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Heap WARM Tuples - Design Draft
On Mon, Aug 8, 2016 at 03:36:12PM -0300, Claudio Freire wrote: > I think I prefer a more thorough approach. > > Increment/decrement may get very complicated with custom opclasses, > for instance. A column-change bitmap won't know how to handle > funcional indexes, etc. Hot does HOT handle them? If it does, it has to look at all the columns passes to the expression index, so it seems to be the same amount of work. > What I intend to try, is modify btree to allow efficient search of a > key-ctid pair, by adding the ctid to the sort order. Only inner pages > need to be changed, to include the ctid in the pointers, leaf pages > already have the ctid there, so they don't need any kind of change. A > bit in the metapage could indicate whether it's a new format or an old > one, and yes, only new indices will be able to use WARM. > > But with efficient key-ctid searches, you handle all cases, and not > just a few common ones. True. I am worried about page spills caused by having to insert a rows into an existing page and and index entry having to be pushed to an adjacent page to maintain ctid index order. -- Bruce Momjian http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Heap WARM Tuples - Design Draft
On Mon, Aug 8, 2016 at 11:22:49PM +0530, Pavan Deolasee wrote: > What I am currently trying to do is to reuse at least the BlockNumber field in > t_ctid. For HOT/WARM chains, that field is really unused (except the last > tuple > when regular update needs to store block number of the new block). My idea is > to use one free bit in t_infomask2 to tell us that t_ctid is really not a > CTID, > but contains new information (for pg_upgrade's sake). For example, one bit in > bi_hi can tell us that this is the last tuple in the chain (information today > conveyed by t_ctid pointing to self). Another bit can tell us that this tuple > was WARM updated. We will still have plenty of bits to store additional > information about WARM chains. Yes, that works too, though it probably only makes sense if we can make use of more than one bit in bi_hi. > My guess is we would need one bit to mark a WARM chain, and perhaps > reuse obsolete pre-9.0 HEAP_MOVED_OFF to indicate increment-only or > decrement-only. > > > I am not convinced that the checking for increment/decrement adds a lot of > value. Sure, we might be able to address some typical work load, but is that > really a common use case? Instead, what I am looking at storing a bitmap which I have no idea, but it guarantees that the first WARM update works because there is no direction set. Then, if the direction changes, you create a new chain and hope the changes stay in that direction for a while. > shows us which table columns have changed so far in the WARM chain. We only > have limited bits, so we can track only limited columns. This will help the > cases where different columns are updated, but not so much if the same column > is updated repeatedly. Well, I don't think in 15 bits we have enough space to store many column numbers, let alone column numbers and _values_. You would need four bits (1-16) to exceed what you can store in a simple bitmap of the first 15 columns. If you want to extend that range, you can use 8 bits (2^8) to record one of the first 256 column numbers. You could do 7 bits (2^7 = 128) and use another bit per column to record the increment/decrement direction, meaning that repeated changes to the same column in the same direction would be allowed in the same WARM chain. I think it is more likely that the same column is going to be changed in the same WARM chain, than changes in different columns. Frankly, with only 16 bits, I can't see how recording specific columns really buys us much because we have to limit the column number storage. Plus, if a column changes twice, you need to create a new WARM chain unless you record the increment/decrement direction. What we could do is to record the first two changed columns in the 16-bit field, with direction, then record a bit for direction of all columns not in the first two that change. That allows you to record three sets of directions in the same HOT chain. It does not allow you to change the direction of any column previously recorded in the WARM chain. You could say you are going to scan the WARM chain for changes, but that limits pruning. You could try storing just the changes for pruned rows, but then you are going to have a lot of overhead scanning the WARM chain looking for changes. It would be interesting to store the change _direction_ for the first 15 columns in the bitmap, and then use the 16th bit for the rest of the columns, but I can't figure out how to record which bits are set with a direction and which are the default/unused. You really need two bits per column, so that only records the first seven or eight columns. > What will help, and something I haven't yet applied any thoughts, is when we > can turn WARM chains back to HOT by removing stale index entries. I can't see how we can ever do that because we have multiple indexes pointing to the chain, and keys that might be duplicated if we switched to HOT. Seems only VACUUM can fix that. > Some heuristics and limits on amount of work done to detect duplicate index > entries will help too. Yeah, I have kind of given up on that. > > We can't use the bits LP_REDIRECT lp_len because we need to create WARM > > chains before pruning, and I don't think walking the pre-pruned chain is > > worth it. (As I understand HOT, LP_REDIRECT is only created during > > pruning.) > > That's correct. But lp_len provides us some place to stash information from > heap tuples when they are pruned. Right. However, I see storing information at prune time as only useful if you are willing to scan the chain, and frankly, I have given up on chain scanning (with column comparisons) as being too expensive for its limited value. What we could do is to use the LP_REDIRECT lp_len to store information of two more changed column numbers, with directions. However, once you store one bit that records the direction of all other columns, you can't go back and record the changes, unless you do a chain scan at prune time. Yo
Re: [HACKERS] No longer possible to query catalogs for index capabilities?
On Sun, Aug 7, 2016 at 11:53 PM, Alvaro Herrera wrote: >> Building on the has-property approach Andrew suggested, I wonder if >> we need something like pg_index_column_has_property(indexoid, colno, >> propertyname) with properties like "sortable", "desc", "nulls first". > > This seems simple enough, on the surface. Why not run with this design? Andrew's patch, plus this, covers everything I can think of. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Parallel tuplesort, partitioning, merging, and the future
Over on the "Parallel tuplesort (for parallel B-Tree index creation)" thread [1], there has been some discussion of merging vs. partitioning. There is a concern about the fact the merge of the tuplesort used to build a B-Tree is not itself parallelized. There is a weak consensus that we'd be better off with partitioning rather than merging anyway. I've started this new thread to discuss where we want to be with partitioning (and further parallel merging). In short: How can we maximize the benefit of parallel tuplesort? I've said or at least implied that I favor keeping the CREATE INDEX merge serial, while pursuing partitioning (for other uses) at a later date. Further parallelization is useful for parallel query much more so than parallel CREATE INDEX. To summarize, I favor not parallelizing merging because: * The scalability of parallel CREATE INDEX as implemented is apparently comparable to the scalability of similar features in other systems [2] (systems with very mature implementations). ISTM that we cannot really expect to do too much better than that. * We already do some merging in parallel to get runs for the leader to merge in serial, if and only if workers each produce more than one initial run (not unlikely). That's still pretty effective, and ensures that the final leader merge is cache efficient. I think that I'll probably end up changing the patch to share maintenance_work_mem among workers (not have it be per-worker), making it common for workers to merge, while the leader merge ends up needed fairly few rounds of preloading to merge the final output. * Even if it helps a bit, parallel merging is not the future; partitioning is (more on this later). I don't think partitioning is urgent for CREATE INDEX, and may be inappropriate for CREATE INDEX under any circumstances, because: * Possible problems with parallel infrastructure and writes. Can the parallel infrastructure be even made to write and WAL log an index in parallel, too? Partitioning more or less forces this, at least if you expect any benefit at all -- reading the concatenated output files in a serial pass where the index is written seems likely to hurt more than it helps. This is not much of an advantage when you're likely to be I/O bound anyway, and when the tuple startup cost *per worker* is not of particular concern. * Unbalanced B-Trees (or the risk thereof). This is the really big one for me. We can only write leaf pages out in parallel; we have to "merge together sub B-Trees" to produce internal pages that make up a finished B-Tree. That risks creating a set of internal pages that reflect a skew that any partition-to-sort approach happened to have, due to whatever problem might emerge in the choice of separators in tuplesort.c (this is a documented issue with SQL Server Enterprise Edition, in fact). Can we tolerate the risk of ending up with an unbalanced final B-Tree in the event of a misestimation? Seems like that's quite a lot worse than poor query performance (due to poor load balancing among parallel workers). DBAs are more or less used to the latter variety of problems. They are not used to the former, especially because the problem might go unnoticed for a long time. * What I've come up with is minimally divergent from the existing approach to tuplesorting. This is useful because of the code complexity of having multiple workers consuming final output (writing an index) in parallel looks to be fairly high. Consider B-Tree related features that tuplesort must care about today, which have further special considerations in a world where this simple "consume all output at once in one process" model is replaced for parallel CREATE INDEX. You'd get all kinds of subtle problems at various boundaries in the final keyspace for things like CREATE UNIQUE INDEX CONCURRENTLY. That seems like something that I'd be quite happy to avoid worrying about (while still allowing parallel CREATE UNIQUE INDEX CONCURRENTLY). Note that some other systems don't allow the equivalent of parallel CREATE INDEX CONCURRENTLY at all, presumably because of similar concerns. My patch supports CIC, and does so almost automatically (although the second pass isn't performed in parallel). I bet that CREATE INDEX won't be the last case where that simplicity and generality clearly matters. Suggested partitioning algorithm I think a hybrid partitioning + merging approach would work well for us. The paper "Parallel Sorting on a Shared-Nothing Architecture using Probabilistic Splitting" [3] has influenced my thinking here (this was written by prominent researchers from the influential UW-Madison Wisconsin database group). Currently, I have in mind something that is closer to what they call exact splitting to what they call probabilistic splitting, because I don't think it's going to be generally possible to have good statistics on partition boundaries immediately available (e.g., through something like their probab
Re: [HACKERS] Slowness of extended protocol
I'm sorry, we are discussing technical details with no real-life use case to cover that. I do not want to suck time for no reason. Please accept my sincere apologies for not asking the real-life case earlier. Shay, can you come up with a real-life use case when those "I claim the statement will be used only once" is would indeed improve performance? Or, to put it in another way: "do you have a real-life case when simple protocol is faster than extended protocol with statement reuse"? I do have a couple of java applications and it turns out there's a huge win of reusing server-prepared statements. There's a problem that "generic plan after 5th execution might be much worse than a specific one", however those statements are not often and I just put hints to the SQL (limit 0, +0, CTE, those kind of things). Tom Lane : > I do not accept this idea that retroactively defining special semantics > for certain statement names is not a protocol break. Sir, any new SQL keyword is what you call a "retroactively defining special semantics". It's obvious that very little current clients do use named server-prepared statements. Statement names are not something that is provided by the end-user in a web page, so it is not a rocket science to come up with a statement name that is both short and "never ever used in the wild". Tom Lane : > If it causes any > change in what the server's response would be, then it is a protocol > break. > I see no changes except "backend would report a protocol violation for the case when special statement is used and message sequence is wrong". > > Note: it is quite easy to invent a name that is not yet used in the wild, > > so it is safe. > > Sir, that is utter nonsense. Tom Lane : > And even if it were true, why is it that > this way would safely pass through existing releases of pgbouncer when > other ways would not? Either pgbouncer needs to understand what it's > passing through, or it doesn't. > Once again: exiting pgbouncer versions know how to parse Parse/Bind/Exec/Deallocate messages, so if we bless some well-defined statement name with a semantics that "it is forbidden to reuse that name for multiple executions in a row", then that is completely transparent for pgbouncer. Pgbouncer would just think that "the application is dumb since it reparses the same statement again and againt", but it would work just fine. On contrary, if a new statement name is added, then pgbouncer would fail to understand the new message. Vladimir
Re: [HACKERS] Slowness of extended protocol
Shay Rojansky : > Ah, I understand the proposal better now - you're not proposing encoding a > new message type in an old one, but rather a magic statement name in Parse > which triggers an optimized processing path in PostgreSQL, that wouldn't go > through the query cache etc. > Exactly. > If so, isn't that what the empty statement is already supposed to do? I > know there's already some optimizations in place around the scenario of > empty statement name (and empty portal). > The problem with "empty statement name" is statements with empty name can be reused (for instance, for batch insert executions), so the server side has to do a defensive copy (it cannot predict how many times this unnamed statement will be used). Shay Rojansky : > Also, part of the point here is to reduce the number of protocol messages > needed in order to send a parameterized query - not to have to do > Parse/Describe/Bind/Execute/Sync - since part of the slowdown comes from > that (although I'm not sure how much). Your proposal keeps the 5 messages. > As my benchmarks show, notable overhead is due to "defensive copying of the execution plan". So I would measure first, and only then would claim where the overhead is. Some more profiling is required to tell which part is a main time consumer. Technically speaking, I would prefer to have a more real-life looking example instead of SELECT 1. Do you have something in mind? For instance, for more complex queries, "Parse/Plan" could cost much more than we shave by adding "a special non-cached statement name" or by reducing "5 messages into 1". There's a side problem: describe message requires full roundtrip since there are cases when client needs to know how to encode parameters. Additional roundtrip hurts much worse than just an additional message that is pipelined (e.g. sent in the same TCP packet). Shay Rojansky : > But people seem to be suggesting that a significant part of the overhead > comes from the fact that there are 5 messages, meaning there's no way to > optimize this without a new message type. > Of course 5 messages are slower than 1 message. However, that does not mean "there's no way to optimize without a new message type". Profiling can easily reveal time consumer parts, then we can decide if there's a solution. Note: if we improve "SELECT 1" by 10%, it does not mean we improved statement execution by 10%. Real-life statements matter for proper profiling/discussion. Shay Rojansky : > Note: it is quite easy to invent a name that is not yet used in the wild, >> so it is safe. >> > > That's problematic, how do you know what's being used in the wild and what > isn't? The protocol has a specification, it's very problematic to get up > one day and to change it retroactively. But again, the empty statement > seems to already be there for that. > Empty statement has different semantics, and it is wildly used. For instance, pgjdbc uses unnamed statements a lot. On the other hand, statement name of "!pq@#!@#42" is rather safe to use as a special case. Note: statement names are not typically created by humans (statement name is not a SQL), and very little PG clients do support named statements. Vladimir
Re: [HACKERS] Slowness of extended protocol
Shay Rojansky writes: > Ah, I understand the proposal better now - you're not proposing encoding a > new message type in an old one, but rather a magic statement name in Parse > which triggers an optimized processing path in PostgreSQL, that wouldn't go > through the query cache etc. > If so, isn't that what the empty statement is already supposed to do? I > know there's already some optimizations in place around the scenario of > empty statement name (and empty portal). Right, but the problem is that per the current protocol spec, those optimizations are not allowed to change any protocol-visible behavior. We might be able to do a bit more optimization than we're doing now, but we still have to be able to cache the statement in case it's executed more than once, we still have to do planning in response to a separate 'bind' message, etc. AFAICT most of the added overhead comes from having to allow for the possibility of re-execution: that forces at least one extra copy of the parse tree to be made, as well as adding manipulations of the plan cache. We can't get out from under that without a protocol change. > Again, if it's possible to make "Parse/Describe/Bind/Execute/Sync" perform > close to Query, e.g. when specifying empty statement/portal, that's > obviously the best thing here. But people seem to be suggesting that a > significant part of the overhead comes from the fact that there are 5 > messages, meaning there's no way to optimize this without a new message > type. The rather crude measurements I've done put most of the blame on the cacheing part, although the number of messages might matter more on platforms with slow process-title-update support (I'm looking at you, Windows). regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Heap WARM Tuples - Design Draft
On Mon, Aug 8, 2016 at 2:52 PM, Pavan Deolasee wrote: >> >> My guess is we would need one bit to mark a WARM chain, and perhaps >> reuse obsolete pre-9.0 HEAP_MOVED_OFF to indicate increment-only or >> decrement-only. > > > I am not convinced that the checking for increment/decrement adds a lot of > value. Sure, we might be able to address some typical work load, but is that > really a common use case? Instead, what I am looking at storing a bitmap > which shows us which table columns have changed so far in the WARM chain. We > only have limited bits, so we can track only limited columns. This will help > the cases where different columns are updated, but not so much if the same > column is updated repeatedly. > > What will help, and something I haven't yet applied any thoughts, is when we > can turn WARM chains back to HOT by removing stale index entries. > > Some heuristics and limits on amount of work done to detect duplicate index > entries will help too. I think I prefer a more thorough approach. Increment/decrement may get very complicated with custom opclasses, for instance. A column-change bitmap won't know how to handle funcional indexes, etc. What I intend to try, is modify btree to allow efficient search of a key-ctid pair, by adding the ctid to the sort order. Only inner pages need to be changed, to include the ctid in the pointers, leaf pages already have the ctid there, so they don't need any kind of change. A bit in the metapage could indicate whether it's a new format or an old one, and yes, only new indices will be able to use WARM. But with efficient key-ctid searches, you handle all cases, and not just a few common ones. A lot of the work for the key-ctid search can be shared with the update itself, so it's probably not such a big performance impact. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] No longer possible to query catalogs for index capabilities?
I wrote: > Having said all that, it is unfortunate that 9.6 is going to go out > without any good solution to this need. But as Robert already pointed > out, trying to fix it now would force delaying 9.6rc1 by several weeks > (and that's assuming that it doesn't take very long to get consensus > on a solution). There's not, AFAICT, desire on the part of the release > team to do that. We'd like to ship 9.6 on time for a change. After some back-and-forth among the release team, there's been a decision to change this position: today's wrap will be 9.6beta4, and we'll try to get something done about the AM property access issue before issuing rc1. We only have a week or two to get this done without starting to impact v10 development as well as hurting our chances of shipping 9.6.0 in September. So let's start getting down to details. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Slowness of extended protocol
Vladimir wrote: > On the other hand, usage of some well-defined statement name to trigger >>> the special case would be fine: all pgbouncer versions would pass those >>> parse/bind/exec message as if it were regular messages. >>> >> >> Can you elaborate on what that means exactly? Are you proposing to >> somehow piggyback on an existing message (e.g. Parse) but use some special >> statement name that would make PostgreSQL interpret it as a different >> message type? >> > > Exactly. > For instance: if client sends Parse(statementName=I_swear_ > the_statement_will_be_used_only_once), then the subsequent message must > be BindMessage, and the subsequent must be ExecMessage for exactly the same > statement id. > Ah, I understand the proposal better now - you're not proposing encoding a new message type in an old one, but rather a magic statement name in Parse which triggers an optimized processing path in PostgreSQL, that wouldn't go through the query cache etc. If so, isn't that what the empty statement is already supposed to do? I know there's already some optimizations in place around the scenario of empty statement name (and empty portal). Also, part of the point here is to reduce the number of protocol messages needed in order to send a parameterized query - not to have to do Parse/Describe/Bind/Execute/Sync - since part of the slowdown comes from that (although I'm not sure how much). Your proposal keeps the 5 messages. Again, if it's possible to make "Parse/Describe/Bind/Execute/Sync" perform close to Query, e.g. when specifying empty statement/portal, that's obviously the best thing here. But people seem to be suggesting that a significant part of the overhead comes from the fact that there are 5 messages, meaning there's no way to optimize this without a new message type. Note: it is quite easy to invent a name that is not yet used in the wild, > so it is safe. > That's problematic, how do you know what's being used in the wild and what isn't? The protocol has a specification, it's very problematic to get up one day and to change it retroactively. But again, the empty statement seems to already be there for that.
Re: [HACKERS] Slowness of extended protocol
Vladimir Sitnikov writes: > The point is "adding a message to current v3 protocol is not a backward > compatible change". > The problem with adding new message types is not only "client support", but > deployment issues as well: new message would require simultaneous upgrade > of both backend, client, and pgbouncer. Right ... > It could make sense to add some kind of transparent extensibility to the > protocol, so clients can just ignore unknown message types. I'm not really sure what use that particular behavior would be. We certainly want to try to have some incremental extensibility in there when we do v4, but I think it would more likely take the form of a way for client and server to agree on which extensions they support. > On the other hand, usage of some well-defined statement name to trigger > the special case would be fine: all pgbouncer versions would pass those > parse/bind/exec message as if it were regular messages. I do not accept this idea that retroactively defining special semantics for certain statement names is not a protocol break. If it causes any change in what the server's response would be, then it is a protocol break. > Note: it is quite easy to invent a name that is not yet used in the wild, > so it is safe. Sir, that is utter nonsense. And even if it were true, why is it that this way would safely pass through existing releases of pgbouncer when other ways would not? Either pgbouncer needs to understand what it's passing through, or it doesn't. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Heap WARM Tuples - Design Draft
On Mon, Aug 8, 2016 at 11:08 PM, Bruce Momjian wrote: > On Sun, Aug 7, 2016 at 12:55:01PM -0400, Bruce Momjian wrote: > > On Sun, Aug 7, 2016 at 10:49:45AM -0400, Bruce Momjian wrote: > > > OK, crazy idea time --- what if we only do WARM chain additions when > all > > > indexed values are increasing (with NULLs higher than all values)? (If > > > a key is always-increasing, it can't match a previous value in the > > > chain.) That avoids the problem of having to check the WARM chain, > > > except for the previous tuple, and the problem of pruning removing > > > changed rows. It avoids having to check the index for matching > key/ctid > > > values, and it prevents CREATE INDEX from having to index WARM chain > > > values. > > > > > > Any decreasing value would cause a normal tuple be created. > > > > Actually, when we add the first WARM tuple, we can mark the HOT/WARM > > chain as either all-incrementing or all-decrementing. We would need a > > bit to indicate that. > > FYI, is see at least two available tuple header bits here, 0x0800 and > 0x1000: > > /* > * information stored in t_infomask2: > */ > #define HEAP_NATTS_MASK 0x07FF /* 11 bits for number of > attributes */ > /* bits 0x1800 are available */ > #define HEAP_KEYS_UPDATED 0x2000 /* tuple was updated and > key cols > * modified, or tuple > deleted */ > #define HEAP_HOT_UPDATED0x4000 /* tuple was HOT-updated */ > #define HEAP_ONLY_TUPLE 0x8000 /* this is heap-only tuple > */ > > #define HEAP2_XACT_MASK 0xE000 /* visibility-related bits > */ > What I am currently trying to do is to reuse at least the BlockNumber field in t_ctid. For HOT/WARM chains, that field is really unused (except the last tuple when regular update needs to store block number of the new block). My idea is to use one free bit in t_infomask2 to tell us that t_ctid is really not a CTID, but contains new information (for pg_upgrade's sake). For example, one bit in bi_hi can tell us that this is the last tuple in the chain (information today conveyed by t_ctid pointing to self). Another bit can tell us that this tuple was WARM updated. We will still have plenty of bits to store additional information about WARM chains. > My guess is we would need one bit to mark a WARM chain, and perhaps > reuse obsolete pre-9.0 HEAP_MOVED_OFF to indicate increment-only or > decrement-only. I am not convinced that the checking for increment/decrement adds a lot of value. Sure, we might be able to address some typical work load, but is that really a common use case? Instead, what I am looking at storing a bitmap which shows us which table columns have changed so far in the WARM chain. We only have limited bits, so we can track only limited columns. This will help the cases where different columns are updated, but not so much if the same column is updated repeatedly. What will help, and something I haven't yet applied any thoughts, is when we can turn WARM chains back to HOT by removing stale index entries. Some heuristics and limits on amount of work done to detect duplicate index entries will help too. > > We can't use the bits LP_REDIRECT lp_len because we need to create WARM > chains before pruning, and I don't think walking the pre-pruned chain is > worth it. (As I understand HOT, LP_REDIRECT is only created during > pruning.) > > That's correct. But lp_len provides us some place to stash information from heap tuples when they are pruned. Thanks, Pavan Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] Slowness of extended protocol
Shay Rojansky : > That's definitely a valid point. But do you think it's a strong enough > argument to avoid ever adding new messages? > The point is "adding a message to current v3 protocol is not a backward compatible change". The problem with adding new message types is not only "client support", but deployment issues as well: new message would require simultaneous upgrade of both backend, client, and pgbouncer. It could make sense to add some kind of transparent extensibility to the protocol, so clients can just ignore unknown message types. For instance, consider a new message "ExtensibleMessage" is added: '#', int16 (message id), int32 (message length), contents. For the pgbouncer case above, pgbouncer could just proxy "unknown requests" as is and it would cover many of the cases out of the box. That message has a well-defined message length, so it is easy to skip. If new "messages" required, a new message_id can be allocated and assigned to the new message type. Old clients would be able to skip the message as they know its length. Of course, adding that '#' message does require support from pgbouncer, etc, etc, however I'm sure it would simplify protocol evolution in the future. Of course there might appear a need for a message that cannot be ignored by the client, but I think "even a bit of flexibility" is better than no flexibility at all. Technically speaking, there's a "NotificationResponse" message, however it is not that good since it cannot have 0 bytes in the contents. Shay Rojansky : > >> On the other hand, usage of some well-defined statement name to trigger >> the special case would be fine: all pgbouncer versions would pass those >> parse/bind/exec message as if it were regular messages. >> > > Can you elaborate on what that means exactly? Are you proposing to somehow > piggyback on an existing message (e.g. Parse) but use some special > statement name that would make PostgreSQL interpret it as a different > message type? > Exactly. For instance: if client sends Parse(statementName=I_swear_the_statement_will_be_used_only_once), then the subsequent message must be BindMessage, and the subsequent must be ExecMessage for exactly the same statement id. Then backend could recognize the pattern and perform the optimization. Note: it is quite easy to invent a name that is not yet used in the wild, so it is safe. >From security point of view (e.g. if a client would want to exploit use-after-free kind of issues), backend could detect deviations from this parse-bind-exec sequence and just drop the mic off. Shay Rojansky : > Apart from being a pretty horrible hack, > I would not call it a horrible hack. That is just a clever use of existing bits, and it does not break neither backward nor forward compatibility. Backward compatibility: new clients would be compatible with old PG versions (even 8.4). Forward compatibility: even if the support of that special statement name would get dropped for some reason, there will be no application issues (it would just result in a slight performance degradation). Shay Rojansky : > it would still break pgbouncer, which has to actually inspect and > understand SQL being sent to the database (e.g. to know when transactions > start and stop). > Note: I do not suggest to change message formats. The message itself is just fine and existing pgbouncer versions can inspect the SQL. The difference is a special statementName, and I see no reasons for that kind of change to break pgbouncer. Vladimir
Re: [HACKERS] Heap WARM Tuples - Design Draft
On Sun, Aug 7, 2016 at 12:55:01PM -0400, Bruce Momjian wrote: > On Sun, Aug 7, 2016 at 10:49:45AM -0400, Bruce Momjian wrote: > > OK, crazy idea time --- what if we only do WARM chain additions when all > > indexed values are increasing (with NULLs higher than all values)? (If > > a key is always-increasing, it can't match a previous value in the > > chain.) That avoids the problem of having to check the WARM chain, > > except for the previous tuple, and the problem of pruning removing > > changed rows. It avoids having to check the index for matching key/ctid > > values, and it prevents CREATE INDEX from having to index WARM chain > > values. > > > > Any decreasing value would cause a normal tuple be created. > > Actually, when we add the first WARM tuple, we can mark the HOT/WARM > chain as either all-incrementing or all-decrementing. We would need a > bit to indicate that. FYI, is see at least two available tuple header bits here, 0x0800 and 0x1000: /* * information stored in t_infomask2: */ #define HEAP_NATTS_MASK 0x07FF /* 11 bits for number of attributes */ /* bits 0x1800 are available */ #define HEAP_KEYS_UPDATED 0x2000 /* tuple was updated and key cols * modified, or tuple deleted */ #define HEAP_HOT_UPDATED0x4000 /* tuple was HOT-updated */ #define HEAP_ONLY_TUPLE 0x8000 /* this is heap-only tuple */ #define HEAP2_XACT_MASK 0xE000 /* visibility-related bits */ My guess is we would need one bit to mark a WARM chain, and perhaps reuse obsolete pre-9.0 HEAP_MOVED_OFF to indicate increment-only or decrement-only. These flags have to be passed to all forward tuples in the chain so an addition to the chain knows quickly if it is WARM chain, and which direction. We can't use the bits LP_REDIRECT lp_len because we need to create WARM chains before pruning, and I don't think walking the pre-pruned chain is worth it. (As I understand HOT, LP_REDIRECT is only created during pruning.) -- Bruce Momjian http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: GiST optimizing memmoves in gistplacetopage for fixed-size updates [PoC]
06.08.2016 19:51, Andrew Borodin: Anastasia , thank you for your attentive code examine. 2016-08-05 21:19 GMT+05:00 Anastasia Lubennikova : First of all, shouldn't we use MAXALIGN(oldsize) instead of oldsize? Although, I'm quite sure that it was already aligned somewhere before. I doubt that the check (size_diff != MAXALIGN(size_diff)) is necessary. I'd rather add the check: (offset+size_diff < pd_lower). Actually, that's a tricky question. There may be different code expectations. 1. If we expect that not-maxaligned tuple may be placed by any other routine, we should remove check (size_diff != MAXALIGN(size_diff)), since it will fail for not-maxaligned tuple. 2. If we expect that noone should use PageIndexTupleOverwrite with not-maxaligned tuples, than current checks are OK: we will break execution if we find non-maxaligned tuples. With an almost correct message. I suggest that this check may be debug-only assertion: in a production code routine will work with not-maxaligned tuples if they already reside on the page, in a dev build it will inform dev that something is going wrong. Is this behavior Postgres-style compliant? I agree that pd_lower check makes sense. Pointing out to this comparison I worried about possible call of MAXALIGN for negative size_diff value. Although I don't sure if it can cause any problem. Tuples on a page are always maxaligned. But, as far as I recall, itemid->lp_len can contain non-maxaligned value. So, I'd suggest you to perform MAXALIGN(oldsize) before computing size_diff: size_diff = oldsize - alignednewsize; Since both arguments are maxaligned the check of size_diff is not needed. BTW, I'm very surprised that it improves performance so much. And also size is reduced significantly. 89MB against 289MB without patch. Could you explain in details, why does it happen? Size reduction is unexpected for me. There might be 4 plausible explanations. I'll list them ordered by descending of probability: 1. Before this patch every update was throwing recently updated tuple to the end of a page. Thus, in some-how ordered data, recent best-fit will be discovered last. This can cause increase of MBB's overlap in spatial index and slightly higher tree. But 3 times size decrease is unlikely. How did you obtained those results? Can I look at a test case? Nothing special, I've just run the test you provided in this thread. And check index size via select pg_size_pretty(pg_relation_size('idx')); 2. Bug in PageIndexTupleDelete causing unused space emersion. I've searched for it, unsuccessfully. 3. Bug in PageIndexTupleOVerwrite. I cannot imagine nature of such a bug. May be we are not placing something not very important and big on a page? 4. Magic. I do not see what one should do with the R-tree to change it's size by a factor of 3. First three explanations are not better that forth, actually. Those 89 MB, they do not include WAL, right? -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Wait events monitoring future development
On Mon, Aug 8, 2016 at 10:03 AM, Bruce Momjian wrote: > On Mon, Aug 8, 2016 at 04:43:40PM +0530, Amit Kapila wrote: >> >According to developers, overhead is small, but many people have doubts >> > that it can be much more significant for intensive workloads. Obviously, it >> > is not an easy task to test, because you need to put doubtfully >> > non-production ready code into mission-critical production for such tests. >> >As a result it will be clear if this design should be abandoned and we >> > need to think about less-invasive solutions or this design is acceptable. >> > >> >> I think here main objection was that gettimeofday can cause >> performance regression which can be taken care by using configurable >> knob. I am not aware if any other part of the design has been >> discussed in detail to conclude whether it has any obvious problem. > > It seems asking users to run pg_test_timing before deploying to check > the overhead would be sufficient. They should also run it in parallel, as sometimes the real overhead is in synchronization between multiple CPUs and doesn't show up when only a single CPU is involved. Cheers, Jeff -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] No longer possible to query catalogs for index capabilities?
Vik Fearing writes: > On 08/08/16 15:38, Tom Lane wrote: >> Well, the next step is to fill in the blanks: what properties need to be >> queryable? > That's less urgent; adding missing properties does not require a > catversion bump. If the complaint is "I can do X before 9.6.0 and after 9.6.0, but not in 9.6.0", it doesn't really matter whether it would take a catversion bump to resolve; the problem is lack of a time machine. Once there's a released 9.6.0 out there that is missing something important, applications that care about version portability are going to have to deal with it. So we need to get this right the first time. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Surprising behaviour of \set AUTOCOMMIT ON
On 08/08/16 17:02, Robert Haas wrote: > On Mon, Aug 8, 2016 at 10:10 AM, Rahila Syed wrote: >> Thank you for inputs everyone. >> >> The opinions on this thread can be classified into following >> 1. Commit >> 2. Rollback >> 3. Error >> 4. Warning >> >> As per opinion upthread, issuing implicit commit immediately after switching >> autocommit to ON, can be unsafe if it was not desired. While I agree that >> its difficult to judge users intention here, but if we were to base it on >> some assumption, the closest would be implicit COMMIT in my opinion.There is >> higher likelihood of a user being happy with issuing a commit when setting >> autocommit ON than a transaction being rolled back. Also there are quite >> some interfaces which provide this. >> >> As mentioned upthread, issuing a warning on switching back to autocommit >> will not be effective inside a script. It won't allow subsequent commands to >> be committed as set autocommit to ON is not committed. Scripts will have to >> be rerun with changes which will impact user friendliness. >> >> While I agree that issuing an ERROR and rolling back the transaction ranks >> higher in safe behaviour, it is not as common (according to instances stated >> upthread) as immediately committing any open transaction when switching back >> to autocommit. > > I think I like the option of having psql issue an error. On the > server side, the transaction would still be open, but the user would > receive a psql error message and the autocommit setting would not be > changed. So the user could type COMMIT or ROLLBACK manually and then > retry changing the value of the setting. This is my preferred action. > Alternatively, I also think it would be sensible to issue an immediate > COMMIT when the autocommit setting is changed from off to on. That > was my first reaction. I don't care for this very much. > Aborting the server-side transaction - with or without notice - > doesn't seem very reasonable. Agreed. -- Vik Fearing +33 6 46 75 15 36 http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Wait events monitoring future development
On Mon, Aug 8, 2016 at 04:43:40PM +0530, Amit Kapila wrote: > >According to developers, overhead is small, but many people have doubts > > that it can be much more significant for intensive workloads. Obviously, it > > is not an easy task to test, because you need to put doubtfully > > non-production ready code into mission-critical production for such tests. > >As a result it will be clear if this design should be abandoned and we > > need to think about less-invasive solutions or this design is acceptable. > > > > I think here main objection was that gettimeofday can cause > performance regression which can be taken care by using configurable > knob. I am not aware if any other part of the design has been > discussed in detail to conclude whether it has any obvious problem. It seems asking users to run pg_test_timing before deploying to check the overhead would be sufficient. -- Bruce Momjian http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] No longer possible to query catalogs for index capabilities?
On 08/08/16 15:38, Tom Lane wrote: > Alvaro Herrera writes: >> Tom Lane wrote: >>> Building on the has-property approach Andrew suggested, I wonder if >>> we need something like pg_index_column_has_property(indexoid, colno, >>> propertyname) with properties like "sortable", "desc", "nulls first". > >> This seems simple enough, on the surface. Why not run with this design? > > Well, the next step is to fill in the blanks: what properties need to be > queryable? That's less urgent; adding missing properties does not require a catversion bump. -- Vik Fearing +33 6 46 75 15 36 http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Heap WARM Tuples - Design Draft
On Mon, Aug 8, 2016 at 06:34:46PM +0530, Amit Kapila wrote: > I think here expensive part would be recheck for the cases where the > index value is changed to a different value (value which doesn't exist > in WARM chain). You anyway have to add the new entry (key,TID) in > index, but each time traversing the WARM chain would be an additional > effort. For cases, where there are just two index entries and one > them is being updated, it might regress as compare to what we do now. Yes, I think the all-increment or all-decrement WARM chain is going to be the right approach. -- Bruce Momjian http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Surprising behaviour of \set AUTOCOMMIT ON
On Mon, Aug 8, 2016 at 11:17 AM, David G. Johnston wrote: > On Mon, Aug 8, 2016 at 11:02 AM, Robert Haas wrote: >> >> On Mon, Aug 8, 2016 at 10:10 AM, Rahila Syed >> wrote: >> > Thank you for inputs everyone. >> > >> > The opinions on this thread can be classified into following >> > 1. Commit >> > 2. Rollback >> > 3. Error >> > 4. Warning >> > >> > As per opinion upthread, issuing implicit commit immediately after >> > switching >> > autocommit to ON, can be unsafe if it was not desired. While I agree >> > that >> > its difficult to judge users intention here, but if we were to base it >> > on >> > some assumption, the closest would be implicit COMMIT in my >> > opinion.There is >> > higher likelihood of a user being happy with issuing a commit when >> > setting >> > autocommit ON than a transaction being rolled back. Also there are >> > quite >> > some interfaces which provide this. >> > >> > As mentioned upthread, issuing a warning on switching back to autocommit >> > will not be effective inside a script. It won't allow subsequent >> > commands to >> > be committed as set autocommit to ON is not committed. Scripts will have >> > to >> > be rerun with changes which will impact user friendliness. >> > >> > While I agree that issuing an ERROR and rolling back the transaction >> > ranks >> > higher in safe behaviour, it is not as common (according to instances >> > stated >> > upthread) as immediately committing any open transaction when switching >> > back >> > to autocommit. >> >> I think I like the option of having psql issue an error. On the >> server side, the transaction would still be open, but the user would >> receive a psql error message and the autocommit setting would not be >> changed. So the user could type COMMIT or ROLLBACK manually and then >> retry changing the value of the setting. >> >> Alternatively, I also think it would be sensible to issue an immediate >> COMMIT when the autocommit setting is changed from off to on. That >> was my first reaction. >> >> Aborting the server-side transaction - with or without notice - >> doesn't seem very reasonable. >> > > Best of both worlds? > > IF (ON_ERROR_STOP == 1) > THEN (treat this like any other error) > ELSE (send COMMIT; to server) No, that's not a good idea. The purpose of ON_ERROR_STOP is something else entirely, and we shouldn't reuse it for an unrelated purpose. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Surprising behaviour of \set AUTOCOMMIT ON
On Mon, Aug 8, 2016 at 11:02 AM, Robert Haas wrote: > On Mon, Aug 8, 2016 at 10:10 AM, Rahila Syed > wrote: > > Thank you for inputs everyone. > > > > The opinions on this thread can be classified into following > > 1. Commit > > 2. Rollback > > 3. Error > > 4. Warning > > > > As per opinion upthread, issuing implicit commit immediately after > switching > > autocommit to ON, can be unsafe if it was not desired. While I agree > that > > its difficult to judge users intention here, but if we were to base it on > > some assumption, the closest would be implicit COMMIT in my > opinion.There is > > higher likelihood of a user being happy with issuing a commit when > setting > > autocommit ON than a transaction being rolled back. Also there are quite > > some interfaces which provide this. > > > > As mentioned upthread, issuing a warning on switching back to autocommit > > will not be effective inside a script. It won't allow subsequent > commands to > > be committed as set autocommit to ON is not committed. Scripts will have > to > > be rerun with changes which will impact user friendliness. > > > > While I agree that issuing an ERROR and rolling back the transaction > ranks > > higher in safe behaviour, it is not as common (according to instances > stated > > upthread) as immediately committing any open transaction when switching > back > > to autocommit. > > I think I like the option of having psql issue an error. On the > server side, the transaction would still be open, but the user would > receive a psql error message and the autocommit setting would not be > changed. So the user could type COMMIT or ROLLBACK manually and then > retry changing the value of the setting. > > Alternatively, I also think it would be sensible to issue an immediate > COMMIT when the autocommit setting is changed from off to on. That > was my first reaction. > > Aborting the server-side transaction - with or without notice - > doesn't seem very reasonable. > > Best of both worlds? IF (ON_ERROR_STOP == 1) THEN (treat this like any other error) ELSE (send COMMIT; to server) David J.
Re: [HACKERS] Refactoring of heapam code.
Anastasia Lubennikova wrote: > By the way, I thought about looking at the mentioned patch and > probably reviewing it, but didn't find any of your patches on the > current commitfest. Could you point out the thread? Sorry, I haven't posted anything yet. > >Agreed. But changing its name while keeping it in heapam.c does not > >really improve things enough. I'd rather have it moved elsewhere that's > >not specific to "heaps" (somewhere in access/common perhaps). However, > >renaming the functions would break third-party code for no good reason. > >I propose that we only rename things if we also break it for other > >reasons (say because the API changes in a fundamental way). > > Yes, I agree that it should be moved to another file. > Just to be more accurate: it's not in heapam.c now, it is in > "src/backend/catalog/heap.c" which requires much more changes > than I did. Argh. Clearly, the code organization in this area is not good at all. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Surprising behaviour of \set AUTOCOMMIT ON
On Mon, Aug 8, 2016 at 10:10 AM, Rahila Syed wrote: > Thank you for inputs everyone. > > The opinions on this thread can be classified into following > 1. Commit > 2. Rollback > 3. Error > 4. Warning > > As per opinion upthread, issuing implicit commit immediately after switching > autocommit to ON, can be unsafe if it was not desired. While I agree that > its difficult to judge users intention here, but if we were to base it on > some assumption, the closest would be implicit COMMIT in my opinion.There is > higher likelihood of a user being happy with issuing a commit when setting > autocommit ON than a transaction being rolled back. Also there are quite > some interfaces which provide this. > > As mentioned upthread, issuing a warning on switching back to autocommit > will not be effective inside a script. It won't allow subsequent commands to > be committed as set autocommit to ON is not committed. Scripts will have to > be rerun with changes which will impact user friendliness. > > While I agree that issuing an ERROR and rolling back the transaction ranks > higher in safe behaviour, it is not as common (according to instances stated > upthread) as immediately committing any open transaction when switching back > to autocommit. I think I like the option of having psql issue an error. On the server side, the transaction would still be open, but the user would receive a psql error message and the autocommit setting would not be changed. So the user could type COMMIT or ROLLBACK manually and then retry changing the value of the setting. Alternatively, I also think it would be sensible to issue an immediate COMMIT when the autocommit setting is changed from off to on. That was my first reaction. Aborting the server-side transaction - with or without notice - doesn't seem very reasonable. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Surprising behaviour of \set AUTOCOMMIT ON
Thank you for inputs everyone. The opinions on this thread can be classified into following 1. Commit 2. Rollback 3. Error 4. Warning As per opinion upthread, issuing implicit commit immediately after switching autocommit to ON, can be unsafe if it was not desired. While I agree that its difficult to judge users intention here, but if we were to base it on some assumption, the closest would be implicit COMMIT in my opinion.There is higher likelihood of a user being happy with issuing a commit when setting autocommit ON than a transaction being rolled back. Also there are quite some interfaces which provide this. As mentioned upthread, issuing a warning on switching back to autocommit will not be effective inside a script. It won't allow subsequent commands to be committed as set autocommit to ON is not committed. Scripts will have to be rerun with changes which will impact user friendliness. While I agree that issuing an ERROR and rolling back the transaction ranks higher in safe behaviour, it is not as common (according to instances stated upthread) as immediately committing any open transaction when switching back to autocommit. Thank you, Rahila Syed On Sun, Aug 7, 2016 at 4:42 AM, Matt Kelly wrote: > Its worth noting that the JDBC's behavior when you switch back to > autocommit is to immediately commit the open transaction. > > Personally, I think committing immediately or erroring are unsurprising > behaviors. The current behavior is surprising and obviously wrong. > Rolling back without an error would be very surprising (no other database > API I know of does that) and would take forever to debug issues around the > behavior. And committing after the next statement is definitely the most > surprising behavior suggested. > > IMHO, I think committing immediately and erroring are both valid. I think > I'd prefer the error in principle, and per the law of bad code I'm sure, > although no one has ever intended to use this behavior, there is probably > some code out there that is relying on this behavior for "correctness". I > think a hard failure and making the dev add an explicit commit is least > likely to cause people serious issues. As for the other options, consider > me opposed. > > - Matt K. >
[HACKERS] extract text from XML
Hi, I have found a basic use case which is supported by the xml2 module, but is unsupported by the new XML API. It is not possible to correctly extract text (either from text nodes or attribute values) which contains the characters '<', '&', or '>'. xpath() (correctly) returns XML text nodes for queries targeting these node types, and there is no inverse to xmlelement(). For example: => select (xpath('/a/text()', xmlelement(name a, '<&>')))[1]::text; xpath --- <&> (1 row) Again, not a bug; but there is no way to specify my desired intent. The xml2 module does provide such a function, xpath_string: => select xpath_string(xmlelement(name a, '<&>')::text, '/a/text()'); xpath_string -- <&> (1 row) One workaround is to return the node's text value by serializing the XML value, and textually replacing those three entities with the characters they represent, but this relies on the xpath() function not generating other entities. (My use case is importing data in XML format, and processing with Postgres into a relational format.) Perhaps a function xpath_value(text, xml) -> text[] would close the gap? (I did search and no such function seems to exist currently, outside xml2.) Thanks, Chris -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] No longer possible to query catalogs for index capabilities?
Alvaro Herrera writes: > Tom Lane wrote: >> Building on the has-property approach Andrew suggested, I wonder if >> we need something like pg_index_column_has_property(indexoid, colno, >> propertyname) with properties like "sortable", "desc", "nulls first". > This seems simple enough, on the surface. Why not run with this design? Well, the next step is to fill in the blanks: what properties need to be queryable? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Heap WARM Tuples - Design Draft
On Fri, Aug 5, 2016 at 9:57 AM, Pavan Deolasee wrote: > > > On Fri, Aug 5, 2016 at 8:23 AM, Claudio Freire > wrote: >> >> On Thu, Aug 4, 2016 at 11:15 PM, Bruce Momjian wrote: >> >> > >> > OK, that's a lot of text, and I am confused. Please tell me the >> > advantage of having an index point to a string of HOT chains, rather >> > than a single one? Is the advantage that the index points into the >> > middle of the HOT chain rather than at the beginning? I can see some >> > value to that, but having more ctid HOT headers that can't be removed >> > except by VACUUM seems bad, plus the need for index rechecks as you >> > cross to the next HOT chain seems bad. >> > >> > The value of WARM is to avoid index bloat --- right now we traverse the >> > HOT chain on a single page just fine with no complaints about speed so I >> > don't see a value to optimizing that traversal, and I think the key >> > rechecks and ctid bloat will make it all much slower. >> > >> > It also seems much more complicated. >> >> The point is avoiding duplicate rows in the output of index scans. >> >> I don't think you can avoid it simply by applying index condition >> rechecks as the original proposal implies, in this case: >> >> CREATE TABLE t (id integer not null primary key, someid integer, dat >> integer); >> CREATE INDEX i1 ON t (someid); >> >> INSERT INTO t (id, someid, dat) VALUES (1, 2, 100); >> UPDATE t SET dat = dat + 1 where id = 1; >> UPDATE t SET dat = dat + 1, id = 2 where id = 1; >> UPDATE t SET dat = dat + 1, id = 3, someid = 3 where id = 2; >> UPDATE t SET dat = dat + 1, id = 1, someid = 2 where id = 3; >> SELECT * FROM t WHERE someid = 2; >> >> That, I believe, will cause the problematic chains where the condition >> recheck passes both times the last visible tuple is visited during the >> scan. It will return more than one tuple when in reality there is only >> one. > > > Hmm. That seems like a real problem. The only way to address that is to > ensure that for a given WARM chain and given index key, there must exists > only a single index entry. There can and will be multiple entries if the > index key changes, but if the index key changes to the original value (or > any other value already in the WARM chain) again, we must not add another > index entry. Now that does not seem like a very common scenario and skipping > a duplicate index entry does have its own benefit, so may be its not a > terrible idea to try that. You're right, it may be expensive to search for > an existing matching index key for the same root line pointer. But if we > could somehow short-circuit the more common case, it might still be worth > trying. > I think here expensive part would be recheck for the cases where the index value is changed to a different value (value which doesn't exist in WARM chain). You anyway have to add the new entry (key,TID) in index, but each time traversing the WARM chain would be an additional effort. For cases, where there are just two index entries and one them is being updated, it might regress as compare to what we do now. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] No longer possible to query catalogs for index capabilities?
On 8 August 2016 at 03:49, Vladimir Sitnikov wrote: > > > Tom Lane : > >> FWIW, this thread started on 25-July, less than two weeks ago. > > > Technically speaking, there was a pgsql-jdbc thread started on May 14: > https://www.postgresql.org/message-id/nh72v6%24582%241%40ger.gmane.org > > 9.6beta1 was released on May 12 > > The fact that it wasn't raised >> till more than 6 months after we committed the pg_am changes > > > This means that nobody was testing compatibility of "postgresql's master > branch with existing third-party clients". > Testing against well-known clients makes sense to catch bugs early. > > I've added "build postgresql from master branch" test to the pgjdbc's > regression suite a week ago, so I hope it would highlight issues early > (even before the official postgresql beta is released). > > However, pgjdbc tests are executed only for pgjdbc commits, so if there's > no pgjdbc changes, then there is no logic to trigger "try newer postgres > with current pgjdbc". > > Ideally, postgresql's regression suite should validate well-known clients > as well. > I've no idea how long would it take to add something to postgresql's > buildfarm, so I just went ahead and created Travis test configuration at > https://github.com/vlsi/postgres > Do you think those Travis changes can be merged to the upstream? > > I mean the following: > 1) Activate TravisCI integration for https://github.com/postgres/postgres > mirror. > 2) Add relevant Travis CI file so it checks postgresql's regression suite, > and the other > 3) Add build badge to the readme (link to travis ci build) for simplified > navigation to test results. > > Here's the commit: https://github.com/vlsi/postgres/commit/ > 4841f8bc00b7c6717d91f51c98979ce84b4f7df3 > Here's how test results look like: https://travis-ci.org/vlsi/postgres > > Nice work +! Dave Cramer da...@postgresintl.com www.postgresintl.com
Re: [HACKERS] [sqlsmith] Failed assertion in joinrels.c
On Mon, Aug 8, 2016 at 5:23 PM, Amit Kapila wrote: > > > > Your other options and the option you choose are same. > > > Sorry typo, I meant ERRCODE_INVALID_OBJECT_DEFINITION. > Did you consider to use ERRCODE_UNDEFINED_COLUMN with error messages > like: "type %u does not exit" or "type id %u does not exit"? Errorcode > ERRCODE_UNDEFINED_COLUMN seems to be used for syscache lookup failure > cases at certain places in code. But I think, OBJECT will be more appropriate than COLUMN at this place. However I can change error message to "type id %u does not exit" if this seems better ? -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] [sqlsmith] Failed assertion in joinrels.c
On Mon, Aug 8, 2016 at 5:11 PM, Amit Kapila wrote: > On Mon, Aug 8, 2016 at 4:58 PM, Dilip Kumar wrote: >> >> On Wed, Aug 3, 2016 at 5:35 AM, Robert Haas wrote: >>> >>> I think that's just making life difficult. If nothing else, sqlsmith >>> hunts around for functions it can call that return internal errors, >>> and if we refuse to fix all of them to return user-facing errors, then >>> it's just crap for the people running sqlsmith to sift through and >>> it's a judgment call whether to fix each particular case. Even aside >>> from that, I think it's much better to have a clear and unambiguous >>> rule that elog is only for can't-happen things, not >>> we-don't-recommend-it things. >> >> >> I have changed for all these function to report more appropriate error with >> ereport. >> >> I used ERRCODE_WRONG_OBJECT_TYPE error code for reporting such errors. >> I think this is closest error code among all existing error codes, other >> options can be (ERRCODE_WRONG_OBJECT_TYPE). >> > > Your other options and the option you choose are same. > Did you consider to use ERRCODE_UNDEFINED_COLUMN with error messages like: "type %u does not exit" or "type id %u does not exit"? Errorcode ERRCODE_UNDEFINED_COLUMN seems to be used for syscache lookup failure cases at certain places in code. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [sqlsmith] Failed assertion in joinrels.c
On Mon, Aug 8, 2016 at 4:58 PM, Dilip Kumar wrote: > > On Wed, Aug 3, 2016 at 5:35 AM, Robert Haas wrote: >> >> I think that's just making life difficult. If nothing else, sqlsmith >> hunts around for functions it can call that return internal errors, >> and if we refuse to fix all of them to return user-facing errors, then >> it's just crap for the people running sqlsmith to sift through and >> it's a judgment call whether to fix each particular case. Even aside >> from that, I think it's much better to have a clear and unambiguous >> rule that elog is only for can't-happen things, not >> we-don't-recommend-it things. > > > I have changed for all these function to report more appropriate error with > ereport. > > I used ERRCODE_WRONG_OBJECT_TYPE error code for reporting such errors. > I think this is closest error code among all existing error codes, other > options can be (ERRCODE_WRONG_OBJECT_TYPE). > Your other options and the option you choose are same. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [sqlsmith] Failed assertion in joinrels.c
On Wed, Aug 3, 2016 at 5:35 AM, Robert Haas wrote: > I think that's just making life difficult. If nothing else, sqlsmith > hunts around for functions it can call that return internal errors, > and if we refuse to fix all of them to return user-facing errors, then > it's just crap for the people running sqlsmith to sift through and > it's a judgment call whether to fix each particular case. Even aside > from that, I think it's much better to have a clear and unambiguous > rule that elog is only for can't-happen things, not > we-don't-recommend-it things. > I have changed for all these function to report more appropriate error with ereport. I used ERRCODE_WRONG_OBJECT_TYPE error code for reporting such errors. I think this is closest error code among all existing error codes, other options can be (ERRCODE_WRONG_OBJECT_TYPE). But I think ERRCODE_WRONG_OBJECT_TYPE is better option. Patch attached for the same. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com cache_lookup_failure_v1.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Wait events monitoring future development
On Sun, Aug 7, 2016 at 5:33 PM, Ilya Kosmodemiansky wrote: > Hi, > > I've summarized Wait events monitoring discussion at Developer unconference > in Ottawa this year on wiki: > > https://wiki.postgresql.org/wiki/PgCon_2016_Developer_Unconference/Wait_events_monitoring > > > (Thanks to Alexander Korotkov for patiently pushing me to make this thing > finally done) > > If you attended, fill free to point me out if I missed something, I will put > it on the wiki too. > Thanks for summarization. > Wait event monitoring looks ones again stuck on the way through community > approval in spite of huge progress done last year in that direction. The > importance of the topic is beyond discussion now, if you talk to any > PostgreSQL person about implementing such a tool in Postgres and if the > person does not get exited, probably you talk to a full-time PostgreSQL > developer;-) Obviously it needs a better design, both the user interface and > implementation, and perhaps this is why full-time developers are still > sceptical. > > In order to move forward, imho we need at least some steps, whose steps can > be done in parallel > > 1. Further requirements need to be collected from DBAs. > >If you are a PostgreSQL DBA with Oracle experience and use perf for > troubleshooting Postgres - you are an ideal person to share your experience, > but everyone is welcome. > > 2. Further pg_wait_sampling performance testing needed and in different > environments. > I think it is better to first go with a knob whose default value will be off. We can do the performance testing as well and if by end of release nobody reported any visible regression, then we can discuss for changing the default to on. >According to developers, overhead is small, but many people have doubts > that it can be much more significant for intensive workloads. Obviously, it > is not an easy task to test, because you need to put doubtfully > non-production ready code into mission-critical production for such tests. >As a result it will be clear if this design should be abandoned and we > need to think about less-invasive solutions or this design is acceptable. > I think here main objection was that gettimeofday can cause performance regression which can be taken care by using configurable knob. I am not aware if any other part of the design has been discussed in detail to conclude whether it has any obvious problem. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Slowness of extended protocol
> > We could call this "protocol 3.1" since it doesn't break backwards >> compatibility (no incompatible server-initiated message changes, but it >> does include a feature that won't be supported by servers which only >> support 3.0. This could be a sort of "semantic versioning" for the protocol >> - optional new client-initiated features are a minor version bump, others >> are a major version bump... >> > > Adding a new message is not backward compatible since it will fail in > pgbouncer kind of deployments. > Suppose there's "new backend", "old pgbouncer", "new client" deployment. > If the client tries to send the new message, it will fail since pgbouncer > would have no idea what to do with that new message. > That's definitely a valid point. But do you think it's a strong enough argument to avoid ever adding new messages? > On the other hand, usage of some well-defined statement name to trigger > the special case would be fine: all pgbouncer versions would pass those > parse/bind/exec message as if it were regular messages. > Can you elaborate on what that means exactly? Are you proposing to somehow piggyback on an existing message (e.g. Parse) but use some special statement name that would make PostgreSQL interpret it as a different message type? Apart from being a pretty horrible hack, it would still break pgbouncer, which has to actually inspect and understand SQL being sent to the database (e.g. to know when transactions start and stop).
Re: [HACKERS] handling unconvertible error messages
On Mon, 08 Aug 2016 18:28:57 +0900 (Tokyo Standard Time) Kyotaro HORIGUCHI wrote: > > I don't see charset compatibility to be easily detectable, In the worst case we can hardcode explicit compatibility table. There is limited set of languages, which have translated error messages, and limited (albeit wide) set of encodings, supported by PostgreSQL. So it is possible to define complete list of encodings, compatible with some translation. And fall back to untranslated messages if client encoding is not in this list. > because locale (or character set) is not a matter of PostgreSQL > (except for some encodings bound to one particular character > set)... So the conversion-fallback might be a only available > solution. Conversion fallback may be a solution for data. For NLS-messages I think it is better to fall back to English (untranslated) messages than use of transliteration or something alike. I think that for now we can assume that the best effort is already done for the data, and think how to improve situation with messages. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Refactoring of heapam code.
08.08.2016 03:51, Michael Paquier: On Sat, Aug 6, 2016 at 2:56 AM, Alvaro Herrera wrote: Anastasia Lubennikova wrote: So there is a couple of patches. They do not cover all mentioned problems, but I'd like to get a feedback before continuing. I agree that we could improve things in this general area, but I do not endorse any particular changes you propose in these patches; I haven't reviewed your patches. Well, I am interested in the topic. And just had a look at the patches proposed. + /* + * Split PageGetItem into set of different macros + * in order to make code more readable. + */ +#define PageGetItemHeap(page, itemId) \ +( \ + AssertMacro(PageIsValid(page)), \ + AssertMacro(ItemIdHasStorage(itemId)), \ + (HeapTupleHeader)(((char *)(page)) + ItemIdGetOffset(itemId)) \ +) Placing into bufpage.h macros that are specific to heap and indexes is not that much a good idea. I'd suggest having the heap ones in heapam.h, and the index ones in a more generic place, as src/access/common as already mentioned by Alvaro. + onpage_tup = PageGetItemHeapHeaderOnly(page, newitemid); Just PageGetItemHeapHeader would be fine. @@ -2324,7 +2087,6 @@ FreeBulkInsertState(BulkInsertState bistate) pfree(bistate); } - /* * heap_insert - insert tuple into a heap Those patches have some noise. Patch 0002 is doing two things: reorganizing the order of the routines in heapam.c and move some routines from heapam.c to hio.c. Those two could be separate patches/ If the final result is to be able to extend custom AMs for tables, we should have a structure like src/access/common/index.c, src/access/common/table.c and src/access/common/relation.c for example, and have headers dedicated to them. But yes, there is definitely a lot of room for refactoring as we are aiming, at least I guess so, at having more various code paths for access methods. Thank you for the review, I'll fix these problems in final version. Posting the first message I intended to raise the discussion. Patches were attached mainly to illustrate the problem and to be more concrete. Thank you for attention and feedback. Since there are no objections to the idea in general, I'll write an exhaustive README about current state of the code and then propose API design to discuss details. Stay tuned. -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] handling unconvertible error messages
On Mon, 08 Aug 2016 18:11:54 +0900 (Tokyo Standard Time) Kyotaro HORIGUCHI wrote: > At Mon, 08 Aug 2016 17:18:21 +0900 (Tokyo Standard Time), Kyotaro > HORIGUCHI wrote in > <20160808.171821.100221089.horiguchi.kyot...@lab.ntt.co.jp> > > Somewhat wrong. The core problem is the procedures offered by > PrepareClientEncoding is choosed only by encoding->encoding > basis, not counting character set compatibility. So, currently > this is not detectable before actually doing conversion of a > character stream. Yes, my idea was to check language/encoding compatibility. Make sure that NLS messages can be represented in the client-specified encoding in a readable way. As far, as I know, there is no platform-independent bulletproof way to do so. On Unix you can try to initialize locale with given language and given encoding, but it can fail even if encoding is compatible with language, simply because corresponding locale is not generated on this system. But this seems to be a problem of system administration and can be left out to local sysadmins. Once you have correctly initialized LC_MESSAGES, you don't need encoding conversion routines for the NLS messages. You can use bind_textdomain_codeset function to provide messages in the client-desired encoding. (but this can cause problems with server logs, where messages from different sessions would come in different encodings) On Windows things are more complicated. There is just one ANSI code page, associated to given language, and locale initialization would fail with any other codepage, including utf-8. > regards, > -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] handling unconvertible error messages
At Mon, 08 Aug 2016 18:11:54 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI wrote in <20160808.181154.252052789.horiguchi.kyot...@lab.ntt.co.jp> > At Mon, 08 Aug 2016 17:18:21 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI > wrote in > <20160808.171821.100221089.horiguchi.kyot...@lab.ntt.co.jp> > > Looking at check_client_encoding(), the comment says as following. > > > > | * If we are not within a transaction then PrepareClientEncoding will not > > | * be able to look up the necessary conversion procs. If we are still > > | * starting up, it will return "OK" anyway, and InitializeClientEncoding > > | * will fix things once initialization is far enough along. After > > > > We shold overcome this to realize startup-time check for > > conversion procs. > > Somewhat wrong. The core problem is the procedures offered by > PrepareClientEncoding is choosed only by encoding->encoding > basis, not counting character set compatibility. So, currently > this is not detectable before actually doing conversion of a > character stream. > > Conversely, providing a means to check character-set > compatibility will naturally fixes this. Check at session-startup > (out-of-transaction check?) is still another problem. I don't see charset compatibility to be easily detectable, because locale (or character set) is not a matter of PostgreSQL (except for some encodings bound to one particular character set)... So the conversion-fallback might be a only available solution. Thougts? regards, -- Kyotaro Horiguchi NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Refactoring of heapam code.
05.08.2016 20:56, Alvaro Herrera: Anastasia Lubennikova wrote: Working on page compression and some other issues related to access methods, I found out that the code related to heap looks too complicated. Much more complicated, than it should be. Since I anyway got into this area, I want to suggest a set of improvements. Hm. I'm hacking on heapam.c pretty heavily ATM. Your first patch causes no problem I think, or if it does it's probably pretty minor, so that's okay. I'm unsure about the second one -- I think it's okay too, because it mostly seems to be about moving stuff from heapam.c to hio.c and shuffling some code around that I don't think I'm modifying. I'm sure these patches will not cause any problems. The second one is mostly about moving unrelated stuff from heapam.c to hio.c and renaming couple of functions in heap.c. Anyway, the are not a final solution just proof of concept. By the way, I thought about looking at the mentioned patch and probably reviewing it, but didn't find any of your patches on the current commitfest. Could you point out the thread? Also I think that it's quite strange to have a function heap_create(), that is called from index_create(). It has absolutely nothing to do with heap access method. Agreed. But changing its name while keeping it in heapam.c does not really improve things enough. I'd rather have it moved elsewhere that's not specific to "heaps" (somewhere in access/common perhaps). However, renaming the functions would break third-party code for no good reason. I propose that we only rename things if we also break it for other reasons (say because the API changes in a fundamental way). Yes, I agree that it should be moved to another file. Just to be more accurate: it's not in heapam.c now, it is in "src/backend/catalog/heap.c" which requires much more changes than I did. Concerning to the third-party code. It's a good observation and we definitely should keep it in mind. Nevertheless, I doubt that there's a lot of external calls to these functions. And I hope this thread will end up with fundamental changes introducing clear API for all mentioned problems. One more thing that requires refactoring is "RELKIND_RELATION" name. We have a type of relation called "relation"... I don't see us fixing this bit, really. Historical baggage and all that. For example, a lot of SQL queries use "WHERE relkind = 'r'". If we change the name, we should probably also change the relkind constant, and that would break the queries. If we change the name and do not change the constant, then we have to have a comment "we call them RELKIND_TABLE but the char is r because it was called RELATION previously", which isn't any better. Good point. I'd rather change it to RELKIND_BASIC_RELATION or something like that, which is more specific and still goes well with 'r' constant. But minor changes like that can wait until we clarify the API in general. -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] handling unconvertible error messages
On Mon, 08 Aug 2016 17:18:21 +0900 (Tokyo Standard Time) Kyotaro HORIGUCHI wrote: > > I'm not sure what messages may be raised before authentication > but it can be a more generic-solution. (Adding check during > on-session.) Definitely, there can be authentication error message, which is sent if authentication didn't happen. Also, as far as I understand, message "Database ... doesn't exists" is also send before authentication. Also, there are CancelRequests, where normal authentication is not used, and server key, provided in another session used instead. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] handling unconvertible error messages
At Mon, 08 Aug 2016 17:18:21 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI wrote in <20160808.171821.100221089.horiguchi.kyot...@lab.ntt.co.jp> > Looking at check_client_encoding(), the comment says as following. > > | * If we are not within a transaction then PrepareClientEncoding will not > | * be able to look up the necessary conversion procs. If we are still > | * starting up, it will return "OK" anyway, and InitializeClientEncoding > | * will fix things once initialization is far enough along. After > > We shold overcome this to realize startup-time check for > conversion procs. Somewhat wrong. The core problem is the procedures offered by PrepareClientEncoding is choosed only by encoding->encoding basis, not counting character set compatibility. So, currently this is not detectable before actually doing conversion of a character stream. Conversely, providing a means to check character-set compatibility will naturally fixes this. Check at session-startup (out-of-transaction check?) is still another problem. regards, -- Kyotaro Horiguchi NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [Patch] Temporary tables that do not bloat pg_catalog (a.k.a fast temp tables)
05.08.2016 19:41, Robert Haas: 2. This inserts additional code in a bunch of really low-level places like heap_hot_search_buffer, heap_update, heap_delete, etc. I think what you've basically done here is create a new, in-memory heap AM and then, because we don't have an API for adding new storage managers, you've bolted it onto the existing heapam layer. That's certainly a reasonable approach for creating a PoC, but I think we actually need a real API here. Otherwise, when the next person comes along and wants to add a third heap implementation, they've got to modify all of these same places again. I don't think this code is reasonably maintainable in this form. As I can see, you recommend to clean up the API of storage management code. I strongly agree that it's high time to do it. So, I started the discussion about refactoring and improving API of heapam and heap relations. You can find it on commitfest: https://commitfest.postgresql.org/10/700/ I'll be glad to see your thoughts on the thread. -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] per-statement-level INSTEAD OF triggers
Hi hackers, I'm asking out of curiosity, do anyone know why we don't have per-statement-level INSTEAD OF triggers? I looked into the standard SQL (20xx draft), but I can't find the restriction such that INSTEAD OF triggers must be row-level. Is there any technical difficulties, or other reasons for the current implementation? Best regards, -- Yugo Nagata -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] handling unconvertible error messages
At Mon, 8 Aug 2016 10:19:10 +0300, Victor Wagner wrote in <20160808101910.49bee...@fafnir.local.vm> > On Fri, 5 Aug 2016 11:21:44 -0400 > Peter Eisentraut wrote: > > > On 8/4/16 2:45 AM, Victor Wagner wrote: > > > 4. At the session startup try to reinitializie LC_MESSAGES locale > > > category with the combination > > > of the server (or better client-send) language and region and > > > client-supplied encoding, and if this failed, use untranslated error > > > message. Obvoisly, attempt to set locale to ru_RU.ISO8859-1 would > > > fail. so, if client would ask server with ru_RU.UTF-8 default > > > locale to use LATIN1 encoding, server would fallback to > > > untranslated messages. > > > I think this is basically my solution (1), with the same problems. > > I think, that there is a big difference from server point of view. > > You propose that both translated and untranslated message should be > passed around inside backend. It has some benefits, but requires > considerable reworking of server internals. Agreed. > My solution doesn't require keeping both original message and > translated one during all call stack unwinding. It just checks if > combination of language and encoding is supported by the NLS subsystem, > and if not, falls back to untranslated message for entire session. Looking at check_client_encoding(), the comment says as following. | * If we are not within a transaction then PrepareClientEncoding will not | * be able to look up the necessary conversion procs. If we are still | * starting up, it will return "OK" anyway, and InitializeClientEncoding | * will fix things once initialization is far enough along. After We shold overcome this to realize startup-time check for conversion procs. > It is much more local change and is comparable by complexity with one, > proposed by Tom Lane. I'm not sure what messages may be raised before authentication but it can be a more generic-solution. (Adding check during on-session.) regards, -- Kyotaro Horiguchi NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] No longer possible to query catalogs for index capabilities?
Tom Lane : > FWIW, this thread started on 25-July, less than two weeks ago. Technically speaking, there was a pgsql-jdbc thread started on May 14: https://www.postgresql.org/message-id/nh72v6%24582%241%40ger.gmane.org 9.6beta1 was released on May 12 The fact that it wasn't raised > till more than 6 months after we committed the pg_am changes This means that nobody was testing compatibility of "postgresql's master branch with existing third-party clients". Testing against well-known clients makes sense to catch bugs early. I've added "build postgresql from master branch" test to the pgjdbc's regression suite a week ago, so I hope it would highlight issues early (even before the official postgresql beta is released). However, pgjdbc tests are executed only for pgjdbc commits, so if there's no pgjdbc changes, then there is no logic to trigger "try newer postgres with current pgjdbc". Ideally, postgresql's regression suite should validate well-known clients as well. I've no idea how long would it take to add something to postgresql's buildfarm, so I just went ahead and created Travis test configuration at https://github.com/vlsi/postgres Do you think those Travis changes can be merged to the upstream? I mean the following: 1) Activate TravisCI integration for https://github.com/postgres/postgres mirror. 2) Add relevant Travis CI file so it checks postgresql's regression suite, and the other 3) Add build badge to the readme (link to travis ci build) for simplified navigation to test results. Here's the commit: https://github.com/vlsi/postgres/commit/4841f8bc00b7c6717d91f51c98979ce84b4f7df3 Here's how test results look like: https://travis-ci.org/vlsi/postgres Vladimir
Re: [HACKERS] Possible duplicate release of buffer lock.
Hello, At Sat, 06 Aug 2016 12:45:32 -0400, Tom Lane wrote in <16748.1470501...@sss.pgh.pa.us> > Amit Kapila writes: > > Isn't the problem here is that due to some reason (like concurrent > > split), the code in question (loop -- > > while (P_ISDELETED(opaque) || opaque->btpo_next != target)) releases > > the lock on target buffer and the caller again tries to release the > > lock on target buffer when false is returned. > > Yeah. I do not think there is anything wrong with the loop-to-find- > current-leftsib logic. The problem is lack of care about what > resources are held on failure return. The sole caller thinks it > should do "_bt_relbuf(rel, buf)" --- that is, drop lock and pin on > what _bt_unlink_halfdead_page calls the leafbuf. But that function > already dropped the lock (line 1582 in 9.4), and won't have reacquired > it unless target != leafblkno. Another problem is that if the target > is different from leafblkno, we'll hold a pin on the target page, which > is leaked entirely in this code path. > > The caller knows nothing of the "target" block so it can't reasonably > deal with all these cases. I'm inclined to change the function's API > spec to say that on failure return, it's responsible for dropping > lock and pin on the passed buffer. We could alternatively reacquire > lock before returning, but that seems pointless. Agreed. > Another thing that looks fishy is at line 1611: > > leftsib = opaque->btpo_prev; > > At this point we already released lock on leafbuf so it seems pretty > unsafe to fetch its left-link. And it's unnecessary cause we already > did; the line should be just > > leftsib = leafleftsib; Right. And I found that it is already committed. Thanks. regards, -- Kyotaro Horiguchi NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] handling unconvertible error messages
On Fri, 5 Aug 2016 11:21:44 -0400 Peter Eisentraut wrote: > On 8/4/16 2:45 AM, Victor Wagner wrote: > > 4. At the session startup try to reinitializie LC_MESSAGES locale > > category with the combination > > of the server (or better client-send) language and region and > > client-supplied encoding, and if this failed, use untranslated error > > message. Obvoisly, attempt to set locale to ru_RU.ISO8859-1 would > > fail. so, if client would ask server with ru_RU.UTF-8 default > > locale to use LATIN1 encoding, server would fallback to > > untranslated messages. > > I think this is basically my solution (1), with the same problems. > I think, that there is a big difference from server point of view. You propose that both translated and untranslated message should be passed around inside backend. It has some benefits, but requires considerable reworking of server internals. My solution doesn't require keeping both original message and translated one during all call stack unwinding. It just checks if combination of language and encoding is supported by the NLS subsystem, and if not, falls back to untranslated message for entire session. It is much more local change and is comparable by complexity with one, proposed by Tom Lane. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] handling unconvertible error messages
On Fri, 5 Aug 2016 11:23:37 -0400 Peter Eisentraut wrote: > On 8/4/16 9:42 AM, Tom Lane wrote: > > I'm inclined to think that we should reset the message locale > > to C as soon as we've forked away from the postmaster, and leave it > > that way until we've absorbed settings from the startup packet. > > Sending messages of this sort in English isn't great, but it's > > better than sending completely-unreadable ones. Or is that just my > > English-centricity showing? > > Well, most of the time this all works, only if there are different > client and server settings you might have problems. We wouldn't want > to partially disable the NLS feature for the normal case. > There are cases, where client cannot tell server which encoding it wants to use, and server cannot tell which encoding it uses, but it can send error messages. For example, CancelRequest. The only way to ensure that message is readable in this case is to fall back to some encoding, definitely known by both client and server. And for now it is US-ASCII. It is, as far as I understand, what Tom is proposing: Fall back to the untranslated message at the beginning of session, and return to NLS only when encoding is successfully negotiated between client and server. May be, there can be other solution - prepare client to be able to accept UTF-8 messages from server regardless of encoding, i.e. if message starts with BOM marker (0xFEFF unicode char, EF BB BF byte sequence in utf-8), interpret it as UTF-8. It would require client to support some kind of encoding conversion, and in some 8-bit environments pose problems with displaying these messages. -- -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Slowness of extended protocol
Shay Rojansky : > > That sounds right to me. As you say, the server version is sent early in > the startup phase, before any queries are sent to the backend, so frontends > know which server they're communicating with. > > We could call this "protocol 3.1" since it doesn't break backwards > compatibility (no incompatible server-initiated message changes, but it > does include a feature that won't be supported by servers which only > support 3.0. This could be a sort of "semantic versioning" for the protocol > - optional new client-initiated features are a minor version bump, others > are a major version bump... > Adding a new message is not backward compatible since it will fail in pgbouncer kind of deployments. Suppose there's "new backend", "old pgbouncer", "new client" deployment. If the client tries to send the new message, it will fail since pgbouncer would have no idea what to do with that new message. On the other hand, usage of some well-defined statement name to trigger the special case would be fine: all pgbouncer versions would pass those parse/bind/exec message as if it were regular messages. Vladimir >