Re: [HACKERS] Extensions makefiles - coverage
On Sunday 22 September 2013 01:34:53 Peter Eisentraut wrote: On Thu, 2013-07-25 at 17:07 +0200, Ronan Dunklau wrote: I am using approximatively the layout that was proposed here: http://www.postgresql.org/message-id/51bb1b6e.2070...@dunslane.net It looks like everything is hard-coded to take the source and the gcda, gcno files in the base directory, but these files lay in a src directory with the proposed layout. The PostgreSQL build system isn't going to work very well if you build files outside of the current directory. If you want to put your source files into a src/ subdirectory, then your top-level makefile should to a $(MAKE) -C src, and you need to have a second makefile in the src directory. If you do that, then the existing coverage targets will work alright, I think. The PGXS build system allows for the definition of an OBJS variable, which works fine with almost every other make target. Maybe we need to take a step back, and think about what kind of extension layouts we want to support ? At the time of this writing, the HOW TO on http://manager.pgxn.org/howto strongly encourage to put all C-files in an src directory. As a result, many extensions on pgxn use this layout. It would be great not to have to change them to measure code coverage. signature.asc Description: This is a digitally signed message part.
Re: [HACKERS] logical changeset generation v6
On 2013-09-23 23:12:53 -0400, Robert Haas wrote: What exactly is the purpose of this tool? My impression is that the output of logical replication is a series of function calls to a logical replication plugin, but does that plugin necessarily have to produce an output format that gets streamed to a client via a tool like this? There needs to be a client acking the reception of the data in some form. There's currently two output methods, SQL and walstreamer, but there easily could be further, it's basically two functions you have write. There are several reasons I think the tool is useful, starting with the fact that it makes the initial use of the feature easier. Writing a client for CopyBoth messages wrapping 'w' style binary messages, with the correct select() loop isn't exactly trivial. I also think it's actually useful in real scenarios where you want to ship the data to a remote system for auditing purposes. For example, for replication, I'd think you might want the plugin to connect to a remote database and directly shove the data in; That sounds like a bad idea to me. If you pull the data from the remote side, you get the data in a streaming fashion and the latency sensitive part of issuing statements to your local database is done locally. Doing things synchronously like that also makes it way harder to use synchronous_commit = off on the remote side, which is a tremendous efficiency win. If somebody needs something like this, e.g. because they want to replicate into hundreds of shards depending on some key or such, the question I don't know is how to actually initiate the streaming. Somebody would need to start the logical decoding. for materialized views, we might like to push the changes into delta relations within the source database. Yes, that's not a bad usecase and I think the only thing missing to use output plugins that way is a convenient function to tell up to where data has been received (aka synced to disk, aka applied). In either case, there's no particular need for any sort of client at all, and in fact it would be much better if none were required. The existence of a tool like pg_receivellog seems to presuppose that the goal is spit out logical change records as text, but I'm not sure that's actually going to be a very common thing to want to do... It doesn't really rely on anything being text - I've used it with a binary plugin without problems. Obviously you might not want to use -f - but an actual file instead... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] record identical operator
On 2013-09-23 21:21:53 -0400, Stephen Frost wrote: Here's an example to illustrate what I'm talking about when it comes down to you can't claim that you'll produce exactly what the query will always, with these types: =# table citext_table; id | name +--- 1 | one 3 | three 5 | 4 | Three 2 | Three (5 rows) =# create MATERIALIZED VIEW citext_matview AS select name from citext_table where id 0 group by name; SELECT 3 =# table citext_matview; name --- one three (3 rows) I don't really understand why you have a problem with this specific thing here. Kevin's goal seems to be to make materialized views behave consistent with the way a plain view would if the matviews would just have been refreshed fully, concurrently or incrementally and there have been no further data changes. SELECT * FROM table WHERE candidate_key IN (...); used in a view or plainly currently guarantees you that you get the original casing for citext. And if you e.g. have some additional expensive joins, such a matview can very well make sense. What does it matter that ... GROUP BY citext_col; doesn't return the same casing consistently? You're aggregating, not accessing the original data. If you need to separate the different casings now, cast to text. Now, I can perfectly understand having a bit of architectural qualms about Kevin's approach of things on the SQL level. But this doesn't seem to be the thread to discuss that. FWIW, I can't forsee any realistic approach that actually won't do such comparisons (although not necessarily through C). Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] FW: REVIEW: Allow formatting in log_line_prefix
On Tue, Sep 24, 2013 at 5:16 AM, Robert Haas robertmh...@gmail.com wrote: On Fri, Sep 20, 2013 at 11:28 PM, David Rowley dgrowle...@gmail.com wrote:\ I put the above results into the attached spreadsheet. On my intel i5 laptop I'm seeing a slow down of about 1 second per million queries for the longer log_line_prefix and about 1 second per 5 million queries with the shorter log_line_prefix. In the attached spreadsheet I've mocked up some very rough estimates on the performance cost of this change. I think a while ago I read about a benchmark Robert ran on a 64 core machine which ran around 400k transactions per second. I've included some workings in the spreadsheet to show how this change would affect that benchmark, but I have assumed that the hardware would only be able to execute the log_line_prefix function at the same speed as my i5 laptop. With this very worst case estimates my calculations say that the performance hit is 0.6% with the log_line_prefix which contains all of the variables and 0.11% for the shorter log_line_prefix. I would imagine that the hardware that performed 400k TPS would be able to run log_line_prefix faster than my 3 year old i5 laptop, so this is likely quite a big over estimation of the hit. Well, on those tests, I was hardly logging anything, so it's hard to really draw any conclusions that way. I think the real concern with this patch, performance-wise, is what happens in environments where there is a lot of logging going on. We've had previous reports of people who can't even enable the logging they want because the performance cost is unacceptably high. That's why we added that logging hook in 9.2 or 9.3, so that people can write alternative loggers that just throw away messages if the recipient can't eat them fast enough. I wouldn't be keen to accept a 25% performance hit on logging overall, but log_line_prefix() is only a small part of that cost. So... I guess the question that I'd ask is, if you write a PL/pgsql function that does RAISE NOTICE in a loop a large number of times, can you measure any difference in how fast that function executes on the patch and unpatched code? If so, how much? Ok, I've been doing a bit of benchmarking on this. To mock up a really fast I/O system I created a RAM disk which I will ask Postgres to send the log files to. mkdir /tmp/ramdisk; chmod 777 /tmp/ramdisk mount -t tmpfs -o size=256M tmpfs /tmp/ramdisk/ I created the following function in plpgsql create function stresslog(n int) returns int as $$ begin while n 0 loop raise notice '%', n; n := n - 1; end loop; return 0; end; $$ language plpgsql; I was running postgreql with david@ubuntu64:~/9.4/bin$ ./pg_ctl start -D /home/david/9.4/data -l /tmp/ramdisk/pg.log I ran the following command 5 times for each version. client_min_message is set to warning and log_min_message is set to notice postgres=# select stresslog(10); stresslog --- 0 (1 row) I do see a 15-18% slow down with the patched version, so perhaps I'll need to look to see if I can speed it up a bit, although I do feel this benchmark is not quite a normal workload. Please see below the results, or if you want to play about with them a bit, please use the attached spreadsheet. * Round 1 Patched: log_line_prefix = %a %u %d %r %h %p %t %m %i %e %c %l %s %v %x Time:1822.105ms Time:1762.529ms Time:1839.724ms Time:1837.372ms Time:1813.240ms unpatched: log_line_prefix = %a %u %d %r %h %p %t %m %i %e %c %l %s %v %x Time:1519.032ms Time:1528.760ms Time:1484.074ms Time:1552.786ms Time:1569.410ms * Round 2 patched: log_line_prefix = %a %u %d %e Time:625.969ms Time:668.444ms Time:648.310ms Time:663.655ms Time:715.397ms unpatched: log_line_prefix = %a %u %d %e Time:598.146ms Time:518.827ms Time:532.858ms Time:529.584ms Time:537.276ms Regards David Rowley log_line_prefix_benchmark_stresslog_v0.4.xls Description: MS-Excel spreadsheet -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] INSERT...ON DUPLICATE KEY LOCK FOR UPDATE - visibility semantics
Hi, Various messages are discussing semantics around visibility. I by now have a hard time keeping track. So let's keep the discussion of the desired semantics to this thread. There have been some remarks about serialization failures in read committed transactions. I agree, those shouldn't occur. But I don't actually think they are so much of a problem if we follow the path set by existing uses of the EPQ logic. The scenario described seems to be an UPSERT conflicting with a row it cannot see in the original snapshot of the query. In that case I think we just have to follow the example laid by ExecUpdate, ExecDelete and heap_lock_tuple. Use the EPQ machinery (or an alternative approach with similar enough semantics) to get a new snapshot and follow the ctid chain. When we've found the end of the chain we try to update that tuple. That surely isn't free of surprising semantics, but it would follows existing semantics. Which everybody writing concurrent applications in read committed should (but doesn't) know. Adding a different set of semantics seems like a bad idea. Robert seems to have been the primary sceptic around this, what scenario are you actually concerned about? There are some scenarios that doesn't trivially answer. But I'd like to understand the primary concerns first. Regards, Andres -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] record identical operator
On 09/23/2013 10:38 PM, Stephen Frost wrote: and heavily caveated. I'm not sure what caveats would be needed. It seems to me that a clear description of what it does would suffice. Like all the other non-default opclasses in core, it will be non-default because it is less frequently useful. This will claim things are different, even when they aren't different when cast to text, or possibly even when extracted in binary mode, ensure this is really what you want is a pretty big caveat, imv. Yes, it should be documented that it tests for sameness and gives no guarantees that lack of sameness means different (as determined by some other operator) -- Hannu Krosing PostgreSQL Consultant Performance, Scalability and High Availability 2ndQuadrant Nordic OÜ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ENABLE/DISABLE CONSTRAINT NAME
--On 13. September 2013 20:17:19 -0400 Robert Haas robertmh...@gmail.com wrote: You're missing the point. Peter wasn't worried that your patch throws an error; he's concerned about the fact that it doesn't. In PostgreSQL, you can only create the following view because test1 has a primary key over column a: = create table test1 (a int constraint pk primary key, b text); = create view test2 as select a, b from test1 group by a; = alter table test1 drop constraint pk; The reason that, if the primary key weren't there, it would be ambiguous which row should be returned as among multiple values where a is equal and b is not. If you can disable the constraint, then you can create precisely that problem. Hmm not sure i understand this argument either: this patch doesn't allow disabling a primary key. It only supports FKs and CHECK constraints explicitly. -- Thanks Bernd -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reasoning behind LWLOCK_PADDED_SIZE/increase it to a full cacheline
Andres Freund and...@2ndquadrant.com writes: So, what we do is we guarantee that LWLocks are aligned to 16 or 32byte boundaries. That means that on x86-64 (64byte cachelines, 24bytes unpadded lwlock) two lwlocks share a cacheline. Yup. In my benchmarks changing the padding to 64byte increases performance in workloads with contended lwlocks considerably. At a huge cost in RAM. Remember we make two LWLocks per shared buffer. I think that rather than using a blunt instrument like that, we ought to see if we can identify pairs of hot LWLocks and make sure they're not adjacent. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reasoning behind LWLOCK_PADDED_SIZE/increase it to a full cacheline
On 2013-09-24 12:39:39 +0200, Tom Lane wrote: Andres Freund and...@2ndquadrant.com writes: So, what we do is we guarantee that LWLocks are aligned to 16 or 32byte boundaries. That means that on x86-64 (64byte cachelines, 24bytes unpadded lwlock) two lwlocks share a cacheline. In my benchmarks changing the padding to 64byte increases performance in workloads with contended lwlocks considerably. At a huge cost in RAM. Remember we make two LWLocks per shared buffer. I think that rather than using a blunt instrument like that, we ought to see if we can identify pairs of hot LWLocks and make sure they're not adjacent. That's a good point. What about making all but the shared buffer lwlocks 64bytes? It seems hard to analyze the interactions between all the locks and keep it maintained. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] record identical operator
* Robert Haas (robertmh...@gmail.com) wrote: Your example demonstrates that if a given query can generate two different outputs, A and B, based on the same input data, then the contents of the materialized view cannot be equal to be A and also equal to B. Well, no duh. Two different outputs based on what *plan* is chosen. Of course, you don't need citext, or any other data type with a loose notion of equality, to generate that sort of problem: [...] rhaas=# set datestyle = 'german'; SET I'm talking about *planner differences* changing the results. If we've got a ton of cases where a different plan means different output, then we've got some serious problems. I'd argue that it's pretty darn clear that datestyle is going to be a *slightly* different animal. My example doesn't *require* changing any GUCs, it was just expedient for illustration. But I'm still wondering what this is intended to prove. These types are, to those that use them at least, a known quantity wrt what you get when using them in GROUP BYs, JOINs, etc. You're trying to 'fix' something that isn't really broken and you're only doing it half-baked anyway because your 'fix' isn't going to actually make these types produce consistent results. There are an infinite number of ways to write queries that produce different results, and I think we all know that materialized views aren't going to hold up very well if given such queries. That seems a poor excuse for not fixing the cases that can be made to work. New operators are not without cost, I don't think it's a good idea to expose our internal binary representations of data out to the SQL level, and the justification for adding them is that they solve a case that they don't. Thanks, Stephen signature.asc Description: Digital signature
Re: [HACKERS] all_visible replay aborting due to uninitialized pages
On 2013-09-23 14:41:16 +0300, Heikki Linnakangas wrote: On 06.06.2013 17:22, Robert Haas wrote: On Thu, May 30, 2013 at 2:29 AM, Andres Freundand...@2ndquadrant.com wrote: Yeah, I think it's fine. The patch also looks fine, although I think the comments could use a bit of tidying. I guess we need to back-patch this all the way back to 8.4? It will require some adjustments for the older branches. I think 9.2 is actually far enough and it should apply there. Before that we only logged the unsetting of all_visible via heap_(inset|update|delete)'s wal records not the setting as far as I can tell. So I don't immediately see a danger 9.2. OK. I have committed this. For 9.2, I had to backport log_newpage_buffer() and use XLByteEQ rather than ==. I'm afraid this patch was a few bricks shy of a load. The log_newpage_buffer() function asserts that: /* We should be in a critical section. */ Assert(CritSectionCount 0); But the call in vacuumlazy.c is not inside a critical section. Also, the comments in log_newpage_buffer() say that the caller should mark the buffer dirty *before* calling log_newpage_buffer(), but in vacuumlazy.c, it's marked dirty afterwards. I'm not sure what consequences that might have, but at least it contradicts the comment. (spotted this while working on a patch, and ran into the assertion on crash recovery) What about the attached patches (one for 9.3 and master, the other for 9.2)? I've tested that I can trigger the assert before and not after by inserting faults... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services From 4fe13d241062c3aa47ddfe3cfb967d80809aa826 Mon Sep 17 00:00:00 2001 From: Andres Freund and...@anarazel.de Date: Tue, 24 Sep 2013 11:58:34 +0200 Subject: [PATCH] Use critical section when ensuring empty pages are initialized during vacuum. a6370fd9 tried to fix the problem that replay of XLOG_HEAP2_VISIBLE records can fail when unitialized page are referenced during replay. Unfortunately the patch/I missed the fact that log_newpage() should be used inside a critical section leading to assertion failure during WAL replay when those are enabled while working otherwise. Also fix the issue that pages should be marked dirty before calling log_newpage_buffer() (check src/backend/access/transam/README for reasons). Both issues noticed Heikki --- src/backend/commands/vacuumlazy.c | 7 ++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c index bb4e03e..af7cd59 100644 --- a/src/backend/commands/vacuumlazy.c +++ b/src/backend/commands/vacuumlazy.c @@ -663,6 +663,11 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats, /* empty pages are always all-visible */ if (!PageIsAllVisible(page)) { +START_CRIT_SECTION(); + +/* dirty page before any possible XLogInsert()s */ +MarkBufferDirty(buf); + /* * It's possible that another backend has extended the heap, * initialized the page, and then failed to WAL-log the page @@ -682,9 +687,9 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats, log_newpage_buffer(buf); PageSetAllVisible(page); -MarkBufferDirty(buf); visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr, vmbuffer, InvalidTransactionId); +END_CRIT_SECTION(); } UnlockReleaseBuffer(buf); -- 1.8.4.21.g992c386.dirty From 2aee405f62207bd7691fd13225ed5be62cc1033a Mon Sep 17 00:00:00 2001 From: Andres Freund and...@anarazel.de Date: Tue, 24 Sep 2013 11:58:34 +0200 Subject: [PATCH] Use critical section when ensuring empty pages are initialized during vacuum. a6370fd9 tried to fix the problem that replay of XLOG_HEAP2_VISIBLE records can fail when unitialized page are referenced during replay. Unfortunately the patch/I missed the fact that log_newpage() should be used inside a critical section leading to assertion failure during WAL replay when those are enabled while working otherwise. Also fix the issue that pages should be marked dirty before calling log_newpage_buffer() (check src/backend/access/transam/README for reasons). Both issues noticed Heikki --- src/backend/commands/vacuumlazy.c | 7 ++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c index ff8764b..87757aa 100644 --- a/src/backend/commands/vacuumlazy.c +++ b/src/backend/commands/vacuumlazy.c @@ -650,6 +650,11 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats, /* empty pages are always all-visible */ if (!PageIsAllVisible(page)) { +START_CRIT_SECTION(); + +/* dirty page before any possible XLogInsert()s */ +MarkBufferDirty(buf); + /* * It's possible that another backend has extended the heap, * initialized the page, and then failed to WAL-log the
Re: [HACKERS] UTF8 national character data type support WIP patch and list of open issues.
From: Peter Eisentraut pete...@gmx.net That assumes that the conversion client encoding - server encoding - NCHAR encoding is not lossy. Yes, so Tatsuo san suggested to restrict server encoding - NCHAR encoding combination to those with lossless conversion. I thought one main point of this exercise was the avoid these conversions and be able to go straight from client encoding into NCHAR. It's slightly different. Please see the following excerpt: http://www.postgresql.org/message-id/B1A7485194DE4FDAB8FA781AFB570079@maumau 4. I guess some users really want to continue to use ShiftJIS or EUC_JP for database encoding, and use NCHAR for a limited set of columns to store international text in Unicode: - to avoid code conversion between the server and the client for performance - because ShiftJIS and EUC_JP require less amount of storage (2 bytes for most Kanji) than UTF-8 (3 bytes) This use case is described in chapter 6 of Oracle Database Globalization Support Guide. Regards MauMau -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Strange hanging bug in a simple milter
Hello, On Mon, 23 Sep 2013, Stephen Frost wrote: I've now committed a fix for this issue. I cloned the 9.4devel branch and linked my authmilter and a test program (based on Heikki's earlier design) against the libpq that comes with it. After hours of pretty extensive stress testing using 2, 3, 4, 5, 6, 50, 100 and 500 parallel threads, I have observed no deadlocks. Everything runs smoothly. Many thanks to all who contributed to the fix. Regards, vmk -- Tietotekniikkakeskus / Helsingin yliopisto IT department / University of Helsinki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] record identical operator
Stephen Frost sfr...@snowman.net wrote: Skipping out on much of the side-discussion to try and drive at the heart of this.. Robert Haas (robertmh...@gmail.com) wrote: I would suggest that you read the referenced papers for details as to how the algorithm works. To make a long story short, they do need to keep track of what's changed, and how. I suppose it's good to know that I wasn't completely misunderstanding the discussion in Ottawa. However, that still seems largely orthogonal to the present discussion. It *solves* this issue, from where I'm sitting, without any binary operators at all. [ argument that the only way we should ever do REFRESH is by using captured deltas, through incremental maintenance techniques ] That would ensure that a query could not be used to define a matview until we had implemented incremental maintenance for queries of that type and complexity. I expect that to come *close* to covering all useful queries that way will take five to ten years, if all goes well. The approach I'm taking makes all queries available *now*, with improvements in how many can be maintained incrementally improving over time. This is the route every other database I've looked at has taken (for good reason, I think). Ultimately, even when we have incremental maintenance supported for all queries that can be used to define a matview, I think there will be demand for a REFRESH command that re-runs the base query. Not only does that fit the workload for some matviews, but consider this, from the paper I cited[1]: | Recomputing the view from scratch is too wasteful in most cases. | Using the heuristic of inertia (only a part of the view changes | in response to changes in the base relations), it is often | cheaper to compute only the changes in the view. We stress that | the above is only a heuristic. For example, if an entire base | relation is deleted, it may be cheaper to recompute a view that | depends on the deleted relation (if the new view will quickly | evaluate to an empty relation) than to compute the changes to the | view. What we're talking about is a performance heuristic -- not something more profound than that. The route I'm taking is to get it *working correctly now* using the simple technique, and then embarking on the long journey of optimizing progressively more cases. What your argument boils down to IMV is essentially a case of premature optimization. You have yet to show any case where the existing patch does not yield correct results, or show that there is a better way to get to the point this patch takes us. [ later post shows a query that does not produce deterministic results ] Sure, two direct runs of that same query, or two runs through a regular view, could show different results (considering synchronized scans, among other things). I don't see what that proves. Of course a refresh of a matview is only going to produce one of those and then will not produce a different result until it is refreshed or (once we add incremental maintenance) something changes in the underlying data. Nobody ever claimed that a query which does not produce consistent results would somehow produce them with this patch. There are queries using citext, numeric, and other types which *do* provide consistent results which are consistently produced by a straight query, a simple view, or a non-concurrent refresh of a materialized view; this patch will cause a concurrent refresh to produce the same results as those, rather than something different. Period. That's the point, and the whole point. You have not shown that it doesn't. You have not shown why adding a 12th non-default opclass is a particular problem here (although we have a consensus to use different operators, to reserve this operator namespace for other things). You have not made any case at all for why people should wait for incremental maintenance to be mature (a project which will take years) before being able to use materialized views with concurrent refreshes. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company [1] A. Gupta, I. S. Mumick, and V. S. Subrahmanian. Maintaining Views Incrementally. In SIGMOD 1993, pages 157-167. http://dl.acm.org/citation.cfm?id=170066 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Strange hanging bug in a simple milter
* Vesa-Matti J Kari (vmk...@cc.helsinki.fi) wrote: Many thanks to all who contributed to the fix. Great! Thanks for the report and the testing. Stephen signature.asc Description: Digital signature
Re: [HACKERS] 9.3 Json Array's
On 09/24/2013 12:59 AM, Chris Travers wrote: I am still in the process of wrapping my head around the current JSON logic. I hope to produce a proof of concept that can later be turned into a patch. See my previous post on this topic. Again collaboration is welcome. Feel free to ask questions. The heart of the API is the event handlers defined in this stuct in include/utils/jsonapi.h: typedef struct JsonSemAction { void *semstate; json_struct_action object_start; json_struct_action object_end; json_struct_action array_start; json_struct_action array_end; json_ofield_action object_field_start; json_ofield_action object_field_end; json_aelem_action array_element_start; json_aelem_action array_element_end; json_scalar_action scalar; } JsonSemAction; Basically there is a handler for the start and end of each non-scalar structural element in JSON, plus a handler for scalars. There are several problems that will be posed by processing nested arrays and objects, including: * in effect you would need to construct a stack of state that could be pushed and popped * JSON arrays aren't a very good match for SQL arrays - they are unidimensional and heterogenous. I'm not saying this can't be done - it will just take a bit of effort. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PostgreSQL 9.3 beta breaks some extensions make install
On 09/23/2013 12:15 PM, Cédric Villemain wrote: I'm working on it. It appears to have a slight problem or two I want to fix at the same time, rather than backpatch something broken. Would you mind sharing the problems you are facing ? You've noticed the problem about installdirs, I suppose. The attached patch is the fix currently applyed at debian. I will when I'm sure it's not a case of pilot error on my part. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] record identical operator
On Tue, Sep 24, 2013 at 7:14 AM, Stephen Frost sfr...@snowman.net wrote: Of course, you don't need citext, or any other data type with a loose notion of equality, to generate that sort of problem: [...] rhaas=# set datestyle = 'german'; SET I'm talking about *planner differences* changing the results. If we've got a ton of cases where a different plan means different output, then we've got some serious problems. I'd argue that it's pretty darn clear that datestyle is going to be a *slightly* different animal. My example doesn't *require* changing any GUCs, it was just expedient for illustration. I'm completely confused, here. What's so special about planner differences? Obviously, there are tons of queries that can produce different results based on planner differences, but materialized views didn't create that problem and they aren't responsible for solving it. Also, even restricting ourselves to planner differences, there's no particular link between the behavior of the type's equality operator and whether or not the query always produces the same results. For example, let's consider good old text. rhaas=# create table tag_data (id integer, tag text, userid text, primary key (id, tag)); CREATE TABLE rhaas=# insert into tag_data values (1, 'foo', 'tom'), (1, 'bar', 'dick'), (2, 'baz', 'harry'); INSERT 0 3 rhaas=# select id, string_agg(tag||':'||userid, ', ') tags from tag_data group by 1; id | tags +--- 1 | foo:tom, bar:dick 2 | baz:harry (2 rows) rhaas=# set enable_seqscan=false; SET rhaas=# select id, string_agg(tag||':'||userid, ', ') tags from tag_data group by 1; id | tags +--- 1 | bar:dick, foo:tom 2 | baz:harry (2 rows) Now texteq() is just a glorified version of memcmp(), so what does this complaint possibly have to do with Kevin's patch, or even materialized views in general? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Cube extension point support // GSoC'13
Hello There is new version of patch. I have separated ordering operators to different patch (https://commitfest.postgresql.org/action/patch_view?id=1243), fixed formatting issues and implemented backward compatibility with old-style points in cube_is_point() and cube_out(). Also comparing output files I've discovered that this four files is combination of two types of different behavior: 1) SELECT '-1e-700'::cube AS cube; can be (0) or (-0) 2) Amount of zeros in exponent of floating point, i.e. SELECT '1e27'::cube AS cube; can be (1e+027) or (1e+27) On my system (OSX) it is second option in both situations. I've also tested it on FreeBSD 9.0 and Ubuntu 12.04 with the same results. So is there some ideas how can I reproduce such results? Stas. points.diff Description: Binary data On Sep 16, 2013, at 10:48 AM, Heikki Linnakangas wrote: On 12.07.2013 14:57, Stas Kelvich wrote: Hello. here is a patch adding to cube extension support for compressed representation of point cubes. If cube is a point, i.e. has coincident lower left and upper right corners, than only one corner is stored. First bit of the cube header indicates whether the cube is point or not. Few moments: * Patch preserves binary compatibility with old indices * All functions that create cubes from user input, check whether it is a point or not * All internal functions that can return cubes takes care of all cases where a cube might become a point Great! cube_is_point() needs to still handle old-style points. An NDBOX without the point-flag set, where the ll and ur coordinates for each dimension are the same, still needs to be considered a point. Even if you are careful to never construct such structs in the code, they can be present on-disk if you have upgraded from an earlier version with pg_upgrade. Same in cube_out(). * Added tests for checking correct point behavior You'll need to adjust all the expected output files, not only cube_1.out. Also this patch includes adapted Alexander Korotkov's patch with kNN-based ordering operator, which he wrote for postgresql-9.0beta1 with knngist patch. More info there http://www.postgresql.org/message-id/aanlktimhfaq6hcibrnk0tlcqmiyhywhwaq2zd87wb...@mail.gmail.com To make review easier, it would be better to keep that as a separate patch, actually. Could you split it up again, please? - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 9.3 Json Array's
On 24 September 2013 at 13:46 Andrew Dunstan and...@dunslane.net wrote: Feel free to ask questions. The heart of the API is the event handlers defined in this stuct in include/utils/jsonapi.h: typedef struct JsonSemAction { void *semstate; json_struct_action object_start; json_struct_action object_end; json_struct_action array_start; json_struct_action array_end; json_ofield_action object_field_start; json_ofield_action object_field_end; json_aelem_action array_element_start; json_aelem_action array_element_end; json_scalar_action scalar; } JsonSemAction; Basically there is a handler for the start and end of each non-scalar structural element in JSON, plus a handler for scalars. There are several problems that will be posed by processing nested arrays and objects, including: * in effect you would need to construct a stack of state that could be pushed and popped True. * JSON arrays aren't a very good match for SQL arrays - they are unidimensional and heterogenous. This is true, but I think one would have to start with an assumption that the data is valid for an SQL type and then check again once one gets it done. JSON is a pretty flexible format which makes it a poor match in many cases for SQL types generally. But I think the example so far (with json_populate_recordset) is a good one, namely a best effort conversion. I'm not saying this can't be done - it will just take a bit of effort. Yeah, looking through the code, I think it will be more work than I originally thought but that just means it will take longer. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers Best Wishes, Chris Travers http://www.2ndquadrant.com PostgreSQL Services, Training, and Support
Re: [HACKERS] Reasoning behind LWLOCK_PADDED_SIZE/increase it to a full cacheline
On Tue, Sep 24, 2013 at 6:48 AM, Andres Freund and...@2ndquadrant.com wrote: On 2013-09-24 12:39:39 +0200, Tom Lane wrote: Andres Freund and...@2ndquadrant.com writes: So, what we do is we guarantee that LWLocks are aligned to 16 or 32byte boundaries. That means that on x86-64 (64byte cachelines, 24bytes unpadded lwlock) two lwlocks share a cacheline. In my benchmarks changing the padding to 64byte increases performance in workloads with contended lwlocks considerably. At a huge cost in RAM. Remember we make two LWLocks per shared buffer. I think that rather than using a blunt instrument like that, we ought to see if we can identify pairs of hot LWLocks and make sure they're not adjacent. That's a good point. What about making all but the shared buffer lwlocks 64bytes? It seems hard to analyze the interactions between all the locks and keep it maintained. I think somebody had a patch a few years ago that made it so that the LWLocks didn't have to be in a single array, but could instead be anywhere in shared memory. Internally, lwlock.c held onto LWLock pointers instead of LWLockIds. That idea seems like it might be worth revisiting, in terms of giving us more options as to how LWLocks can be laid out in shared memory. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] FW: REVIEW: Allow formatting in log_line_prefix
On Tue, Sep 24, 2013 at 5:04 AM, David Rowley dgrowle...@gmail.com wrote: So... I guess the question that I'd ask is, if you write a PL/pgsql function that does RAISE NOTICE in a loop a large number of times, can you measure any difference in how fast that function executes on the patch and unpatched code? If so, how much? I do see a 15-18% slow down with the patched version, so perhaps I'll need to look to see if I can speed it up a bit, although I do feel this benchmark is not quite a normal workload. Ouch! That's pretty painful. I agree that's not a normal workload, but I don't think it's an entirely unfair benchmark, either. There certainly are people who suffer because of the cost of logging as things are; for example, log_min_duration_statement is commonly used and can produce massive amounts of output on a busy system. I wouldn't mind too much if the slowdown you are seeing only occurred when the feature is actually used, but taking a 15-18% hit on logging even when the new formatting features aren't being used seems too expensive to me. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Assertions in PL/PgSQL
On Mon, Sep 23, 2013 at 5:48 AM, Amit Khandekar amit.khande...@enterprisedb.com wrote: The assert levels sound a bit like a user might be confused by these levels being present at both places: In the RAISE syntax itself, and the assert GUC level. But I like the syntax. How about keeping the ASSERT keyword optional ? When we have WHEN, we anyway mean that we ware asserting that this condition must be true. So something like this : RAISE [ level ] 'format' [, expression [, ... ]] [ USING option = expression [, ... ] ]; RAISE [ level ] condition_name [ USING option = expression [, ... ] ]; RAISE [ level ] SQLSTATE 'sqlstate' [ USING option = expression [, ... ] ]; RAISE [ level ] USING option = expression [, ... ]; RAISE [ ASSERT ] WHEN bool_expression; RAISE ; I don't think so it is a good idea. WHEN clause should be independent on exception level. I am ok with generalizing the WHEN clause across all levels. The main proposal was for adding assertion support, so we can keep the WHEN generalization as a nice-to-have stuff and do it only if it comes as a natural extension in the assertion support patch. I think that's right: ISTM that at this point there are two different proposals here. 1. Allowing ASSERT as an argument to RAISE. 2. Allowing RAISE to have a WHEN clause. Those two things are logically separate. We could do either one without doing the other one. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] record identical operator
* Kevin Grittner (kgri...@ymail.com) wrote: That's the point, and the whole point. You have not shown that it doesn't. You have not shown why adding a 12th non-default opclass is a particular problem here (although we have a consensus to use different operators, to reserve this operator namespace for other things). We need justification to add operators, imv, especially ones that expose our internal binary representation of data. I worry that adding these will come back to bite us later and that we're making promises we won't be able to keep. If these inconsistencies in what happens with these data types are an issue then REFRESH can be handled as a wholesale DELETE/INSERT. Trying to do this incremental-but-not-really maintenance where the whole query is run but we try to skimp on what's actually getting updated in the matview is a premature optimization, imv, and one which may be less performant and more painful, with more gotchas and challenges for our users, to deal with in the long run. Thanks, Stephen signature.asc Description: Digital signature
Re: [HACKERS] SSL renegotiation
On Mon, Sep 23, 2013 at 4:51 PM, Alvaro Herrera alvhe...@2ndquadrant.com wrote: Here's an updated version; this mainly simplifies code, per comments from Andres (things were a bit too baroque in places due to the way the code had evolved, and I hadn't gone over it to simplify it). The only behavior change is that the renegotiation is requested 1kB before the limit is hit: the raise to 1% of the configured limit was removed. What basis do we have for thinking that 1kB is definitely enough to avoid spurious disconnects? (I have a bad feeling that you're going to say something along the lines of well, we tried it a bunch of times, and) -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] record identical operator
Stephen Frost sfr...@snowman.net wrote: We need justification to add operators, imv, especially ones that expose our internal binary representation of data. I believe I have done so. I worry that adding these will come back to bite us later How? and that we're making promises we won't be able to keep. The promise that a concurrent refresh will produce the same set of rows as a non-concurrent one? If these inconsistencies in what happens with these data types are an issue then REFRESH can be handled as a wholesale DELETE/INSERT. I have real-life experience with handling faux materialized views by creating tables (at Wisconsin Courts). We initially did it the way you describe, but the run time was excessive (in Milwaukee County the overnight run did not always complete before the start of business hours the next day). Switching to logic similar to what I've implemented here, it completed an order of magnitude faster, and generated a small fraction of the WAL and logical replication data that the technique you describe did. Performing preliminary steps in temporary tables to minimize the changes needed to permanent tables seems beneficial enough on the face of it that I think the burden of proof should be on someone arguing that deleting all rows and re-inserting (in the same transaction) is, in general, a superior approach to finding the differences and applying only those/ Trying to do this incremental-but-not-really maintenance where the whole query is run but we try to skimp on what's actually getting updated in the matview is a premature optimization, imv, and one which may be less performant and more painful, with more gotchas and challenges for our users, to deal with in the long run. I have the evidence of a ten-fold performance improvement plus minimized WAL and replication work on my side. What evidence do you have to back your assertions? (Don't forget to work in bloat and vacuum truncation issues to the costs of your proposal.) -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Improving avg performance for numeric
On Mon, Sep 23, 2013 at 4:15 PM, Tomas Vondra t...@fuzzy.cz wrote: Seems ready for commiter to me. I'll wait a few days for others to comment, and then I'll update the commitfest page. Some thoughts: The documentation doesn't reflect the restriction to type internal. On a related note, why restrict this to type internal? Formatting fixes are needed: + if (aggtransspace 0) + { + costs-transitionSpace += aggtransspace; + } Project style is not to use curly-braces for single statements. Also, the changes to numeric.c add blank lines before and after function header comments, which is not the usual style. ! if (state == NULL) ! PG_RETURN_NULL(); ! else ! PG_RETURN_POINTER(state); I think this should just say PG_RETURN_POINTER(state). PG_RETURN_NULL is for returning an SQL NULL, not (void *) 0. Is there some reason why we need an SQL NULL here, rather than a NULL pointer? On the whole this looks fairly solid on a first read-through. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] INSERT...ON DUPLICATE KEY LOCK FOR UPDATE
On Mon, Sep 23, 2013 at 7:05 PM, Peter Geoghegan p...@heroku.com wrote: It suits my purposes to have the value locks be held for only an instant, because: [ detailed explanation ] I don't really disagree with any of that. TBH, I think the question of how long value locks (as you call them) are held is going to boil down to a question of how they end up being implemented. As I mentioned to you at PG Open (going through the details here for those following along at home), we could optimistically insert the new heap tuple, then go add index entries for it, and if we find a conflict, then instead of erroring out, we mark the tuple we were inserting dead and go try update the conflicting tuple instead. In that implementation, if we find that we have to wait for some other transaction along the way, it's probably not worth reversing out the index entries already inserted, because getting them into the index in the first place was a WAL-logged operation, and therefore relatively expensive, and IMHO it's most likely better to just hope things work out than to risk having to redo all of that. On the other hand, if the locks are strictly in-memory, then the cost of releasing them all before we go to wait, and of reacquiring them after we finish waiting, is pretty low. There might be some modularity issues to work through there, but they might not turn out to be very painful, and the advantages you mention are certainly worth accruing if it turns out to be fairly straightforward. Personally, I think that trying to keep it all in-memory is going to be hard. The problem is that we can't de-optimize regular inserts or updates to any significant degree to cater to this feature - because as valuable as this feature is, the number of times it gets used is still going to be a whole lot smaller than the number of times it doesn't get used. Also, I tend to think that we might want to define the operation as a REPLACE-type operation with respect to a certain set of key columns; and so we'll do the insert-or-update behavior with respect only to the index on those columns and let the chips fall where they may with respect to any others. In that case this all becomes much less urgent. Even *considering* this is largely academic, though, because without some kind of miracle guaranteeing serial ordering, a miracle that doesn't allow for serialization failures and also doesn't seriously slow down, for example, updates (by making them care about value locks *before* they do anything, or in the case of HOT updates *at all*), all of this is _impossible_. So, I say let's just do the actually-serial-ordering for value lock acquisition with serialization failures where we're read committed. I've seriously considered what it would take to do it any other way so things would work how you and Andres expect for read committed, and it makes my head hurt, because apart from seeming unnecessary to me, it also seems completely hopeless. Am I being too negative here? Well, I guess that's possible. The fact is that it's really hard to judge, because all of this is really hard to reason about. That's what I really don't like about it. Suppose we define the operation as REPLACE rather than INSERT...ON DUPLICATE KEY LOCK FOR UPDATE. Then we could do something like this: 1. Try to insert a tuple. If no unique index conflicts occur, stop. 2. Note the identity of the conflicting tuple and mark the inserted heap tuple dead. 3. If the conflicting tuple's inserting transaction is still in progress, wait for the inserting transaction to end. 4. If the conflicting tuple is dead (e.g. because the inserter aborted), start over. 5. If the conflicting tuple's key columns no longer match the key columns of the REPLACE operation, start over. 6. If the conflicting tuple has a valid xmax, wait for the deleting or locking transaction to end. If xmax is still valid, follow the CTID chain to the updated tuple, let that be the new conflicting tuple, and resume from step 5. 7. Update the tuple, even though it may be invisible to our snapshot (a deliberate MVCC violation!). While this behavior is admittedly wonky from an MVCC perspective, I suspect that it would make a lot of people happy. I respectfully suggest that that exact definition of upsert isn't a useful one, because other snapshot isolation/MVCC systems operating within the same constraints must have the same issues, and yet they manage to implement something that could be called upsert that people seem happy with. Yeah. I wonder how they do that. My guess is that they have some fancy snapshot type that is used by the equivalent of ModifyTable subplans, that is appropriately paranoid about the Halloween problem and so on. How that actually might work is far from clear, but it's a question that I have begun to consider. As I said, a concern is that it would be in tension with the generalized, composable syntax, where we don't explicitly have a special
Re: [HACKERS] record identical operator
* Kevin Grittner (kgri...@ymail.com) wrote: Stephen Frost sfr...@snowman.net wrote: I worry that adding these will come back to bite us later How? User misuse is certainly one consideration, but I wonder what's going to happen if we change our internal representation of data (eg: numerics get changed again), or when incremental matview maintenance happens and we start looking at subsets of rows instead of the entire query. Will the first update of a matview after a change to numeric's internal data structure cause the entire thing to be rewritten? and that we're making promises we won't be able to keep. The promise that a concurrent refresh will produce the same set of rows as a non-concurrent one? The promise that we'll always return the binary representation of the data that we saw last. When greatest(x,y) comes back 'false' for a MAX(), we then have to go check well, does the type consider them equal?, because, if the type considers them equal, we then have to decide if we should replace x with y anyway, because it's different at a binary level. That's what we're saying we'll always do now. We're also saying that we'll replace things based on plan differences rather than based on if the rows underneath actually changed at all. We could end up with material differences in the result of matviews updated through incremental REFRESH and matviews updated through actual incremental mainteance- and people may *care* about those because we've told them (or they discover) they can depend on these types of changes to be reflected in the result. Trying to do this incremental-but-not-really maintenance where the whole query is run but we try to skimp on what's actually getting updated in the matview is a premature optimization, imv, and one which may be less performant and more painful, with more gotchas and challenges for our users, to deal with in the long run. I have the evidence of a ten-fold performance improvement plus minimized WAL and replication work on my side. What evidence do you have to back your assertions? (Don't forget to work in bloat and vacuum truncation issues to the costs of your proposal.) I don't doubt that there are cases in both directions and I'm not trying to argue that it'd always be faster, but I doubt it's always slower. I'm surprised that you had a case where the query was apparently quite fast yet the data set hardly changed and resulted in a very large result but I don't doubt that it happened. What I was trying to get at is really that the delete/insert approach would be good enough in very many cases and it wouldn't have what look, to me anyway, as some pretty ugly warts around these cases. Thanks, Stephen signature.asc Description: Digital signature
Re: [HACKERS] ENABLE/DISABLE CONSTRAINT NAME
On Tue, Sep 24, 2013 at 5:58 AM, Bernd Helmle maili...@oopsware.de wrote: --On 13. September 2013 20:17:19 -0400 Robert Haas robertmh...@gmail.com wrote: You're missing the point. Peter wasn't worried that your patch throws an error; he's concerned about the fact that it doesn't. In PostgreSQL, you can only create the following view because test1 has a primary key over column a: = create table test1 (a int constraint pk primary key, b text); = create view test2 as select a, b from test1 group by a; = alter table test1 drop constraint pk; The reason that, if the primary key weren't there, it would be ambiguous which row should be returned as among multiple values where a is equal and b is not. If you can disable the constraint, then you can create precisely that problem. Hmm not sure i understand this argument either: this patch doesn't allow disabling a primary key. It only supports FKs and CHECK constraints explicitly. Well, that is certainly one way of skating around the specific concern Peter raised. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] INSERT...ON DUPLICATE KEY LOCK FOR UPDATE - visibility semantics
On Tue, Sep 24, 2013 at 5:14 AM, Andres Freund and...@2ndquadrant.com wrote: Various messages are discussing semantics around visibility. I by now have a hard time keeping track. So let's keep the discussion of the desired semantics to this thread. There have been some remarks about serialization failures in read committed transactions. I agree, those shouldn't occur. But I don't actually think they are so much of a problem if we follow the path set by existing uses of the EPQ logic. The scenario described seems to be an UPSERT conflicting with a row it cannot see in the original snapshot of the query. In that case I think we just have to follow the example laid by ExecUpdate, ExecDelete and heap_lock_tuple. Use the EPQ machinery (or an alternative approach with similar enough semantics) to get a new snapshot and follow the ctid chain. When we've found the end of the chain we try to update that tuple. That surely isn't free of surprising semantics, but it would follows existing semantics. Which everybody writing concurrent applications in read committed should (but doesn't) know. Adding a different set of semantics seems like a bad idea. Robert seems to have been the primary sceptic around this, what scenario are you actually concerned about? I'm not skeptical about offering it as an option; in fact, I just suggested basically the same thing on the other thread, before reading this. Nonetheless it IS an MVCC violation; the chances that someone will be able to demonstrate serialization anomalies that can't occur today with this new facility seem very high to me. I feel it's perfectly fine to respond to that by saying: yep, we know that's possible, if it's a concern in your environment then don't use this feature. But it should be clearly documented. I do think that it will be easier to get this to work if we have a define the operation as REPLACE, bundling all of the magic inside a single SQL command. If the user issues an INSERT first and then must try an UPDATE afterwards if the INSERT doesn't actually insert, then you're going to have problems if the UPDATE can't see the tuple with which the INSERT conflicted, and you're going to need some kind of a loop in case the UPDATE itself fails. Even if we can work out all the details, a single command that does insert-or-update seems like it will be easier to use and more efficient. You might also want to insert multiple tuples using INSERT ... VALUES (...), (...), (...); figuring out which ones were inserted and which ones must now be updated seems like a chore better avoided. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical changeset generation v6
On Tue, Sep 24, 2013 at 4:15 AM, Andres Freund and...@2ndquadrant.com wrote: There needs to be a client acking the reception of the data in some form. There's currently two output methods, SQL and walstreamer, but there easily could be further, it's basically two functions you have write. There are several reasons I think the tool is useful, starting with the fact that it makes the initial use of the feature easier. Writing a client for CopyBoth messages wrapping 'w' style binary messages, with the correct select() loop isn't exactly trivial. I also think it's actually useful in real scenarios where you want to ship the data to a remote system for auditing purposes. I have two basic points here: - Requiring a client is a short-sighted design. There's no reason we shouldn't *support* having a client, but IMHO it shouldn't be the only way to use the feature. - Suppose that you use pg_receivellog (or whatever we decide to call it) to suck down logical replication messages. What exactly are you going to do with that data once you've got it? In the case of pg_receivexlog it's quite obvious what you will do with the received files: you'll store them in archive of some kind and maybe eventually use them for archive recovery, streaming replication, or PITR. But the answer here is a lot less obvious, at least to me. For example, for replication, I'd think you might want the plugin to connect to a remote database and directly shove the data in; That sounds like a bad idea to me. If you pull the data from the remote side, you get the data in a streaming fashion and the latency sensitive part of issuing statements to your local database is done locally. Doing things synchronously like that also makes it way harder to use synchronous_commit = off on the remote side, which is a tremendous efficiency win. This sounds like the voice of experience talking, so I won't argue too much, but I don't think it's central to my point. And anyhow, even if it is a bad idea, that doesn't mean someone won't want to do it. :-) If somebody needs something like this, e.g. because they want to replicate into hundreds of shards depending on some key or such, the question I don't know is how to actually initiate the streaming. Somebody would need to start the logical decoding. Sounds like a job for a background worker. It would be pretty swell if you could write a background worker that connects to a logical replication slot and then does whatever. for materialized views, we might like to push the changes into delta relations within the source database. Yes, that's not a bad usecase and I think the only thing missing to use output plugins that way is a convenient function to tell up to where data has been received (aka synced to disk, aka applied). Yes. It feels to me (and I only work here) like the job of the output plugin ought to be to put the data somewhere, and the replication code shouldn't make too many assumptions about where it's actually going. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] trivial one-off memory leak in guc-file.l ParseConfigFile
On Sun, Sep 22, 2013 at 3:40 PM, didier did...@gmail.com wrote: fix a small memory leak in guc-file.l ParseConfigFile AbsoluteConfigLocation() return a strdup string but it's never free or referenced outside ParseConfigFile Courtesy Valgrind and Noah Misch MEMPOOL work. I'd like to look at this, but I haven't got time right now. Could you add it to the next CommitFest so it doesn't get forgotten about? https://commitfest.postgresql.org/action/commitfest_view/open -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] LDAP: bugfix and deprecated OpenLDAP API
Abhijit Menon-Sen wrote: I read through the patch, and it looks sensible. Thanks for the thorough review! I would have preferred the ldap_simple_bind_s() call in the HAVE_LIBLDAP branch to not be inside an else {} (the if block above returns if there is an error anyway), but that's a minor point. I agree, changed. I tested the code as follows: 1. Built the patched source --with-ldap 2. Set up ~/.pg_service.conf: [foo] ldap://localhost:3343/cn=dbserver,cn=hosts?pgconnectinfo?base?(objectclass=*) ldap://localhost:3443/cn=dbserver,cn=hosts?pgconnectinfo?base?(objectclass=*) 3. iptables -A INPUT -p tcp -d 127.0.0.1 --dport 3343 -j DROP 4. netcat -l 127.0.0.1 3343 ; netcat -l 127.0.0.1 3443 5. PGSERVICE=foo bin/psql psql did connect to localhost:3443 after a few seconds of trying to connect to :3343 and failing. (I tried without the iptables rule, so I know that it does try to connect to both.) This doesn't seem to handle timeouts in the sense of a server that doesn't respond after you connect (or perhaps the timeout was long enough that it outlasted my patience), but that's not the fault of this patch, anyway. That's a bug. I thought that setting LDAP_OPT_NETWORK_TIMEOUT would also take care of an unresponsive server, but it seems that I was wrong. The attached patch uses ldap_simple_bind / ldap_result to handle this kind of timeout (like the original code). I can't say anything about the patch on Windows, but since Magnus seemed to think it was OK, I'm marking this ready for committer. The Windows part should handle all kinds of timeouts the same. Yours, Laurenz Albe ldap-bug-3.patch Description: ldap-bug-3.patch -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] record identical operator
Stephen Frost sfr...@snowman.net wrote: Kevin Grittner (kgri...@ymail.com) wrote: Stephen Frost sfr...@snowman.net wrote: I worry that adding these will come back to bite us later How? User misuse is certainly one consideration, I think that one has been talked to death already, with the consensus that we should use different operator names and document them as a result. Any comparison operator can be misused. Andres said he has seen the operators from text_pattern_ops used in production; but we have not removed, for example, ~=~ from our text operators as a result. but I wonder what's going to happen if we change our internal representation of data (eg: numerics get changed again), or when incremental matview maintenance happens and we start looking at subsets of rows instead of the entire query. Will the first update of a matview after a change to numeric's internal data structure cause the entire thing to be rewritten? Something like that could happen, if the newly generated values, for example, all took less space than the old. On the first REFRESH on the new version of the software it might delete and insert all rows; then it would stay with the new representation. I have trouble seeing that as a problem, since presumably we only come up with a new representation because it is smaller, faster, or cures some bug. and that we're making promises we won't be able to keep. The promise that a concurrent refresh will produce the same set of rows as a non-concurrent one? The promise that we'll always return the binary representation of the data that we saw last. When greatest(x,y) comes back 'false' for a MAX(), we then have to go check well, does the type consider them equal?, because, if the type considers them equal, we then have to decide if we should replace x with y anyway, because it's different at a binary level. That's what we're saying we'll always do now. I'm having a little trouble following that. The query runs as it always has, with all the old definitions of comparisons. After it is done, we check whether the rows are the same. The operation of MAX() will not be affected in any way. If MAX() returns a value which is not the same as what the matview has, the matview will be modified to match what MAX() returned. We're also saying that we'll replace things based on plan differences rather than based on if the rows underneath actually changed at all. Only if the user uses a query which does not produce deterministic results. We could end up with material differences in the result of matviews updated through incremental REFRESH and matviews updated through actual incremental mainteance- and people may *care* about those because we've told them (or they discover) they can depend on these types of changes to be reflected in the result. Perhaps we should document the recommendation that people not create materialized views with non-deterministic results, but I don't think that should be a hard restriction. For example, I could see someone creating a materialized view to pick a random sample from a jury pool to be the on the jury panel for today (from which juries are selected on that day), and not want that to be predictable and not want it to change until the next refresh of the panel matview. (That would not be my first design choice, but it would not be a horrid choice if there were sufficient logging of the REFRESH statements in a place the user could not modify.) Trying to do this incremental-but-not-really maintenance where the whole query is run but we try to skimp on what's actually getting updated in the matview is a premature optimization, imv, and one which may be less performant and more painful, with more gotchas and challenges for our users, to deal with in the long run. I have the evidence of a ten-fold performance improvement plus minimized WAL and replication work on my side. What evidence do you have to back your assertions? (Don't forget to work in bloat and vacuum truncation issues to the costs of your proposal.) I don't doubt that there are cases in both directions and I'm not trying to argue that it'd always be faster, but I doubt it's always slower. Sure. We provide a way to support those cases, although it involves blocking. So far, even the tests I expected to be faster with heap replacement have come out marginally slower that way than with CONCURRENTLY, due to index activity; but I have no doubt cases can be constructed that go the other way. I'm surprised that you had a case where the query was apparently quite fast yet the data set hardly changed and resulted in a very large result but I don't doubt that it happened. The history of all events in the county (tens of millions of records in Milwaukee County) need to be scanned to generate the status of cases and the date of last activity, as well as scanning all future calendar events to see what is scheduled. This is compared to tables setting
Re: [HACKERS] logical changeset generation v6
On 2013-09-24 11:04:06 -0400, Robert Haas wrote: - Requiring a client is a short-sighted design. There's no reason we shouldn't *support* having a client, but IMHO it shouldn't be the only way to use the feature. There really aren't many limitations preventing you from doing anything else. - Suppose that you use pg_receivellog (or whatever we decide to call it) to suck down logical replication messages. What exactly are you going to do with that data once you've got it? In the case of pg_receivexlog it's quite obvious what you will do with the received files: you'll store them in archive of some kind and maybe eventually use them for archive recovery, streaming replication, or PITR. But the answer here is a lot less obvious, at least to me. Well, it's not like it's going to be the only client. But it's a useful one. I don't see this as an argument against pg_receivelogical? Most sites don't use pg_receivexlog either. Not having a consumer of the walsender interface included sounds like a bad idea to me, even if it were only useful for testing. Now, you could argue it should be in /contrib - and I wouldn't argue against that except it shares code with the rest of src/bin/pg_basebackup. If somebody needs something like this, e.g. because they want to replicate into hundreds of shards depending on some key or such, the question I don't know is how to actually initiate the streaming. Somebody would need to start the logical decoding. Sounds like a job for a background worker. It would be pretty swell if you could write a background worker that connects to a logical replication slot and then does whatever. That's already possible. In that case you don't have to connect to a walsender, although doing so would give you some parallelism, one decoding the data, the other processing it ;). There's one usecase I do not forsee decoupling from the walsender interface this release though - synchronous logical replication. There currently is no code changes required to make sync rep work for this, and decoupling sync rep from walsender is too much to bite off in one go. for materialized views, we might like to push the changes into delta relations within the source database. Yes, that's not a bad usecase and I think the only thing missing to use output plugins that way is a convenient function to tell up to where data has been received (aka synced to disk, aka applied). Yes. It feels to me (and I only work here) like the job of the output plugin ought to be to put the data somewhere, and the replication code shouldn't make too many assumptions about where it's actually going. The output plugin just has two functions it calls to send out data, 'prepare_write' and 'write'. The callsite has to provide those callbacks. Two are included. walsender and an SQL SRF. Check the 'test_logical_decoding commit, it includes the SQL consumer. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Cube extension types support
Hello, hackers. In this patch I've implemented support for different storage types for cubes. Now it supports float8, float4, int4, int2, int1. Type stored in the header of each cube, one for all coordinates. So for cubes with int1 coordinates it can save up to 8x disk space. Typed cubes can be created in two ways: 1) Via cube_suffix() functions, where suffix can be f4, i4, i2, i1, and arguments are two or one numbers or arrays, i.e. # select cube_i1(1,2) as c; c (1),(2):i1 # select cube_f4(array[1,2,3], array[5,6,7]) as c; c (1, 2, 3),(5, 6, 7):f4 2) Via modificator in the end of string that will be casted to cube, i.e. # select '(1,2,3):i2'::cube as c; c -- (1, 2, 3):i2 When no modificator given float8 will be used by default. Old-style cubes without type in header also treated as float8 for backward compatibility. This patch changes a lot of things in code, so it interfere with others patches to the cube extension. I will update this patch, if other patches will be accepted to commit. types_cumulative.diff Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] record identical operator
* Kevin Grittner (kgri...@ymail.com) wrote: Stephen Frost sfr...@snowman.net wrote: The promise that we'll always return the binary representation of the data that we saw last. When greatest(x,y) comes back 'false' for a MAX(), we then have to go check well, does the type consider them equal?, because, if the type considers them equal, we then have to decide if we should replace x with y anyway, because it's different at a binary level. That's what we're saying we'll always do now. I'm having a little trouble following that. You're looking at it from the perspective of what's committed today, I think. The query runs as it always has, with all the old definitions of comparisons. After it is done, we check whether the rows are the same. The operation of MAX() will not be affected in any way. If MAX() returns a value which is not the same as what the matview has, the matview will be modified to match what MAX() returned. I'm referring to a case where we're doing incremental maintenance of the matview (in the far distant future..) and all we've got is the current MAX value and the value from the row being updated. We're going to update that MAX on cases where the values are equal-but-binary-different. We're also saying that we'll replace things based on plan differences rather than based on if the rows underneath actually changed at all. Only if the user uses a query which does not produce deterministic results. Which is apt to happen.. We could end up with material differences in the result of matviews updated through incremental REFRESH and matviews updated through actual incremental mainteance- and people may *care* about those because we've told them (or they discover) they can depend on these types of changes to be reflected in the result. Perhaps we should document the recommendation that people not create materialized views with non-deterministic results, but I don't think that should be a hard restriction. I wasn't suggesting a hard restriction, but I was postulating about how the behavior may change in the future (or we may need to do very hacky things to prevent it from channging) and how these decisions may impact that. What I was trying to get at is really that the delete/insert approach would be good enough in very many cases and it wouldn't have what look, to me anyway, as some pretty ugly warts around these cases. I guess we could add a DELETE everything and INSERT the new version of everything option for REFRESH in addition to what is there now, but I would be very reluctant to use it as a replacement. It wouldn't address my concerns anyway, which are around these binary operators and the update-in-place approach being risky and setting us up for problems down the road. Thanks, Stephen signature.asc Description: Digital signature
Re: [HACKERS] logical changeset generation v6
On 09/24/2013 11:21 AM, Andres Freund wrote: Not having a consumer of the walsender interface included sounds like a bad idea to me, even if it were only useful for testing. Now, you could argue it should be in /contrib - and I wouldn't argue against that except it shares code with the rest of src/bin/pg_basebackup. +1 on pg_receivellog (or whatever better name we pick) being somewhere. I found the pg_receivellog code very useful as an example and for debugging/development purposes. It isn't something that I see useful for the average user and I think the use-cases it meets are closer to other things we usually put in /contrib Steve -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] dynamic shared memory
On Fri, Sep 20, 2013 at 7:44 AM, Andres Freund and...@2ndquadrant.com wrote: Hm, I guess you dont't want to add it to global/ or so because of the mmap implementation where you presumably scan the directory? Yes, and also because I thought this way would make it easier to teach things like pg_basebackup (or anybody's home-brew scripts) to just skip that directory completely. Actually, I was wondering if we ought to have a directory under pgdata whose explicit charter it was to contain files that shouldn't be copied as part of a base backup. pg_do_not_back_this_up. Wondered exactly about that as soon as you've mentioned pg_basebackup. pg_local/? That seems reasonable. It's not totally transparent what that's supposed to mean, but it's fairly mnemonic once you know. Other suggestions welcome, if anyone has ideas. Are there any other likely candidates for inclusion in that directory other than this stuff? + /* Create or truncate the file. */ + statefd = open(PG_DYNSHMEM_NEW_STATE_FILE, O_RDWR|O_CREAT|O_TRUNC, 0600); Doesn't this need a | PG_BINARY? It's a text file. Do we need PG_BINARY anyway? I'd say yes. Non binary mode stuff on windows does stuff like transforming LF = CRLF on reading/writing, which makes sizes not match up and similar ugliness. Imo there's little reason to use non-binary mode for anything written for postgres' own consumption. Well, I'm happy to do whatever the consensus is. AFAICT you and Noah are both for it and Amit's position is that it doesn't matter either way, so I'll go ahead and change that unless further discussion sheds a different light on things. Why are you using open() and not BasicOpenFile or even OpenTransientFile? Because those don't work in the postmaster. Oh, that's news to me. Seems strange, especially for BasicOpenFile. Per its header comment, InitFileAccess is not called in the postmaster, so there's no VFD cache. Thus, any attempt by BasicOpenFile to call ReleaseLruFile would be pointless at best. dsm_handle is an alias for uint32. Is that always exactly an unsigned int or can it sometimes be an unsigned long? I thought the latter, so couldn't figure out how to write this portably without casting to a type that explicitly matched the format string. That should always be an unsigned int on platforms we support. Note that we've printed TransactionIds that way (i.e. %u) for a long time and they are a uint32 as well. Fixed. Not sure whether it's sensible to only LOG in these cases. After all there's something unexpected happening. The robustness argument doesn't count since we're already shutting down. I see no point in throwing an error. The fact that we're having trouble cleaning up one dynamic shared memory segment doesn't mean we shouldn't try to clean up others, or that any remaining postmaster shutdown hooks shouldn't be executed. Well, it means we'll do a regular shutdown instead of PANICing and *not* writing a checkpoint. If something has corrupted our state to the point we cannot unregister shared memory we registered, something has gone terribly wrong. Quite possibly we've scribbled over our control structures or such. In that case it's not proper to do a normal shutdown, we're quite possibly writing bad data. I have to admit I didn't consider the possibility of an otherwise-clean shutdown that hit only this problem. I'm not sure how seriously to take that case. I guess we could emit warnings for individual failures and then throw an error at the end if there were 0, but that seems a little ugly. Or we could go whole hog and treat any failure as a critical error. Anyone else have an opinion on what to do here? Imo that's a PANIC or at the very least a FATAL. Sure, that's a tempting option, but it doesn't seem to serve any very necessary point. There's no data corruption problem if we proceed here. Most likely either (1) there's a bug in the code, which panicking won't fix or (2) the DBA hand-edited the state file, in which case maybe he shouldn't have done that, but if he thinks the best way to recover from that is a cluster-wide restart, he can do that himself. There's no data corruption problem if we proceed - but there likely has been one leading to the current state. I doubt it. It's more likely that the file permissions got changed or something. Do we rely on being run in an environment with proper setup for lwlock cleanup? I can imagine shared libraries doing this pretty early on... Yes, we rely on that. I don't really see that as a problem. You'd better connect to the main shared memory segment before starting to create your own. I am not talking about lwlocks itself being setup but an environment that has resource owners defined and catches errors. I am specifically asking because you're a) ereport()ing without releasing an LWLock b) unconditionally relying on the fact that there's a current resource owner. In
Re: [HACKERS] SSL renegotiation
Robert Haas escribió: On Mon, Sep 23, 2013 at 4:51 PM, Alvaro Herrera alvhe...@2ndquadrant.com wrote: Here's an updated version; this mainly simplifies code, per comments from Andres (things were a bit too baroque in places due to the way the code had evolved, and I hadn't gone over it to simplify it). The only behavior change is that the renegotiation is requested 1kB before the limit is hit: the raise to 1% of the configured limit was removed. What basis do we have for thinking that 1kB is definitely enough to avoid spurious disconnects? I noticed that the count variable (which is what we use to determine when to start the renegotiation and eventually kill the connection) is only incremented when there's successful SSL transmission: it doesn't count low-level network transmission. If OpenSSL returns a WANT_READ or WANT_WRITE error code, that variable is not incremented. The number of bytes returned does not include network data transmitted only to satisfy the renegotiation. Sadly, with the OpenSSL codebase, there isn't much documented field experience to go by. Even something battle-tested such as Apache's mod_ssl gets this wrong; but apparently they don't care because their sessions are normally so short-lived that they don't get these problems. Also, I spent several days trying to understand the OpenSSL codebase to figure out how this works, and I think there might be bugs in there too, at least with nonblocking sockets. I wasn't able to reproduce an actual failure, though. Funnily enough, their own test utilities do not stress this area too much (at least the ones they include in their release tarballs). (I have a bad feeling that you're going to say something along the lines of well, we tried it a bunch of times, and) Well, I did try a few times and saw no failure :-) I have heard about processes in production environments that are restarted periodically to avoid SSL failures which they blame on renegotiation. Some other guys have ssl_renegotiation_limit=0 because they know it causes network problems. I suggest we need to get this patch out there, so that they can test it; and if 1kB turns out not to be sufficient, we will have field experience including appropriate error messages on what is actually going on. (Right now, the error messages we get are complaining about completely the wrong thing.) I mean, if that 1kB limit is the only quarrel you have with this patch, I'm happy. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Completing PL support for Event Triggers
Peter Eisentraut wrote: In the source code, I'm dubious about the use of is_dml_trigger. I can see where you are coming from, but in most of the code, a trigger is a trigger and an event trigger is an event trigger. Let's not introduce more terminology. Other opinions on that? I'm with you. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] record identical operator
On Tue, Sep 24, 2013 at 12:00 PM, Stephen Frost sfr...@snowman.net wrote: It wouldn't address my concerns anyway, which are around these binary operators and the update-in-place approach being risky and setting us up for problems down the road. I think that if you want to hold up a bug fix to a feature that's already committed, you need to be more specific than to say that there might be problems down the road. I can't see that you've identified any consequence of this change that is clearly bad. I understand that you don't like the idea of making binary-comparison operators user-visible due to the risk of user-confusion, but we have lots of confusing features already and we handle that by documenting them. This one doesn't seem any worse than average; in fact, to me it seems rather minor. At any rate, I think we're getting distracted from the real point here. Here's what this patch is attempting to fix: rhaas=# create table sales (txn_date date, amount numeric); CREATE TABLE rhaas=# insert into sales values ('2013-09-01', '100.00'), ('2013-09-01', '150.00'), ('2013-09-02', '75.0'); INSERT 0 3 rhaas=# create materialized view sales_summary as select txn_date, sum(amount) as amount from sales group by 1; SELECT 2 rhaas=# create unique index on sales_summary (txn_date); CREATE INDEX rhaas=# select * from sales_summary; txn_date | amount + 2013-09-01 | 250.00 2013-09-02 | 75.0 (2 rows) rhaas=# update sales set amount = round(amount, 2); UPDATE 3 rhaas=# refresh materialized view concurrently sales_summary; REFRESH MATERIALIZED VIEW rhaas=# select * from sales_summary; txn_date | amount + 2013-09-01 | 250.00 2013-09-02 | 75.0 (2 rows) rhaas=# refresh materialized view sales_summary; REFRESH MATERIALIZED VIEW rhaas=# select * from sales_summary; txn_date | amount + 2013-09-01 | 250.00 2013-09-02 | 75.00 (2 rows) Note that, after we change the data in the underlying table, the output of the query changes. But REFRESH MATERIALIZED VIEW CONCURRENTLY doesn't fix it. However, REFRESH MATERIALIZED VIEW without CONCURRENTLY does fix it. That's a bug, because if there are two ways of refreshing a materialized view they should both produce the same answer. Shouldn't they? The fact that users can write queries that don't always give the same answer is not a reason why it's OK for REFRESH CONCURRENTLY to misbehave on queries that do. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] SSL renegotiation
On Tue, Sep 24, 2013 at 12:30 PM, Alvaro Herrera alvhe...@2ndquadrant.com wrote: I mean, if that 1kB limit is the only quarrel you have with this patch, I'm happy. You should probably be happy, then. :-) -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] FW: REVIEW: Allow formatting in log_line_prefix
David Rowley escribió: I do see a 15-18% slow down with the patched version, so perhaps I'll need to look to see if I can speed it up a bit, although I do feel this benchmark is not quite a normal workload. Ouch. That's certainly way too much. Is the compiler inlining process_log_prefix_padding()? If not, does it do it if you add inline to it? That might speed up things a bit. If that's not enough, maybe you need some way to return to the original coding for the case where no padding is set in front of each option. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] bitmap indexes
On Sat, Sep 14, 2013 at 11:14 AM, Abhijit Menon-Sen a...@2ndquadrant.comwrote: Hi. This is a cleaned-up and rebased version of the bitmap index patch from Gavin Sherry, later revised by Gianni Ciolli and Gabriele Bartolini, and others including Daniel Bausch. I've been working on this patch for a while, and have made some progress towards (a) general fixing, and (b) a working VACUUM implementation (the major remaining piece). Unfortunately, I've been busy moving house, and the latter is not complete (and not in this patch). I will continue working on the code, and I'll post updates. I expect to have more to show in just a few days. Nevertheless, I'm posting it for review now as I keep working. Given the size and age of the patch, I would appreciate any comments, no matter how nitpicky. Hi Abhijit, I get wrong answers from this index sometimes. It seems to occur when the position of the column within the index is not the same as its position within the table. So I think that what is happening is somewhere the offset into the list of table columns is misused to offset into the list of index columns. I didn't see any XXX notes that indicate this is a known problem. create table foo as select floor(random()*10) as a, floor(random()*10) as b, floor(random()*10) as c, d from generate_series(1,1000) d; vacuum ANALYZE; create index on foo using bitmap (a); create index on foo using bitmap (b); select count(*) from foo where a=4; 1000173 select count(*) from foo where a+0=4; 1000173 select count(*) from foo where b=4; 0 select count(*) from foo where b+0=4; 999750 Cheers, Jeff
Re: [HACKERS] record identical operator
* Robert Haas (robertmh...@gmail.com) wrote: On Tue, Sep 24, 2013 at 12:00 PM, Stephen Frost sfr...@snowman.net wrote: It wouldn't address my concerns anyway, which are around these binary operators and the update-in-place approach being risky and setting us up for problems down the road. I think that if you want to hold up a bug fix to a feature that's already committed, you need to be more specific than to say that there might be problems down the road. Committed isn't released and I simply can't agree that introducing new operators is a 'bug fix'. Had it gone out as-is, I can't see us back-patching a bunch of new operators to fix it. Note that, after we change the data in the underlying table, the output of the query changes. But REFRESH MATERIALIZED VIEW CONCURRENTLY doesn't fix it. However, REFRESH MATERIALIZED VIEW without CONCURRENTLY does fix it. That's a bug, because if there are two ways of refreshing a materialized view they should both produce the same answer. Shouldn't they? The same queries run over time without changes to the underlying data really should return the same data, shoudln't they? Is it a bug that they don't? In general, I agree that they should produce the same results, as should incrementally maintained views when they happen. I'm not convinced that choosing whatever the 'new' value is to represent the value in the matview for the equal-but-not-binary-identical will always be the correct answer though. The fact that users can write queries that don't always give the same answer is not a reason why it's OK for REFRESH CONCURRENTLY to misbehave on queries that do. This really is, imv, agreeing to hold a higher standard for matviews than we do for what matviews are *based* on- which is queries. Thanks, Stephen signature.asc Description: Digital signature
Re: [HACKERS] Some interesting news about Linux 3.12 OOM
All, I've send kernel.org a message that we're keen on seeing these changes become committed. BTW, in the future if anyone sees kernel.org contemplating a patch which helps or hurts Postgres, don't hesiate to speak up to them. They don't get nearly enough feedback from DB developers. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Some interesting news about Linux 3.12 OOM
All, I've send kernel.org a message that we're keen on seeing these changes get committed. BTW, in the future if anyone sees kernel.org contemplating a patch which helps or hurts Postgres, don't hesiate to speak up to them. They don't get nearly enough feedback from DB developers. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] FW: REVIEW: Allow formatting in log_line_prefix
On 2013-09-24 13:51:04 -0300, Alvaro Herrera wrote: David Rowley escribió: I do see a 15-18% slow down with the patched version, so perhaps I'll need to look to see if I can speed it up a bit, although I do feel this benchmark is not quite a normal workload. Ouch. That's certainly way too much. Is the compiler inlining process_log_prefix_padding()? If not, does it do it if you add inline to it? That might speed up things a bit. If that's not enough, maybe you need some way to return to the original coding for the case where no padding is set in front of each option. From a very short look without actually running it I'd guess the issue is all the $* things you're now passing to do appendStringInfo (which passes them off to vsnprintf). How does it look without that? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] record identical operator
On Tue, Sep 24, 2013 at 1:04 PM, Stephen Frost sfr...@snowman.net wrote: * Robert Haas (robertmh...@gmail.com) wrote: On Tue, Sep 24, 2013 at 12:00 PM, Stephen Frost sfr...@snowman.net wrote: It wouldn't address my concerns anyway, which are around these binary operators and the update-in-place approach being risky and setting us up for problems down the road. I think that if you want to hold up a bug fix to a feature that's already committed, you need to be more specific than to say that there might be problems down the road. Committed isn't released and I simply can't agree that introducing new operators is a 'bug fix'. Had it gone out as-is, I can't see us back-patching a bunch of new operators to fix it. You're perfectly welcome to argue that we should rip REFRESH MATERIALIZED VIEW CONCURRENTLY back out, or change the implementation of it. However, that's not the topic of this thread. Had it gone out as is, we would have either had to come up with some more complex fix that could make things right without catalog changes, or we would have had to document that it doesn't work correctly. Fortunately we are not faced with that conundrum. Note that, after we change the data in the underlying table, the output of the query changes. But REFRESH MATERIALIZED VIEW CONCURRENTLY doesn't fix it. However, REFRESH MATERIALIZED VIEW without CONCURRENTLY does fix it. That's a bug, because if there are two ways of refreshing a materialized view they should both produce the same answer. Shouldn't they? The same queries run over time without changes to the underlying data really should return the same data, shoudln't they? Not necessarily. There are and always have been plenty of cases where that isn't true. Is it a bug that they don't? No. In general, I agree that they should produce the same results, as should incrementally maintained views when they happen. I'm not convinced that choosing whatever the 'new' value is to represent the value in the matview for the equal-but-not-binary-identical will always be the correct answer though. Well, you have yet to provide an example of where it isn't the right behavior. I initially thought that perhaps we needed a type-specific concept of exact equality, so that two things would be exactly equal if they would both be dumped the same way by pg_dump, even if the internal representations were different. But on further thought I'm no longer convinced that's a good idea. For example, consider the compact-numeric format that I introduced a few releases back. The changes are backward compatible, so you can run pg_upgrade on a cluster containing the old format and everything works just fine. But your table will be larger than you really want it to be until you rewrite it. Any materialized view that was built by selecting those numeric values will *also* be larger than you want it to be until you rewrite it. Fortunately, you can shrink the table by using a rewriting variant of ALTER TABLE. After you do that, you will perhaps also want to shrink the materialized view. And in fact, REFRESH will do that for you, but currently, REFRESH CONCURRENTLY won't. It seems to me that it would be nicer if it did. Now I admit that's an arguable point. We could certainly define an intermediate notion of equality that is more equal than whatever = does, but not as equal as exact binary equality. However, such a new notion would have no precedence in our existing code, whereas the use of binary equality does. Going to a lot of work to build up this intermediate notion of equality does not seem like a good idea to me; I think a general rule of good programming is not to invent more ways to do things than are clearly necessary, and requiring every core and extension type to define an exact-equality operator just to support this feature seems like excessive in the extreme. Moreover, I think what to include in that intermediate notion of equality would be kind of hard to decide, because there are a lot of subtly different questions that one can ask, and we'd inevitably have to answer the question how equal does it have to be to be equal enough?. Numerics and arrays have multiple ways of referencing what is intended to be the exact same value, but those can be disambiguated by passing them to a function that cares, like pg_column_size(). Then, on a user-visible level, numerics allow variation in the number of trailing zeroes after the decimal point. Floats have extra digits that aren't shown except with extra_float_digits, so that the value can be less than totally equal even if the text representation is the same. citext intentionally ignores case. In every one of those cases, it's possible that the user of materialized views might say you know, if it's equal enough that the = operator says it's the same, then I really don't mind if the materialized view maintenance gets skipped. But it's also possible that they might say, you know, those
Re: [HACKERS] dynamic shared memory
On 2013-09-24 12:19:51 -0400, Robert Haas wrote: On Fri, Sep 20, 2013 at 7:44 AM, Andres Freund and...@2ndquadrant.com wrote: Hm, I guess you dont't want to add it to global/ or so because of the mmap implementation where you presumably scan the directory? Yes, and also because I thought this way would make it easier to teach things like pg_basebackup (or anybody's home-brew scripts) to just skip that directory completely. Actually, I was wondering if we ought to have a directory under pgdata whose explicit charter it was to contain files that shouldn't be copied as part of a base backup. pg_do_not_back_this_up. Wondered exactly about that as soon as you've mentioned pg_basebackup. pg_local/? That seems reasonable. It's not totally transparent what that's supposed to mean, but it's fairly mnemonic once you know. Other suggestions welcome, if anyone has ideas. pg_node_local/ was the only reasonable thing I could think of otherwise, and I disliked it because it seems we shouldn't introduce node as a term just for this. Are there any other likely candidates for inclusion in that directory other than this stuff? You could argue that pg_stat_tmp/ is one. Why are you using open() and not BasicOpenFile or even OpenTransientFile? Because those don't work in the postmaster. Oh, that's news to me. Seems strange, especially for BasicOpenFile. Per its header comment, InitFileAccess is not called in the postmaster, so there's no VFD cache. Thus, any attempt by BasicOpenFile to call ReleaseLruFile would be pointless at best. Well, but it makes code running in both backends and postmaster easier to write. Good enough for me anyway. Imo that's a PANIC or at the very least a FATAL. Sure, that's a tempting option, but it doesn't seem to serve any very necessary point. There's no data corruption problem if we proceed here. Most likely either (1) there's a bug in the code, which panicking won't fix or (2) the DBA hand-edited the state file, in which case maybe he shouldn't have done that, but if he thinks the best way to recover from that is a cluster-wide restart, he can do that himself. There's no data corruption problem if we proceed - but there likely has been one leading to the current state. I doubt it. It's more likely that the file permissions got changed or something. We panic in that case during a shutdown, don't we? ... Yep: PANIC: could not open control file global/pg_control: Permission denied I am not talking about lwlocks itself being setup but an environment that has resource owners defined and catches errors. I am specifically asking because you're a) ereport()ing without releasing an LWLock b) unconditionally relying on the fact that there's a current resource owner. In shared_preload_libraries neither is the case afair? I don't really feel like solving all of those problems and, TBH, I don't see why it's particularly important. If somebody wants a loadable module that can be loaded either from shared_preload_libraries or on the fly, and they use dynamic shared memory in the latter case, then they can use it in the former case as well. If they've already got logic to create the DSM when it's first needed, it doesn't cost extra to do it that way in both cases. Fair enough. They'll continue to see the portion they have mapped, but must do dsm_remap() if they want to see the whole thing. But resizing can shrink, can it not? And we do an ftruncate() in at least the posix shmem case. Which means the other backend will get a SIGSEGV accessing that memory IIRC. Yep. Shrinking the shared memory segment will require special caution. Caveat emptor. Then a comment to that effect would be nice. We're not talking about a missed munmap() but about one that failed. If we unpin the leaked pins and notice that we haven't actually pinned it anymore we do error (well, Assert) out. Same for TupleDescs. If there were valid scenarios in which you could get into that situation, maybe. But which would that be? ISTM we can only get there if our internal state is messed up. I don't know. I think that's part of why it's hard to decide what we want to happen. But personally I think it's paranoid to say, well, something happened that we weren't expecting, so that must mean something totally horrible has happened and we'd better die in a fire. Well, by that argument we wouldn't need to PANIC on a whole host of issues. Like segfaults. Anyway, I guess we need other opinions here. Why isn't the port number part of the posix shmem identifiers? Sure, we retry, but using a logic similar to sysv_shmem.c seems like a good idea. According to the man page for shm_open on Solaris, For maximum portability, name should include no more than 14 characters, but this limit is not enforced. What about /pgsql.%u or something similar? That should still fit. Well, if you
Re: [HACKERS] FW: REVIEW: Allow formatting in log_line_prefix
On 2013-09-24 19:56:32 +0200, Andres Freund wrote: On 2013-09-24 13:51:04 -0300, Alvaro Herrera wrote: David Rowley escribió: I do see a 15-18% slow down with the patched version, so perhaps I'll need to look to see if I can speed it up a bit, although I do feel this benchmark is not quite a normal workload. Ouch. That's certainly way too much. Is the compiler inlining process_log_prefix_padding()? If not, does it do it if you add inline to it? That might speed up things a bit. If that's not enough, maybe you need some way to return to the original coding for the case where no padding is set in front of each option. From a very short look without actually running it I'd guess the issue is all the $* things you're now passing to do appendStringInfo (which passes them off to vsnprintf). How does it look without that? That's maybe misunderstandable, what I mean is to have an if (padding 0) around the the changed appendStringInfo invocations and use the old ones otherwise. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pgbench progress report improvements - split 2
Hello Noah, meet all those goals simultaneously with simpler code, can we not? int64 wait = (int64) (throttle_delay * Min(7.0, -log(1 - pg_erand48(thread-random_state; If you truncate roughly the multipler, as it is done here with min, you necessarily create a bias, my random guess would be a 0.5% under estimation, but maybe it is more... Thus you would have to compute and the correcting factor as well. Its computation is a little bit less easy than with the quantized formula where I just used a simple SUM, and you have to really do the maths here. So I would keep the simple solution, but it is fine with me if you do the maths! -- Fabien. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] record identical operator
* Robert Haas (robertmh...@gmail.com) wrote: Now I admit that's an arguable point. We could certainly define an intermediate notion of equality that is more equal than whatever = does, but not as equal as exact binary equality. I suggested it up-thread and don't recall seeing a response, so here it is again- passing the data through the binary-out function for the type and comparing *that* would allow us to change the interal binary representation of data types and would be something which we could at least explain to users as to why X isn't the same as Y according to this binary operator. I think the conservative (and therefore correct) approach is to decide that we're always going to update if we detect any difference at all. And there may be users who are surprised that a refresh changed parts of the table that have nothing to do with what was changed in the underlying relation, because a different plan was used and the result ended up being binary-different. It's easy to explain to users why that would be when we're doing a wholesale replace but I don't think it'll be nearly as clear why that happened when we're not replacing the whole table and why REFRESH can basically end up changing anything (but doesn't consistently). If we're paying attention to the records changed and only updating the matview's records when they're involved, that becomes pretty clear. What's happening here feels very much like unintended consequences. It is obviously true, and unavoidable, that if the query can produce more than one result set depending on the query plan or other factors, then the materialized view contents can match only one of those possible result sets. But you are arguing that it should be allowed to produce some OTHER result set, that a direct execution of the query could *never* have produced, and I can't see how that can be right. I agree that the matview shouldn't produce things which *can't* exist through an execution of the query. I don't intend to argue against that but I realize that's a fallout of not accepting the patch to implement the binary comparison operators. My complaint is more generally that if this approach needs such then there's going to be problems down the road. No, I can't predict exactly what they are and perhaps I'm all wet here, but this kind of binary-equality operations are something I've tried to avoid since discovering what happens when you try to compare a couple of floating point numbers. Thanks, Stephen signature.asc Description: Digital signature
Re: [HACKERS] pgbench progress report improvements
Dear Robert, (1) ... I really don't think it's a good idea to change the default behavior. We've had many discussions about how the overhead of gettimeofday() can sometimes be surprisingly large; I don't think we should assume it's guaranteed to be small in this case. Also, changing established defaults will annoy users who like to present defaults; I don't see any reason to assume that your preferences will be universal. In doubtful cases we should favor leaving the behavior the way it's been in previous releases, because backward-compatibility is very desirable. I argued in another mail precisely, with worst figures found on the net, about the relative overhead of gettimeofday as used by pgbench, which is also handling network traffic and waiting for the database to perform actual transactions. I do not thing that I'm assuming, and I'm trying to argument with objective data. Anyway, this particular behavior is preserved by the current version of the patch, so no worry. The additional gettimeofday is *not* perform under standard silent benchmarking, and the standard deviation of the latency is not measured, because it can't. It is however measured under --rate and --progress. It is necessary for rate because otherwise the computed latency is false. It is done for progress because if you are interested to know what is going on, I assume that it would make sense to provide this data. I must admit that I think, un-universaly, that people should care to know that their transaction latency can vary by several order of magnitudes, but this opinion is clearly not shared. I tried to preserve the row-counting behavior because I thought that someone would object otherwise, but I would be really in favor of dropping the row-counting report behavior altogether and only keep the time triggered report. -1 for changing this again. Frankly, I might have come down in a different place if I had understood exactly how this was going to end up being committed; it ended up being quite different from what I was expecting. But I really think that relitigating this just so we can break backward compatibility again one release later is not a good use of anyone's time, or a good idea in general. The current status on my small (SSD) laptop is that pgbench -i throws about 10 lines of 100,000-rows progress report per second. I must be a slow reader because I can't read that fast. So either other users can read much faster than me, or they have slower computers:-) ISTM that it is no big deal to change this kind of thing on a minor contrib tool of postgresql if it is reasonnably better and useful, and I would be surprise and seeing protests about changing an unreadable flow to a readable one. Anyway, let us keep this default behavior, as I hope there must be people who like it. Well, if anyone could tell me that he/she likes better having 10 lines a second thrown on the screen than a regular progress report every few seconds, I would feel less stupid at reinstating this behavior and re-submitting a new version of the patch. -- Fabien. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] FW: REVIEW: Allow formatting in log_line_prefix
On Wed, Sep 25, 2013 at 6:25 AM, Andres Freund and...@2ndquadrant.comwrote: On 2013-09-24 19:56:32 +0200, Andres Freund wrote: On 2013-09-24 13:51:04 -0300, Alvaro Herrera wrote: David Rowley escribió: I do see a 15-18% slow down with the patched version, so perhaps I'll need to look to see if I can speed it up a bit, although I do feel this benchmark is not quite a normal workload. Ouch. That's certainly way too much. Is the compiler inlining process_log_prefix_padding()? If not, does it do it if you add inline to it? That might speed up things a bit. If that's not enough, maybe you need some way to return to the original coding for the case where no padding is set in front of each option. From a very short look without actually running it I'd guess the issue is all the $* things you're now passing to do appendStringInfo (which passes them off to vsnprintf). How does it look without that? That's maybe misunderstandable, what I mean is to have an if (padding 0) around the the changed appendStringInfo invocations and use the old ones otherwise. Yeah I had the same idea to try that next. I suspect that's where the slow down is rather than the processing of the padding. I'm thinking these small tweaks are going to make the code a bit ugly, but I agree about the 15-18% slowdown is a no go. The only other thing apart from checking if padding 0 is to check if the char after the % is '9', in that case it can't be formatting as we're only allowing '-' and '0' to '9'. Although I think that's a bit hackish, but perhaps it is acceptable if it helps narrow the performance gap. More benchmarks to follow soon. David Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services
Re: [HACKERS] 9.3 Json Array's
I agree with the best effort type of conversion, and only being able to handle JSON array's that conform to an SQL array. With that said I would love to collaborate with you on this, but there is one thing holding me back. The current company I work for (an insurance company) says it is a conflict of interest so I have to be careful. I can try to help out in other ways if possible, and I will double check with our HR. On Tue, Sep 24, 2013 at 8:12 AM, Chris Travers ch...@2ndquadrant.comwrote: ** On 24 September 2013 at 13:46 Andrew Dunstan and...@dunslane.net wrote: Feel free to ask questions. The heart of the API is the event handlers defined in this stuct in include/utils/jsonapi.h: typedef struct JsonSemAction { void *semstate; json_struct_action object_start; json_struct_action object_end; json_struct_action array_start; json_struct_action array_end; json_ofield_action object_field_start; json_ofield_action object_field_end; json_aelem_action array_element_start; json_aelem_action array_element_end; json_scalar_action scalar; } JsonSemAction; Basically there is a handler for the start and end of each non-scalar structural element in JSON, plus a handler for scalars. There are several problems that will be posed by processing nested arrays and objects, including: * in effect you would need to construct a stack of state that could be pushed and popped True. * JSON arrays aren't a very good match for SQL arrays - they are unidimensional and heterogenous. This is true, but I think one would have to start with an assumption that the data is valid for an SQL type and then check again once one gets it done.JSON is a pretty flexible format which makes it a poor match in many cases for SQL types generally. But I think the example so far (with json_populate_recordset) is a good one, namely a best effort conversion. I'm not saying this can't be done - it will just take a bit of effort. Yeah, looking through the code, I think it will be more work than I originally thought but that just means it will take longer. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers Best Wishes, Chris Travers http://www.2ndquadrant.com PostgreSQL Services, Training, and Support
Re: [HACKERS] 9.3 Json Array's
On Tue, Sep 24, 2013 at 3:14 PM, Adam Jelinek ajeli...@gmail.com wrote: I agree with the best effort type of conversion, and only being able to handle JSON array's that conform to an SQL array. With that said I would love to collaborate with you on this, but there is one thing holding me back. The current company I work for (an insurance company) says it is a conflict of interest so I have to be careful. I can try to help out in other ways if possible, and I will double check with our HR. pro tip: don't ask until you already did the work. merlin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pgbench progress report improvements - split 3
On 09/22/2013 10:44 PM, Fabien COELHO wrote: Due to the possibly repetitive table structure of the data, maybe CSV would make sense as well. It is less extensible, but it is directly usable by sqlite or pg. I'd be OK with CSV. At least I wouldn't be regexing the output. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pgbench progress report improvements - split 3 v2
Split 3 of the initial submission, which actually deal with data measured and reported on stderr under various options. This version currently takes into account many comments by Noah Mish. In particular, the default no report behavior under benchmarking is not changed, although I really think that it should. Also, the standard deviation is only shown when available, which is not in under all settings, because of concerns expressed about the cost of additional calls to gettimeofday. ISTM that these concerned are misplaced in this particular case. Yet another version to fulfill requirements by Robert Haas not to change current default behaviors. The initialization progress is reported every 100-k rows, and there is no progress report for benchmarking. Improve pgbench measurements progress report These changes are coupled because measures are changed, and their reporting as well. Submitting separate patches for these different features would result in conflicting or dependent patches, so I wish to avoid that if possible. - Use same progress and quiet options both under init bench. However, the default reporting is NOT changed, as required by Robert Haas. It is currently every 100k rows when initializing, and nothing when benchmarking. I would suggest to change this default to a few seconds for both. The 100-k row report is very verbose an unreadable on good hardware. For benchmarking, the rational is that it is error prone as it must run a long time to be significant because of warmup effects (esp on HDD, less true on SSD). Seeing how the current performance evolve would help pgbench users realise that. See down below a sample output. - Measure transaction latency instead of computing it, for --rate and --progress. The previous computed figure does not make sense under throttling: as sleep throttling time was included in the figures, the displayed results were plain false. The latency and its standard deviation are printed out under progress and in the final report when available. It could be made always available, but that would require to accept additional gettimeofday calls. I do not think that there is a performance issue here, but there is significant opposition to the idea. - Take thread start time at the beginning of the thread (!) Otherwise it includes pretty slow thread/fork system start times in the measurements. May help with bug #8326. This also helps with throttling, as the start time is used to adjust the rate: if it is not the actual start time, there is a bit of a rush at the beginning in order to catch up. Taking the start time when the thread actually starts is the sane thing to do to avoid surprises at possibly strange measures on short runs. Sample output under initialization with --progress=1 creating tables... 1126000 of 300 tuples (37%) done (elapsed 1.00 s, remaining 1.67 s). 2106000 of 300 tuples (70%) done (elapsed 2.00 s, remaining 0.85 s). 2824000 of 300 tuples (94%) done (elapsed 3.00 s, remaining 0.19 s). 300 of 300 tuples (100%) done (elapsed 3.19 s, remaining 0.00 s). vacuum... set primary keys... done. Sample output under benchmarking with --progress=1 progress: 1.0 s, 2626.1 tps, 0.374 stddev 0.597 ms lat progress: 2.0 s, 2766.6 tps, 0.358 stddev 0.588 ms lat progress: 3.0 s, 2567.4 tps, 0.385 stddev 0.665 ms lat progress: 4.0 s, 3014.2 tps, 0.328 stddev 0.593 ms lat progress: 5.0 s, 2959.3 tps, 0.334 stddev 0.553 ms lat ... progress: 16.0 s, 5009.2 tps, 0.197 stddev 0.381 ms lat ... progress: 24.0 s, 7051.2 tps, 0.139 stddev 0.284 ms lat ... progress: 50.0 s, 6860.5 tps, 0.143 stddev 0.052 ms lat ... The warmup is quite fast because the DB is on a SSD. In the beginning the standard deviation is well over the average transaction time, but when the steady state is reached (later) it is much below. -- Fabien.diff --git a/contrib/pgbench/pgbench.c b/contrib/pgbench/pgbench.c index ad8e272..934244b 100644 --- a/contrib/pgbench/pgbench.c +++ b/contrib/pgbench/pgbench.c @@ -167,11 +167,13 @@ char *index_tablespace = NULL; #define SCALE_32BIT_THRESHOLD 2 bool use_log; /* log transaction latencies to a file */ -bool use_quiet; /* quiet logging onto stderr */ int agg_interval; /* log aggregates instead of individual * transactions */ -int progress = 0; /* thread progress report every this seconds */ +bool use_quiet = false; /* more or less quiet logging onto stderr */ +int progress = -1; /* report every this seconds. + 0 is no report, -1 is not set yet */ int progress_nclients = 0; /* number of clients for progress report */ +int progress_nthreads = 0; /* number of threads for progress report */ bool is_connect; /* establish connection for each transaction */ bool is_latencies; /* report per-command latencies */ int main_pid;
Re: [HACKERS] Suggestion: Issue warning when calling SET TRANSACTION outside transaction block
On Thu, Sep 12, 2013 at 09:38:43AM +0530, Amit Kapila wrote: I have created the attached patch which issues an error when SET TRANSACTION and SET LOCAL are used outside of transactions: test= set transaction isolation level serializable; ERROR: SET TRANSACTION can only be used in transaction blocks test= reset transaction isolation level; ERROR: RESET TRANSACTION can only be used in transaction blocks test= set local effective_cache_size = '3MB'; ERROR: SET LOCAL can only be used in transaction blocks test= set local effective_cache_size = default; ERROR: SET LOCAL can only be used in transaction blocks Shouldn't we do it for Set Constraints as well? Oh, very good point. I missed that one. Updated patch attached. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c new file mode 100644 index b1023c4..3ffdfe0 *** a/src/backend/tcop/utility.c --- b/src/backend/tcop/utility.c *** standard_ProcessUtility(Node *parsetree, *** 688,694 break; case T_VariableSetStmt: ! ExecSetVariableStmt((VariableSetStmt *) parsetree); break; case T_VariableShowStmt: --- 688,694 break; case T_VariableSetStmt: ! ExecSetVariableStmt((VariableSetStmt *) parsetree, isTopLevel); break; case T_VariableShowStmt: *** standard_ProcessUtility(Node *parsetree, *** 754,759 --- 754,760 break; case T_ConstraintsSetStmt: + RequireTransactionChain(isTopLevel, SET CONSTRAINTS); AfterTriggerSetState((ConstraintsSetStmt *) parsetree); break; diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c new file mode 100644 index 3107f9c..ff39920 *** a/src/backend/utils/misc/guc.c --- b/src/backend/utils/misc/guc.c *** flatten_set_variable_args(const char *na *** 6252,6258 * SET command */ void ! ExecSetVariableStmt(VariableSetStmt *stmt) { GucAction action = stmt-is_local ? GUC_ACTION_LOCAL : GUC_ACTION_SET; --- 6252,6258 * SET command */ void ! ExecSetVariableStmt(VariableSetStmt *stmt, bool isTopLevel) { GucAction action = stmt-is_local ? GUC_ACTION_LOCAL : GUC_ACTION_SET; *** ExecSetVariableStmt(VariableSetStmt *stm *** 6260,6265 --- 6260,6267 { case VAR_SET_VALUE: case VAR_SET_CURRENT: + if (stmt-is_local) + RequireTransactionChain(isTopLevel, SET LOCAL); (void) set_config_option(stmt-name, ExtractSetVariableArgs(stmt), (superuser() ? PGC_SUSET : PGC_USERSET), *** ExecSetVariableStmt(VariableSetStmt *stm *** 6269,6275 0); break; case VAR_SET_MULTI: - /* * Special-case SQL syntaxes. The TRANSACTION and SESSION * CHARACTERISTICS cases effectively set more than one variable --- 6271,6276 *** ExecSetVariableStmt(VariableSetStmt *stm *** 6281,6286 --- 6282,6289 { ListCell *head; + RequireTransactionChain(isTopLevel, SET TRANSACTION); + foreach(head, stmt-args) { DefElem*item = (DefElem *) lfirst(head); *** ExecSetVariableStmt(VariableSetStmt *stm *** 6325,6330 --- 6328,6335 { A_Const*con = (A_Const *) linitial(stmt-args); + RequireTransactionChain(isTopLevel, SET TRANSACTION); + if (stmt-is_local) ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), *** ExecSetVariableStmt(VariableSetStmt *stm *** 6338,6344 --- 6343,6355 stmt-name); break; case VAR_SET_DEFAULT: + if (stmt-is_local) + RequireTransactionChain(isTopLevel, SET LOCAL); + /* fall through */ case VAR_RESET: + if (strcmp(stmt-name, transaction_isolation) == 0) + RequireTransactionChain(isTopLevel, RESET TRANSACTION); + (void) set_config_option(stmt-name, NULL, (superuser() ? PGC_SUSET : PGC_USERSET), diff --git a/src/include/utils/guc.h b/src/include/utils/guc.h new file mode 100644 index 99211c1..89ee40c *** a/src/include/utils/guc.h --- b/src/include/utils/guc.h *** extern void SetPGVariable(const char *na *** 334,340 extern void GetPGVariable(const char *name, DestReceiver *dest); extern TupleDesc GetPGVariableResultDesc(const char *name); ! extern void ExecSetVariableStmt(VariableSetStmt *stmt); extern char *ExtractSetVariableArgs(VariableSetStmt *stmt); extern void ProcessGUCArray(ArrayType *array, --- 334,340 extern void GetPGVariable(const char *name, DestReceiver *dest); extern TupleDesc GetPGVariableResultDesc(const char *name); ! extern void ExecSetVariableStmt(VariableSetStmt *stmt, bool isTopLevel); extern
Re: [HACKERS] unaccent module - two params function should be immutable
On Tue, Sep 17, 2013 at 10:15:47AM -0400, Robert Haas wrote: On Sat, Sep 14, 2013 at 9:42 AM, Pavel Stehule pavel.steh...@gmail.com wrote: I have developed the attached patch based on your suggestion. I did not see anything in the code that would make it STABLE, except a lookup of a dictionary library: dictOid = get_ts_dict_oid(stringToQualifiedNameList(unaccent), false); yes, it risk, but similar is with timezones, and other external data. That's a catalog lookup - not a reliance on external data. Sorry, I was wrong. Only unaccent_dict() calls get_ts_dict_oid(), and we aren't changing the signature of that function. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Minmax indexes
Jaime Casanova wrote: Found another problem with the this steps: create table t1 (i int); create index idx_t1_i on t1 using minmax(i); insert into t1 select generate_series(1, 200); ERROR: could not read block 1 in file base/12645/16397_vm: read only 0 of 8192 bytes Thanks. This was a trivial off-by-one bug; fixed in the attached patch. While studying it, I noticed that I was also failing to notice extension of the fork by another process. I have tried to fix that also in the current patch, but I'm afraid that a fully robust solution for this will involve having a cached fork size in the index's relcache entry -- just like we have smgr_vm_nblocks. In fact, since the revmap fork is currently reusing the VM forknum, I might even be able to use the same variable to keep track of the fork size. But I don't really like this bit of reusing the VM forknum for revmap, so I've refrained from extending that assumption into further code for the time being. There was also a bug that we would try to initialize a revmap page twice during recovery, if two backends thought they needed to extend it; that would cause the data written by the first extender to be lost. This patch applies on top of the two previous incremental patches. I will send a full patch later, including all those fixes and the fix for the opr_sanity regression test. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services *** a/src/backend/access/minmax/mmrevmap.c --- b/src/backend/access/minmax/mmrevmap.c *** *** 30,36 #define HEAPBLK_TO_REVMAP_INDEX(pagesPerRange, heapBlk) \ ((heapBlk / pagesPerRange) % IDXITEMS_PER_PAGE) ! static void mmRevmapExtend(mmRevmapAccess *rmAccess, BlockNumber blkno); /* typedef appears in minmax_revmap.h */ struct mmRevmapAccess --- 30,36 #define HEAPBLK_TO_REVMAP_INDEX(pagesPerRange, heapBlk) \ ((heapBlk / pagesPerRange) % IDXITEMS_PER_PAGE) ! static bool mmRevmapExtend(mmRevmapAccess *rmAccess, BlockNumber blkno); /* typedef appears in minmax_revmap.h */ struct mmRevmapAccess *** *** 52,62 mmRevmapAccessInit(Relation idxrel, BlockNumber pagesPerRange) { mmRevmapAccess *rmAccess = palloc(sizeof(mmRevmapAccess)); rmAccess-idxrel = idxrel; rmAccess-pagesPerRange = pagesPerRange; rmAccess-currBuf = InvalidBuffer; rmAccess-physPagesInRevmap = ! RelationGetNumberOfBlocksInFork(idxrel, MM_REVMAP_FORKNUM); return rmAccess; } --- 52,64 { mmRevmapAccess *rmAccess = palloc(sizeof(mmRevmapAccess)); + RelationOpenSmgr(idxrel); + rmAccess-idxrel = idxrel; rmAccess-pagesPerRange = pagesPerRange; rmAccess-currBuf = InvalidBuffer; rmAccess-physPagesInRevmap = ! smgrnblocks(idxrel-rd_smgr, MM_REVMAP_FORKNUM); return rmAccess; } *** *** 111,121 mmSetHeapBlockItemptr(mmRevmapAccess *rmAccess, BlockNumber heapBlk, /* * If the revmap is out of space, extend it first. */ ! if (mapBlk rmAccess-physPagesInRevmap - 1) ! { ! mmRevmapExtend(rmAccess, mapBlk); ! extend = true; ! } /* * Obtain the buffer from which we need to read. If we already have the --- 113,120 /* * If the revmap is out of space, extend it first. */ ! if (mapBlk = rmAccess-physPagesInRevmap) ! extend = mmRevmapExtend(rmAccess, mapBlk); /* * Obtain the buffer from which we need to read. If we already have the *** *** 202,211 mmGetHeapBlockItemptr(mmRevmapAccess *rmAccess, BlockNumber heapBlk, mapBlk = HEAPBLK_TO_REVMAP_BLK(rmAccess-pagesPerRange, heapBlk); ! if (mapBlk rmAccess-physPagesInRevmap) { ! ItemPointerSetInvalid(out); ! return; } if (rmAccess-currBuf == InvalidBuffer || --- 201,229 mapBlk = HEAPBLK_TO_REVMAP_BLK(rmAccess-pagesPerRange, heapBlk); ! /* ! * If we are asked for a block of the map which is beyond what we know ! * about it, try to see if our fork has grown since we last checked its ! * size; a concurrent inserter could have extended it. ! */ ! if (mapBlk = rmAccess-physPagesInRevmap) { ! RelationOpenSmgr(rmAccess-idxrel); ! LockRelationForExtension(rmAccess-idxrel, ShareLock); ! rmAccess-physPagesInRevmap = ! smgrnblocks(rmAccess-idxrel-rd_smgr, MM_REVMAP_FORKNUM); ! ! if (mapBlk = rmAccess-physPagesInRevmap) ! { ! /* definitely not in range */ ! ! UnlockRelationForExtension(rmAccess-idxrel, ShareLock); ! ItemPointerSetInvalid(out); ! return; ! } ! ! /* the block exists now, proceed */ ! UnlockRelationForExtension(rmAccess-idxrel, ShareLock); } if (rmAccess-currBuf == InvalidBuffer || *** *** 273,286 mmRevmapCreate(Relation idxrel) } /* ! * Extend the reverse range map to cover the given block number. * * NB -- caller is responsible for ensuring this action is properly WAL-logged. */ ! static void
Re: [HACKERS] record identical operator
Stephen Frost sfr...@snowman.net wrote: I think the conservative (and therefore correct) approach is to decide that we're always going to update if we detect any difference at all. And there may be users who are surprised that a refresh changed parts of the table that have nothing to do with what was changed in the underlying relation, because a different plan was used and the result ended up being binary-different. Binary different for equal values could include box (or other geometry) objects moved to completely new locations and/or not quite the same size. Here is v2 of the patch which changes from the universally disliked operator names v1 used. It also fixes bugs in the row comparisons for pass-by-reference types, fixes a couple nearby comments, and adds regression tests for a matview containing a box column. The box type is interesting in that its = operator only worries about the area of the box, and considers two boxes with no more than EPSILON difference to be equal. This means that boxes at totally different locations can be equal, and that if A = B and B = C it is not necessarily true that A = C. This doesn't cause too much trouble in general because boxes don't have btree opclasses. Since there are types, including core types, without a default btree opclass, materialized views containing them cannot use RMVC as currently committed. This patch would fix that, or we could rip out the current implementation and go to the delete everything and insert the entire new matview contents approach. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company*** a/contrib/citext/expected/citext.out --- b/contrib/citext/expected/citext.out *** *** 2276,2278 SELECT like_escape( name::text, ''::citext ) = like_escape( name::text, '' ) AS --- 2276,2319 t (5 rows) + -- Ensure correct behavior for citext with materialized views. + CREATE TABLE citext_table ( + id serial primary key, + name citext + ); + INSERT INTO citext_table (name) + VALUES ('one'), ('two'), ('three'), (NULL), (NULL); + CREATE MATERIALIZED VIEW citext_matview AS + SELECT * FROM citext_table; + CREATE UNIQUE INDEX citext_matview_id + ON citext_matview (id); + SELECT * + FROM citext_matview m + FULL JOIN citext_table t ON (t.id = m.id AND t *= m) + WHERE t.id IS NULL OR m.id IS NULL; + id | name | id | name + +--++-- + (0 rows) + + UPDATE citext_table SET name = 'Two' WHERE name = 'TWO'; + SELECT * + FROM citext_matview m + FULL JOIN citext_table t ON (t.id = m.id AND t *= m) + WHERE t.id IS NULL OR m.id IS NULL; + id | name | id | name + +--++-- + | | 2 | Two + 2 | two || + (2 rows) + + REFRESH MATERIALIZED VIEW CONCURRENTLY citext_matview; + SELECT * FROM citext_matview ORDER BY id; + id | name + +--- + 1 | one + 2 | Two + 3 | three + 4 | + 5 | + (5 rows) + *** a/contrib/citext/expected/citext_1.out --- b/contrib/citext/expected/citext_1.out *** *** 2276,2278 SELECT like_escape( name::text, ''::citext ) = like_escape( name::text, '' ) AS --- 2276,2319 t (5 rows) + -- Ensure correct behavior for citext with materialized views. + CREATE TABLE citext_table ( + id serial primary key, + name citext + ); + INSERT INTO citext_table (name) + VALUES ('one'), ('two'), ('three'), (NULL), (NULL); + CREATE MATERIALIZED VIEW citext_matview AS + SELECT * FROM citext_table; + CREATE UNIQUE INDEX citext_matview_id + ON citext_matview (id); + SELECT * + FROM citext_matview m + FULL JOIN citext_table t ON (t.id = m.id AND t *= m) + WHERE t.id IS NULL OR m.id IS NULL; + id | name | id | name + +--++-- + (0 rows) + + UPDATE citext_table SET name = 'Two' WHERE name = 'TWO'; + SELECT * + FROM citext_matview m + FULL JOIN citext_table t ON (t.id = m.id AND t *= m) + WHERE t.id IS NULL OR m.id IS NULL; + id | name | id | name + +--++-- + | | 2 | Two + 2 | two || + (2 rows) + + REFRESH MATERIALIZED VIEW CONCURRENTLY citext_matview; + SELECT * FROM citext_matview ORDER BY id; + id | name + +--- + 1 | one + 2 | Two + 3 | three + 4 | + 5 | + (5 rows) + *** a/contrib/citext/sql/citext.sql --- b/contrib/citext/sql/citext.sql *** *** 711,713 SELECT COUNT(*) = 19::bigint AS t FROM try; --- 711,736 SELECT like_escape( name, '' ) = like_escape( name::text, '' ) AS t FROM srt; SELECT like_escape( name::text, ''::citext ) = like_escape( name::text, '' ) AS t FROM srt; + + -- Ensure correct behavior for citext with materialized views. + CREATE TABLE citext_table ( + id serial primary key, + name citext + ); + INSERT INTO citext_table (name) + VALUES ('one'), ('two'), ('three'), (NULL), (NULL); + CREATE MATERIALIZED VIEW citext_matview AS + SELECT * FROM citext_table; + CREATE UNIQUE INDEX citext_matview_id + ON citext_matview (id); +
Re: [HACKERS] Some interesting news about Linux 3.12 OOM
On Sep 24, 2013 10:12 AM, Josh Berkus j...@agliodbs.com wrote: All, I've send kernel.org a message that we're keen on seeing these changes become committed. I thought it was merged already in 3.12. There are a few related patches, but here's one: commit 519e52473ebe9db5cdef44670d5a97f1fd53d721 Author: Johannes Weiner han...@cmpxchg.org Date: Thu Sep 12 15:13:42 2013 -0700 mm: memcg: enable memcg OOM killer only for user faults System calls and kernel faults (uaccess, gup) can handle an out of memory situation gracefully and just return -ENOMEM. Enable the memcg OOM killer only for user faults, where it's really the only option available. Signed-off-by: Johannes Weiner han...@cmpxchg.org Acked-by: Michal Hocko mho...@suse.cz Cc: David Rientjes rient...@google.com Cc: KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com Cc: azurIt azu...@pobox.sk Cc: KOSAKI Motohiro kosaki.motoh...@jp.fujitsu.com Signed-off-by: Andrew Morton a...@linux-foundation.org Signed-off-by: Linus Torvalds torva...@linux-foundation.org $ git tag --contains 519e52473ebe9db5cdef44670d5a97f1fd53d721 v3.12-rc1 v3.12-rc2 Searching for recent work by Johannes Weiner shows the pertinent stuff more exhaustively. BTW, in the future if anyone sees kernel.org contemplating a patch which helps or hurts Postgres, don't hesiate to speak up to them. They don't get nearly enough feedback from DB developers. I don't hesitate, most of the time I simply don't know. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] bitmap indexes
At 2013-09-24 09:51:00 -0700, jeff.ja...@gmail.com wrote: I get wrong answers from this index sometimes. Thanks for the report and the test case. I'll investigate. -- Abhijit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] record identical operator
On Tue, Sep 24, 2013 at 2:22 PM, Stephen Frost sfr...@snowman.net wrote: * Robert Haas (robertmh...@gmail.com) wrote: Now I admit that's an arguable point. We could certainly define an intermediate notion of equality that is more equal than whatever = does, but not as equal as exact binary equality. I suggested it up-thread and don't recall seeing a response, so here it is again- passing the data through the binary-out function for the type and comparing *that* would allow us to change the interal binary representation of data types and would be something which we could at least explain to users as to why X isn't the same as Y according to this binary operator. I think the conservative (and therefore correct) approach is to decide that we're always going to update if we detect any difference at all. And there may be users who are surprised that a refresh changed parts of the table that have nothing to do with what was changed in the underlying relation, because a different plan was used and the result ended up being binary-different. It's easy to explain to users why that would be when we're doing a wholesale replace but I don't think it'll be nearly as clear why that happened when we're not replacing the whole table and why REFRESH can basically end up changing anything (but doesn't consistently). If we're paying attention to the records changed and only updating the matview's records when they're involved, that becomes pretty clear. What's happening here feels very much like unintended consequences. FWIW you make some interesting points (I did a triple take on the plan dependent changes) but I'm 100% ok with the proposed behavior. Matviews satisfy 'truth' as *defined by the underlying query only*. This is key: there may be N candidate 'truths' for that query: it's not IMNSHO reasonable to expect the post-refresh truth to be approximately based in the pre-refresh truth. Even if the implementation happened to do what you're asking for it would only be demonstrating undefined but superficially useful behavior...a good analogy would be the old scan behavior where an unordered scan would come up in 'last update order'. That (again, superficially useful) behavior was undefined and we reserved the right to change it. And we did. Unnecessarily defined behaviors defeat future performance optimizations. So Kevin's patch AIUI defines a hitherto non-user accessible (except in the very special case of row-wise comparison) mechanic to try and cut down the number of rows that *must* be refreshed. It may or may not do a good job at that on a situational basis -- if it was always better we'd probably be overriding the default behavior. I don't think it's astonishing at all for matview to pseudo-randomly adjust case over a citext column; that's just part of the deal with equality ambiguous types. As long as the matview doesn't expose a dataset that was impossible to have been generated by the underlying query, I'm good. merlin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] INSERT...ON DUPLICATE KEY LOCK FOR UPDATE
On Tue, Sep 24, 2013 at 7:35 AM, Robert Haas robertmh...@gmail.com wrote: I don't really disagree with any of that. TBH, I think the question of how long value locks (as you call them) are held is going to boil down to a question of how they end up being implemented. Well, I think we can rule out value locks that are held for the duration of a transaction right away. That's just not going to fly. As I mentioned to you at PG Open (going through the details here for those following along at home), we could optimistically insert the new heap tuple, then go add index entries for it, and if we find a conflict, then instead of erroring out, we mark the tuple we were inserting dead and go try update the conflicting tuple instead. In that implementation, if we find that we have to wait for some other transaction along the way, it's probably not worth reversing out the index entries already inserted, because getting them into the index in the first place was a WAL-logged operation, and therefore relatively expensive, and IMHO it's most likely better to just hope things work out than to risk having to redo all of that. I'm afraid that there are things that concern me about this design. It does have one big advantage over promise-tuples, which is that the possibility of index-only bloat, and even the possible need to freeze indexes separately from their heap relation is averted (or are you going to have recovery do promise clean-up instead? Does recovery look for an eventual successful insertion relating to the promise? How far does it look?). However, while I'm just as concerned as you that backing out is too expensive, I'm equally concerned that there is no reasonable alternative to backing out, which is why cheap, quick in-memory value locks are so compelling to me. See my remarks below. On the other hand, if the locks are strictly in-memory, then the cost of releasing them all before we go to wait, and of reacquiring them after we finish waiting, is pretty low. There might be some modularity issues to work through there, but they might not turn out to be very painful, and the advantages you mention are certainly worth accruing if it turns out to be fairly straightforward. It's certainly a difficult situation to judge. Personally, I think that trying to keep it all in-memory is going to be hard. The problem is that we can't de-optimize regular inserts or updates to any significant degree to cater to this feature - because as valuable as this feature is, the number of times it gets used is still going to be a whole lot smaller than the number of times it doesn't get used. Right - I don't think that anyone would argue that any other standard should be applied. Fortunately, I'm reasonably confident that it can work. The last part of index tuple insertion, where we acquire an exclusive lock on a buffer, needs to look out for a page header bit (on pages considered for insertion of its value). The cost of that to anyone not using this feature is likely to be infinitesimally small. We can leave clean-up of that bit to the next inserter, who needs the exclusive lock anyway and doesn't find a corresponding SLRU entry. But really, that's a discussion for another day. I think we'd want to track value locks per pinned-by-upserter buffer, to localize any downsides on concurrency. If we forbid page-splits in respect of a value-locked page, we can still have a stable value (buffer number) to use within a shared memory hash table, or something along those lines. We're still going to want to minimize the duration of locking under this scheme, by doing TOASTing before locking values and so on, which is quite possible. If we're really lucky, maybe the value locking stuff can be generalized or re-used as part of a btree index insertion buffer feature. Also, I tend to think that we might want to define the operation as a REPLACE-type operation with respect to a certain set of key columns; and so we'll do the insert-or-update behavior with respect only to the index on those columns and let the chips fall where they may with respect to any others. In that case this all becomes much less urgent. Well, MySQL's REPLACE does zero or more DELETEs followed by an INSERT, not try an INSERT, then maybe mark the heap tuple if there's a unique index dup and then go UPDATE the conflicting tuple. I mention this only because the term REPLACE has a certain baggage, and I feel it's important to be careful about such things. The only way that's going to work is if you say use this unique index, which will look pretty gross in DML. That might actually be okay with me if we had somewhere to go from there in a future release, but I doubt that's the case. Another issue is that I'm not sure that this helps Andres much (or rather, clients of the logical changeset generation infrastructure that need to do conflict resolution), and that matters a lot to me here. Suppose we define the operation as REPLACE rather
Re: [HACKERS] ENABLE/DISABLE CONSTRAINT NAME
On Tue, 2013-09-24 at 11:58 +0200, Bernd Helmle wrote: Hmm not sure i understand this argument either: this patch doesn't allow disabling a primary key. It only supports FKs and CHECK constraints explicitly. Well, as soon as the patch for cataloging not-null constraints as check constraints is available, it will be possible to create views that depend functionally on check constraints. Then you'll have the same problem there. It's also not clear why this patch only supports foreign keys and check constraints. Maybe that's what was convenient to implement, but it's not a principled solution to the general issue that constraints can be involved in dependencies. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] UTF8 national character data type support WIP patch and list of open issues.
On Tue, 2013-09-24 at 21:04 +0900, MauMau wrote: 4. I guess some users really want to continue to use ShiftJIS or EUC_JP for database encoding, and use NCHAR for a limited set of columns to store international text in Unicode: - to avoid code conversion between the server and the client for performance - because ShiftJIS and EUC_JP require less amount of storage (2 bytes for most Kanji) than UTF-8 (3 bytes) This use case is described in chapter 6 of Oracle Database Globalization Support Guide. But your proposal wouldn't address the first point, because data would have to go client - server - NCHAR. The second point is valid, but it's going to be an awful amount of work for that limited result. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] dynamic shared memory
On Tue, Sep 24, 2013 at 9:49 PM, Robert Haas robertmh...@gmail.com wrote: On Fri, Sep 20, 2013 at 7:44 AM, Andres Freund and...@2ndquadrant.com wrote: Hm, I guess you dont't want to add it to global/ or so because of the mmap implementation where you presumably scan the directory? Yes, and also because I thought this way would make it easier to teach things like pg_basebackup (or anybody's home-brew scripts) to just skip that directory completely. Actually, I was wondering if we ought to have a directory under pgdata whose explicit charter it was to contain files that shouldn't be copied as part of a base backup. pg_do_not_back_this_up. Wondered exactly about that as soon as you've mentioned pg_basebackup. pg_local/? That seems reasonable. It's not totally transparent what that's supposed to mean, but it's fairly mnemonic once you know. Other suggestions welcome, if anyone has ideas. Are there any other likely candidates for inclusion in that directory other than this stuff? pgsql_tmp. Refer sendDir() in basebackup.c, there we avoid sending files in backup. Some of future features like ALTER SYSTEM, can also use it for tmp file. + /* Create or truncate the file. */ + statefd = open(PG_DYNSHMEM_NEW_STATE_FILE, O_RDWR|O_CREAT|O_TRUNC, 0600); Doesn't this need a | PG_BINARY? It's a text file. Do we need PG_BINARY anyway? I'd say yes. Non binary mode stuff on windows does stuff like transforming LF = CRLF on reading/writing, which makes sizes not match up and similar ugliness. Imo there's little reason to use non-binary mode for anything written for postgres' own consumption. Well, I'm happy to do whatever the consensus is. AFAICT you and Noah are both for it and Amit's position is that it doesn't matter either way I am sorry If my mails doesn't say that I am in favour of keeping code as it is unless there is really a case which requires it. Basically as per my understanding, I have presented some facts in above mails which indicates, there is no need for PG_BINARY in this case. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers