[HACKERS] Bloom Filter lookup for hash joins
Hi All, I have been researching bloom filters and discussed it on IRC with RhodiumToad and David Fetter, and they pointed me to the various places that could potentially have bloom filters, apart from the places that already have them currently. I have been reading the current implementation of hash joins, and in ExecScanHashBucket, which I understand is the actual lookup function, we could potentially look at a bloom filter per bucket. Instead of actually looking up each hash value for the outer relation, we could just check the corresponding bloom filter for that bucket, and if we get a positive, then lookup the actual values i.e. continue with our current behaviour (since we could be looking at a false positive). This doesn't involve a lot of new logic. We need to add a bloom filter in HashJoinState and set it when the hash table is being made. Also, one thing we need to look at is the set of hash functions being used for the bloom filters. This can be a point of further discussion. The major potential benefit we could gain is that bloom filters never give false negatives. So, we could save a lot of lookup where the corresponding bloom filter gives a negative. This approach can also be particularly useful for hash anti joins, where we look for negatives. Since bloom filters can easily be stored and processed, by straight bit operations, we could be looking at a lot of saving here. I am not sure if we already do something like this. Please direct me to the relevant parts in the code if we already do. Regards, Atri -- Regards, Atri l'apprenant -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Add more regression tests for dbcommands
Hi Andres, Attached is a patch which does not CREATE DATABASE, but now the regression tests do not test the following: - ALTER DATABASE RENAME TO is not allowed on a database in use. Had to remove two tests that were using this. - ALTER DATABASE SET TABLESPACE is also not allowed on a database in use, so removed it (the existing test was supposed to fail too, but it was to fail at a later stage in the function when checking for whether the tablespace exists, but with the 'regression' database, it just doesn't even reach that part) - The CREATE DATABASE test itself was checking whether the 'CONNECTION LIMIT' was working. Removed that as well. Code coverage improved from 36% to 68%. -- Robins Tharakan On 24 June 2013 07:47, Andres Freund and...@2ndquadrant.com wrote: On 2013-05-21 02:58:25 +0530, Robins Tharakan wrote: Attached is an updated patch that does only 1 CREATE DATABASE and reuses that for all other tests. The code-coverage with this patch goes up from 36% to 70%. Even creating one database seems superfluous. The plain CREATE DATABASE has been tested due to the creation of the regression database itself anyway? All the other tests should be doable using the normal regression db? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services regress_db_v4.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Review: UNNEST (and other functions) WITH ORDINALITY
On 26 June 2013 01:22, Josh Berkus j...@agliodbs.com wrote: Folks, (the below was already discussed on IRC) Leaving names aside on this patch, I'm wondering about a piece of functionality I have with the current unnest() and with the unnest_ordinality()[1] extension: namely, the ability to unnest several arrays in parallel by using unnest() in the target list. For example, given the table: lotsarrays ( id serial PK, arr1 int[], arr2 numeric[], arr3 boolean[] ) I can currently do: SELECT id, unnest(arr1) as arr1, unnest(arr2) as arr2, unnest(arr3) as arr3 FROM lotsarrays; ... and if arr1, 2 and 3 are exactly the same length, this creates a coordinated dataset. I can even use the unnest_ordinality() extension function to get the ordinality of this combined dataset: SELECT id, (unnest_ordinality(arr1)).element_number as array_index, unnest(arr1) as arr1, unnest(arr2) as arr2, unnest(arr3) as arr3 FROM lotsarrays; There are reasons why this will be complicated to implement WITH ORDINALITY; DF, Andrew and I discussed them on IRC. So allowing WITH ORDINALITY in the target list is a TODO, either for later in 9.4 development, or for 9.5. So, this isn't stopping the patch; I just want a TODO for implement WITH ORDINALITY in the target list for SRFs. So if I'm understanding correctly, your issue is that WITH ORDINALITY is currently only accepted on SRFs in the FROM list, not that it isn't working as expected in any way. I have no objection to adding a TODO item to extend it, but note that the restriction is trivial to work around: CREATE TABLE lotsarrays ( id serial primary key, arr1 int[], arr2 numeric[], arr3 boolean[] ); INSERT INTO lotsarrays(arr1, arr2, arr3) VALUES (ARRAY[1,2], ARRAY[1.1, 2.2], ARRAY[true, false]), (ARRAY[10,20,30], ARRAY[10.1, 20.2, 30.3], ARRAY[true, false, true]); CREATE OR REPLACE FUNCTION unnest_ordinality(anyarray) RETURNS TABLE(element_number bigint, element anyelement) AS $$ SELECT ord, elt FROM unnest($1) WITH ORDINALITY AS t(elt, ord) $$ LANGUAGE sql STRICT IMMUTABLE; SELECT id, (unnest_ordinality(arr1)).element_number as array_index, unnest(arr1) as arr1, unnest(arr2) as arr2, unnest(arr3) as arr3 FROM lotsarrays; id | array_index | arr1 | arr2 | arr3 +-+--+--+-- 1 | 1 |1 | 1.1 | t 1 | 2 |2 | 2.2 | f 2 | 1 | 10 | 10.1 | t 2 | 2 | 20 | 20.2 | f 2 | 3 | 30 | 30.3 | t (5 rows) Personally I'm not a fan of SRFs in the select list, especially not multiple SRFs there, since the results are hard to deal with if they return differing numbers of rows. So I would tend to write this as a LATERAL FULL join on the ordinality columns: SELECT id, COALESCE(u1.ord, u2.ord, u3.ord) AS array_index, u1.arr1, u2.arr2, u3.arr3 FROM lotsarrays, unnest(arr1) WITH ORDINALITY AS u1(arr1, ord) FULL JOIN unnest(arr2) WITH ORDINALITY AS u2(arr2, ord) ON u2.ord = u1.ord FULL JOIN unnest(arr3) WITH ORDINALITY AS u3(arr3, ord) ON u3.ord = COALESCE(u1.ord, u2.ord); Either way, I think the WITH ORDINALITY patch is working as expected. Regards, Dean -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Fix conversion for Decimal arguments in plpython functions
On 26 June 2013 01:40, Steve Singer st...@ssinger.info wrote: On 06/25/2013 06:42 AM, Szymon Guz wrote: Hi, I've attached a new patch. I've fixed all the problems you've found, except for the efficiency problem, which has been described in previous email. thanks, Szymon This version of the patch addresses the issues I mentioned. Thanks for looking into seeing if the performance issue is with our conversions to strings or inherit with the python decimal type. I guess we (Postgresql) can't do much about it. A runtime switch to use cdecimal if it is available is a good idea, but I agree with you that could be a different patch. One minor thing I noticed in this round, PLy_elog(ERROR, could not import module 'decimal'); I think should have decimal in double-quotes. I think this patch is ready for a committer to look at it. Steve Hi Steve, thanks for the review. I was thinking about speeding up the Decimal conversion using the module you wrote about. What about trying to import it, if it fails, than trying to load decimal.Decimal? There will be no warning in logs, just additional information in documentation that it uses this module if it is available? thanks, Szymon
Re: [HACKERS] [9.4 CF 1] The Commitfest Slacker List
Hi, Apologies for being unable to respond promptly. I've been traveling (without much access) and this was the fastest I could settle down. I was free for months and had to travel smack in the middle of the commitfest. Incidentally I had reviewed one patch after your direct email, but as someone earlier mentioned, actually pasting my name as 'reviewer' seemed awkward and so didn't then (but currently its set to David Fetter for some reason). I guess the point Tom mentioned earlier makes good sense and I agree with the spirit of the process . In my part would try to review more patches and mark them so on the commitfest page soon . -- Robins Tharakan On 23 June 2013 23:41, Josh Berkus j...@agliodbs.com wrote: Folks, For 9.2, we adopted it as policy that anyone submitting a patch to a commitfest is expected to review at least one patch submitted by someone else. And that failure to do so would affect the attention your patches received in the future. For that reason, I'm publishing the list below of people who submitted patches and have not yet claimed any patch in the commitfest to review. For those of you who are contributing patches for your company, please let your boss know that reviewing is part of contributing, and that if you don't do the one you may not be able to do the other. The two lists below, idle submitters and committers, constitutes 26 people. This *outnumbers* the list of contributors who are busy reviewing patches -- some of them four or more patches. If each of these people took just *one* patch to review, it would almost entirely clear the list of patches which do not have a reviewer. If these folks continue not to do reviews, this commitfest will drag on for at least 2 weeks past its closure date. Andrew Geirth Nick White Peter Eisentrout Alexander Korotkov Etsuro Fujita Hari Babu Jameison Martin Jon Nelson Oleg Bartunov Chris Farmiloe Samrat Revagade Alexander Lakhin Mark Kirkwood Liming Hu Maciej Gajewski Josh Kuperschmidt Mark Wong Gurjeet Singh Robins Tharakan Tatsuo Ishii Karl O Pinc Additionally, the following committers are not listed as reviewers on any patch. Note that I have no way to search which ones might be *committers* on a patch, so these folks are not necessarily slackers (although with Bruce, we know for sure): Bruce Momjian (momjian) Michael Meskes (meskes) Teodor Sigaev (teodor) Andrew Dunstan (adunstan) Magnus Hagander (mha) Itagaki Takahiro (itagaki) -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Add more regression tests for CREATE OPERATOR
Hi Szymon, The commented out test that you're referring to, is an existing test (not that I added or commented). I was going to remove but interestingly its testing a part of code where (prima-facie) it should fail, but it passes (probably why it was disabled in the first place) ! So technically I hope this regression patch I submitted could go through since this feedback isn't towards that patch, but in my part I am quite intrigued about this test (and how it passes) and probably I'd get back on this thread about this particular commented out test in question, as time permits. -- Robins Tharakan On 25 June 2013 04:12, Robins Tharakan thara...@gmail.com wrote: Thanks a ton Szymon (for a reminder on this one). As a coincidental turn of events, I have had to travel half way across the world and am without my personal laptop (without a linux distro etc.) and just recovering from a jet-lag now. I'll try to install a VM on a make-shift laptop and get something going to respond as soon as is possible. Thanks -- Robins Tharakan -- Robins Tharakan On 17 June 2013 05:19, Szymon Guz mabew...@gmail.com wrote: On 23 May 2013 00:34, Robins Tharakan thara...@gmail.com wrote: Hi, Please find attached a patch to take code-coverage of CREATE OPERATOR (src/backend/commands/operatorcmds.c) from 56% to 91%. Any and all feedback is welcome. -- Robins Tharakan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers Hi, there is one commented out test. I think it should be run, or deleted. There is no use of commented sql code which is not run. What do you think? regards, Szymon
Re: [HACKERS] Add more regression tests for CREATE OPERATOR
OK, so I think this patch can be committed, I will change the status. thanks, Szymon On 26 June 2013 09:26, Robins Tharakan thara...@gmail.com wrote: Hi Szymon, The commented out test that you're referring to, is an existing test (not that I added or commented). I was going to remove but interestingly its testing a part of code where (prima-facie) it should fail, but it passes (probably why it was disabled in the first place) ! So technically I hope this regression patch I submitted could go through since this feedback isn't towards that patch, but in my part I am quite intrigued about this test (and how it passes) and probably I'd get back on this thread about this particular commented out test in question, as time permits. -- Robins Tharakan On 25 June 2013 04:12, Robins Tharakan thara...@gmail.com wrote: Thanks a ton Szymon (for a reminder on this one). As a coincidental turn of events, I have had to travel half way across the world and am without my personal laptop (without a linux distro etc.) and just recovering from a jet-lag now. I'll try to install a VM on a make-shift laptop and get something going to respond as soon as is possible. Thanks -- Robins Tharakan -- Robins Tharakan On 17 June 2013 05:19, Szymon Guz mabew...@gmail.com wrote: On 23 May 2013 00:34, Robins Tharakan thara...@gmail.com wrote: Hi, Please find attached a patch to take code-coverage of CREATE OPERATOR (src/backend/commands/operatorcmds.c) from 56% to 91%. Any and all feedback is welcome. -- Robins Tharakan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers Hi, there is one commented out test. I think it should be run, or deleted. There is no use of commented sql code which is not run. What do you think? regards, Szymon
Re: [HACKERS] Review: Patch to compute Max LSN of Data Pages
On 2013-06-26 08:50:27 +0530, Amit Kapila wrote: On Tuesday, June 25, 2013 11:12 PM Andres Freund wrote: On 2013-06-16 17:19:49 -0700, Josh Berkus wrote: Amit posted a new version of this patch on January 23rd. But last comment on it by Tom is not sure everyone wants this. https://commitfest.postgresql.org/action/patch_view?id=905 ... so, what's the status of this patch? That comment was referencing a mail of mine - so perhaps I better explain: I think the usecase for this utility isn't big enough to be included in postgres since it really can only help in a very limited circumstances. And I think it's too likely to be misused for stuff it's not useable for (e.g. remastering). The only scenario I see is that somebody deleted/corrupted pg_controldata. In that scenario the tool is supposed to be used to find the biggest lsn used so far so the user then can use pg_resetxlog to set that as the wal starting point. But that can be way much easier solved by just setting the LSN to something very, very high. The database cannot be used for anything reliable afterwards anyway. One of the main reason this was written was to make server up in case of corruption and user can take dump of some useful information if any. By setting LSN very, very high user might loose the information which he wants to take dump. Which information would that loose? We don't currently use the LSN for tuple visibility. And you sure wouldn't do anything but dump such a cluster. Now you could argue that you could modify this to find the current xid used - but that's not that easy due to the wraparound semantics of xids. And you are more likely to be successfull by looking at pg_clog. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Add visibility map information to pg_freespace.
Hello, I'm looking into this patch as a reviewer. I'd appreciate your time to review. I've had some suggestions so far, - I should be cautious in changing existing interface. You're right. It was somehow gone out of my mind. It might be better to provide a separate function from the compatibility view despite the loss of the pertinence to stay in this extension. However, it is too small to be a standalone extension. On the other hand the newly-added-column-to-the-tail could be said to be harmless for the most cases considering the usage of this extension, I suppose. - Historical note is needed in pg_freespace doc. Agreed, I'll provide documents not only for freespace, but for other modules I'll touch in this patch later. - How about pageinspect? I proposed a simple representation format as a basis for discussion. Nevertheless, the VM pages has no more structure than a simple bit string. Given the VM info in pg_freespacemap, I've come in doubt of the necessity of vm_page_contnets() for the reason besides the orthogonality in the this extension's interface (which paid no attention before:-). - How about pgstattuple? It could even be said to be meaningful to add the number of not-all-visible pages or the ratio of it in the total pages.. | postgres=# select * from pgstattuple('t'); | -[ RECORD 1 ]+- | table_len| 88711168 | tuple_count | 61 | tuple_len| 26400044 | tuple_percent| 29.76 | dead_tuple_count | 39 | dead_tuple_len | 17599956 | dead_tuple_percent | 19.84 | free_space | 33607960 | free_percent | 37.88 + not_all_visible_page_percent | 23.54 # This column name looks too long, though. In addition, the discussion above about the stability of the interface is also applicable to this. Any suggestions? regards, -- Kyotaro Horiguchi NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Fix conversion for Decimal arguments in plpython functions
The v2 patch does not work for me: regression tests for plpython fail on the plpython_types test: every numeric is converted to None. It seems the global decimal ctor is not initialized. Please find two patches, to be applied on top of the v2 patch: one initializes the decimal ctor, the other uses cdecimal when possible. Using the performance test from steve, on my machine: - with cdecimal installed: ~84ms - without cdecimal installed (standard decimal module): ~511ms 2013/6/26 Szymon Guz mabew...@gmail.com On 26 June 2013 01:40, Steve Singer st...@ssinger.info wrote: On 06/25/2013 06:42 AM, Szymon Guz wrote: Hi, I've attached a new patch. I've fixed all the problems you've found, except for the efficiency problem, which has been described in previous email. thanks, Szymon This version of the patch addresses the issues I mentioned. Thanks for looking into seeing if the performance issue is with our conversions to strings or inherit with the python decimal type. I guess we (Postgresql) can't do much about it. A runtime switch to use cdecimal if it is available is a good idea, but I agree with you that could be a different patch. One minor thing I noticed in this round, PLy_elog(ERROR, could not import module 'decimal'); I think should have decimal in double-quotes. I think this patch is ready for a committer to look at it. Steve Hi Steve, thanks for the review. I was thinking about speeding up the Decimal conversion using the module you wrote about. What about trying to import it, if it fails, than trying to load decimal.Decimal? There will be no warning in logs, just additional information in documentation that it uses this module if it is available? thanks, Szymon cnumeric.patch Description: Binary data null_ctor.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Add visibility map information to pg_freespace.
On 26 June 2013 09:09, Kyotaro HORIGUCHI horiguchi.kyot...@lab.ntt.co.jpwrote: - How about pageinspect? I proposed a simple representation format as a basis for discussion. Nevertheless, the VM pages has no more structure than a simple bit string. Given the VM info in pg_freespacemap, I've come in doubt of the necessity of vm_page_contnets() for the reason besides the orthogonality in the this extension's interface (which paid no attention before:-). I don't think that will be needed, now I understand. - How about pgstattuple? It could even be said to be meaningful to add the number of not-all-visible pages or the ratio of it in the total pages.. | postgres=# select * from pgstattuple('t'); | -[ RECORD 1 ]+- | table_len| 88711168 | tuple_count | 61 | tuple_len| 26400044 | tuple_percent| 29.76 | dead_tuple_count | 39 | dead_tuple_len | 17599956 | dead_tuple_percent | 19.84 | free_space | 33607960 | free_percent | 37.88 + not_all_visible_page_percent | 23.54 # This column name looks too long, though. Yes, please. But name should be all_visible_percent. Anybody that wants not_all_visible_percent can do the math. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services
Re: FILTER for aggregates [was Re: [HACKERS] Department of Redundancy Department: makeNode(FuncCall) division]
On 26 June 2013 01:01, Josh Berkus j...@agliodbs.com wrote: I know it's heresy in these parts, but maybe we should consider adopting a non-spec syntax for this feature? In particular, it's really un-obvious why the FILTER clause shouldn't be inside rather than outside the aggregate's parens, like ORDER BY. Well, what other DBMSes support this feature? Will being non-spec introduce migration pain? I can't find any, but that doesn't mean they don't exist, or aren't being worked on. To recap, the options currently on offer are: 1). Make FILTER a new partially reserved keyword, accepting that that might break some users' application code. 2). Make FILTER unreserved, accepting that that will lead to syntax errors rather than more specific error messages if the user tries to use an aggregate/window function with FILTER or OVER in the FROM clause of a query, or as an index expression. 3). Adopt a non-standard syntax for this feature, accepting that that might conflict with other databases, and that we can never then claim to have implemented T612, Advanced OLAP operations. 4). Some other parser hack that will offer a better compromise? My preference is for (2) as the lesser of several evils --- it's a fairly narrow case where the quality of the error message is reduced. Regards, Dean -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Fix conversion for Decimal arguments in plpython functions
Thanks Steve, that's exactly what I wanted to send when you sent your patches :) I need to figure out why that patch v2 worked for me, I think I made mess somewhere in my git repo and didn't create the patch properly. Sorry for that. Patch is attached, I've also added information about cdecimal to plpython documentation. I'm just wondering how to make integration tests to check when cdecimal is installed and when it is not. thanks, Szymon On 26 June 2013 10:12, Ronan Dunklau rdunk...@gmail.com wrote: The v2 patch does not work for me: regression tests for plpython fail on the plpython_types test: every numeric is converted to None. It seems the global decimal ctor is not initialized. Please find two patches, to be applied on top of the v2 patch: one initializes the decimal ctor, the other uses cdecimal when possible. Using the performance test from steve, on my machine: - with cdecimal installed: ~84ms - without cdecimal installed (standard decimal module): ~511ms 2013/6/26 Szymon Guz mabew...@gmail.com On 26 June 2013 01:40, Steve Singer st...@ssinger.info wrote: On 06/25/2013 06:42 AM, Szymon Guz wrote: Hi, I've attached a new patch. I've fixed all the problems you've found, except for the efficiency problem, which has been described in previous email. thanks, Szymon This version of the patch addresses the issues I mentioned. Thanks for looking into seeing if the performance issue is with our conversions to strings or inherit with the python decimal type. I guess we (Postgresql) can't do much about it. A runtime switch to use cdecimal if it is available is a good idea, but I agree with you that could be a different patch. One minor thing I noticed in this round, PLy_elog(ERROR, could not import module 'decimal'); I think should have decimal in double-quotes. I think this patch is ready for a committer to look at it. Steve Hi Steve, thanks for the review. I was thinking about speeding up the Decimal conversion using the module you wrote about. What about trying to import it, if it fails, than trying to load decimal.Decimal? There will be no warning in logs, just additional information in documentation that it uses this module if it is available? thanks, Szymon plpython_decimal_v3.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash partitioning.
Tom, I clearly understand your point. I actually came from corporate market such as Verizon, Barclays... I remember very good that PostgreSQL is open source, but let's forget it for a moment. The key issue for corporate market always been a partitioning(vertical and lately horizontal). Because of that Oracle has too many types and combinations of partitions, the other vendors as well. Easy partitions maintenance (automatic, simple syntax) is very important for everybody who lives in corporate RDBMS world and not only use DB's for free in order to create some virtual shop. The main purpose of partitioning in my world is to store billions of rows and be able to search by date, hour or even minute as fast as possible. When you dealing with company, which has ~350.000.000 users, and you don't want to use key/value data stores: you need hash partitioned tables and hash partitioned table clusters to perform fast search and 4-6 tables join based on user phone number for example. I believe to increase PostgreSQL popularity in corporate world, to make real money from support, the next features might be: better vertical and later horizontal partitioning, columnar-oriented tables, DB freeze for NetApp/EMC snapshots and similar. Sincerely yours, Yuri Levinsky, DBA Celltick Technologies Ltd., 32 Maskit St., Herzliya 46733, Israel Mobile: +972 54 6107703, Office: +972 9 9710239; Fax: +972 9 9710222 -Original Message- From: Tom Lane [mailto:t...@sss.pgh.pa.us] Sent: Tuesday, June 25, 2013 10:33 PM To: Christopher Browne Cc: Yuri Levinsky; Robert Haas; Bruce Momjian; PostgreSQL Mailing Lists Subject: Re: [HACKERS] Hash partitioning. Christopher Browne cbbro...@gmail.com writes: There would indeed be merit in improving the partitioning apparatus, and actually, I think it's been a couple of years since there has been serious discussion of this. We could certainly use a partitioning mechanism that's easier to use than what we have now, which is basically build it yourself, here's the parts bin. There would also be some performance benefits from moving the partitioning logic into hard-wired code. However, I find it hard to think that hash partitioning as such is very high on the to-do list. As was pointed out upthread, the main practical advantage of partitioning is *not* performance of routine queries, but improved bulk-data management such as the ability to do periodic housecleaning by dropping a partition. If your partitioning is on a hash, you've thrown away any such advantage, because there's no real-world meaning to the way the data's been split up. So I find range and list partitioning way more plausible. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Improvement of checkpoint IO scheduler for stable transaction responses
Thank you for comments! On Tue, Jun 25, 2013 at 1:15 PM, Heikki Linnakangas Hmm, so the write patch doesn't do much, but the fsync patch makes the response times somewhat smoother. I'd suggest that we drop the write patch for now, and focus on the fsyncs. Write patch is effective in TPS! I think that delay of checkpoint write is caused long time fsync and heavy load in fsync phase. Because it go slow disk right in write phase. Therefore, combination of write patch and fsync patch are suiter each other than only write patch. I think that amount of WAL write in beginning of checkpoint can indicate effect of write patch. What checkpointer_fsync_delay_ratio and checkpointer_fsync_delay_threshold settings did you use with the fsync patch? It's disabled by default. I used these parameters. checkpointer_fsync_delay_ratio = 1 checkpointer_fsync_delay_threshold = 1000ms As a matter of fact, I used long time sleep in slow fsyncs. And other maintains parameters are here. checkpoint_completion_target = 0.7 checkpoint_smooth_target = 0.3 checkpoint_smooth_margin = 0.5 checkpointer_write_delay = 200ms Attached is a quick patch to implement a fixed, 100ms delay between fsyncs, and the assumption that fsync phase is 10% of the total checkpoint duration. I suspect 100ms is too small to have much effect, but that happens to be what we have currently in CheckpointWriteDelay(). Could you test this patch along with yours? If you can test with different delays (e.g 100ms, 500ms and 1000ms) and different ratios between the write and fsync phase (e.g 0.5, 0.7, 0.9), to get an idea of how sensitive the test case is to those settings. It seems interesting algorithm! I will test it in same setting and study about your patch essence. (2013/06/26 5:28), Heikki Linnakangas wrote: On 25.06.2013 23:03, Robert Haas wrote: On Tue, Jun 25, 2013 at 1:15 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: I'm not sure it's a good idea to sleep proportionally to the time it took to complete the previous fsync. If you have a 1GB cache in the RAID controller, fsyncing the a 1GB segment will fill it up. But since it fits in cache, it will return immediately. So we proceed fsyncing other files, until the cache is full and the fsync blocks. But once we fill up the cache, it's likely that we're hurting concurrent queries. ISTM it would be better to stay under that threshold, keeping the I/O system busy, but never fill up the cache completely. Isn't the behavior implemented by the patch a reasonable approximation of just that? When the fsyncs start to get slow, that's when we start to sleep. I'll grant that it would be better to sleep when the fsyncs are *about* to get slow, rather than when they actually have become slow, but we have no way to know that. Well, that's the point I was trying to make: you should sleep *before* the fsyncs get slow. Actuary, fsync time is changed by progress of background disk writes in OS. We cannot know about progress of background disk write before fsyncs. I think Robert's argument is right. Please see under following log messages. * fsync file which had been already wrote in disk DEBUG: 0: checkpoint sync: number=23 file=base/16384/16413.5 time=2.546 msec DEBUG: 0: checkpoint sync: number=24 file=base/16384/16413.6 time=3.174 msec DEBUG: 0: checkpoint sync: number=25 file=base/16384/16413.7 time=2.358 msec DEBUG: 0: checkpoint sync: number=26 file=base/16384/16413.8 time=2.013 msec DEBUG: 0: checkpoint sync: number=27 file=base/16384/16413.9 time=1232.535 msec DEBUG: 0: checkpoint sync: number=28 file=base/16384/16413_fsm time=0.005 msec * fsync file which had not been wrote in disk very much DEBUG: 0: checkpoint sync: number=54 file=base/16384/16419.8 time=3408.759 msec DEBUG: 0: checkpoint sync: number=55 file=base/16384/16419.9 time=3857.075 msec DEBUG: 0: checkpoint sync: number=56 file=base/16384/16419.10 time=13848.237 msec DEBUG: 0: checkpoint sync: number=57 file=base/16384/16419.11 time=898.836 msec DEBUG: 0: checkpoint sync: number=58 file=base/16384/16419_fsm time=0.004 msec DEBUG: 0: checkpoint sync: number=59 file=base/16384/16419_vm time=0.002 msec I think it is wasteful of sleep every fsyncs including short time, and fsync time performance is also changed by hardware which is like RAID card and kind of or number of disks and OS. So it is difficult to set fixed-sleep-time. My proposed method will be more adoptive in these cases. The only feedback we have on how bad things are is how long it took the last fsync to complete, so I actually think that's a much better way to go than any fixed sleep - which will often be unnecessarily long on a well-behaved system, and which will often be far too short on one that's having trouble. I'm inclined to think think Kondo-san has got it right. Quite possible, I really don't know. I'm inclined to first try the
Re: [HACKERS] [PATCH] Fix conversion for Decimal arguments in plpython functions
You had a great idea, the time with cdecimal is really great, the difference on my machine is 64 ms vs 430 ms. Szymon
Re: FILTER for aggregates [was Re: [HACKERS] Department of Redundancy Department: makeNode(FuncCall) division]
2013/6/26 Dean Rasheed dean.a.rash...@gmail.com: On 26 June 2013 01:01, Josh Berkus j...@agliodbs.com wrote: I know it's heresy in these parts, but maybe we should consider adopting a non-spec syntax for this feature? In particular, it's really un-obvious why the FILTER clause shouldn't be inside rather than outside the aggregate's parens, like ORDER BY. Well, what other DBMSes support this feature? Will being non-spec introduce migration pain? I can't find any, but that doesn't mean they don't exist, or aren't being worked on. To recap, the options currently on offer are: 1). Make FILTER a new partially reserved keyword, accepting that that might break some users' application code. 2). Make FILTER unreserved, accepting that that will lead to syntax errors rather than more specific error messages if the user tries to use an aggregate/window function with FILTER or OVER in the FROM clause of a query, or as an index expression. 3). Adopt a non-standard syntax for this feature, accepting that that might conflict with other databases, and that we can never then claim to have implemented T612, Advanced OLAP operations. 4). Some other parser hack that will offer a better compromise? My preference is for (2) as the lesser of several evils --- it's a fairly narrow case where the quality of the error message is reduced. @2 looks well Pavel Regards, Dean -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Bloom Filter lookup for hash joins
Hello it looks interesting - please, try to send some performance results, Regards Pavel 2013/6/26 Atri Sharma atri.j...@gmail.com: Hi All, I have been researching bloom filters and discussed it on IRC with RhodiumToad and David Fetter, and they pointed me to the various places that could potentially have bloom filters, apart from the places that already have them currently. I have been reading the current implementation of hash joins, and in ExecScanHashBucket, which I understand is the actual lookup function, we could potentially look at a bloom filter per bucket. Instead of actually looking up each hash value for the outer relation, we could just check the corresponding bloom filter for that bucket, and if we get a positive, then lookup the actual values i.e. continue with our current behaviour (since we could be looking at a false positive). This doesn't involve a lot of new logic. We need to add a bloom filter in HashJoinState and set it when the hash table is being made. Also, one thing we need to look at is the set of hash functions being used for the bloom filters. This can be a point of further discussion. The major potential benefit we could gain is that bloom filters never give false negatives. So, we could save a lot of lookup where the corresponding bloom filter gives a negative. This approach can also be particularly useful for hash anti joins, where we look for negatives. Since bloom filters can easily be stored and processed, by straight bit operations, we could be looking at a lot of saving here. I am not sure if we already do something like this. Please direct me to the relevant parts in the code if we already do. Regards, Atri -- Regards, Atri l'apprenant -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Kudos for Reviewers -- straw poll
Dean Rasheed wrote: How should reviewers get credited in the release notes? a) not at all b) in a single block titled Reviewers for this version at the bottom. c) on the patch they reviewed, for each patch b) Unless they contribute enough to the patch to be considered a co-author. Should there be a criteria for a creditable review? a) no, all reviews are worthwhile b) yes, they have to do more than it compiles c) yes, only code reviews should count a) Sometimes even it compiles can be worthwhile, if there is doubt over platform compatibility. All contributions should be acknowledged and encouraged. Should reviewers for 9.4 get a prize, such as a t-shirt, as a promotion to increase the number of non-submitter reviewers? a) yes b) no c) yes, but submitters and committers should get it too b) Getting your name in the fine manual is reward enough ;-) +1, except that I like Josh's idea about a free ticket to pgCon. Yours, Laurenz Albe -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Bloom Filter lookup for hash joins
On Wed, Jun 26, 2013 at 2:09 PM, Pavel Stehule pavel.steh...@gmail.com wrote: Hello it looks interesting - please, try to send some performance results, Regards Pavel Hi, Sure. I will think more about it and put up a design on the list soon. My current aim would be to work on hash joins. If it works well for us, we can look for hash anti joins as well. Regards, Atri -- Regards, Atri l'apprenant -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Kudos for Reviewers -- straw poll
For me, B,B and another B works. Regards, Atri -- Regards, Atri l'apprenant -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] updated emacs configuration
Peter Eisentraut pete...@gmx.net writes: emacs -q src/backend/commands/extension.c emacs -q -l ../emacs.samples src/backend/commands/extension.c Well, the first one uses 8-space tabs, the second 4-space tabs, so they look completely different. I'm not sure what you are trying to point out. With the .dir-locals.el in place, the first one should be using 4-space tabs too, right? - Name the lambda used in the hook for easier removing / reference Interesting, I had never thought of that. I don't see that used in Emacs source code or core packages, however. Do you have a reference that this is recommended practice? Well a friend of mine pointed that out to me one day, and he's an Emacs and Gnus commiter. The thing is that you're not guaranteed that the same lambda source code will evaluate to the same object each time, and that would prevent you to remove-hook. $ git clone git://git.postgresql.org/git/postgresql.git Cloning into 'postgresql'... I can reproduce that here. I don't know why I have those postgres dirs then, and I'm pretty confused about my round of testing now. -- Dimitri Fontaine http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] FILTER for aggregates [was Re: Department of Redundancy Department: makeNode(FuncCall) division]
Dean Rasheed said: To recap, the options currently on offer are: 1). Make FILTER a new partially reserved keyword, accepting that that might break some users' application code. 2). Make FILTER unreserved, accepting that that will lead to syntax errors rather than more specific error messages if the user tries to use an aggregate/window function with FILTER or OVER in the FROM clause of a query, or as an index expression. 3). Adopt a non-standard syntax for this feature, accepting that that might conflict with other databases, and that we can never then claim to have implemented T612, Advanced OLAP operations. 4). Some other parser hack that will offer a better compromise? My preference is for (2) as the lesser of several evils --- it's a fairly narrow case where the quality of the error message is reduced. Possibly significant in this context is that there is a proof-of-concept patch in development for another part of T612, namely inverse distribution functions (e.g. percentile_disc and percentile_cont) which should be available by the next CF, and which will require a similar decision with respect to the keyword WITHIN (to support doing: select percentile_cont(0.5) within group (order by x) from ...; which returns the median value of x). This syntax is much more widely supported than FILTER, including by both Oracle and MSSQL, so a non-standard alternative is much less attractive. A solution which suits both (i.e. either option 1 or option 2) would make a whole lot more sense than trying to handle them differently. -- Andrew (irc:RhodiumToad) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] updated emacs configuration
On 26/06/13 10:51, Dimitri Fontaine wrote: Peter Eisentraut pete...@gmx.net writes: $ git clone git://git.postgresql.org/git/postgresql.git Cloning into 'postgresql'... I can reproduce that here. I don't know why I have those postgres dirs then, and I'm pretty confused about my round of testing now. Maybe you cloned from GitHub, where the mirrored repository is called 'postgres'? Cheers, Jan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [Review] Add SPI_gettypmod() to return a field's typemod from a TupleDesc
On Wed, Jun 26, 2013 at 7:49 AM, Mark Wong mark...@gmail.com wrote: On Tue, Jun 25, 2013 at 1:38 AM, Jeevan Chalke jeevan.cha...@enterprisedb.com wrote: Hi Mark, Is this the latest patch you are targeting for 9.4 CF1 ? I am going to review it. From the comment, here is one issue you need to resolve first: *** exec_eval_datum(PLpgSQL_execstate *estat *** 4386,4396 errmsg(record \%s\ has no field \%s\, rec-refname, recfield-fieldname))); *typeid = SPI_gettypeid(rec-tupdesc, fno); ! /* XXX there's no SPI_gettypmod, for some reason */ ! if (fno 0) ! *typetypmod = rec-tupdesc-attrs[fno - 1]-atttypmod; ! else ! *typetypmod = -1; *value = SPI_getbinval(rec-tup, rec-tupdesc, fno, isnull); break; } --- 4386,4392 errmsg(record \%s\ has no field \%s\, rec-refname, recfield-fieldname))); *typeid = SPI_gettypeid(rec-tupdesc, fno); ! *typetypmod = SPI_gettypeid(rec-tupdesc, fno); *value = SPI_getbinval(rec-tup, rec-tupdesc, fno, isnull); break; } Once you confirm, I will go ahead reviewing it. Hi Jeevan, Oopsies, I've updated the patch and attached it. Here are my review points: 1. Patch is very simple and straight forward. 2. Applies well with patch command. No issues at all. 3. Regression test passes. We have good coverage for that. Also NO issues found with my testing. 4. New function is analogous to other SPI_get* functions 5. Ready for committer However, while walking through your changes, I see following line: /* XXX there's no SPI_getcollation either */ It says we do need function for SPI_getcollation as well. It will be another simple patch. Anyway this is not part of this topic so I will go ahead and mark it as Ready for committer Thanks Regards, Mark -- Jeevan B Chalke Senior Software Engineer, RD EnterpriseDB Corporation The Enterprise PostgreSQL Company Phone: +91 20 30589500 Website: www.enterprisedb.com EnterpriseDB Blog: http://blogs.enterprisedb.com/ Follow us on Twitter: http://www.twitter.com/enterprisedb This e-mail message (and any attachment) is intended for the use of the individual or entity to whom it is addressed. This message contains information from EnterpriseDB Corporation that may be privileged, confidential, or exempt from disclosure under applicable law. If you are not the intended recipient or authorized to receive this for the intended recipient, any use, dissemination, distribution, retention, archiving, or copying of this communication is strictly prohibited. If you have received this e-mail in error, please notify the sender immediately by reply e-mail and delete this message.
Re: [HACKERS] [PATCH] Fix conversion for Decimal arguments in plpython functions
It seems like you confused me with steve :) The patch applies cleanly, and the regression tests pass on python2 when cdecimal is not installed. When it is, the type info returned for the converted numeric value is cdecimal.Decimal instead of decimal.Decimal. The regression tests expected output have not been modified for python3, and as such they fail on the type conversions. I am a bit confused with the use of PyModule_GetDict: shouldn't PyObj_GetAttrString be used directly instead ? Moreover, the reference count in the current implementation might be off: the reference count for the decimal module is never decreased, while the reference count to the module dict is, when the docs say it returns a borrowed reference. Please find a patch that fixes both issues. 2013/6/26 Szymon Guz mabew...@gmail.com Thanks Steve, that's exactly what I wanted to send when you sent your patches :) I need to figure out why that patch v2 worked for me, I think I made mess somewhere in my git repo and didn't create the patch properly. Sorry for that. Patch is attached, I've also added information about cdecimal to plpython documentation. I'm just wondering how to make integration tests to check when cdecimal is installed and when it is not. thanks, Szymon On 26 June 2013 10:12, Ronan Dunklau rdunk...@gmail.com wrote: The v2 patch does not work for me: regression tests for plpython fail on the plpython_types test: every numeric is converted to None. It seems the global decimal ctor is not initialized. Please find two patches, to be applied on top of the v2 patch: one initializes the decimal ctor, the other uses cdecimal when possible. Using the performance test from steve, on my machine: - with cdecimal installed: ~84ms - without cdecimal installed (standard decimal module): ~511ms 2013/6/26 Szymon Guz mabew...@gmail.com On 26 June 2013 01:40, Steve Singer st...@ssinger.info wrote: On 06/25/2013 06:42 AM, Szymon Guz wrote: Hi, I've attached a new patch. I've fixed all the problems you've found, except for the efficiency problem, which has been described in previous email. thanks, Szymon This version of the patch addresses the issues I mentioned. Thanks for looking into seeing if the performance issue is with our conversions to strings or inherit with the python decimal type. I guess we (Postgresql) can't do much about it. A runtime switch to use cdecimal if it is available is a good idea, but I agree with you that could be a different patch. One minor thing I noticed in this round, PLy_elog(ERROR, could not import module 'decimal'); I think should have decimal in double-quotes. I think this patch is ready for a committer to look at it. Steve Hi Steve, thanks for the review. I was thinking about speeding up the Decimal conversion using the module you wrote about. What about trying to import it, if it fails, than trying to load decimal.Decimal? There will be no warning in logs, just additional information in documentation that it uses this module if it is available? thanks, Szymon plpython_decimal_fix_regression.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Review: Patch to compute Max LSN of Data Pages
On Wednesday, June 26, 2013 1:20 PM Andres Freund wrote: On 2013-06-26 08:50:27 +0530, Amit Kapila wrote: On Tuesday, June 25, 2013 11:12 PM Andres Freund wrote: On 2013-06-16 17:19:49 -0700, Josh Berkus wrote: Amit posted a new version of this patch on January 23rd. But last comment on it by Tom is not sure everyone wants this. https://commitfest.postgresql.org/action/patch_view?id=905 ... so, what's the status of this patch? That comment was referencing a mail of mine - so perhaps I better explain: I think the usecase for this utility isn't big enough to be included in postgres since it really can only help in a very limited circumstances. And I think it's too likely to be misused for stuff it's not useable for (e.g. remastering). The only scenario I see is that somebody deleted/corrupted pg_controldata. In that scenario the tool is supposed to be used to find the biggest lsn used so far so the user then can use pg_resetxlog to set that as the wal starting point. But that can be way much easier solved by just setting the LSN to something very, very high. The database cannot be used for anything reliable afterwards anyway. One of the main reason this was written was to make server up in case of corruption and user can take dump of some useful information if any. By setting LSN very, very high user might loose the information which he wants to take dump. Which information would that loose? Information from WAL replay which can be more appropriate by selecting LSN. Also for a developer, guessing very high LSN might be easy, but for users it might not be equally easy, and getting such value by utility would be comfortable. One more use case for which this utility was done is as below: It will be used to decide that on new-standby (old-master) whether a full backup is needed from New-master(old-standby). The backup is required when the data page in old-master precedes the last applied LSN in old-standby (i.e., new-master) at the moment of the failover. We don't currently use the LSN for tuple visibility. And you sure wouldn't do anything but dump such a cluster. Now you could argue that you could modify this to find the current xid used - but that's not that easy due to the wraparound semantics of xids. And you are more likely to be successfull by looking at pg_clog. With Regards, Amit Kapila. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Fix conversion for Decimal arguments in plpython functions
On 26 June 2013 12:04, Ronan Dunklau rdunk...@gmail.com wrote: It seems like you confused me with steve :) Hi Ronan, Oh, yes. I'm sorry for that :) The patch applies cleanly, and the regression tests pass on python2 when cdecimal is not installed. When it is, the type info returned for the converted numeric value is cdecimal.Decimal instead of decimal.Decimal. The regression tests expected output have not been modified for python3, and as such they fail on the type conversions. I am a bit confused with the use of PyModule_GetDict: shouldn't PyObj_GetAttrString be used directly instead ? Moreover, the reference count in the current implementation might be off: the reference count for the decimal module is never decreased, while the reference count to the module dict is, when the docs say it returns a borrowed reference. Please find a patch that fixes both issues. Thanks for the patch. I assume you generated that from clean trunk, and it includes all the changes (mine and yours) right? I've checked the patch, everything looks great. I've attached it to this email with changed name, just for consistent naming in commitfest app. thanks, Szymon plpython_decimal_v4.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Bloom Filter lookup for hash joins
On Wed, Jun 26, 2013 at 9:46 AM, Atri Sharma atri.j...@gmail.com wrote: I have been researching bloom filters and discussed it on IRC with RhodiumToad and David Fetter, and they pointed me to the various places that could potentially have bloom filters, apart from the places that already have them currently. One idea that I had was to use them to do CLOG lookups from smaller datastructures. You could store the list of aborted xids in bloom filters. When a xid is not found in the filter, then it is known to have committed, if the filter returns true, then you have to check the real CLOG. This would achieve for example a 1:8 compression ratio at 99% commit rate and 1:160'000 false positive rate, or 1:16 compression ratio at 1:400 false positive rate. Nothing too exciting, so I didn't really develop the idea any further. I have been reading the current implementation of hash joins, and in ExecScanHashBucket, which I understand is the actual lookup function, we could potentially look at a bloom filter per bucket. Instead of actually looking up each hash value for the outer relation, we could just check the corresponding bloom filter for that bucket, and if we get a positive, then lookup the actual values i.e. continue with our current behaviour (since we could be looking at a false positive). The problem here is that if the hash table is in memory, doing a hash table lookup directly is likely to be cheaper than a bloom filter lookup, even if the bloom filter fits into the processor cache and the hash table doesn't (10 last level cache hits is slower than one cache miss). Bloom filter will help when its not feasible to use an actual hash table (lots of large keys), the real lookup is slow (e.g. an index lookup), you are doing a lot of lookups to amortize the construction cost and the condition is expected to be selective (most lookups give a negative). There might be some dataware house types of queries where it would help, but it seems like an awfully narrow use case with a potential for performance regressions when the planner has a bad estimate. Regards, Ants Aasma -- Cybertec Schönig Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Review: Patch to compute Max LSN of Data Pages
Hi Amit, On 2013-06-26 16:22:28 +0530, Amit Kapila wrote: On Wednesday, June 26, 2013 1:20 PM Andres Freund wrote: On 2013-06-26 08:50:27 +0530, Amit Kapila wrote: On Tuesday, June 25, 2013 11:12 PM Andres Freund wrote: On 2013-06-16 17:19:49 -0700, Josh Berkus wrote: Amit posted a new version of this patch on January 23rd. But last comment on it by Tom is not sure everyone wants this. https://commitfest.postgresql.org/action/patch_view?id=905 ... so, what's the status of this patch? That comment was referencing a mail of mine - so perhaps I better explain: I think the usecase for this utility isn't big enough to be included in postgres since it really can only help in a very limited circumstances. And I think it's too likely to be misused for stuff it's not useable for (e.g. remastering). The only scenario I see is that somebody deleted/corrupted pg_controldata. In that scenario the tool is supposed to be used to find the biggest lsn used so far so the user then can use pg_resetxlog to set that as the wal starting point. But that can be way much easier solved by just setting the LSN to something very, very high. The database cannot be used for anything reliable afterwards anyway. One of the main reason this was written was to make server up in case of corruption and user can take dump of some useful information if any. By setting LSN very, very high user might loose the information which he wants to take dump. Which information would that loose? Information from WAL replay which can be more appropriate by selecting LSN. Sorry, I can't follow. If wal replay still is an option you can just look at the WAL and get a sensible value way easier. The whole tool seems to only make sense if you've lost pg_xlog. Also for a developer, guessing very high LSN might be easy, but for users it might not be equally easy, and getting such value by utility would be comfortable. Well, then we can just document some very high lsn and be done with it. Like CF00/. That would leave enough space for eventual writes caused while dumping the database (say hint bit writes in a checksummed database) and cannot yet be realistically be reached during normal operation. One more use case for which this utility was done is as below: It will be used to decide that on new-standby (old-master) whether a full backup is needed from New-master(old-standby). The backup is required when the data page in old-master precedes the last applied LSN in old-standby (i.e., new-master) at the moment of the failover. That's exactly what I was afraid of. Unless I miss something the tool is *NOT* sufficient to do this. Look at the mail introducing pg_rewind and the ensuing discussion for what's necessary for that. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checking variadic any argument in parser - should be array
Hi Pavel On Sat, Jan 26, 2013 at 9:22 AM, Pavel Stehule pavel.steh...@gmail.comwrote: Hello Tom you did comment ! * Non-null argument had better be an array. The parser doesn't ! * enforce this for VARIADIC ANY functions (maybe it should?), so ! * that check uses ereport not just elog. ! */ So I moved this check to parser. It is little bit stricter - requests typed NULL instead unknown NULL, but it can mark error better and early Tom suggested that this check should be better done by parser. This patch tries to accomplish that. I will go review this. However, is it possible to you to re-base it on current master? I can't apply it using git apply but patch -p1 was succeeded with lot of offset. Thanks Regards Pavel -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers -- Jeevan B Chalke Senior Software Engineer, RD EnterpriseDB Corporation The Enterprise PostgreSQL Company Phone: +91 20 30589500 Website: www.enterprisedb.com EnterpriseDB Blog: http://blogs.enterprisedb.com/ Follow us on Twitter: http://www.twitter.com/enterprisedb This e-mail message (and any attachment) is intended for the use of the individual or entity to whom it is addressed. This message contains information from EnterpriseDB Corporation that may be privileged, confidential, or exempt from disclosure under applicable law. If you are not the intended recipient or authorized to receive this for the intended recipient, any use, dissemination, distribution, retention, archiving, or copying of this communication is strictly prohibited. If you have received this e-mail in error, please notify the sender immediately by reply e-mail and delete this message.
Re: [HACKERS] Improvement of checkpoint IO scheduler for stable transaction responses
On 26.06.2013 11:37, KONDO Mitsumasa wrote: On Tue, Jun 25, 2013 at 1:15 PM, Heikki Linnakangas Hmm, so the write patch doesn't do much, but the fsync patch makes the response times somewhat smoother. I'd suggest that we drop the write patch for now, and focus on the fsyncs. Write patch is effective in TPS! Your test results don't agree with that. You got 3465.96 TPS with the write patch, and 3474.62 and 3469.03 without it. The fsync+write combination got slightly more TPS than just the fsync patch, but only by about 1%, and then the response times were worse. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] add --progress option to pgbench (submission 3)
Dear Matsumasa, Here is a v4 that takes into account most of your points: The report is performed for all threads by thread 0, however --progress is not supported under thread fork emulation if there are more than one thread. The report time does not slip anymore. However I've kept the format scarse. It is a style thing:-) and it is more consistent with the kind of format used in the log. I have not added the latency measure because it is redundant with the tps, and the latency that people are expecting is the actual latency of each transactions, not the apparent latency of transactions running in parallel, which is really a throuput. -- Fabien.diff --git a/contrib/pgbench/pgbench.c b/contrib/pgbench/pgbench.c index 1303217..707ea37 100644 --- a/contrib/pgbench/pgbench.c +++ b/contrib/pgbench/pgbench.c @@ -74,7 +74,7 @@ static int pthread_join(pthread_t th, void **thread_return); #include pthread.h #else /* Use emulation with fork. Rename pthread identifiers to avoid conflicts */ - +#define PTHREAD_FORK_EMULATION #include sys/wait.h #define pthread_tpg_pthread_t @@ -164,6 +164,8 @@ bool use_log; /* log transaction latencies to a file */ bool use_quiet; /* quiet logging onto stderr */ int agg_interval; /* log aggregates instead of individual * transactions */ +int progress = 0; /* thread progress report every this seconds */ +int progress_nclients = 0; /* number of clients for progress report */ bool is_connect; /* establish connection for each transaction */ bool is_latencies; /* report per-command latencies */ int main_pid; /* main process id used in log filename */ @@ -354,6 +356,8 @@ usage(void) protocol for submitting queries to server (default: simple)\n -n do not run VACUUM before tests\n -N do not update tables \pgbench_tellers\ and \pgbench_branches\\n + -P NUM, --progress NUM\n + show progress report about every NUM seconds\n -r report average latency per command\n -s NUM report this scale factor in output\n -S perform SELECT-only transactions\n @@ -2115,6 +2119,7 @@ main(int argc, char **argv) {unlogged-tables, no_argument, unlogged_tables, 1}, {sampling-rate, required_argument, NULL, 4}, {aggregate-interval, required_argument, NULL, 5}, + {progress, required_argument, NULL, 'P'}, {NULL, 0, NULL, 0} }; @@ -2181,7 +2186,7 @@ main(int argc, char **argv) state = (CState *) pg_malloc(sizeof(CState)); memset(state, 0, sizeof(CState)); - while ((c = getopt_long(argc, argv, ih:nvp:dqSNc:j:Crs:t:T:U:lf:D:F:M:, long_options, optindex)) != -1) + while ((c = getopt_long(argc, argv, ih:nvp:dqSNc:j:Crs:t:T:U:lf:D:F:M:P:, long_options, optindex)) != -1) { switch (c) { @@ -2336,6 +2341,16 @@ main(int argc, char **argv) exit(1); } break; + case 'P': +progress = atoi(optarg); +if (progress = 0) +{ + fprintf(stderr, + thread progress delay (-P) must not be negative (%s)\n, + optarg); + exit(1); +} +break; case 0: /* This covers long options which take no argument. */ break; @@ -2401,6 +2416,15 @@ main(int argc, char **argv) exit(1); } +#ifdef PTHREAD_FORK_EMULATION + if (progress nthreads1) + { + fprintf(stderr, +option --progress does not work with thread fork emulation); + exit(1); + } +#endif /* PTHREAD_FORK_EMULATION */ + /* --sampling-rate may be used only with -l */ if (sample_rate 0.0 !use_log) { @@ -2461,6 +2485,7 @@ main(int argc, char **argv) * changed after fork. */ main_pid = (int) getpid(); + progress_nclients = nclients; if (nclients 1) { @@ -2712,6 +2737,11 @@ threadRun(void *arg) int nstate = thread-nstate; int remains = nstate; /* number of remaining clients */ int i; + /* for reporting progress: */ + int64 thread_start = INSTR_TIME_GET_MICROSEC(thread-start_time); + int64 last_report = thread_start; + int64 next_report = last_report + progress * 100; + int64 last_count = 0; AggVals aggs; @@ -2875,6 +2905,30 @@ threadRun(void *arg) st-con = NULL; } } + + /* progress report by thread 0 */ + if (progress thread-tid == 0) + { + instr_time now_time; + int64 now; + INSTR_TIME_SET_CURRENT(now_time); + now = INSTR_TIME_GET_MICROSEC(now_time); + if (now = next_report) + { +/* generate and show report */ +int64 count = 0; +int64 run = now - last_report; +/* thread 0 reports other threads data */ +for (i = 0 ; i progress_nclients ; i++) + count += state[i].cnt; +fprintf(stderr, progress: %.1f s %.1f tps\n, + (now - thread_start) / 100.0, + 100.0 * (count-last_count) / run); +last_count = count; +last_report = now; +next_report += progress * 100; + } + } } done: diff --git a/doc/src/sgml/pgbench.sgml b/doc/src/sgml/pgbench.sgml index
Re: [HACKERS] PQConnectPoll, connect(2), EWOULDBLOCK and somaxconn
Hi, On 2013-06-17 16:16:22 +0200, Andres Freund wrote: When postgres on linux receives connection on a high rate client connections sometimes error out with: could not send data to server: Transport endpoint is not connected could not send startup packet: Transport endpoint is not connected To reproduce start something like on a server with sufficiently high max_connections: pgbench -h /tmp -p 5440 -T 10 -c 400 -j 400 -n -f /tmp/simplequery.sql Now that's strange since that error should happen at connect(2) time, not when sending the startup packet. Some investigation led me to fe-secure.c's PQConnectPoll: So, we're accepting EWOULDBLOCK as a valid return value for connect(2). Which it isn't. EAGAIN in contrast is on some BSDs and on linux. Unfortunately POSIX allows those two to share the same value... My manpage tells me: EAGAIN No more free local ports or insufficient entries in the routing cache. For AF_INET see the description of /proc/sys/net/ipv4/ip_local_port_range ip(7) for information on how to increase the number of local ports. So, the problem is that we took a failed connection as having been initially successfull but in progress. Not accepting EWOULDBLOCK in the above if() results in: could not connect to server: Resource temporarily unavailable Is the server running locally and accepting connections on Unix domain socket /tmp/.s.PGSQL.5440? which makes more sense. Trivial patch attached. Could I convince a committer to NACK or commit backpatch that patch? It has come up before: http://www.postgresql.org/message-id/CAMnJ+Beq0hCBuTY_=nz=ru0u-no543_raeunlvsayu8tugd...@mail.gmail.com possibly also: http://lists.pgfoundry.org/pipermail/pgpool-general/2007-March/000595.html Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash partitioning.
On 26.06.2013 11:17, Yuri Levinsky wrote: The main purpose of partitioning in my world is to store billions of rows and be able to search by date, hour or even minute as fast as possible. Hash partitioning sounds like a bad fit for that use case. A regular b-tree, possibly with range partitioning, sounds optimal for that. When you dealing with company, which has ~350.000.000 users, and you don't want to use key/value data stores: you need hash partitioned tables and hash partitioned table clusters to perform fast search and 4-6 tables join based on user phone number for example. B-trees are surprisingly fast for key-value lookups. There is no reason to believe that a hash partitioned table would be faster for that than a plain table. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Add more regression tests for dbcommands
Hi, I am generally a bit unsure whether the regression tests you propose aren't a bit too verbose. Does any of the committers have an opinion about this? My feeling is that they are ok if they aren't slowing down things much. On 2013-06-26 01:55:53 -0500, Robins Tharakan wrote: The CREATE DATABASE test itself was checking whether the 'CONNECTION LIMIT' was working. Removed that as well. You should be able to test that by setting the connection limit to 1 and then try to connect using \c. The old connection is only dropped after the new one has been successfully performed. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] proposal 9.4 plpgsql: allows access to call stack from GET DIAGNOSTICS statement
On Tue, Jun 25, 2013 at 11:09 AM, Pavel Stehule pavel.steh...@gmail.comwrote: 2013/6/25 Rushabh Lathia rushabh.lat...@gmail.com: On Tue, Jun 25, 2013 at 1:47 AM, Pavel Stehule pavel.steh...@gmail.com wrote: Hello This is fragment ofmy old code from orafce package - it is functional, but it is written little bit more generic due impossible access to static variables (in elog.c) static char* dbms_utility_format_call_stack(char mode) { MemoryContext oldcontext = CurrentMemoryContext; ErrorData *edata; ErrorContextCallback *econtext; StringInfo sinfo; #if PG_VERSION_NUM = 80400 errstart(ERROR, __FILE__, __LINE__, PG_FUNCNAME_MACRO, TEXTDOMAIN); #else errstart(ERROR, __FILE__, __LINE__, PG_FUNCNAME_MACRO); #endif MemoryContextSwitchTo(oldcontext); for (econtext = error_context_stack; econtext != NULL; econtext = econtext-previous) (*econtext-callback) (econtext-arg); edata = CopyErrorData(); FlushErrorState(); https://github.com/orafce/orafce/blob/master/utility.c 2013/6/24 Rushabh Lathia rushabh.lat...@gmail.com: Hi, Use of this feature is to get call stack for debugging without raising exception. This definitely seems very useful. Here are my comments about the submitted patch: Patch gets applied cleanly (there are couple of white space warning but that's into test case file). make and make install too went smooth. make check was smooth too. Did some manual testing about feature and its went smooth. However would like to share couple of points: My main motive is concentration to stack info string only. So I didn't use err_start function, and I didn't use CopyErrorData too. These routines does lot of other work. 1) I noticed that you did the initialization of edata manually, just wondering can't we directly use CopyErrorData() which will do that job ? yes, we can, but in this context on context field is interesting for us. I found that inside CopyErrorData() there is Assert: Assert(CurrentMemoryContext != ErrorContext); And in the patch we are setting using ErrorContext as short living context, is it the reason of not using CopyErrorData() ? it was not a primary reason, but sure - this routine is not designed for direct using from elog module. Next, we can save one palloc call. Hmm ok. 2) To reset stack to empty, FlushErrorState() can be used. yes, it can be. I use explicit rows due different semantics, although it does same things. FlushErrorState some global ErrorState and it is used from other modules. I did a reset of ErrorContext. I fill a small difference there - so FlushErrorState is not best name for my purpose. What about modification: static void resetErrorStack() { errordata_stack_depth = -1; recursion_depth = 0; /* Delete all data in ErrorContext */ MemoryContextResetAndDeleteChildren(ErrorContext); } and then call in InvokeErrorCallbacks -- resetErrorStack() and rewrote flushErrorState like void FlushErrorState(void) { /* reset ErrorStack is enough */ resetErrorStack(); } ??? Nope. rather then that I would still prefer direct call of FlushErrorState(). but I can live well with direct call of FlushErrorState() - it is only minor change. 3) I was think about the usability of the feature and wondering how about if we add some header to the stack, so user can differentiate between STACK and normal NOTICE message ? Something like: postgres=# select outer_outer_func(10); NOTICE: - Call Stack - PL/pgSQL function inner_func(integer) line 4 at GET DIAGNOSTICS PL/pgSQL function outer_func(integer) line 3 at RETURN PL/pgSQL function outer_outer_func(integer) line 3 at RETURN CONTEXT: PL/pgSQL function outer_func(integer) line 3 at RETURN PL/pgSQL function outer_outer_func(integer) line 3 at RETURN outer_outer_func -- 20 (1 row) I understand, but I don't think so it is good idea. Sometimes, you would to use context for different purposes than debug log - for example - you should to identify top most call or near call - and any additional lines means some little bit more difficult parsing. You can add this line simply (if you want) RAISE NOTICE e'--- Call Stack ---\n%', stack but there are more use cases, where you would not this information (or you can use own header (different language)), and better returns only clean data. I don't know but I still feel like given header from function it self would be nice to have things. Because it will help user to differentiate between STACK and normal NOTICE context message. I worked on point 2) and 3) and PFA patch for
Re: [HACKERS] proposal 9.4 plpgsql: allows access to call stack from GET DIAGNOSTICS statement
2013/6/26 Rushabh Lathia rushabh.lat...@gmail.com: On Tue, Jun 25, 2013 at 11:09 AM, Pavel Stehule pavel.steh...@gmail.com wrote: 2013/6/25 Rushabh Lathia rushabh.lat...@gmail.com: On Tue, Jun 25, 2013 at 1:47 AM, Pavel Stehule pavel.steh...@gmail.com wrote: Hello This is fragment ofmy old code from orafce package - it is functional, but it is written little bit more generic due impossible access to static variables (in elog.c) static char* dbms_utility_format_call_stack(char mode) { MemoryContext oldcontext = CurrentMemoryContext; ErrorData *edata; ErrorContextCallback *econtext; StringInfo sinfo; #if PG_VERSION_NUM = 80400 errstart(ERROR, __FILE__, __LINE__, PG_FUNCNAME_MACRO, TEXTDOMAIN); #else errstart(ERROR, __FILE__, __LINE__, PG_FUNCNAME_MACRO); #endif MemoryContextSwitchTo(oldcontext); for (econtext = error_context_stack; econtext != NULL; econtext = econtext-previous) (*econtext-callback) (econtext-arg); edata = CopyErrorData(); FlushErrorState(); https://github.com/orafce/orafce/blob/master/utility.c 2013/6/24 Rushabh Lathia rushabh.lat...@gmail.com: Hi, Use of this feature is to get call stack for debugging without raising exception. This definitely seems very useful. Here are my comments about the submitted patch: Patch gets applied cleanly (there are couple of white space warning but that's into test case file). make and make install too went smooth. make check was smooth too. Did some manual testing about feature and its went smooth. However would like to share couple of points: My main motive is concentration to stack info string only. So I didn't use err_start function, and I didn't use CopyErrorData too. These routines does lot of other work. 1) I noticed that you did the initialization of edata manually, just wondering can't we directly use CopyErrorData() which will do that job ? yes, we can, but in this context on context field is interesting for us. I found that inside CopyErrorData() there is Assert: Assert(CurrentMemoryContext != ErrorContext); And in the patch we are setting using ErrorContext as short living context, is it the reason of not using CopyErrorData() ? it was not a primary reason, but sure - this routine is not designed for direct using from elog module. Next, we can save one palloc call. Hmm ok. 2) To reset stack to empty, FlushErrorState() can be used. yes, it can be. I use explicit rows due different semantics, although it does same things. FlushErrorState some global ErrorState and it is used from other modules. I did a reset of ErrorContext. I fill a small difference there - so FlushErrorState is not best name for my purpose. What about modification: static void resetErrorStack() { errordata_stack_depth = -1; recursion_depth = 0; /* Delete all data in ErrorContext */ MemoryContextResetAndDeleteChildren(ErrorContext); } and then call in InvokeErrorCallbacks -- resetErrorStack() and rewrote flushErrorState like void FlushErrorState(void) { /* reset ErrorStack is enough */ resetErrorStack(); } ??? Nope. rather then that I would still prefer direct call of FlushErrorState(). but I can live well with direct call of FlushErrorState() - it is only minor change. 3) I was think about the usability of the feature and wondering how about if we add some header to the stack, so user can differentiate between STACK and normal NOTICE message ? Something like: postgres=# select outer_outer_func(10); NOTICE: - Call Stack - PL/pgSQL function inner_func(integer) line 4 at GET DIAGNOSTICS PL/pgSQL function outer_func(integer) line 3 at RETURN PL/pgSQL function outer_outer_func(integer) line 3 at RETURN CONTEXT: PL/pgSQL function outer_func(integer) line 3 at RETURN PL/pgSQL function outer_outer_func(integer) line 3 at RETURN outer_outer_func -- 20 (1 row) I understand, but I don't think so it is good idea. Sometimes, you would to use context for different purposes than debug log - for example - you should to identify top most call or near call - and any additional lines means some little bit more difficult parsing. You can add this line simply (if you want) RAISE NOTICE e'--- Call Stack ---\n%', stack but there are more use cases, where you would not this information (or you can use own header (different language)), and better returns only clean data. I don't know but I still feel like given header from function it self would be nice to have things. Because it will help user to differentiate between STACK and normal NOTICE context
Re: [HACKERS] Review: Patch to compute Max LSN of Data Pages
On Wednesday, June 26, 2013 4:40 PM Andres Freund wrote: Hi Amit, On 2013-06-26 16:22:28 +0530, Amit Kapila wrote: On Wednesday, June 26, 2013 1:20 PM Andres Freund wrote: On 2013-06-26 08:50:27 +0530, Amit Kapila wrote: On Tuesday, June 25, 2013 11:12 PM Andres Freund wrote: On 2013-06-16 17:19:49 -0700, Josh Berkus wrote: Amit posted a new version of this patch on January 23rd. But last comment on it by Tom is not sure everyone wants this. https://commitfest.postgresql.org/action/patch_view?id=905 ... so, what's the status of this patch? That comment was referencing a mail of mine - so perhaps I better explain: I think the usecase for this utility isn't big enough to be included in postgres since it really can only help in a very limited circumstances. And I think it's too likely to be misused for stuff it's not useable for (e.g. remastering). The only scenario I see is that somebody deleted/corrupted pg_controldata. In that scenario the tool is supposed to be used to find the biggest lsn used so far so the user then can use pg_resetxlog to set that as the wal starting point. But that can be way much easier solved by just setting the LSN to something very, very high. The database cannot be used for anything reliable afterwards anyway. One of the main reason this was written was to make server up in case of corruption and user can take dump of some useful information if any. By setting LSN very, very high user might loose the information which he wants to take dump. Which information would that loose? Information from WAL replay which can be more appropriate by selecting LSN. Sorry, I can't follow. If wal replay still is an option you can just look at the WAL and get a sensible value way easier. Originally 2 parts were proposed, one was to get LSN from data pages and other from data pages. Original proposal is: http://www.postgresql.org/message-id/6C0B27F7206C9E4CA54AE035729E9C382851FFA 1@szxeml509-mbs The second part for looking into WAL was written but due to xlogreader patch, it was postponed and I didn't get time after that to pursue it. The whole tool seems to only make sense if you've lost pg_xlog. The tool's initial intent was if pg_controldata is lost and this idea is originated in below mail chain: http://www.postgresql.org/message-id/4274.1340084...@sss.pgh.pa.us Also for a developer, guessing very high LSN might be easy, but for users it might not be equally easy, and getting such value by utility would be comfortable. Well, then we can just document some very high lsn and be done with it. Like CF00/. That would leave enough space for eventual writes caused while dumping the database (say hint bit writes in a checksummed database) and cannot yet be realistically be reached during normal operation. Can we be ultra sure, that this LSN is not reached. I think it will take vary long to reach such LSN, but still theoretically it can be possible. I don't have any evidence. One more use case for which this utility was done is as below: It will be used to decide that on new-standby (old-master) whether a full backup is needed from New-master(old-standby). The backup is required when the data page in old-master precedes the last applied LSN in old-standby (i.e., new-master) at the moment of the failover. That's exactly what I was afraid of. Unless I miss something the tool is *NOT* sufficient to do this. You mean to say if user knows the max LSN of data pages in old-master and last applied LSN in new master, he cannot decide whether Full backup is needed? It should be straightforward decision that skip a backup if that old-master LSN is less than the new-master (i.e., last applied LSN, IOW, timeline switch LSN). It was proposed as a usecase in this below mail: http://www.postgresql.org/message-id/CAHGQGwHyd1fY0hF0qKh0-uKDh-gcbYxMOFBYVk kh4jzji-f...@mail.gmail.com Look at the mail introducing pg_rewind and the ensuing discussion for what's necessary for that. I had briefly looked into that discussion at the time it was going on, but I will look into it more carefully. With Regards, Amit Kapila. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Bloom Filter lookup for hash joins
One idea that I had was to use them to do CLOG lookups from smaller datastructures. You could store the list of aborted xids in bloom filters. When a xid is not found in the filter, then it is known to have committed, if the filter returns true, then you have to check the real CLOG. This would achieve for example a 1:8 compression ratio at 99% commit rate and 1:160'000 false positive rate, or 1:16 compression ratio at 1:400 false positive rate. Nothing too exciting, so I didn't really develop the idea any further. Right. The problem here is that if the hash table is in memory, doing a hash table lookup directly is likely to be cheaper than a bloom filter lookup, even if the bloom filter fits into the processor cache and the hash table doesn't (10 last level cache hits is slower than one cache miss). Bloom filter will help when its not feasible to use an actual hash table (lots of large keys), the real lookup is slow (e.g. an index lookup), you are doing a lot of lookups to amortize the construction cost and the condition is expected to be selective (most lookups give a negative). There might be some dataware house types of queries where it would help, but it seems like an awfully narrow use case with a potential for performance regressions when the planner has a bad estimate. Ok, sounds good. Cant we use bloom filters for the case where the hash table doesnt fit in memory? Specifically, when reading from disk is inevitable for accessing the hash table, we can use bloom filters for deciding which keys to actually read from disk.Thus ,we save I/O costs which we would have incurred when reading the keys which are not the ones we want and bloom filter gives us a negative. Regards, Atri -- Regards, Atri l'apprenant -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Move unused buffers to freelist
On Tuesday, June 25, 2013 10:25 AM Amit Kapila wrote: On Monday, June 24, 2013 11:00 PM Robert Haas wrote: On Thu, Jun 6, 2013 at 3:01 AM, Amit Kapila amit.kap...@huawei.com wrote: To avoid above 3 factors in test readings, I used below steps: 1. Initialize the database with scale factor such that database size + shared_buffers = RAM (shared_buffers = 1/4 of RAM). For example: Example -1 if RAM = 128G, then initialize db with scale factor = 6700 and shared_buffers = 32GB. Database size (98 GB) + shared_buffers (32GB) = 130 (which is approximately equal to total RAM) Example -2 (this is based on your test m/c) If RAM = 64GB, then initialize db with scale factor = 3400 and shared_buffers = 16GB. 2. reboot m/c 3. Load all buffers with data (tables/indexes of pgbench) using pg_prewarm. I had loaded 3 times, so that usage count of buffers will be approximately 3. Hmm. I don't think the usage count will actually end up being 3, though, because the amount of data you're loading is sized to 3/4 of RAM, and shared_buffers is just 1/4 of RAM, so I think that each run of pg_prewarm will end up turning over the entire cache and you'll never get any usage counts more than 1 this way. Am I confused? The way I am pre-warming is that loading the data of relation (table/index) continuously 3 times, so mostly the buffers will contain the data of relations loaded in last which are indexes and also they got accessed more during scans. So usage count should be 3. Can you please once see load_all_buffers.sql, may be my understanding has some gap. Now about the question why then load all the relations. Apart from PostgreSQL shared buffers, loading data this way can also make sure OS buffers will have the data with higher usage count which can lead to better OS scheduling. I wonder if it would be beneficial to test the case where the database size is just a little more than shared_buffers. I think that would lead to a situation where the usage counts are high most of the time, which - now that you mention it - seems like the sweet spot for this patch. I will check this case and take the readings for same. Thanks for your suggestions. Configuration Details O/S - Suse-11 RAM - 128GB Number of Cores - 16 Server Conf - checkpoint_segments = 300; checkpoint_timeout = 15 min, synchronous_commit = 0FF, shared_buffers = 14GB, AutoVacuum=off Pgbench - Select-only Scalefactor - 1200 Time - 30 mins 8C-8T16C-16T32C-32T64C-64T Head 62403101810 99516 94707 Patch 62827101404 99109 94744 On 128GB RAM, if use scalefactor=1200 (database=approx 17GB) and 14GB shared buffers, this is no major difference. One of the reasons could be that there is no much swapping in shared buffers as most data already fits in shared buffers. I think more readings are need for combinations related to below settings: scale factor such that database size + shared_buffers = RAM (shared_buffers = 1/4 of RAM). I can try varying shared_buffer size. Kindly let me know your suggestions? With Regards, Amit Kapila. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Bloom Filter lookup for hash joins
On 26 June 2013 07:46, Atri Sharma atri.j...@gmail.com wrote: I have been researching bloom filters and discussed it on IRC with RhodiumToad and David Fetter, and they pointed me to the various places that could potentially have bloom filters, apart from the places that already have them currently. I have been reading the current implementation of hash joins, and in ExecScanHashBucket, which I understand is the actual lookup function, we could potentially look at a bloom filter per bucket. Instead of actually looking up each hash value for the outer relation, we could just check the corresponding bloom filter for that bucket, and if we get a positive, then lookup the actual values i.e. continue with our current behaviour (since we could be looking at a false positive). Exactly this was suggested by me on the NTUP_PER_BUCKET thread last week. Probably good idea to join in there also. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services
Re: [HACKERS] Bloom Filter lookup for hash joins
On Wed, Jun 26, 2013 at 5:47 PM, Simon Riggs si...@2ndquadrant.com wrote: On 26 June 2013 07:46, Atri Sharma atri.j...@gmail.com wrote: I have been researching bloom filters and discussed it on IRC with RhodiumToad and David Fetter, and they pointed me to the various places that could potentially have bloom filters, apart from the places that already have them currently. I have been reading the current implementation of hash joins, and in ExecScanHashBucket, which I understand is the actual lookup function, we could potentially look at a bloom filter per bucket. Instead of actually looking up each hash value for the outer relation, we could just check the corresponding bloom filter for that bucket, and if we get a positive, then lookup the actual values i.e. continue with our current behaviour (since we could be looking at a false positive). Exactly this was suggested by me on the NTUP_PER_BUCKET thread last week. Probably good idea to join in there also. Great.I would love to assist you in implementing this, if possible. Regards, Atri -- Regards, Atri l'apprenant -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] proposal 9.4 plpgsql: allows access to call stack from GET DIAGNOSTICS statement
On Wed, Jun 26, 2013 at 5:11 PM, Pavel Stehule pavel.steh...@gmail.comwrote: 2013/6/26 Rushabh Lathia rushabh.lat...@gmail.com: On Tue, Jun 25, 2013 at 11:09 AM, Pavel Stehule pavel.steh...@gmail.com wrote: 2013/6/25 Rushabh Lathia rushabh.lat...@gmail.com: On Tue, Jun 25, 2013 at 1:47 AM, Pavel Stehule pavel.steh...@gmail.com wrote: Hello This is fragment ofmy old code from orafce package - it is functional, but it is written little bit more generic due impossible access to static variables (in elog.c) static char* dbms_utility_format_call_stack(char mode) { MemoryContext oldcontext = CurrentMemoryContext; ErrorData *edata; ErrorContextCallback *econtext; StringInfo sinfo; #if PG_VERSION_NUM = 80400 errstart(ERROR, __FILE__, __LINE__, PG_FUNCNAME_MACRO, TEXTDOMAIN); #else errstart(ERROR, __FILE__, __LINE__, PG_FUNCNAME_MACRO); #endif MemoryContextSwitchTo(oldcontext); for (econtext = error_context_stack; econtext != NULL; econtext = econtext-previous) (*econtext-callback) (econtext-arg); edata = CopyErrorData(); FlushErrorState(); https://github.com/orafce/orafce/blob/master/utility.c 2013/6/24 Rushabh Lathia rushabh.lat...@gmail.com: Hi, Use of this feature is to get call stack for debugging without raising exception. This definitely seems very useful. Here are my comments about the submitted patch: Patch gets applied cleanly (there are couple of white space warning but that's into test case file). make and make install too went smooth. make check was smooth too. Did some manual testing about feature and its went smooth. However would like to share couple of points: My main motive is concentration to stack info string only. So I didn't use err_start function, and I didn't use CopyErrorData too. These routines does lot of other work. 1) I noticed that you did the initialization of edata manually, just wondering can't we directly use CopyErrorData() which will do that job ? yes, we can, but in this context on context field is interesting for us. I found that inside CopyErrorData() there is Assert: Assert(CurrentMemoryContext != ErrorContext); And in the patch we are setting using ErrorContext as short living context, is it the reason of not using CopyErrorData() ? it was not a primary reason, but sure - this routine is not designed for direct using from elog module. Next, we can save one palloc call. Hmm ok. 2) To reset stack to empty, FlushErrorState() can be used. yes, it can be. I use explicit rows due different semantics, although it does same things. FlushErrorState some global ErrorState and it is used from other modules. I did a reset of ErrorContext. I fill a small difference there - so FlushErrorState is not best name for my purpose. What about modification: static void resetErrorStack() { errordata_stack_depth = -1; recursion_depth = 0; /* Delete all data in ErrorContext */ MemoryContextResetAndDeleteChildren(ErrorContext); } and then call in InvokeErrorCallbacks -- resetErrorStack() and rewrote flushErrorState like void FlushErrorState(void) { /* reset ErrorStack is enough */ resetErrorStack(); } ??? Nope. rather then that I would still prefer direct call of FlushErrorState(). but I can live well with direct call of FlushErrorState() - it is only minor change. 3) I was think about the usability of the feature and wondering how about if we add some header to the stack, so user can differentiate between STACK and normal NOTICE message ? Something like: postgres=# select outer_outer_func(10); NOTICE: - Call Stack - PL/pgSQL function inner_func(integer) line 4 at GET DIAGNOSTICS PL/pgSQL function outer_func(integer) line 3 at RETURN PL/pgSQL function outer_outer_func(integer) line 3 at RETURN CONTEXT: PL/pgSQL function outer_func(integer) line 3 at RETURN PL/pgSQL function outer_outer_func(integer) line 3 at RETURN outer_outer_func -- 20 (1 row) I understand, but I don't think so it is good idea. Sometimes, you would to use context for different purposes than debug log - for example - you should to identify top most call or near call - and any additional lines means some little bit more difficult parsing. You can add this line simply (if you want) RAISE NOTICE e'--- Call Stack ---\n%', stack but there are more use cases, where you would not this information (or you can use own header
Re: [HACKERS] A better way than tweaking NTUP_PER_BUCKET
On Sun, Jun 23, 2013 at 8:41 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: On 23.06.2013 01:48, Simon Riggs wrote: On 22 June 2013 21:40, Stephen Frostsfr...@snowman.net wrote: I'm actually not a huge fan of this as it's certainly not cheap to do. If it can be shown to be better than an improved heuristic then perhaps it would work but I'm not convinced. We need two heuristics, it would seem: * an initial heuristic to overestimate the number of buckets when we have sufficient memory to do so * a heuristic to determine whether it is cheaper to rebuild a dense hash table into a better one. Although I like Heikki's rebuild approach we can't do this every x2 overstretch. Given large underestimates exist we'll end up rehashing 5-12 times, which seems bad. It's not very expensive. The hash values of all tuples have already been calculated, so rebuilding just means moving the tuples to the right bins. Better to let the hash table build and then re-hash once, it we can see it will be useful. That sounds even less expensive, though. Hi all, I just popped in here on Simon's advice to put an idea I had about optimizing hash joins on this thread. Essentially, I was thinking of using bloom filters in the part where we match the actual hash values of the outer relation's tuple and the hash table. We could do a bloom filter lookup first, and if it gives a positive, then we can proceed like we do currently, since we could have a false positive. However, if we have a negative from the bloom filter, then, we can skip that tuple since bloom filters never give false negatives. Another thing we could potentially look at is that in the case when the hash table doesnt fit into memory, and we have to do disk lookups, then, we could do bloom filter lookups first in order to select the tuples to be read from disk(only the positive ones). Thoughts/Comments please? Regards, Atri -- Regards, Atri l'apprenant -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash partitioning.
On Tue, Jun 25, 2013 at 02:52:33PM -0700, Kevin Grittner wrote: Claudio Freire klaussfre...@gmail.com wrote: Did you try select * from foo where (a % 16) = (1::int % 16)? I did. Using Robert's hashed partitioning table definitions: test=# explain select * from foo where a = 1 and (a % 16) = (1 % 16); QUERY PLAN Append (cost=0.00..31.53 rows=2 width=36) - Seq Scan on foo (cost=0.00..0.00 rows=1 width=36) Filter: ((a = 1) AND ((a % 16) = 1)) - Seq Scan on foo1 (cost=0.00..31.53 rows=1 width=36) Filter: ((a = 1) AND ((a % 16) = 1)) (5 rows) So if you are generating your queries through something capable of generating that last clause off of the first, this could work. Not OK, so what is it in our code that requires that? It is a type mismatch? -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Kudos for Reviewers -- straw poll
On Wed, Jun 26, 2013 at 10:40:17AM +1000, Brendan Jurd wrote: On 26 June 2013 03:17, Josh Berkus j...@agliodbs.com wrote: How should reviewers get credited in the release notes? a) not at all b) in a single block titled Reviewers for this version at the bottom. c) on the patch they reviewed, for each patch A weak preference for (c), with (b) running a close second. As others have suggested, a review that leads to significant commitable changes to the patch should bump the credit to co-authorship. As a reminder, I tried a variant of C for 9.2 beta release notes, and got lots of complaints, particularly because the line describing the feature now had many more names on it. In my opinion, adding reviewer names to each feature item might result in the removal of all names from features. A poll is nice for gauging interest, but many people who vote don't understand the ramifications of what they are voting on. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Kudos for Reviewers -- straw poll
On 06/26/2013 09:14 AM, Bruce Momjian wrote: On Wed, Jun 26, 2013 at 10:40:17AM +1000, Brendan Jurd wrote: On 26 June 2013 03:17, Josh Berkus j...@agliodbs.com wrote: How should reviewers get credited in the release notes? a) not at all b) in a single block titled Reviewers for this version at the bottom. c) on the patch they reviewed, for each patch A weak preference for (c), with (b) running a close second. As others have suggested, a review that leads to significant commitable changes to the patch should bump the credit to co-authorship. As a reminder, I tried a variant of C for 9.2 beta release notes, and got lots of complaints, particularly because the line describing the feature now had many more names on it. In my opinion, adding reviewer names to each feature item might result in the removal of all names from features. A poll is nice for gauging interest, but many people who vote don't understand the ramifications of what they are voting on. That's why I voted for b :-) cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Optimizing pglz compressor
On Wednesday, June 26, 2013 2:15 AM Heikki Linnakangas wrote: On 19.06.2013 14:01, Amit Kapila wrote: Observations -- 1. For small data perforamce is always good with patch. 2. For random small/large data performace is good. 3. For medium and large text and same byte data(3K,5K text, 10K,100K,500K same byte), performance is degraded. Wow, that's strange. What platform and CPU did you test on? CPU - 4 core RAM - 24GB OS - SUSE 11 SP2 Kernel version - 3.0.13 Are you sure you used the same compiler flags with and without the patch? Yes. Can you also try the attached patch, please? It's the same as before, but in this version, I didn't replace the prev and next pointers in PGLZ_HistEntry struct with int16s. That avoids some table lookups, at the expense of using more memory. It's closer to what we have without the patch, so maybe that helps on your system. Yes it helped a lot on my system. Head: testname | auto ---+--- 5k text | 3499.888 512b text | 1425.106 256b text | 1769.126 1K text | 1378.151 3K text | 4081.254 2k random | -410.928 100k random | -10.235 500k random |-2.094 512b random | -770.665 256b random | -1120.173 1K random | -570.351 10k of same byte | 3602.610 100k of same byte | 36187.863 500k of same byte | 26055.472 After your Patch (pglz-variable-size-hash-table-2.patch) testname | auto ---+--- 5k text | 3524.306 512b text | 954.962 256b text | 832.527 1K text | 1273.970 3K text | 3963.329 2k random | -300.516 100k random |-7.538 500k random |-1.525 512b random | -439.726 256b random | -440.154 1K random | -391.070 10k of same byte | 3570.921 100k of same byte | 37498.502 500k of same byte | 26904.426 There was minor problem in you patch, in one of experiments it crashed. Fix is not to access 0th history entry in function pglz_find_match(), modified patch is attached. After fix, readings are almost similar: testname | auto ---+--- 5k text | 3347.961 512b text | 938.442 256b text | 834.496 1K text | 1243.618 3K text | 3790.835 2k random | -306.470 100k random |-7.589 500k random |-1.517 512b random | -442.649 256b random | -438.781 1K random | -392.106 10k of same byte | 3565.449 100k of same byte | 37355.595 500k of same byte | 26776.076 I guess some difference might be due to different way of accessing in pglz_hist_add(). With Regards, Amit Kapila. pglz-variable-size-hash-table-3.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash partitioning.
Heikki, As far as I understand the height of the btree will affect the number of I/Os necessary. The height of the tree does not increase linearly with the number of records. May be I wrong in terminology but when I am trying to insert data into empty table the insertion time is increasing when number of records is growing. In order to keep indexes as small as possible I usually split the table by hash if I don't have any better alternative. On some systems hash functions +index might work faster when only index for insert and search operations. This especially usable when you have non unique index with small number of possible values that you don't know in advance or that changing between your customers. In that case the hash partition has to be used instead of index. Sincerely yours, Yuri Levinsky, DBA Celltick Technologies Ltd., 32 Maskit St., Herzliya 46733, Israel Mobile: +972 54 6107703, Office: +972 9 9710239; Fax: +972 9 9710222 -Original Message- From: Heikki Linnakangas [mailto:hlinnakan...@vmware.com] Sent: Wednesday, June 26, 2013 2:23 PM To: Yuri Levinsky Cc: Tom Lane; Christopher Browne; Robert Haas; Bruce Momjian; PostgreSQL Mailing Lists Subject: Re: [HACKERS] Hash partitioning. On 26.06.2013 11:17, Yuri Levinsky wrote: The main purpose of partitioning in my world is to store billions of rows and be able to search by date, hour or even minute as fast as possible. Hash partitioning sounds like a bad fit for that use case. A regular b-tree, possibly with range partitioning, sounds optimal for that. When you dealing with company, which has ~350.000.000 users, and you don't want to use key/value data stores: you need hash partitioned tables and hash partitioned table clusters to perform fast search and 4-6 tables join based on user phone number for example. B-trees are surprisingly fast for key-value lookups. There is no reason to believe that a hash partitioned table would be faster for that than a plain table. - Heikki This mail was received via Mail-SeCure System. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] updated emacs configuration
On Tue, Jun 25, 2013 at 11:17:47PM -0400, Peter Eisentraut wrote: On Sun, 2013-06-23 at 16:03 -0400, Noah Misch wrote: ((c-mode . ((c-basic-offset . 4) (fill-column . 79) I don't know whether you'd consider it to fall within the scope of this update, but 78 is the fill-column setting that matches pgindent. Well, well, well. I did some extensive tests some time ago when that setting was added. I have a suspicion that this could be related to the recent pgindent changes (which everyone claims were no changes). I'm going to have to research that some more. pgindent has not changed the following xlog.c comment since April 2011, but fill-column 77 or 79 changes it: /* * fullPageWrites is the master copy used by all backends to determine * whether to write full-page to WAL, instead of using process-local one. * This is required because, when full_page_writes is changed by SIGHUP, * we must WAL-log it before it actually affects WAL-logging by backends. * Checkpointer sets at startup or after SIGHUP. */ Note that emacs and pgindent remain at odds over interior tabs in comments. When pgindent finds a double-space (typically after a sentence) ending at a tab stop, it replaces the double-space with a tab. c-fill-paragraph will convert that tab to a *single* space, and that can be enough to change many line break positions. -- Noah Misch EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] A better way than tweaking NTUP_PER_BUCKET
Atri, * Atri Sharma (atri.j...@gmail.com) wrote: I just popped in here on Simon's advice to put an idea I had about optimizing hash joins on this thread. I'd encourage reading the thread a bit first, in the future.. :) Essentially, I was thinking of using bloom filters in the part where we match the actual hash values of the outer relation's tuple and the hash table. We could do a bloom filter lookup first, and if it gives a positive, then we can proceed like we do currently, since we could have a false positive. However, if we have a negative from the bloom filter, then, we can skip that tuple since bloom filters never give false negatives. I suggested this up-thread already, but it's not really a bloom filter as there's only one hash function available- I can't see us requiring every data type to provide multiple hash functions. Still, I do think breaking the single 32-bit hash key space up into fixed-sized chunks and then having a bitfield array which we test against (very similar to how the visibility map works) to see if there's any chance that a given hash key exists might be valuable. The problem is that, because we don't have multiple hash functions, it's not clear how much empty space we'd actually end up with. This needs to be tested with different data sets and different sizes for the bitfield (leading to correspondingly different sizes for the amount of key space covered by a single bit), along with performance metrics which determine how expensive it is to build and use the bitfield. Another thing we could potentially look at is that in the case when the hash table doesnt fit into memory, and we have to do disk lookups, then, we could do bloom filter lookups first in order to select the tuples to be read from disk(only the positive ones). We could have a bitfield filter (as I described above) created for each bucket and then test against that before considering if we actually have to go look in that bucket, yes. I'm not sure if that's quite what you were thinking, but I can see how a bitfield per bucket might work. If you were suggesting something else, please clarify. Thanks, Stephen signature.asc Description: Digital signature
Re: [HACKERS] Hash partitioning.
On 06/25/2013 11:52 PM, Kevin Grittner wrote: At least until we have parallel query execution. At *that* point this all changes. Can you elaborate on that, please? I currently have a hard time imagining how partitions can help performance in that case, either. At least compared to modern RAID and read-ahead capabilities. After all, RAID can be thought of as hash partitioning with a very weird hash function. Or maybe rather range partitioning on an internal key. Put another way: ideally, the system should take care of optimally distributing data across its physical storage itself. If you need to do partitioning manually for performance reasons, that's actually a deficiency of it, not a feature. I certainly agree that manageability may be a perfectly valid reason to partition your data. Maybe there even exist other good reasons. I don't think performance optimization is one. (It's more like giving the system a hint. And we all dislike hints, don't we? *ducks*) Regards Markus Wanner -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash partitioning.
On Wed, Jun 26, 2013 at 03:47:43PM +0200, Markus Wanner wrote: On 06/25/2013 11:52 PM, Kevin Grittner wrote: At least until we have parallel query execution. At *that* point this all changes. Can you elaborate on that, please? I currently have a hard time imagining how partitions can help performance in that case, either. At least compared to modern RAID and read-ahead capabilities. After all, RAID can be thought of as hash partitioning with a very weird hash function. Or maybe rather range partitioning on an internal key. Put another way: ideally, the system should take care of optimally distributing data across its physical storage itself. If you need to do partitioning manually for performance reasons, that's actually a deficiency of it, not a feature. I certainly agree that manageability may be a perfectly valid reason to partition your data. Maybe there even exist other good reasons. I don't think performance optimization is one. (It's more like giving the system a hint. And we all dislike hints, don't we? *ducks*) Regards Markus Wanner Hi Markus, I think he is referring to the fact that with parallel query execution, multiple partitions can be processed simultaneously instead of serially as they are now with the resulting speed increase. Regards, Ken -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] patch submission: truncate trailing nulls from heap rows to reduce the size of the null bitmap [Review]
FYI I submitted a slightly modified patch since Amit's measurements that is slightly faster. On Jun 25, 2013, at 1:26 PM, Amit Kapila amit.kap...@huawei.com wrote: On Monday, June 24, 2013 8:20 PM Tom Lane wrote: Amit Kapila amit.kap...@huawei.com writes: I will summarize the results, and if most of us feel that they are not good enough, then we can return this patch. Aside from the question of whether there's really any generally-useful performance improvement from this patch, it strikes me that this change forecloses other games that we might want to play with interpretation of the value of a tuple's natts. In particular, when I was visiting Salesforce last week, the point came up that they'd really like ALTER TABLE ADD COLUMN to be free even when the column had a non-null DEFAULT. It's not too difficult to imagine how we might support that, at least when the default expression is a constant: decree that if the requested attribute number is beyond natts, look aside at the tuple descriptor to see if the column is marked as having a magic default value, and if so, substitute that rather than just returning NULL. (This has to be a magic value separate from the current default, else a subsequent DROP or SET DEFAULT would do the wrong thing.) However, this idea conflicts with an optimization that supposes it can reduce natts to suppress null columns: if the column was actually stored as null, you'd lose that information, and would incorrectly return the magic default on subsequent reads. I think it might be possible to make both ideas play together, by not reducing natts further than the last column with a magic default. However, that means extra complexity in heap_form_tuple, which would invalidate the performance measurements done in support of this patch. It can have slight impact on normal scenario's, but I am not sure how much because the change will be very less(may be one extra if check and one assignment) For this Patch's scenario, I think the major benefit for Performance is in heap_fill_tuple() where the For loop is reduced. However added some logic in heap_form_tuple can reduce the performance improvement, but there can still be space saving benefit. With Regards, Amit Kapila. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash partitioning.
On 26.06.2013 16:41, Yuri Levinsky wrote: Heikki, As far as I understand the height of the btree will affect the number of I/Os necessary. The height of the tree does not increase linearly with the number of records. The height of a b-tree is O(log n), where n is the number of records. Informally, if we assume that you have on average, say, 1000 keys on one b-tree page, a two-level b-tree can hold one million items, and a three level one billion items, and so on. The height of the tree affects the number of I/Os needed for searches, but keep in mind that the top levels of the tree are going to be very frequently accessed and in practice will stay permanently cached. You will only perform actual I/O on the 1-2 bottom levels of the tree (or 0 if it all fits in cache) Now let's compare that with a hash partitioned table, with 1000 partitions and a b-tree index on every partition. When doing a search, you first hash the key to look up the right partition, then you search the index of that partition. This is almost equivalent to just having a b-tree that's one level taller - instead of looking up the right partition in the hash table, you look up the right child page at the root of the b-tree. From a very coarse theoretical point of view, the only difference is that you replaced the binary search on the b-tree root page with an equivalent hash lookup. A hash lookup can be somewhat cheaper than binary search, but in practice there is very little difference. There certainly isn't any difference in the number of actual I/O performed. In practice, there might be a lot of quirks and inefficiencies and locking contention etc. involved in various DBMS's, that you might be able to work around with hash partitioning. But from a theoretical point of view, there is no reason to expect just partitioning a table on a hash to make key-value lookups any faster. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash partitioning.
Markus, It's no relation between partitions and raids despite they both distribute data somehow. By the end of the day when you use the raid you have one single device with some performance limitations. When you want to improve your data access after that and not to work with huge indexes that you unable to maintain or you don't want to use index like in case of range partition by time or hash partition: you welcome to use partitions. You typically don't want to use b-tree index when yo select more when ~1-2% of your data. Sincerely yours, Yuri Levinsky, DBA Celltick Technologies Ltd., 32 Maskit St., Herzliya 46733, Israel Mobile: +972 54 6107703, Office: +972 9 9710239; Fax: +972 9 9710222 -Original Message- From: k...@rice.edu [mailto:k...@rice.edu] Sent: Wednesday, June 26, 2013 5:01 PM To: Markus Wanner Cc: Kevin Grittner; Claudio Freire; Robert Haas; Bruce Momjian; Yuri Levinsky; PostgreSQL-Dev Subject: Re: [HACKERS] Hash partitioning. On Wed, Jun 26, 2013 at 03:47:43PM +0200, Markus Wanner wrote: On 06/25/2013 11:52 PM, Kevin Grittner wrote: At least until we have parallel query execution. At *that* point this all changes. Can you elaborate on that, please? I currently have a hard time imagining how partitions can help performance in that case, either. At least compared to modern RAID and read-ahead capabilities. After all, RAID can be thought of as hash partitioning with a very weird hash function. Or maybe rather range partitioning on an internal key. Put another way: ideally, the system should take care of optimally distributing data across its physical storage itself. If you need to do partitioning manually for performance reasons, that's actually a deficiency of it, not a feature. I certainly agree that manageability may be a perfectly valid reason to partition your data. Maybe there even exist other good reasons. I don't think performance optimization is one. (It's more like giving the system a hint. And we all dislike hints, don't we? *ducks*) Regards Markus Wanner Hi Markus, I think he is referring to the fact that with parallel query execution, multiple partitions can be processed simultaneously instead of serially as they are now with the resulting speed increase. Regards, Ken This mail was received via Mail-SeCure System. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash partitioning.
On Wed, Jun 26, 2013 at 05:10:00PM +0300, Heikki Linnakangas wrote: In practice, there might be a lot of quirks and inefficiencies and locking contention etc. involved in various DBMS's, that you might be able to work around with hash partitioning. But from a theoretical point of view, there is no reason to expect just partitioning a table on a hash to make key-value lookups any faster. Good analysis. Has anyone benchmarked this to know our btree is efficient in this area? -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash partitioning.
Heiki, This is most professional explanation that I ever seen. Let me please disagree with a bottom line. It's heavily depends on amount of memory and actual index sizes. I did a benchmark ~6 years ago and I won a glass of beer. Anyway I am talking about hash partitioning as a feature and my example about compare with unique b-tree index scan is little bit extreme. In case you have 2,4,8..1024 different values (not known in advance) the index might be eliminated. That's whole the feature: no competition for hash function. Sincerely yours, Yuri Levinsky, DBA Celltick Technologies Ltd., 32 Maskit St., Herzliya 46733, Israel Mobile: +972 54 6107703, Office: +972 9 9710239; Fax: +972 9 9710222 -Original Message- From: Heikki Linnakangas [mailto:hlinnakan...@vmware.com] Sent: Wednesday, June 26, 2013 5:10 PM To: Yuri Levinsky Cc: Tom Lane; Christopher Browne; Robert Haas; Bruce Momjian; PostgreSQL Mailing Lists Subject: Re: [HACKERS] Hash partitioning. On 26.06.2013 16:41, Yuri Levinsky wrote: Heikki, As far as I understand the height of the btree will affect the number of I/Os necessary. The height of the tree does not increase linearly with the number of records. The height of a b-tree is O(log n), where n is the number of records. Informally, if we assume that you have on average, say, 1000 keys on one b-tree page, a two-level b-tree can hold one million items, and a three level one billion items, and so on. The height of the tree affects the number of I/Os needed for searches, but keep in mind that the top levels of the tree are going to be very frequently accessed and in practice will stay permanently cached. You will only perform actual I/O on the 1-2 bottom levels of the tree (or 0 if it all fits in cache) Now let's compare that with a hash partitioned table, with 1000 partitions and a b-tree index on every partition. When doing a search, you first hash the key to look up the right partition, then you search the index of that partition. This is almost equivalent to just having a b-tree that's one level taller - instead of looking up the right partition in the hash table, you look up the right child page at the root of the b-tree. From a very coarse theoretical point of view, the only difference is that you replaced the binary search on the b-tree root page with an equivalent hash lookup. A hash lookup can be somewhat cheaper than binary search, but in practice there is very little difference. There certainly isn't any difference in the number of actual I/O performed. In practice, there might be a lot of quirks and inefficiencies and locking contention etc. involved in various DBMS's, that you might be able to work around with hash partitioning. But from a theoretical point of view, there is no reason to expect just partitioning a table on a hash to make key-value lookups any faster. - Heikki This mail was received via Mail-SeCure System. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Developer meeting photos
Hackers, at last developer meeting we missed Oleg Bartunov. So, it's not surprising that photos is also missed. I remember that somebody took photos, but unfortunately it appears that I don't remember who. My employer who sponsored my attendance in PGCon want to publish post about it on the website. So, it would be very nice to have a photo of developer meeting. Does anybody have it? -- With best regards, Alexander Korotkov.
Re: [HACKERS] Hash partitioning.
Heikki Linnakangas hlinnakan...@vmware.com writes: On 26.06.2013 11:17, Yuri Levinsky wrote: When you dealing with company, which has ~350.000.000 users, and you don't want to use key/value data stores: you need hash partitioned tables and hash partitioned table clusters to perform fast search and 4-6 tables join based on user phone number for example. B-trees are surprisingly fast for key-value lookups. There is no reason to believe that a hash partitioned table would be faster for that than a plain table. Or in short: the quoted advice may very well be true for Oracle, but applying it blindly to Postgres is not a good idea. PG's performance characteristics are a lot different, especially in the area of partitioned tables. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Bloom Filter lookup for hash joins
Ants Aasma a...@cybertec.at writes: On Wed, Jun 26, 2013 at 9:46 AM, Atri Sharma atri.j...@gmail.com wrote: I have been reading the current implementation of hash joins, and in ExecScanHashBucket, which I understand is the actual lookup function, we could potentially look at a bloom filter per bucket. Instead of actually looking up each hash value for the outer relation, we could just check the corresponding bloom filter for that bucket, and if we get a positive, then lookup the actual values i.e. continue with our current behaviour (since we could be looking at a false positive). The problem here is that if the hash table is in memory, doing a hash table lookup directly is likely to be cheaper than a bloom filter lookup, Yeah. Given the plan to reduce NTUP_PER_BUCKET to 1, it's hard to see how adding a Bloom filter phase could be anything except overhead. Even with the current average bucket length, it doesn't sound very promising. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] A better way than tweaking NTUP_PER_BUCKET
On Wed, Jun 26, 2013 at 7:12 PM, Stephen Frost sfr...@snowman.net wrote: Atri, * Atri Sharma (atri.j...@gmail.com) wrote: I just popped in here on Simon's advice to put an idea I had about optimizing hash joins on this thread. I'd encourage reading the thread a bit first, in the future.. :) Yeah, I actually read a bit(admitted, not much) of the above thread. I was following it a bit as well. I suggested this up-thread already, but it's not really a bloom filter as there's only one hash function available- I can't see us requiring every data type to provide multiple hash functions. Still, I do think breaking the single 32-bit hash key space up into fixed-sized chunks and then having a bitfield array which we test against (very similar to how the visibility map works) to see if there's any chance that a given hash key exists might be valuable. The problem is that, because we don't have multiple hash functions, it's not clear how much empty space we'd actually end up with. Agreed. We could have a bitfield filter (as I described above) created for each bucket and then test against that before considering if we actually have to go look in that bucket, yes. I'm not sure if that's quite what you were thinking, but I can see how a bitfield per bucket might work. If you were suggesting something else, please clarify. Yeah, this is what I wanted. My point is that I would like to help in the implementation, if possible. :) Regards, Atri -- Regards, Atri l'apprenant -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] LATERAL quals revisited
On 06/26/2013 12:52 AM, Tom Lane wrote: Instead of setting it aside, can we (mis)use placeholder var (PHV), to ensure that the WHERE clause is evaluated below the OJ; instead of combining it with the ON clause? No, that doesn't help; it has to be part of the joinquals at the join node, or you don't get the right execution semantics. When I wrote 'below the OJ' I meant to retain something of the original semantics (just like the subquery applies the WHERE clause below the OJ). However that's probably too restrictive and your next arguments Plus you could lose some optimization opportunities, for example if we fail to see that there's a strict join clause associated with the outer join (cf lhs_strict). Worse, I think wrapping a PHV around an otherwise indexable clause would prevent using it for an indexscan. also confirm the restrictiveness. So I can forget. One more concern anyway: doesn't your proposal make subquery pull-up a little bit risky in terms of cost of the resulting plan? IMO the subquery in the original query may filter out many rows and thus decrease the number of pairs to be evaluated by the join the ON clause belongs to. If the WHERE clause moves up, then the resulting plan might be less efficient than the one we'd get if the subquery hadn't been pulled-up at all. However at the time of cost evaluation there's no way to get back (not even to notice the higher cost) because the original subquery has gone at earlier stage of the planning. Regards, Antonin Houska (Tony) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash partitioning.
On 06/26/2013 04:10 PM, Yuri Levinsky wrote: You typically don't want to use b-tree index when yo select more when ~1-2% of your data. Agreed. Indices on columns with very low selectivity don't perform well. (Postgres knows that and uses a sequential scan based on selectivity estimates. Being able to eliminate entire partitions from such a seq scan would certainly be beneficial, yes.) In the Postgres world, though, I think CLUSTERing might be the better approach to solve that problem. (Note: this has nothing to do with distributed systems in this case.) I'm not sure what the current status of auto clustering or optimized scans on such a permanently clustered table is, though. The minmax indices proposed for 9.4 might be another feature worth looking at. Both of these approaches may eventually provide a more general and more automatic way to speed up scans on large portions of a relation, IMO. Regards Markus Wanner -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] LATERAL quals revisited
Antonin Houska antonin.hou...@gmail.com writes: If the WHERE clause moves up, then the resulting plan might be less efficient than the one we'd get if the subquery hadn't been pulled-up at all. No, because we can push the qual back down again (using a parameterized path) if that's appropriate. The problem at this stage is to understand the semantics of the outer join correctly, not to make a choice of what the plan will be. In fact, the reason we'd not noticed this bug before is exactly that all the test cases in the regression tests do end up pushing the qual back down. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Bugfix and new feature for PGXS
On 06/25/2013 11:29 AM, Cédric Villemain wrote: Le mardi 25 juin 2013 17:18:51, Andrew Dunstan a écrit : On 06/24/2013 07:24 PM, Cédric Villemain wrote: Le mardi 25 juin 2013 00:18:26, Andrew Dunstan a écrit : On 06/24/2013 04:02 PM, Cédric Villemain wrote: WIth extension, we do have to set VPATH explicitely if we want to use VPATH (note that contribs/extensions must not care that postgresql has been built with or without VPATH set). My patches try to fix that. No, this is exactly what I'm objecting to. I want to be able to do: invoke_vpath_magic standard_make_commands_as_for_non_vpath Your patches have been designed to overcome your particular issues, but they don't meet the general case, IMNSHO. This is why I want to have more discussion about how vpath builds could work for PGXS modules. The patch does not restrict anything, it is not supposed to lead to regression. The assignment of VPATH and srcdir are wrong in the PGXS case, the patch correct them. You're still free to do make VPATH=/mypath ... the VPATH provided won't be erased by the pgxs.mk logic. I still think this is premature. I have spent some more time trying to make it work the way I think it should, so far without success. I think we need more discussion about how we'd like VPATH to work for PGXS would. There's no urgency about this - we've lived with the current situation for quite a while. Sure... I did a quick and dirty patch (I just validate without lot of testing), attached to this email to fix json_build (at least for make, make install) As you're constructing targets based on commands to execute in the srcdir directory, and because srcdir is only set in pgxs.mk, it is possible to define srcdir early in the json_build/Makefile and use it. This still doesn't do what I really want, which is to be able to invoke make without anything special in the invocation that triggers VPATH processing. Here's what I did that works like I think it should. First the attached patch on top of yours. Second, I did: mkdir vpath.json_build cd vpath.json_build sh/path/to/pgsource/config/prep_buildtree ../json_build . ln -s ../json_build/json_build.control . Then I created vpath.mk with these contents: ext_srcdir = ../json_build USE_VPATH = $(ext_srcdir) Finally I vpath-ized the Makefile (see attached). Given all of that, I was able to do, in the vpath directory: make make install make installcheck make clean and it all just worked, with exactly the same make invocations as work in an in-tree build. So what's lacking to make this nice is a tool to automate the second and third steps above. Maybe there are other ways of doing this, but this does what I'd like to see. cheers andrew diff --git a/src/makefiles/pgxs.mk b/src/makefiles/pgxs.mk index f70d5f7..42e3654 100644 --- a/src/makefiles/pgxs.mk +++ b/src/makefiles/pgxs.mk @@ -61,18 +61,21 @@ ifdef PGXS top_builddir := $(dir $(PGXS))../.. include $(top_builddir)/src/Makefile.global -# If Makefile is not in current directory we are building the extension with -# VPATH so we set the variable here -# XXX what about top_srcdir ? -ifeq ($(CURDIR),$(dir $(firstword $(MAKEFILE_LIST top_srcdir = $(top_builddir) +# If USE_VPATH is set or Makefile is not in current directory we are building +# the extension with VPATH so we set the variable here +ifdef USE_VPATH +srcdir = $(USE_VPATH) +VPATH = $(USE_VPATH) +else +ifeq ($(CURDIR),$(dir $(firstword $(MAKEFILE_LIST srcdir = . VPATH = else -top_srcdir = $(top_builddir) srcdir = $(dir $(firstword $(MAKEFILE_LIST))) VPATH = $(srcdir) endif +endif # These might be set in Makefile.global, but if they were not found # during the build of PostgreSQL, supply default values so that users EXTENSION= json_build ifeq ($(wildcard vpath.mk),vpath.mk) include vpath.mk else ext_srcdir = . endif EXTVERSION = $(shell grep default_version $(EXTENSION).control | sed -e s/default_version[[:space:]]*=[[:space:]]*'\([^']*\)'/\1/) DATA = $(filter-out $(wildcard sql/*--*.sql),$(wildcard sql/*.sql)) DOCS = $(wildcard $(ext_srcdir)/doc/*.md) USE_MODULE_DB = 1 TESTS= $(wildcard $(ext_srcdir)/test/sql/*.sql) REGRESS_OPTS = --inputdir=$(ext_srcdir)/test --outputdir=test \ --load-extension=$(EXTENSION) REGRESS = $(patsubst $(ext_srcdir)/test/sql/%.sql,%,$(TESTS)) MODULE_big = $(EXTENSION) OBJS = $(patsubst $(ext_srcdir)/src/%.c,src/%.o,$(wildcard $(ext_srcdir)/src/*.c)) PG_CONFIG= pg_config all: sql/$(EXTENSION)--$(EXTVERSION).sql sql/$(EXTENSION)--$(EXTVERSION).sql: $(ext_srcdir)/sql/$(EXTENSION).sql cp $ $@ DATA_built = sql/$(EXTENSION)--$(EXTVERSION).sql DATA = $(filter-out $(ext_srcdir)/sql/$(EXTENSION)--$(EXTVERSION).sql, $(wildcard $(ext_srcdir)/sql/*--*.sql)) EXTRA_CLEAN = sql/$(EXTENSION)--$(EXTVERSION).sql PGXS := $(shell $(PG_CONFIG) --pgxs) include $(PGXS) # we put all the tests in a test subdir, but
Re: [HACKERS] Computer VARSIZE_ANY(PTR) during debugging
Amit Langote escribió: The segfault in question happens at line 1141: off = att_align_pointer(off, thisatt-attalign, -1, tp + off); char *tp; /* ptr to tuple data */ longoff;/* offset in tuple data */ Disassembling seems to suggest (tp + off) is the faulting address. Apparently, the segfault happens when 5th text column is being extracted from a tuple (char(n), char(n), int4, char(n), text, ...). Since, tp is fixed for the whole duration of loop and only off is subject to change over iterations, it may have happened due to wrong offset in this iteration. Has anything of this kind been encountered/reported before? Yes, I vaguely recall I have seen this in cases where tuples contain corrupt data. I think you just need the length word of the fourth datum to be wrong. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash partitioning.
On 06/26/2013 04:01 PM, k...@rice.edu wrote: I think he is referring to the fact that with parallel query execution, multiple partitions can be processed simultaneously instead of serially as they are now with the resulting speed increase. Processing simultaneously is the purpose of parallel query execution, yes. But I see no reason for that not to work equally well for unpartitioned tables. Disk I/O is already pretty well optimized and parallelized, I think. Trying to parallelize a seq scan on the Postgres side is likely to yield far inferior performance. Regards Markus Wanner -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash partitioning.
On Wed, Jun 26, 2013 at 05:04:11PM +0200, Markus Wanner wrote: On 06/26/2013 04:01 PM, k...@rice.edu wrote: I think he is referring to the fact that with parallel query execution, multiple partitions can be processed simultaneously instead of serially as they are now with the resulting speed increase. Processing simultaneously is the purpose of parallel query execution, yes. But I see no reason for that not to work equally well for unpartitioned tables. Well, I think by definition you are going to be able to spread lookups for a particular range across more partitions with 'hash' than with range or list partitioning. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Kudos for Reviewers -- straw poll
On 06/25/2013 08:26 PM, Andres Freund wrote: It's not about the reviewers being less. It's a comparison of effort. The effort for a casual review simply isn't comparable with the effort spent on developing a nontrivial patch. Remember: Debugging is twice as hard as writing the code in the first place. ... (Brian Kernighan) IMO, the kind of reviews we need are of almost debug level difficulty. (To the point where the reviewer becomes a co-author or even takes over and submits a completely revamped patch instead.) I agree that the casual review is several levels below that, so your point holds. I doubt we need more reviews of that kind, though. Thus, I'm in the AAB camp. The remaining question being: What's the criterion for becoming a co-author (and thus getting mentioned in the release notes)? If at all, we should honor quality work with a prize. Maybe a price for the best reviewer per CF? Maybe even based on votes from the general public on who's been the best, so reviews gain attention that way... Click here to vote for my review. ... Maybe a crazy idea. Regards Markus Wanner -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] BUG #7493: Postmaster messages unreadable in a Windows console
On Mon, Jun 24, 2013 at 04:00:00PM +0400, Alexander Law wrote: Thanks for your work, your patch is definitely better. I agree that this approach much more generic. Committed. -- Noah Misch EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash partitioning.
Markus Wanner mar...@bluegap.ch wrote: On 06/25/2013 11:52 PM, Kevin Grittner wrote: At least until we have parallel query execution. At *that* point this all changes. Can you elaborate on that, please? I currently have a hard time imagining how partitions can help performance in that case, either. Well, partitioning will *still* be a net loss for overall throughput on a machine with enough active connections to keep all the resources busy. Where it will help is when you have a machine with a lot of cores and a few big reporting style queries. Since we currently can only use one core for a single query, we leave a lot of CPU time (often the bottleneck for such queries) unused. If we allow a large query to search multiple partitions in parallel, a big query can complete sooner. It will consume more resources overall in doing so, but if those resources would otherwise sit idle, it could be a big win for some use cases. Put another way: ideally, the system should take care of optimally distributing data across its physical storage itself. Agreed. That's not where I see the win. Most cases where I've seen people attempt to micro-manage object placement have performed worse than somply giving the OS a big RAID (we had 40 spindles in some of ours at Wis. Courts) and letting the filesystem figure it out. There are exceptions for special cases like WAL. I found out by accident how much that can matter. If you need to do partitioning manually for performance reasons, that's actually a deficiency of it, not a feature. +1 I certainly agree that manageability may be a perfectly valid reason to partition your data. It can definitely help there. Maybe there even exist other good reasons. I'm arguing that better concurrency can be one in the future. -- Kevin Grittner EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash partitioning.
On 26.06.2013 18:34, Kevin Grittner wrote: Markus Wannermar...@bluegap.ch wrote: On 06/25/2013 11:52 PM, Kevin Grittner wrote: At least until we have parallel query execution. At *that* point this all changes. Can you elaborate on that, please? I currently have a hard time imagining how partitions can help performance in that case, either. Well, partitioning will *still* be a net loss for overall throughput on a machine with enough active connections to keep all the resources busy. Where it will help is when you have a machine with a lot of cores and a few big reporting style queries. Since we currently can only use one core for a single query, we leave a lot of CPU time (often the bottleneck for such queries) unused. If we allow a large query to search multiple partitions in parallel, a big query can complete sooner. We could also allow a large query to search a single table in parallel. A seqscan would be easy to divide into N equally-sized parts that can be scanned in parallel. It's more difficult for index scans, but even then it might be possible at least in some limited cases. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] A better way than tweaking NTUP_PER_BUCKET
* Atri Sharma (atri.j...@gmail.com) wrote: My point is that I would like to help in the implementation, if possible. :) Feel free to go ahead and implement it.. I'm not sure when I'll have a chance to (probably not in the next week or two anyway). Unfortunately, the bigger issue here is really about testing the results and determining if it's actually faster/better with various data sets (including ones which have duplicates). I've got one test data set which has some interesting characteristics (for one thing, hashing the large side and then seq-scanning the small side is actually faster than going the other way, which is quite 'odd' imv for a hashing system): http://snowman.net/~sfrost/test_case2.sql You might also look at the other emails that I sent regarding this subject and NTUP_PER_BUCKET. Having someone confirm what I saw wrt changing that parameter would be nice and it would be a good comparison point against any kind of pre-filtering that we're doing. One thing that re-reading the bloom filter description reminded me of is that it's at least conceivable that we could take the existing hash functions for each data type and do double-hashing or perhaps seed the value to be hashed with additional data to produce an independent hash result to use. Again, a lot of things that need to be tested and measured to see if they improve overall performance. Thanks, Stephen signature.asc Description: Digital signature
Re: [HACKERS] XLogInsert scaling, revisited
On 24.06.2013 21:01, Andres Freund wrote: Ok, I started to look at this: Thanks! * Could you document the way slots prevent checkpoints from occurring when XLogInsert rechecks for full page writes? I think it's correct - but not very obvious on a glance. There's this in the comment near the top of the file: * To update RedoRecPtr or fullPageWrites, one has to make sure that all * subsequent inserters see the new value. This is done by reserving all the * insertion slots before changing the value. XLogInsert reads RedoRecPtr and * fullPageWrites after grabbing a slot, so by holding all the slots * simultaneously, you can ensure that all subsequent inserts see the new * value. Those fields change very seldom, so we prefer to be fast and * non-contended when they need to be read, and slow when they're changed. Does that explain it well enough? XLogInsert holds onto a slot while it rechecks for full page writes. * The read of Insert-RedoRecPtr while rechecking whether it's out of date now is unlocked, is that correct? And why? Same here, XLogInsert holds the slot while rechecking Insert-RedoRecptr. * Have you measured whether it works to just keep as many slots as max_backends requires around and not bothering with dynamically allocating them to inserters? That seems to require to keep actually waiting slots in a separate list which very well might make that too expensive. Correctness wise the biggest problem for that probably is exlusive acquiration of all slots CreateCheckpoint() does... Hmm. It wouldn't be much different, each backend would still need to reserve its own dedicated slot, because it might be held by the a backend that grabbed all the slots. Also, bgwriter and checkpointer need to flush the WAL, so they'd need slots too. More slots make WaitXLogInsertionsToFinish() more expensive, as it has to loop through all of them. IIRC some earlier pgbench tests I ran didn't show much difference in performance, whether there were 40 slots or 100, as long as there was enough of them. I can run some more tests with even more slots to see how it behaves. * What about using some sort of linked list of unused slots for WALInsertSlotAcquireOne? Everytime we're done we put it to the *end* of the list so it's less likely to have been grabbed by somebody else so we can reuse it. a) To grab a new one go to the head of the linked list spinlock it and recheck whether it's still free. If not, restart. Otherwise, remove from list. b) To reuse a previously used slot That way we only have to do the PGSemaphoreLock() dance if there really aren't any free slots. That adds a spinlock acquisition / release into the critical path, to protect the linked list. I'm trying very hard to avoid serialization points like that. While profiling the tests I've done, finding a free slot hasn't shown up much. So I don't think it's a problem performance-wise as it is, and I don't think it would make the code simpler. * The queuing logic around slots seems to lack documentation. It's complex enough to warrant that imo. Yeah, it's complex. I expanded the comments above XLogInsertSlot, to explain how xlogInsertingAt works. Did that help, or was it some other part of that that needs more docs? * Not a big fan of the holdingAll variable name, for a file-global one that seems a bit too generic. Renamed to holdingAllSlots. * Could you add a #define or comment for the 64 used in XLogInsertSlotPadded? Not everyone might recognize that immediately as the most common cacheline size. Commenting on the reason we pad in general would be a good idea as well. Copy-pasted and edited the explanation from LWLockPadded for that. I also changed the way it's allocated so that it's aligned at 64-byte boundary, like we do for lwlocks. * Is it correct that WALInsertSlotAcquireOne() resets xlogInsertingAt of all slots *before* it has acquired locks in all of them? If yes, why haven't we already reset it to something invalid? I didn't understand this part. Can you elaborate? * Is GetXLogBuffer()'s unlocked read correct without a read barrier? Hmm. I thought that it was safe because GetXLogBuffer() handles the case that you get a torn read of the 64-bit XLogRecPtr value. But now that I think of it, I wonder if it's possible for reads/writes to be reordered so that AdvanceXLInsertBuffer() overwrites WAL data that another backend has copied onto a page. Something like this: 1. Backend B zeroes a new WAL page, and sets its address in xlblocks 2. Backend A calls GetXLogBuffer(), and gets a pointer to that page 3. Backend A copies the WAL data to the page. There is no lock acquisition in backend A during those steps, so I think in theory the writes from step 3 might be reordered to happen before step 1, so that that step 1 overwrites the WAL data written in step 3. It sounds crazy, but after reading
Re: [HACKERS] Hash partitioning.
On 06/26/2013 05:46 PM, Heikki Linnakangas wrote: We could also allow a large query to search a single table in parallel. A seqscan would be easy to divide into N equally-sized parts that can be scanned in parallel. It's more difficult for index scans, but even then it might be possible at least in some limited cases. So far reading sequentially is still faster than hopping between different locations. Purely from the I/O perspective, that is. For queries where the single CPU core turns into a bottle-neck and which we want to parallelize, we should ideally still do a normal, fully sequential scan and only fan out after the scan and distribute the incoming pages (or even tuples) to the multiple cores to process. Regards Markus Wanner -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] A better way than tweaking NTUP_PER_BUCKET
On Wed, Jun 26, 2013 at 9:20 PM, Stephen Frost sfr...@snowman.net wrote: * Atri Sharma (atri.j...@gmail.com) wrote: My point is that I would like to help in the implementation, if possible. :) Feel free to go ahead and implement it.. I'm not sure when I'll have a chance to (probably not in the next week or two anyway). Unfortunately, the bigger issue here is really about testing the results and determining if it's actually faster/better with various data sets (including ones which have duplicates). I've got one test data set which has some interesting characteristics (for one thing, hashing the large side and then seq-scanning the small side is actually faster than going the other way, which is quite 'odd' imv for a hashing system): http://snowman.net/~sfrost/test_case2.sql You might also look at the other emails that I sent regarding this subject and NTUP_PER_BUCKET. Having someone confirm what I saw wrt changing that parameter would be nice and it would be a good comparison point against any kind of pre-filtering that we're doing. One thing that re-reading the bloom filter description reminded me of is that it's at least conceivable that we could take the existing hash functions for each data type and do double-hashing or perhaps seed the value to be hashed with additional data to produce an independent hash result to use. Again, a lot of things that need to be tested and measured to see if they improve overall performance. Right, let me look.Although, I am pretty busy atm with ordered set functions, so will get it done maybe last week of this month. Another thing I believe in is that we should have multiple hashing functions for bloom filters, which generate different probability values so that the coverage is good. Regards, Atri -- Regards, Atri l'apprenant -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PQConnectPoll, connect(2), EWOULDBLOCK and somaxconn
Andres Freund and...@2ndquadrant.com writes: On 2013-06-17 16:16:22 +0200, Andres Freund wrote: Not accepting EWOULDBLOCK in the above if() results in: could not connect to server: Resource temporarily unavailable Is the server running locally and accepting connections on Unix domain socket /tmp/.s.PGSQL.5440? which makes more sense. Could I convince a committer to NACK or commit backpatch that patch? Some trawling in the commit history shows that the current logic dates from my commit 6d0d838cebdf2bcd5c03f5449a1888f1e120496f, which unified Windows and non-Windows code paths; the check for EWOULDBLOCK was added in the earlier commit ca5a51627919c6fb6ab5e23739615a674caa4037 which (claimed to) add support for non-blocking connect on Windows. So I'm concerned that your patch could break that platform. A possible compromise is { if (SOCK_ERRNO == EINPROGRESS || +#ifndef WIN32 SOCK_ERRNO == EWOULDBLOCK || +#endif SOCK_ERRNO == EINTR || SOCK_ERRNO == 0) but I wonder whether it's safe to remove the case altogether. Could anyone research the situation for non-blocking connect() on Windows? Perhaps on Windows we shouldn't test for EINPROGRESS at all? I'm also pretty suspicious of the SOCK_ERRNO == 0 case, now that I look at it, especially in view of the lack of any attempt to set errno to 0 before the call. The Single Unix Spec is pretty clear that EINTR and EINPROGRESS are the only legit cases, so unless Windows is doing its own thing, we could probably get rid of that too. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Computer VARSIZE_ANY(PTR) during debugging
On Thu, Jun 27, 2013 at 12:02 AM, Alvaro Herrera alvhe...@2ndquadrant.com wrote: Amit Langote escribió: The segfault in question happens at line 1141: off = att_align_pointer(off, thisatt-attalign, -1, tp + off); char *tp; /* ptr to tuple data */ longoff;/* offset in tuple data */ Disassembling seems to suggest (tp + off) is the faulting address. Apparently, the segfault happens when 5th text column is being extracted from a tuple (char(n), char(n), int4, char(n), text, ...). Since, tp is fixed for the whole duration of loop and only off is subject to change over iterations, it may have happened due to wrong offset in this iteration. Has anything of this kind been encountered/reported before? Yes, I vaguely recall I have seen this in cases where tuples contain corrupt data. I think you just need the length word of the fourth datum to be wrong. The query in question is: select col1, col2, col4, octet_length(col5) from table where octet_length(col5) 800; In case of corrupt data, even select * from table should give segfault, shouldn't it? -- Amit Langote -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Kudos for Reviewers -- straw poll
Josh Berkus j...@agliodbs.com writes: Well, one of the other prizes which occurred to me today would be a pgCon lottery. That is, each review posted by a non-committer would go in a hat, and in February we would draw one who would get a free registration and airfare to pgCon. +1, I like that idea! Regards, -- Dimitri Fontaine http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Add more regression tests for dbcommands
Andres Freund and...@2ndquadrant.com writes: I am generally a bit unsure whether the regression tests you propose aren't a bit too verbose. Does any of the committers have an opinion about this? My feeling is that they are ok if they aren't slowing down things much. Yeah, I'm concerned about speed too. If the regression tests take a long time it makes it harder for developers to run them. (I like to point at mysql's regression tests, which take well over an hour even on fast machines. If lots of tests are so helpful, why is their bug rate no better than ours?) I was intending to suggest that much of what Robins has submitted doesn't belong in the core regression tests, but could usefully be put into an optional set of big regression tests. We already have a numeric_big test in that spirit. What seems to be lacking is an organizational principle for this (should the big tests live with the regular ones, or in a separate directory?) and a convenient make target for running the big ones. Once we had a setup like that, we could get some or all of the buildfarm machines to run the big tests, but we wouldn't be penalizing development work. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] fixing pg_ctl with relative paths
On Wed, Jun 26, 2013 at 2:36 PM, Hari Babu haribabu.ko...@huawei.com wrote: On June 26, 2013 5:02 AM Josh Kupershmidt wrote: Thanks for the feedback. Attached is a rebased version of the patch with the two small issues noted fixed. The following description in the document of pg_ctl needs to be modified? restart might fail if relative paths specified were specified on the command-line during server start. +#define DATADIR_SPEC \-D\ \ + + datadir = strstr(post_opts, DATADIR_SPEC); Though this is a corner case, the patch doesn't seem to handle properly the case where -D appears as other option value, e.g., -k option value, in postmaster.opts file. Just idea to work around that problem is to just append the specified -D option and value to post_opts. IOW, -D option and value appear twice in post_opts. In this case, posteriorly-located ones are used in the end. Thought? Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Review: Patch to compute Max LSN of Data Pages
On Wed, Jun 26, 2013 at 8:57 PM, Amit Kapila amit.kap...@huawei.com wrote: On Wednesday, June 26, 2013 4:40 PM Andres Freund wrote: Hi Amit, On 2013-06-26 16:22:28 +0530, Amit Kapila wrote: On Wednesday, June 26, 2013 1:20 PM Andres Freund wrote: On 2013-06-26 08:50:27 +0530, Amit Kapila wrote: On Tuesday, June 25, 2013 11:12 PM Andres Freund wrote: On 2013-06-16 17:19:49 -0700, Josh Berkus wrote: Amit posted a new version of this patch on January 23rd. But last comment on it by Tom is not sure everyone wants this. https://commitfest.postgresql.org/action/patch_view?id=905 ... so, what's the status of this patch? That comment was referencing a mail of mine - so perhaps I better explain: I think the usecase for this utility isn't big enough to be included in postgres since it really can only help in a very limited circumstances. And I think it's too likely to be misused for stuff it's not useable for (e.g. remastering). The only scenario I see is that somebody deleted/corrupted pg_controldata. In that scenario the tool is supposed to be used to find the biggest lsn used so far so the user then can use pg_resetxlog to set that as the wal starting point. But that can be way much easier solved by just setting the LSN to something very, very high. The database cannot be used for anything reliable afterwards anyway. One of the main reason this was written was to make server up in case of corruption and user can take dump of some useful information if any. By setting LSN very, very high user might loose the information which he wants to take dump. Which information would that loose? Information from WAL replay which can be more appropriate by selecting LSN. Sorry, I can't follow. If wal replay still is an option you can just look at the WAL and get a sensible value way easier. Originally 2 parts were proposed, one was to get LSN from data pages and other from data pages. Original proposal is: http://www.postgresql.org/message-id/6C0B27F7206C9E4CA54AE035729E9C382851FFA 1@szxeml509-mbs The second part for looking into WAL was written but due to xlogreader patch, it was postponed and I didn't get time after that to pursue it. The whole tool seems to only make sense if you've lost pg_xlog. The tool's initial intent was if pg_controldata is lost and this idea is originated in below mail chain: http://www.postgresql.org/message-id/4274.1340084...@sss.pgh.pa.us Also for a developer, guessing very high LSN might be easy, but for users it might not be equally easy, and getting such value by utility would be comfortable. Well, then we can just document some very high lsn and be done with it. Like CF00/. That would leave enough space for eventual writes caused while dumping the database (say hint bit writes in a checksummed database) and cannot yet be realistically be reached during normal operation. Can we be ultra sure, that this LSN is not reached. I think it will take vary long to reach such LSN, but still theoretically it can be possible. I don't have any evidence. One more use case for which this utility was done is as below: It will be used to decide that on new-standby (old-master) whether a full backup is needed from New-master(old-standby). The backup is required when the data page in old-master precedes the last applied LSN in old-standby (i.e., new-master) at the moment of the failover. That's exactly what I was afraid of. Unless I miss something the tool is *NOT* sufficient to do this. You mean to say if user knows the max LSN of data pages in old-master and last applied LSN in new master, he cannot decide whether Full backup is needed? It should be straightforward decision that skip a backup if that old-master LSN is less than the new-master (i.e., last applied LSN, IOW, timeline switch LSN). It was proposed as a usecase in this below mail: http://www.postgresql.org/message-id/CAHGQGwHyd1fY0hF0qKh0-uKDh-gcbYxMOFBYVk kh4jzji-f...@mail.gmail.com I guess he meant the commit hint bit problem. Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] A better way than tweaking NTUP_PER_BUCKET
* Atri Sharma (atri.j...@gmail.com) wrote: Right, let me look.Although, I am pretty busy atm with ordered set functions, so will get it done maybe last week of this month. Isn't it currently the last week of this month? :) I'm guessing you mean July. Another thing I believe in is that we should have multiple hashing functions for bloom filters, which generate different probability values so that the coverage is good. I really don't see that happening, to be honest.. I think it would be interesting to try some of the surrogate-additional-hashing that I described. Thanks, Stephen signature.asc Description: Digital signature
Re: [HACKERS] A better way than tweaking NTUP_PER_BUCKET
Isn't it currently the last week of this month? :) I'm guessing you mean July. Heh,no.I lose track of time these days. Alright, second week of July then. I really don't see that happening, to be honest.. I think it would be interesting to try some of the surrogate-additional-hashing that I described. Yes, I agree to that. Let me think more and get back on this. Regards, Atri -- Regards, Atri l'apprenant -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] proposal 9.4 plpgsql: allows access to call stack from GET DIAGNOSTICS statement
Hello updated patch with some basic doc Regards Pavel 2013/6/26 Rushabh Lathia rushabh.lat...@gmail.com: On Wed, Jun 26, 2013 at 5:11 PM, Pavel Stehule pavel.steh...@gmail.com wrote: 2013/6/26 Rushabh Lathia rushabh.lat...@gmail.com: On Tue, Jun 25, 2013 at 11:09 AM, Pavel Stehule pavel.steh...@gmail.com wrote: 2013/6/25 Rushabh Lathia rushabh.lat...@gmail.com: On Tue, Jun 25, 2013 at 1:47 AM, Pavel Stehule pavel.steh...@gmail.com wrote: Hello This is fragment ofmy old code from orafce package - it is functional, but it is written little bit more generic due impossible access to static variables (in elog.c) static char* dbms_utility_format_call_stack(char mode) { MemoryContext oldcontext = CurrentMemoryContext; ErrorData *edata; ErrorContextCallback *econtext; StringInfo sinfo; #if PG_VERSION_NUM = 80400 errstart(ERROR, __FILE__, __LINE__, PG_FUNCNAME_MACRO, TEXTDOMAIN); #else errstart(ERROR, __FILE__, __LINE__, PG_FUNCNAME_MACRO); #endif MemoryContextSwitchTo(oldcontext); for (econtext = error_context_stack; econtext != NULL; econtext = econtext-previous) (*econtext-callback) (econtext-arg); edata = CopyErrorData(); FlushErrorState(); https://github.com/orafce/orafce/blob/master/utility.c 2013/6/24 Rushabh Lathia rushabh.lat...@gmail.com: Hi, Use of this feature is to get call stack for debugging without raising exception. This definitely seems very useful. Here are my comments about the submitted patch: Patch gets applied cleanly (there are couple of white space warning but that's into test case file). make and make install too went smooth. make check was smooth too. Did some manual testing about feature and its went smooth. However would like to share couple of points: My main motive is concentration to stack info string only. So I didn't use err_start function, and I didn't use CopyErrorData too. These routines does lot of other work. 1) I noticed that you did the initialization of edata manually, just wondering can't we directly use CopyErrorData() which will do that job ? yes, we can, but in this context on context field is interesting for us. I found that inside CopyErrorData() there is Assert: Assert(CurrentMemoryContext != ErrorContext); And in the patch we are setting using ErrorContext as short living context, is it the reason of not using CopyErrorData() ? it was not a primary reason, but sure - this routine is not designed for direct using from elog module. Next, we can save one palloc call. Hmm ok. 2) To reset stack to empty, FlushErrorState() can be used. yes, it can be. I use explicit rows due different semantics, although it does same things. FlushErrorState some global ErrorState and it is used from other modules. I did a reset of ErrorContext. I fill a small difference there - so FlushErrorState is not best name for my purpose. What about modification: static void resetErrorStack() { errordata_stack_depth = -1; recursion_depth = 0; /* Delete all data in ErrorContext */ MemoryContextResetAndDeleteChildren(ErrorContext); } and then call in InvokeErrorCallbacks -- resetErrorStack() and rewrote flushErrorState like void FlushErrorState(void) { /* reset ErrorStack is enough */ resetErrorStack(); } ??? Nope. rather then that I would still prefer direct call of FlushErrorState(). but I can live well with direct call of FlushErrorState() - it is only minor change. 3) I was think about the usability of the feature and wondering how about if we add some header to the stack, so user can differentiate between STACK and normal NOTICE message ? Something like: postgres=# select outer_outer_func(10); NOTICE: - Call Stack - PL/pgSQL function inner_func(integer) line 4 at GET DIAGNOSTICS PL/pgSQL function outer_func(integer) line 3 at RETURN PL/pgSQL function outer_outer_func(integer) line 3 at RETURN CONTEXT: PL/pgSQL function outer_func(integer) line 3 at RETURN PL/pgSQL function outer_outer_func(integer) line 3 at RETURN outer_outer_func -- 20 (1 row) I understand, but I don't think so it is good idea. Sometimes, you would to use context for different purposes than debug log - for example - you should to identify top most call or near call - and any additional lines means some little bit more difficult parsing. You can add this line simply (if you want)
Re: [HACKERS] Kudos for Reviewers -- straw poll
On Wed, Jun 26, 2013 at 10:25 AM, Andrew Dunstan and...@dunslane.net wrote: On 06/26/2013 09:14 AM, Bruce Momjian wrote: On Wed, Jun 26, 2013 at 10:40:17AM +1000, Brendan Jurd wrote: On 26 June 2013 03:17, Josh Berkus j...@agliodbs.com wrote: How should reviewers get credited in the release notes? a) not at all b) in a single block titled Reviewers for this version at the bottom. c) on the patch they reviewed, for each patch A weak preference for (c), with (b) running a close second. As others have suggested, a review that leads to significant commitable changes to the patch should bump the credit to co-authorship. As a reminder, I tried a variant of C for 9.2 beta release notes, and got lots of complaints, particularly because the line describing the feature now had many more names on it. In my opinion, adding reviewer names to each feature item might result in the removal of all names from features. A poll is nice for gauging interest, but many people who vote don't understand the ramifications of what they are voting on. That's why I voted for b :-) Yeah, with that in mind, I'd also switch to b. I wouldn't complain, but if it's been tried and failed... what can I say? -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PQConnectPoll, connect(2), EWOULDBLOCK and somaxconn
On 2013-06-26 12:07:54 -0400, Tom Lane wrote: Andres Freund and...@2ndquadrant.com writes: On 2013-06-17 16:16:22 +0200, Andres Freund wrote: Not accepting EWOULDBLOCK in the above if() results in: could not connect to server: Resource temporarily unavailable Is the server running locally and accepting connections on Unix domain socket /tmp/.s.PGSQL.5440? which makes more sense. Could I convince a committer to NACK or commit backpatch that patch? Some trawling in the commit history shows that the current logic dates from my commit 6d0d838cebdf2bcd5c03f5449a1888f1e120496f, which unified Windows and non-Windows code paths; the check for EWOULDBLOCK was added in the earlier commit ca5a51627919c6fb6ab5e23739615a674caa4037 which (claimed to) add support for non-blocking connect on Windows. So I'm concerned that your patch could break that platform. A possible compromise is { if (SOCK_ERRNO == EINPROGRESS || +#ifndef WIN32 SOCK_ERRNO == EWOULDBLOCK || +#endif SOCK_ERRNO == EINTR || SOCK_ERRNO == 0) but I wonder whether it's safe to remove the case altogether. Could anyone research the situation for non-blocking connect() on Windows? Perhaps on Windows we shouldn't test for EINPROGRESS at all? The way EWOULDBLOCK is mentioned on msdn (http://msdn.microsoft.com/en-us/library/windows/desktop/ms737625%28v=vs.85%29.aspx) it indeed seems to be required. I don't see how we could trigger the conditions for EINPROGRESS on windows that msdn lists, but since we need it on unixoid systems and its valid to treat the connect as partiall successfull on windows, there seems little benefit in dropping it. So I guess the #ifdef you suggested is the way to go. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Kudos for Reviewers -- straw poll
On Wed, 26 Jun 2013 09:14:07 -0400 Bruce Momjian br...@momjian.us wrote: On Wed, Jun 26, 2013 at 10:40:17AM +1000, Brendan Jurd wrote: On 26 June 2013 03:17, Josh Berkus j...@agliodbs.com wrote: How should reviewers get credited in the release notes? a) not at all b) in a single block titled Reviewers for this version at the bottom. c) on the patch they reviewed, for each patch A weak preference for (c), with (b) running a close second. As others have suggested, a review that leads to significant commitable changes to the patch should bump the credit to co-authorship. As a reminder, I tried a variant of C for 9.2 beta release notes, and got lots of complaints, particularly because the line describing the feature now had many more names on it. I am just someone that is thinking that maybe can review things...I am not voting OK but I have a comment about your last email... If people thinks (and with people I am not talking about myself but regular committers and reviewers) think that option c is good, I think that we should change the tool (or the way) that release notes are doneI mean (and excuse my poor English) if people thing that it is the way to go, we should make tools good enough for what people want not change people thoughts cause tools are not good enough In my opinion, adding reviewer names to each feature item might result in the removal of all names from features. Let's fix the way that release notes are done A poll is nice for gauging interest, but many people who vote don't understand the ramifications of what they are voting on. I agree, but cost-benefit is what we should see here not just cost -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Kudos for Reviewers -- straw poll
On Wed, Jun 26, 2013 at 03:12:00PM -0300, Rodrigo Gonzalez wrote: On Wed, 26 Jun 2013 09:14:07 -0400 Bruce Momjian br...@momjian.us wrote: On Wed, Jun 26, 2013 at 10:40:17AM +1000, Brendan Jurd wrote: On 26 June 2013 03:17, Josh Berkus j...@agliodbs.com wrote: How should reviewers get credited in the release notes? a) not at all b) in a single block titled Reviewers for this version at the bottom. c) on the patch they reviewed, for each patch A weak preference for (c), with (b) running a close second. As others have suggested, a review that leads to significant commitable changes to the patch should bump the credit to co-authorship. As a reminder, I tried a variant of C for 9.2 beta release notes, and got lots of complaints, particularly because the line describing the feature now had many more names on it. I am just someone that is thinking that maybe can review things...I am not voting OK but I have a comment about your last email... If people thinks (and with people I am not talking about myself but regular committers and reviewers) think that option c is good, I think that we should change the tool (or the way) that release notes are doneI mean (and excuse my poor English) if people thing that it is the way to go, we should make tools good enough for what people want not change people thoughts cause tools are not good enough Production of the release notes was not the problem; it was the text in the release notes. I don't see how we could modify the release note format. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash partitioning.
On Wed, Jun 26, 2013 at 11:14 AM, Bruce Momjian br...@momjian.us wrote: On Wed, Jun 26, 2013 at 05:10:00PM +0300, Heikki Linnakangas wrote: In practice, there might be a lot of quirks and inefficiencies and locking contention etc. involved in various DBMS's, that you might be able to work around with hash partitioning. But from a theoretical point of view, there is no reason to expect just partitioning a table on a hash to make key-value lookups any faster. Good analysis. Has anyone benchmarked this to know our btree is efficient in this area? Yep. I had at one point, and came to the same conclusion. I ended up building a few partial indices, and have been happy ever since. Granted, my DB isn't that big, just around 200G. No, I don't have the benchmark results. It's been a while. Back then, it was 8.3, so I did the partitioning on the application. It still wasn't worth it. Now I just have two indices. One that indexes only hot tuples, it's very heavily queried and works blazingly fast, and one that indexes by (hotness, key). I include the hotness value on the query, and still works quite fast enough. Luckily, I know things become cold after an update to mark them cold, so I can do that. I included hotness on the index to cluster updates on the hot part of the index, but I could have just used a regular index and paid a small price on the updates. Indeed, for a while it worked without the hotness, and there was no significant trouble. I later found out that WAL bandwidth was noticeably decreased when I added that hotness column, so I did, helps a bit with replication. Has worked ever since. Today, I only use partitioning to split conceptually different tables, so I can have different schemas for each table (and normalize with a view). Now it's 9.2, so the view works quite nicely and transparently. I have yet to find a use for hash partitioning. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Kudos for Reviewers -- straw poll
On Wed, 26 Jun 2013 14:13:32 -0400 Bruce Momjian br...@momjian.us wrote: On Wed, Jun 26, 2013 at 03:12:00PM -0300, Rodrigo Gonzalez wrote: On Wed, 26 Jun 2013 09:14:07 -0400 Bruce Momjian br...@momjian.us wrote: On Wed, Jun 26, 2013 at 10:40:17AM +1000, Brendan Jurd wrote: On 26 June 2013 03:17, Josh Berkus j...@agliodbs.com wrote: How should reviewers get credited in the release notes? a) not at all b) in a single block titled Reviewers for this version at the bottom. c) on the patch they reviewed, for each patch A weak preference for (c), with (b) running a close second. As others have suggested, a review that leads to significant commitable changes to the patch should bump the credit to co-authorship. As a reminder, I tried a variant of C for 9.2 beta release notes, and got lots of complaints, particularly because the line describing the feature now had many more names on it. I am just someone that is thinking that maybe can review things...I am not voting OK but I have a comment about your last email... If people thinks (and with people I am not talking about myself but regular committers and reviewers) think that option c is good, I think that we should change the tool (or the way) that release notes are doneI mean (and excuse my poor English) if people thing that it is the way to go, we should make tools good enough for what people want not change people thoughts cause tools are not good enough Production of the release notes was not the problem; it was the text in the release notes. I don't see how we could modify the release note format. Well... Checking release notes for 9.2.4 you have Fix insecure parsing of server command-line switches (Mitsumasa Kondo, Kyotaro Horiguchi) What about (it people think that it is good) a second () with reviewed by someone -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Bloom Filter lookup for hash joins
On Wed, Jun 26, 2013 at 5:01 AM, Atri Sharma atri.j...@gmail.com wrote: The problem here is that if the hash table is in memory, doing a hash table lookup directly is likely to be cheaper than a bloom filter lookup, even if the bloom filter fits into the processor cache and the hash table doesn't (10 last level cache hits is slower than one cache miss). Bloom filter will help when its not feasible to use an actual hash table (lots of large keys), the real lookup is slow (e.g. an index lookup), you are doing a lot of lookups to amortize the construction cost and the condition is expected to be selective (most lookups give a negative). There might be some dataware house types of queries where it would help, but it seems like an awfully narrow use case with a potential for performance regressions when the planner has a bad estimate. Ok, sounds good. Cant we use bloom filters for the case where the hash table doesnt fit in memory? Specifically, when reading from disk is inevitable for accessing the hash table, we can use bloom filters for deciding which keys to actually read from disk. I don't think that sounds all that promising. When the hash table does not fit in memory, it is either partitioned into multiple passes, each of which do fit in memory, or it chooses a different plan altogether. Do we know under what conditions a Bloom filter would be superior to those options, and could we reliably detect those conditions? Cheers, Jeff
Re: [HACKERS] MD5 aggregate
On Mon, Jun 17, 2013 at 11:34:52AM +0100, Dean Rasheed wrote: I've been playing around with the idea of an aggregate that computes the sum of the md5 hashes of each of its inputs, which I've called md5_total() for now, although I'm not particularly wedded to that name. Comparing it with md5_agg() on a 100M row table (see attached test script) produces interesting results: SELECT md5_agg(foo.*::text) FROM (SELECT * FROM foo ORDER BY id) foo; 50bc42127fb9b028c9708248f835ed8f Time: 92960.021 ms SELECT md5_total(foo.*::text) FROM foo; 02faea7fafee4d253fc94cfae031afc43c03479c Time: 96190.343 ms Unlike md5_agg(), it is no longer a true MD5 sum (for one thing, its result is longer) but it seems like it would be very useful for quickly comparing data in SQL, since its value is not dependent on the row-order making it easier to use and better performing if there is no usable index for ordering. I took a look at this patch. First, as Peter suggested upthread, it should have been two patches: one to optimize calculateDigestFromBuffer() and callees, another atop the first adding new SQL functions and adjusting the infrastructure to meet their needs. + primarymd5_total/primary +/indexterm +functionmd5_total(replaceable class=parameterexpression/replaceable)/function + /entry + entrytypetext/type or typebytea/type/entry + entrytypetext/type/entry + entry +sum of the MD5 hash values of the input values, independent of their +order This is not specific enough. We would need to provide either an algorithm specification or a literature reference. However, that just leads to the problem that we should not put a literature-unrecognized cryptographic algorithm into core PostgreSQL. I realize that the use case inspiring this patch wasn't concerned with cryptographic properties, but the association with md5 immediately makes it a cryptography offering. md5_agg() is well-defined and not cryptographically novel, and your use case is credible. However, not every useful-sometimes function belongs in core; we mostly stick to functions with ubiquitous value and functions that would be onerous to implement externally. (We do carry legacy stuff that wouldn't make the cut today.) In my estimation, md5_agg() does not make that cut. The variety of intuitive md5_agg() definitions attested upthread doesn't bode well for its broad applicability. It could just as well live in an extension published on PGXN. Mine is just one opinion, though; I won't be horrified if the community wants an md5_agg() in core. *** a/src/backend/libpq/md5.c --- b/src/backend/libpq/md5.c *** *** 1,14 /* * md5.c * ! * Implements the MD5 Message-Digest Algorithm as specified in ! * RFC 1321. This implementation is a simple one, in that it ! * needs every input byte to be buffered before doing any ! * calculations. I do not expect this file to be used for ! * general purpose MD5'ing of large amounts of data, only for ! * generating hashed passwords from limited input. In other words, performance wasn't a design criterion. I wonder if we would be wasting our time with a narrow performance change like removing the data copy. What's your opinion of copying pgcrypto's md5 implementation, which already avoids the data copy? Thanks, nm -- Noah Misch EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Kudos for Reviewers -- straw poll
On Wed, Jun 26, 2013 at 03:22:06PM -0300, Rodrigo Gonzalez wrote: On Wed, 26 Jun 2013 14:13:32 -0400 Production of the release notes was not the problem; it was the text in the release notes. I don't see how we could modify the release note format. Well... Checking release notes for 9.2.4 you have Fix insecure parsing of server command-line switches (Mitsumasa Kondo, Kyotaro Horiguchi) What about (it people think that it is good) a second () with reviewed by someone That's what we had, and people didn't like it. If we overload that list of names, we might find we want to remove all the names. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Bloom Filter lookup for hash joins
On Thu, Jun 27, 2013 at 12:01 AM, Jeff Janes jeff.ja...@gmail.com wrote: I don't think that sounds all that promising. When the hash table does not fit in memory, it is either partitioned into multiple passes, each of which do fit in memory, or it chooses a different plan altogether. Yeah, my point is, we could potentially be looking at not bringing in all of the memory blocks for the hash tables. Even if they are processed in parts, we still are looking at all of them. Why not do a local bloom filter lookup, and never bring in the tuples that are given negative the bloom filter? This could save us some reads anyhow. Do we know under what conditions a Bloom filter would be superior to those options, and could we reliably detect those conditions? Yes, this needs to be researched. Regards, Atri -- Regards, Atri l'apprenant -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Add more regression tests for CREATE OPERATOR
On 06/26/2013 12:29 AM, Szymon Guz wrote: OK, so I think this patch can be committed, I will change the status. Can we have a full review before you mark it ready for committer? How did you test it? What kinds of review have you done? The committer can't know whether it's ready or not if he doesn't have a full report from you. Thanks! -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Add more regression tests for CREATE OPERATOR
On 26 June 2013 20:55, Josh Berkus j...@agliodbs.com wrote: On 06/26/2013 12:29 AM, Szymon Guz wrote: OK, so I think this patch can be committed, I will change the status. Can we have a full review before you mark it ready for committer? How did you test it? What kinds of review have you done? The committer can't know whether it's ready or not if he doesn't have a full report from you. Thanks! Hi Josh, I will add more detailed descriptions to all patches I set as read for committer. Szymon
Re: [HACKERS] Kudos for Reviewers -- straw poll
Bruce Momjian escribió: On Wed, Jun 26, 2013 at 03:22:06PM -0300, Rodrigo Gonzalez wrote: Checking release notes for 9.2.4 you have Fix insecure parsing of server command-line switches (Mitsumasa Kondo, Kyotaro Horiguchi) What about (it people think that it is good) a second () with reviewed by someone That's what we had, and people didn't like it. If we overload that list of names, we might find we want to remove all the names. Yeah, it becomes too long. (For security patches, in particular, it's probably not wise to list reviewers; there might well be reviewers whose input you never see because they happened in the closed security list). See the entry for foreign key locks: Prevent non-key-field row updates from locking foreign key rows (Álvaro Herrera, Noah Misch, Andres Freund, Alexander Shulgin, Marti Raudsepp) I am the author of most of the code, yet I chose to add Noah and Andres because they contributed a huge amount of time to reviewing the patch, and Alex and Marti because they submitted some code. They are all listed as coauthors, which seems a reasonable compromise to me. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers