Re: [GENERAL] free RAM not being used for page cache
This is a super-interesting topic, thanks for all the info. On Thu, Sep 4, 2014 at 7:44 AM, Shaun Thomas stho...@optionshouse.com wrote: Check /proc/meminfo for a better breakdown of how the memory is being used. This should work: grep -A1 Active /proc/meminfo I suspect your inactive file cache is larger than the active set, suggesting an overly aggressive memory manager. $ grep -A1 Active /proc/meminfo Active: 34393512 kB Inactive: 20765832 kB Active(anon): 13761028 kB Inactive(anon): 890688 kB Active(file): 20632484 kB Inactive(file): 19875144 kB The inactive set isn't larger than the active set, they're about even, but I'm still reading that as the memory manager being aggressive in marking pages as inactive, is that what it says to you too? Interestingly, I just looked at the memory graph for our standby backup database, and while it *normally* uses all the available RAM as the page cache, which is what I'd expect to see, when it was the active database for a time in April and May, the page cache size was reduced by about the same margin. So it's the act of running an active postgres instance that causes the phenomenon. http://s76.photobucket.com/user/kgoesspb/media/db2-mem-historic.png.html -- Kevin M. Goess Software Engineer Berkeley Electronic Press kgo...@bepress.com 510-665-1200 x179 www.bepress.com bepress: sustainable scholarly publishing
Re: [GENERAL] free RAM not being used for page cache
On Tue, Aug 5, 2014 at 8:27 AM, Shaun Thomas stho...@optionshouse.com wrote: On 07/30/2014 12:51 PM, Kevin Goess wrote: A couple months ago we upgraded the RAM on our database servers from 48GB to 64GB. Immediately afterwards the new RAM was being used for page cache, which is what we want, but that seems to have dropped off over time, and there's currently actually like 12GB of totally unused RAM. What version of the Linux kernel are you using? We had exactly this problem when we were on 3.2. We've since moved to 3.8 and that solved this issue, along with a few others. Debian squeeze, still on 2.6.32. If you're having the same problem, this is not a NUMA issue or in any way related to zone_reclaim_mode. The memory page aging algorithm in pre 3.7 is simply broken, judging by the traffic on the Linux Kernel Mailing List (LKML). Darn, really? I just learned about the mysql swap insanity problem and noticed that all the free memory is concentrated on one of the two nodes. $ numactl --hardware available: 2 nodes (0-1) node 0 cpus: 0 2 4 6 node 0 size: 32768 MB node 0 free: 9105 MB node 1 cpus: 1 3 5 7 node 1 size: 32755 MB node 1 free: 259 MB $ free total used free sharedbuffers cached Mem: 66099280 565658049533476 0 11548 51788624 I haven't been able to get any traction on what that means yet though. -- Kevin M. Goess Software Engineer Berkeley Electronic Press kgo...@bepress.com 510-665-1200 x179 www.bepress.com bepress: sustainable scholarly publishing
[GENERAL] free RAM not being used for page cache
A couple months ago we upgraded the RAM on our database servers from 48GB to 64GB. Immediately afterwards the new RAM was being used for page cache, which is what we want, but that seems to have dropped off over time, and there's currently actually like 12GB of totally unused RAM. http://s76.photobucket.com/user/kgoesspb/media/db1-mem-historical.png.html Is that expected? Is there a setting we need to tune for that? We have 400GB of databases on this box, so I know it's not all fitting in that 49.89GB.
Re: [GENERAL] free RAM not being used for page cache
Good suggestion, but nope, that ain't it: $ cat /proc/sys/vm/zone_reclaim_mode 0 On Wed, Jul 30, 2014 at 11:49 AM, Merlin Moncure mmonc...@gmail.com wrote: On Wed, Jul 30, 2014 at 12:51 PM, Kevin Goess kgo...@bepress.com wrote: A couple months ago we upgraded the RAM on our database servers from 48GB to 64GB. Immediately afterwards the new RAM was being used for page cache, which is what we want, but that seems to have dropped off over time, and there's currently actually like 12GB of totally unused RAM. http://s76.photobucket.com/user/kgoesspb/media/db1-mem-historical.png.html Is that expected? Is there a setting we need to tune for that? We have 400GB of databases on this box, so I know it's not all fitting in that 49.89GB. could be a numa issue. Take a look at: http://frosty-postgres.blogspot.com/2012/08/postgresql-numa-and-zone-reclaim-mode.html merlin -- Kevin M. Goess Software Engineer Berkeley Electronic Press kgo...@bepress.com 510-665-1200 x179 www.bepress.com bepress: sustainable scholarly publishing
Re: [GENERAL] Merge a sharded master into a single read-only slave
So my conclusion is that for now, the best way to scale read-only queries for a sharded master is to implement map-reduce at the application level. That's the conclusion I would expect. It's the price you pay for sharding, it's part of the deal. But it's also the benefit you get from sharding. Once your read traffic grows to the point that it's too much for a single host, you're going to have to re-shard it all again *anyway*. The whole point of sharding is that it allows you to grow outside the capacities of a single host.
Re: [GENERAL] are analyze statistics synced with replication?
On Thu, May 15, 2014 at 6:39 AM, Dorian Hoxha dorian.ho...@gmail.comwrote: If you don't do read queries on the slave than it will not have hot data/pages/rows/tables/indexes in ram like the primary ? Yeah, that was the first thing we noticed, the cacti graph shows it took two hours for the page cache to fill up our 64GB of RAM, but I/O didn't stop sucking after that.
[GENERAL] are analyze statistics synced with replication?
We have a master/slave setup with replication. Today we failed over to the slave and saw disk I/O go through the roof. Are the pg_statistic statistics synced along with streaming replication? Are you expected to have to do a vacuum analyze after failing over? That's what we're trying now to see if it makes a difference. Our next step will be to fall back to the first host and see where this one went wrong (society? lax discipline at home? the wrong sort of friends?)
[GENERAL] upgrading from debian 6 to 7--do in place or wipe-and-install?
We're looking at upgrading our database hosts running postgres 9.2 from debian 6/squeeze to debian 7/wheezy. It seems to me that using apt to upgrade in-place would be less work than wiping the boxes and reinstalling debian 7 from scratch, but the latter way would be cleaner. Does anybody have any experience with upgrading in-place? Are there any recommendations for one or the other?
Re: [GENERAL] pgpgout/s without swapping--what does it mean?
That makes sense, so updates to rows that are already in memory, either in blocks in the kernel page cache or in blocks in the postgres cache, would trigger writes but not reads. Thanks for the sanity check. On Mon, Mar 17, 2014 at 11:04 PM, Jeff Janes jeff.ja...@gmail.com wrote: On Monday, March 17, 2014, Kevin Goess kgo...@bepress.com wrote: We had a big increase in load, iowait, and disk i/o on a dedicated database host the other night. Looking at the sar logs, the problem shows itself in a big increase in pgpgout/s, which I believe is postgres paging out parts of itself to disk? 02:15:01 AM pgpgin/s pgpgout/s fault/s majflt/s pgfree/s pgscank/s pgscand/s pgsteal/s%vmeff ... However, there isn't a corresponding increase in pages *in*, so if postgres is writing portions of itself out to disk, they can't be very important. As far as I can tell, pgpgout/s includes all data written to disk, not just process memory being paged. So it includes WAL and data files being written, for example due to bulk loads. Seems like a odd name for that parameter, and I don't know how it differs from bwrtn/s, other than the different units. If it is a bulk load, that would explain why it is not being read back in. Also, it could be that the data is needed, but when it is needed it is still in cache and so doesn't lead to disk reads. But it still needs to be written for durability reasons. Cheers, Jeff -- Kevin M. Goess Software Engineer Berkeley Electronic Press kgo...@bepress.com 510-665-1200 x179 www.bepress.com bepress: sustainable scholarly publishing
[GENERAL] pgpgout/s without swapping--what does it mean?
We had a big increase in load, iowait, and disk i/o on a dedicated database host the other night. Looking at the sar logs, the problem shows itself in a big increase in pgpgout/s, which I believe is postgres paging out parts of itself to disk? 02:15:01 AM pgpgin/s pgpgout/s fault/s majflt/s pgfree/s pgscank/s pgscand/s pgsteal/s%vmeff 02:25:01 AM 49.63163.83836.80 0.00467.69 4.80 0.00 4.80100.00 02:35:01 AM 44.85230.29 1248.68 0.00677.18 16.21 0.00 16.21100.00 02:45:01 AM 47.67190.82 1059.58 0.00517.72 5.33 0.00 5.33100.00 02:55:01 AM 59.00175.58986.87 0.00514.08 18.13 0.00 18.13100.00 03:05:01 AM 24.67849.77 1382.60 0.00 1267.94 37.35 0.00 37.35100.00 03:15:01 AM 28.67 1701.67717.88 0.00 1231.48 0.00 0.00 0.00 0.00 03:25:02 AM 42.64 21342.02 4086.19 0.04 9701.70415.92 0.00414.44 99.64 03:35:01 AM 35.60 28290.69 4305.38 0.10 12906.73623.89 0.00615.85 98.71 03:45:01 AM 36.94 31119.30 3675.34 0.01 12456.54527.55 0.00521.61 98.87 03:55:01 AM 42.77 29020.72 3458.96 0.01 12165.57557.57 0.00553.10 99.20 Average:41.25 11306.39 2175.59 0.02 5189.70220.62 0.00218.63 99.10 However, there isn't a corresponding increase in pages *in*, so if postgres is writing portions of itself out to disk, they can't be very important. And there's no swapping going on: 02:15:01 AM pswpin/s pswpout/s 02:25:01 AM 0.00 0.00 02:35:01 AM 0.00 0.00 02:45:01 AM 0.00 0.00 02:55:01 AM 0.00 0.00 03:05:01 AM 0.00 0.00 03:15:01 AM 0.00 0.00 03:25:02 AM 0.00 0.00 03:35:01 AM 0.00 0.00 03:45:01 AM 0.00 0.00 03:55:01 AM 0.00 0.00 Average: 0.00 0.00 Can anybody help me understand what these statistics are suggesting, what's actually going on on this box/in postgresql? What is it writing to disk, and why? Is it just writing out new/changed rows, or what? -- Kevin M. Goess Software Engineer Berkeley Electronic Press kgo...@bepress.com 510-665-1200 x179 www.bepress.com bepress: sustainable scholarly publishing
Re: [GENERAL] simple query with radically different plan after 9.0 - 9.2 upgrade
On Tue, Nov 12, 2013 at 2:47 PM, Tom Lane t...@sss.pgh.pa.us wrote: That's right, we store 90 days and roll up data older than that into a different table. Ah-hah. The default statistics target is 100, so indeed ANALYZE is going to be able to fit every date entry in the table into the most-common-values list. In this situation, you'd rather that there were some uncertainty left. Given that the distribution of the date column is (I assume) pretty level, you don't really need full information about this column. I'd try backing off the stats target for the date column (and only the date column --- see ALTER TABLE SET STATISTICS) to 50 or even less. That was it! I set it to 50 on all the 90-days tables and now we no longer see that regular increase in disk reads between midnight of the new day and the 1:37am re-analyze. Thanks!
Re: [GENERAL] simple query with radically different plan after 9.0 - 9.2 upgrade
Thanks for the reply! Your analysis matches everything I see here, so what you say is probably the case. As to why it changed for us with the 9.0 = 9.2 upgrade, I also don't know--the change was pretty dramatic though. Since we've compensated for it, and since you say the current behavior is actually what's expected, I'm happy. But since you went to the trouble to reply: Now, the only way to get to a zero selectivity estimate for var = const is if the planner believes that the pg_stats most-common-values list for the column is complete, and the constant is nowhere in the list. So one plausible explanation for the change in behavior is that you jacked up the statistics target for the date column enough so that it includes all of the date values you keep in that column. I'm not following you there, but I'm not a full-time database guy. Attached is the pg_stats for that column in case you find that interesting or helpful. Am I right in guessing that you drop old data from this table? How far back? That's right, we store 90 days and roll up data older than that into a different table. -- Kevin M. Goess Software Engineer Berkeley Electronic Press kgo...@bepress.com 510-665-1200 x179 www.bepress.com bepress: sustainable scholarly publishing # select * from pg_stats where tablename = 'hits_user_daily_count' and attname = 'date'; -[ RECORD 1 ]--+--- schemaname | public tablename | hits_user_daily_count attname| date inherited | f null_frac | 0 avg_width | 4 n_distinct | 91 most_common_vals | {2013-11-11,2013-10-21,2013-10-07,2013-10-09,2013-10-03,2013-10-23,2013-11-06,2013-09-16,2013-10-02,2013-10-08,2013-09-23,2013-10-29,2013-10-15,2013-09-30,2013-10-22,2013-11-07,2013-10-28,2013-09-11,2013-11-05,2013-10-16,2013-10-30,2013-10-10,2013-11-04,2013-09-24,2013-09-17,2013-10-14,2013-10-01,2013-10-17,2013-11-08,2013-10-24,2013-09-09,2013-09-19,2013-09-10,2013-09-25,2013-10-04,2013-09-18,2013-10-31,2013-09-04,2013-09-26,2013-10-20,2013-08-29,2013-10-18,2013-08-27,2013-10-13,2013-09-12,2013-08-14,2013-09-13,2013-09-02,2013-10-25,2013-11-03,2013-08-19,2013-09-05,2013-09-27,2013-10-06,2013-10-11,2013-09-15,2013-09-03,2013-09-22,2013-10-27,2013-11-10,2013-08-28,2013-11-01,2013-08-26,2013-09-20,2013-10-19,2013-11-09,2013-10-12,2013-08-15,2013-08-30,2013-08-16,2013-08-25,2013-09-21,2013-09-28,2013-11-02,2013-10-05,2013-08-23,2013-09-08,2013-09-06,2013-09-29,2013-10-26,2013-09-07,2013-09-14,2013-09-01,2013-08-31,2013-08-20,2013-08-17,2013-08-24,2013-08-18,2013-08-22,2013-08-21,2013-11-12} most_common_freqs | {0.0144667,0.0137556,0.0136889,0.0135333,0.0134,0.0134,0.013,0.0133111,0.0132667,0.0132,0.0131556,0.0131333,0.0130667,0.013,0.0129778,0.0129333,0.0128889,0.0128,0.0127333,0.0126444,0.0126,0.0125778,0.0125111,0.0124889,0.0123778,0.0123778,0.0122667,0.0122667,0.0122,0.0121556,0.0120889,0.0119778,0.0119111,0.0118444,0.0118,0.0117333,0.0116,0.0114222,0.0114,0.0113556,0.0113111,0.0112889,0.011,0.011,0.0109556,0.0109333,0.0109111,0.0108889,0.0108444,0.0108444,0.0108222,0.0107333,0.0107333,0.0107333,0.0107333,0.0106889,0.0106,0.0105778,0.0105556,0.0105111,0.0104889,0.0104889,0.0104,0.0103778,0.010,0.0102667,0.0102,0.00995556,0.0098,0.00975556,0.0097,0.0097,0.0097,0.0096,0.0096,0.00948889,0.0094,0.0094,0.0092,0.0092,0.0086,0.00848889,0.00846667,0.0083,0.00828889,0.00826667,0.0081,0.0077,0.00717778,0.00497778,0.0018} histogram_bounds | correlation| 0.276451 most_common_elems | most_common_elem_freqs | elem_count_histogram | -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
[GENERAL] simple query with radically different plan after 9.0 - 9.2 upgrade
Given this table articles_1= \d hits_user_daily_count; Table public.hits_user_daily_count Column | Type | Modifiers ---+-+--- userid| integer | not null date | date| not null num_hits | integer | default 0 num_cover_page_hits | integer | default 0 num_additional_files_hits | integer | default 0 Indexes: hits_user_daily_count_pkey PRIMARY KEY, btree (userid, date) hits_user_daily_count_date btree (date) whose data looks like this: articles_1= select * from hits_user_daily_count limit 5; userid |date| num_hits | num_cover_page_hits | num_additional_files_hits -++--+-+--- 1590185 | 2013-07-10 |3 | 4 | 0 391907 | 2013-07-10 | 16 | 12 | 0 1739541 | 2013-08-03 |1 | 0 | 0 1798435 | 2013-07-10 |0 | 1 | 0 1521468 | 2013-07-10 |2 | 0 | 0 We noticed a big change after upgrading from 9.0 to 9.2. For *yesterday‘s*date, the query plan is fine, like you’d expect articles_1= explain (analyze, buffers) select 1 from hits_user_daily_count where userid = 1590185 and date = '2013-07-30'; QUERY PLAN -- Index Only Scan using hits_user_daily_count_pkey on hits_user_daily_count (cost=0.00..8.02 rows=1 width=0) (actual time=0.02 Index Cond: ((userid = 1590185) AND (date = '2013-07-30'::date)) Heap Fetches: 1 Buffers: shared hit=5 Total runtime: 0.044 ms but for *today‘s* date it looks like it’s reading all the rows for that date, 15,277 buffers: articles_1= explain (analyze, buffers) select 1 from hits_user_daily_count where userid = 1590185 and date = '2013-08-01'; QUERY PLAN -- Index Scan using hits_user_daily_count_date on hits_user_daily_count (cost=0.00..7.92 rows=1 width=0) (actual time=11.957..1 Index Cond: (date = '2013-08-01'::date) Filter: (userid = 1590185) Rows Removed by Filter: 20149 Buffers: shared hit=15277 Total runtime: 17.924 ms (The dates in the queries are old because I've had this email in draft for a while, but the behavior is still identical). We‘ve addressed the problem by running ’analyze' on the table every day ate about 1:30am. Buffer hits on that table go from about 1,000/sec to 70,000/sec between midnight and that analyze job, and then go back down to 1,000/sec and stay flat until midnight rolls around again. Before the 9.0 - 9.2 upgrade, the behavior would be flat all day. Any ideas what would be causing that problem?
[GENERAL] process deadlocking on its own transactionid?
We're seeing a problem with some of our processes hanging on locks. The select below makes it look like it's *waiting* for a ShareLock on transactionid, but it *has* an ExclusiveLock on the same value in virtualxid. That makes it look like the process has deadlocked on its own transactionid. Or are we reading the results of this query wrong, and this is expected behavior, and our problem lies elsewhere? (Yes, the process is doing a select for update on this context_objects table according to pg_stat_activity) production= select locktype, database, relname, relation, virtualxid, virtualtransaction, pid, mode, granted from pg_locks left outer join pg_class on pg_class.oid = pg_locks.relation where pid = 2288; locktype| database | relname| relation | virtualxid | virtualtransaction | pid | mode | granted ---+--+--+--+--++--+-+- relation |41194 | context_objects_pkey | 95318843 | | 123/45694692 | 2288 | AccessShareLock | t relation |41194 | context_objects |41553 | | 123/45694692 | 2288 | RowShareLock| t virtualxid| | | | 123/45694692 | 123/45694692 | 2288 | ExclusiveLock | t transactionid | | | | | 123/45694692 | 2288 | ShareLock | f tuple |41194 | context_objects |41553 | | 123/45694692 | 2288 | ExclusiveLock | t (5 rows) ``` -- Kevin M. Goess Software Engineer Berkeley Electronic Press kgo...@bepress.com 510-665-1200 x179 www.bepress.com bepress: sustainable scholarly publishing
Re: [GENERAL] Why are stored procedures looked on so negatively?
On Tue, Jul 23, 2013 at 5:40 PM, Adrian Klaver adrian.kla...@gmail.comwrote: On 07/23/2013 05:29 PM, Some Developer wrote: I'm in the middle of building a database and was going to make extensive use of stored procedures and trigger functions because it makes more sense for the actions to happen at the database layer rather than in the app layer. Should I use them or not? Personally I figure the arguments for and against are closely correlated with where on the development chain you are, and are tied in with job security. If you are an app developer than it is in your interest to have code in the app, if you are a database developer in the database. What he says is very true. But make sure to think about things that may already be set up to manage the application code: versioning, testing, unit testing, packaging, release process, and documentation--how much of that is in place for your stored procedures and triggers? If a developer makes a change to application code, it gets checked in to source control, unit tested, code reviewed, goes through some QA, and is staged for the next roll to production--will that all happen for your stored procedures? And consider, there is already logic in the application, now some of the logic will be in the application and some of it will be in the database--does it make sense to have it in two places? I think those are the kind of concerns that make people shy about putting too much logic in the database. None of them are insurmountable, but you should at least think about them.
Re: [GENERAL] WHERE IN (subselect) versus WHERE IN (1,2,3,)
Thanks for looking into it, Tom. We're using 9.0.4, so that might indeed be the problem. What additional data (if any) would you like to see? If you want to look into it further, I can give you schema, though I hesitate to spam the whole list. I could also mock up some tables and see what's the smallest data set that shows the problem and send you those in a dump. The fact that the behavior changes so radically when the limit on the joined table goes from 199 to 200 rows does make me suspect somethings not behaving the way it should. On Tue, Mar 20, 2012 at 4:27 AM, Tom Lane t...@sss.pgh.pa.us wrote: I wrote: You've still got a nasty join-size estimation error: - Nested Loop (cost=6.18..1939.43 rows=411736 width=8) (actual time=0.203..3.487 rows=35 loops=1) It's not apparent why that's so far off ... What PG version is this, anyway? It strikes me that this estimation error might have something with the eqjoinsel bugs that we repaired in 9.0.5. I'm not having any luck reproducing such a bogus estimate with current code, either, though that may just mean you've omitted some critical info about how the tables are set up. regards, tom lane -- Kevin M. Goess Software Engineer Berkeley Electronic Press kgo...@bepress.com 510-665-1200 x179 www.bepress.com bepress: sustainable scholarly publishing
Re: [GENERAL] WHERE IN (subselect) versus WHERE IN (1,2,3,)
On Mon, Mar 19, 2012 at 9:24 AM, Albe Laurenz laurenz.a...@wien.gv.atwrote: That means that your statistics are not accurate. As a first measure, you should ANALYZE the tables involved and see if the problem persists. If yes, post the new plans. Aha, thanks, that explains why my test table with one row was so bad. But even with all freshly ANALYZE'd tables, I still see the query reverting to a sequential scan on that big contexts table once the number of rows in the subselect goes over 199. Here's a simplified version that demonstrates the problem. production= explain (analyze, buffers) SELECT contexts.context_key FROM contexts JOIN articles ON (articles.context_key=contexts.context_key) WHERE contexts.context_key IN (SELECT context_key FROM virtual_ancestors limit 200) AND articles.indexed; QUERY PLAN - Hash Join (cost=7086.13..219322.15 rows=411736 width=4) (actual time=50.118..1213.046 rows=35 loops=1) Hash Cond: (contexts.context_key = articles.context_key) Buffers: shared hit=72539 read=100104 - Seq Scan on contexts (cost=0.00..190285.83 rows=1783283 width=4) (actual time=0.040..769.891 rows=1786074 loops=1) Buffers: shared hit=72399 read=100054 - Hash (cost=1939.43..1939.43 rows=411736 width=8) (actual time=3.510..3.510 rows=35 loops=1) Buckets: 65536 Batches: 1 Memory Usage: 2kB Buffers: shared hit=140 read=50 - Nested Loop (cost=6.18..1939.43 rows=411736 width=8) (actual time=0.203..3.487 rows=35 loops=1) Buffers: shared hit=140 read=50 - HashAggregate (cost=6.18..8.18 rows=200 width=4) (actual time=0.174..0.198 rows=48 loops=1) Buffers: shared read=2 - Limit (cost=0.00..3.68 rows=200 width=4) (actual time=0.015..0.108 rows=200 loops=1) Buffers: shared read=2 - Seq Scan on virtual_ancestors (cost=0.00..87676.17 rows=4759617 width=4) (actual time=0.015..0.075 rows=200 loops=1) Buffers: shared read=2 - Index Scan using articles_pkey on articles (cost=0.00..9.64 rows=1 width=4) (actual time=0.015..0.068 rows=1 loops=48) Index Cond: (articles.context_key = virtual_ancestors.context_key) Filter: articles.indexed Buffers: shared hit=140 read=48 Total runtime: 1213.138 ms (21 rows) But if I write the keys in the subquery inline, I get a very nice execution plan, all the way up to a tested maximum of about 50,000 keys: production= explain (analyze, buffers) SELECT contexts.context_key FROM contexts JOIN articles ON (articles.context_key=contexts.context_key) WHERE contexts.context_key IN (2482612,2482612,...) AND articles.indexed; QUERY PLAN - Nested Loop (cost=758.71..3418.40 rows=200 width=4) (actual time=0.621..1.089 rows=35 loops=1) Buffers: shared hit=826 read=1 - Bitmap Heap Scan on contexts (cost=752.58..1487.55 rows=200 width=4) (actual time=0.604..0.699 rows=48 loops=1) Recheck Cond: (context_key = ANY ('{2482612,2482612,...}'::integer[])) Buffers: shared hit=639 - Bitmap Index Scan on contexts_pkey (cost=0.00..752.53 rows=200 width=0) (actual time=0.591..0.591 rows=200 loops=1) Index Cond: (context_key = ANY ('{2482612,2482612,...}'::integer[])) Buffers: shared hit=600 - Bitmap Heap Scan on articles (cost=6.13..9.64 rows=1 width=4) (actual time=0.007..0.007 rows=1 loops=48) Recheck Cond: (articles.context_key = contexts.context_key) Filter: articles.indexed Buffers: shared hit=187 read=1 - Bitmap Index Scan on articles_pkey (cost=0.00..6.13 rows=1 width=0) (actual time=0.005..0.005 rows=1 loops=48) Index Cond: (articles.context_key = contexts.context_key) Buffers: shared hit=148 Total runtime: 1.147 ms Is this expected behavior, that writing the ids inline does much better than the subquery? I've been told that it's not, but this isn't the first time I've seen this, so I feel like I'm not understanding something.
[GENERAL] WHERE IN (subselect) versus WHERE IN (1,2,3,)
My apologies, I'm sure this question has been asked before but I couldn't find anything on the list that meant anything to me. We have a table contexts with 1.6 million rows, and a table articles with 1.4 million rows, where an article is a particular kind of context. We want to select from a join on those two tables like this SELECT COUNT(*) FROM contexts JOIN articles ON (articles.context_key=contexts.context_key) WHERE contexts.context_key IN (...); /* and some combination of columns from articles and contexts */ If IN(...) is a query, then this guy does a seq scan on the contexts table, even if the subquery is select col_a from kgtest where kgtest has one row. If however I read the ids beforehand and write them into the query, a la IN (111,222,333...), then the everything is happy, up to at least 20,000 values written into the sql, at which point smaller machines will take 2-5 minutes to parse the query. I can certainly write the ids inline into the SQL, but when I do that I get the distinct impression that I'm Doing It Wrong. Is this expected behavior? It seems surprising to me. To demonstrate: /* nothing up my sleeve */ # select * from kgtest; cola - 1652729 (1 row) /* inline, good query plan */ # explain (analyze, buffers) select count(*) from contexts JOIN articles ON (articles.context_key=contexts.context_key) where contexts.context_key in (1652729); QUERY PLAN -- Aggregate (cost=3.82..3.83 rows=1 width=0) (actual time=0.188..0.189 rows=1 loops=1) Buffers: shared hit=7 - Nested Loop (cost=0.00..3.81 rows=1 width=0) (actual time=0.181..0.181 rows=0 loops=1) Buffers: shared hit=7 - Index Scan using contexts_pkey on contexts (cost=0.00..1.90 rows=1 width=4) (actual time=0.109..0.112 ro Index Cond: (context_key = 1652729) Buffers: shared hit=4 - Index Scan using articles_pkey on articles (cost=0.00..1.90 rows=1 width=4) (actual time=0.060..0.060 ro Index Cond: (articles.context_key = 1652729) Buffers: shared hit=3 Total runtime: 0.324 ms (11 rows) /* subselect, query plan does seq scan on contexts */ # explain (analyze, buffers) select count(*)from contexts JOIN articles ON (articles.context_key=contexts.context_key) where contexts.context_key in (select cola from kgtest); QUERY PLAN -- Aggregate (cost=118505.72..118505.73 rows=1 width=0) (actual time=0.274..0.275 rows=1 loops=1) Buffers: shared hit=5 - Hash Join (cost=12512.61..116661.91 rows=737524 width=0) (actual time=0.269..0.269 rows=0 loops=1) Hash Cond: (contexts.context_key = articles.context_key) Buffers: shared hit=5 - Seq Scan on contexts (cost=0.00..64533.03 rows=1648203 width=4) (actual time=0.009..0.009 rows=1 loops=1 Buffers: shared hit=1 - Hash (cost=412.56..412.56 rows=737524 width=8) (actual time=0.110..0.110 rows=0 loops=1) Buckets: 4096 Batches: 32 Memory Usage: 0kB Buffers: shared hit=4 - Nested Loop (cost=40.00..412.56 rows=737524 width=8) (actual time=0.107..0.107 rows=0 loops=1) Buffers: shared hit=4 - HashAggregate (cost=40.00..42.00 rows=200 width=4) (actual time=0.069..0.071 rows=1 loops=1) Buffers: shared hit=1 - Seq Scan on kgtest (cost=0.00..34.00 rows=2400 width=4) (actual time=0.048..0.050 rows Buffers: shared hit=1 - Index Scan using articles_pkey on articles (cost=0.00..1.84 rows=1 width=4) (actual time=0.0 Index Cond: (articles.context_key = kgtest.cola) Buffers: shared hit=3 Total runtime: 0.442 ms -- Kevin M. Goess Software Engineer Berkeley Electronic Press kgo...@bepress.com 510-665-1200 x179 www.bepress.com bepress: sustainable scholarly publishing