Re: [GENERAL] free RAM not being used for page cache

2014-09-04 Thread Kevin Goess
This is a super-interesting topic, thanks for all the info.

On Thu, Sep 4, 2014 at 7:44 AM, Shaun Thomas stho...@optionshouse.com
wrote:

 Check /proc/meminfo for a better breakdown of how the memory is being
 used. This should work:

 grep -A1 Active /proc/meminfo

 I suspect your inactive file cache is larger than the active set,
 suggesting an overly aggressive memory manager.


$ grep -A1 Active /proc/meminfo
Active: 34393512 kB
Inactive:   20765832 kB
Active(anon):   13761028 kB
Inactive(anon):   890688 kB
Active(file):   20632484 kB
Inactive(file): 19875144 kB

The inactive set isn't larger than the active set, they're about even, but
I'm still reading that as the memory manager being aggressive in marking
pages as inactive, is that what it says to you too?

Interestingly, I just looked at the memory graph for our standby backup
database, and while it *normally* uses all the available RAM as the page
cache, which is what I'd expect to see, when it was the active database for
a time in April and May, the page cache size was reduced by about the same
margin. So it's the act of running an active postgres instance that causes
the phenomenon.

http://s76.photobucket.com/user/kgoesspb/media/db2-mem-historic.png.html



-- 
Kevin M. Goess
Software Engineer
Berkeley Electronic Press
kgo...@bepress.com

510-665-1200 x179
www.bepress.com

bepress: sustainable scholarly publishing


Re: [GENERAL] free RAM not being used for page cache

2014-09-03 Thread Kevin Goess
On Tue, Aug 5, 2014 at 8:27 AM, Shaun Thomas stho...@optionshouse.com
wrote:

 On 07/30/2014 12:51 PM, Kevin Goess wrote:

  A couple months ago we upgraded the RAM on our database servers from
 48GB to 64GB.  Immediately afterwards the new RAM was being used for
 page cache, which is what we want, but that seems to have dropped off
 over time, and there's currently actually like 12GB of totally unused RAM.


 What version of the Linux kernel are you using? We had exactly this
 problem when we were on 3.2. We've since moved to 3.8 and that solved this
 issue, along with a few others.


Debian squeeze, still on 2.6.32.


 If you're having the same problem, this is not a NUMA issue or in any way
 related to zone_reclaim_mode. The memory page aging algorithm in pre 3.7 is
 simply broken, judging by the traffic on the Linux Kernel Mailing List
 (LKML).


Darn, really? I just learned about the mysql swap insanity problem and
noticed that all the free memory is concentrated on one of the two nodes.

$ numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 2 4 6
node 0 size: 32768 MB
node 0 free: 9105 MB
node 1 cpus: 1 3 5 7
node 1 size: 32755 MB
node 1 free: 259 MB

$ free
 total   used   free sharedbuffers cached
Mem:  66099280   565658049533476  0  11548   51788624

I haven't been able to get any traction on what that means yet though.


-- 
Kevin M. Goess
Software Engineer
Berkeley Electronic Press
kgo...@bepress.com

510-665-1200 x179
www.bepress.com

bepress: sustainable scholarly publishing


[GENERAL] free RAM not being used for page cache

2014-07-30 Thread Kevin Goess
A couple months ago we upgraded the RAM on our database servers from 48GB
to 64GB.  Immediately afterwards the new RAM was being used for page cache,
which is what we want, but that seems to have dropped off over time, and
there's currently actually like 12GB of totally unused RAM.

http://s76.photobucket.com/user/kgoesspb/media/db1-mem-historical.png.html

Is that expected?  Is there a setting we need to tune for that?  We have
400GB of databases on this box, so I know it's not all fitting in that
49.89GB.


Re: [GENERAL] free RAM not being used for page cache

2014-07-30 Thread Kevin Goess
Good suggestion, but nope, that ain't it:

$ cat /proc/sys/vm/zone_reclaim_mode
0



On Wed, Jul 30, 2014 at 11:49 AM, Merlin Moncure mmonc...@gmail.com wrote:

 On Wed, Jul 30, 2014 at 12:51 PM, Kevin Goess kgo...@bepress.com wrote:
  A couple months ago we upgraded the RAM on our database servers from
 48GB to
  64GB.  Immediately afterwards the new RAM was being used for page cache,
  which is what we want, but that seems to have dropped off over time, and
  there's currently actually like 12GB of totally unused RAM.
 
 
 http://s76.photobucket.com/user/kgoesspb/media/db1-mem-historical.png.html
 
  Is that expected?  Is there a setting we need to tune for that?  We have
  400GB of databases on this box, so I know it's not all fitting in that
  49.89GB.

 could be a numa issue.  Take a look at:

 http://frosty-postgres.blogspot.com/2012/08/postgresql-numa-and-zone-reclaim-mode.html

 merlin




-- 
Kevin M. Goess
Software Engineer
Berkeley Electronic Press
kgo...@bepress.com

510-665-1200 x179
www.bepress.com

bepress: sustainable scholarly publishing


Re: [GENERAL] Merge a sharded master into a single read-only slave

2014-06-02 Thread Kevin Goess
 So my conclusion is that for now, the best way to scale read-only queries
for a sharded master is to
 implement map-reduce at the application level.

That's the conclusion I would expect. It's the price you pay for sharding,
it's part of the deal.

But it's also the benefit you get from sharding.  Once your read traffic
grows to the point that it's too much for a single host, you're going to
have to re-shard it all again *anyway*.  The whole point of sharding is
that it allows you to grow outside the capacities of a single host.


​


Re: [GENERAL] are analyze statistics synced with replication?

2014-05-15 Thread Kevin Goess
On Thu, May 15, 2014 at 6:39 AM, Dorian Hoxha dorian.ho...@gmail.comwrote:

 If you don't do read queries on the slave than it will not have hot
 data/pages/rows/tables/indexes in ram like the primary ?


Yeah, that was the first thing we noticed, the cacti graph shows it took
two hours for the page cache to fill up our 64GB of RAM, but I/O didn't
stop sucking after that.


[GENERAL] are analyze statistics synced with replication?

2014-05-14 Thread Kevin Goess
We have a master/slave setup with replication.  Today we failed over to the
slave and saw disk I/O go through the roof.

Are the pg_statistic statistics synced along with streaming replication?
Are you expected to have to do a vacuum analyze after failing over?  That's
what we're trying now to see if it makes a difference.  Our next step will
be to fall back to the first host and see where this one went wrong
(society?  lax discipline at home? the wrong sort of friends?)


[GENERAL] upgrading from debian 6 to 7--do in place or wipe-and-install?

2014-03-28 Thread Kevin Goess
We're looking at upgrading our database hosts running postgres 9.2 from
debian 6/squeeze to debian 7/wheezy.  It seems to me that using apt to
upgrade in-place would be less work than wiping the boxes and reinstalling
debian 7 from scratch, but the latter way would be cleaner.

Does anybody have any experience with upgrading in-place?  Are there any
recommendations for one or the other?


Re: [GENERAL] pgpgout/s without swapping--what does it mean?

2014-03-18 Thread Kevin Goess
That makes sense, so updates to rows that are already in memory, either in
blocks in the kernel page cache or in blocks in the postgres cache, would
trigger writes but not reads.  Thanks for the sanity check.


On Mon, Mar 17, 2014 at 11:04 PM, Jeff Janes jeff.ja...@gmail.com wrote:

 On Monday, March 17, 2014, Kevin Goess kgo...@bepress.com wrote:

 We had a big increase in load, iowait, and disk i/o on a dedicated
 database host the other night.

 Looking at the sar logs, the problem shows itself in a big increase in
 pgpgout/s, which I believe is postgres paging out parts of itself to disk?

 02:15:01 AM  pgpgin/s pgpgout/s   fault/s  majflt/s  pgfree/s pgscank/s
 pgscand/s pgsteal/s%vmeff


 ...


 However, there isn't a corresponding increase in pages *in*, so if
 postgres is writing portions of itself out to disk, they can't be very
 important.



 As far as I can tell, pgpgout/s includes all data written to disk, not
 just process memory being paged.  So it includes WAL and data files being
 written, for example due to bulk loads.  Seems like a odd name for that
 parameter, and I don't know how it differs from bwrtn/s, other than the
 different units.

 If it is a bulk load, that would explain why it is not being read back in.
  Also, it could be that the data is needed, but when it is needed it is
 still in cache and so doesn't lead to disk reads.  But it still needs to be
 written for durability reasons.

 Cheers,

 Jeff




-- 
Kevin M. Goess
Software Engineer
Berkeley Electronic Press
kgo...@bepress.com

510-665-1200 x179
www.bepress.com

bepress: sustainable scholarly publishing


[GENERAL] pgpgout/s without swapping--what does it mean?

2014-03-17 Thread Kevin Goess
We had a big increase in load, iowait, and disk i/o on a dedicated database
host the other night.

Looking at the sar logs, the problem shows itself in a big increase in
pgpgout/s, which I believe is postgres paging out parts of itself to disk?

02:15:01 AM  pgpgin/s pgpgout/s   fault/s  majflt/s  pgfree/s pgscank/s
pgscand/s pgsteal/s%vmeff
02:25:01 AM 49.63163.83836.80  0.00467.69  4.80
 0.00  4.80100.00
02:35:01 AM 44.85230.29   1248.68  0.00677.18 16.21
 0.00 16.21100.00
02:45:01 AM 47.67190.82   1059.58  0.00517.72  5.33
 0.00  5.33100.00
02:55:01 AM 59.00175.58986.87  0.00514.08 18.13
 0.00 18.13100.00
03:05:01 AM 24.67849.77   1382.60  0.00   1267.94 37.35
 0.00 37.35100.00
03:15:01 AM 28.67   1701.67717.88  0.00   1231.48  0.00
 0.00  0.00  0.00
03:25:02 AM 42.64  21342.02   4086.19  0.04   9701.70415.92
 0.00414.44 99.64
03:35:01 AM 35.60  28290.69   4305.38  0.10  12906.73623.89
 0.00615.85 98.71
03:45:01 AM 36.94  31119.30   3675.34  0.01  12456.54527.55
 0.00521.61 98.87
03:55:01 AM 42.77  29020.72   3458.96  0.01  12165.57557.57
 0.00553.10 99.20
Average:41.25  11306.39   2175.59  0.02   5189.70220.62
 0.00218.63 99.10

However, there isn't a corresponding increase in pages *in*, so if postgres
is writing portions of itself out to disk, they can't be very important.
 And there's no swapping going on:

02:15:01 AM  pswpin/s pswpout/s
02:25:01 AM  0.00  0.00
02:35:01 AM  0.00  0.00
02:45:01 AM  0.00  0.00
02:55:01 AM  0.00  0.00
03:05:01 AM  0.00  0.00
03:15:01 AM  0.00  0.00
03:25:02 AM  0.00  0.00
03:35:01 AM  0.00  0.00
03:45:01 AM  0.00  0.00
03:55:01 AM  0.00  0.00
   Average: 0.00  0.00

Can anybody help me understand what these statistics are suggesting, what's
actually going on on this box/in postgresql?  What is it writing to disk,
and why?  Is it just writing out new/changed rows, or what?

-- 
Kevin M. Goess
Software Engineer
Berkeley Electronic Press
kgo...@bepress.com

510-665-1200 x179
www.bepress.com

bepress: sustainable scholarly publishing


Re: [GENERAL] simple query with radically different plan after 9.0 - 9.2 upgrade

2013-11-17 Thread Kevin Goess
On Tue, Nov 12, 2013 at 2:47 PM, Tom Lane t...@sss.pgh.pa.us wrote:

  That's right, we store 90 days and roll up data older than that into a
  different table.

 Ah-hah.  The default statistics target is 100, so indeed ANALYZE is going
 to be able to fit every date entry in the table into the
 most-common-values list.  In this situation, you'd rather that there were
 some uncertainty left.  Given that the distribution of the date column is
 (I assume) pretty level, you don't really need full information about this
 column.  I'd try backing off the stats target for the date column (and
 only the date column --- see ALTER TABLE SET STATISTICS) to 50 or even
 less.


That was it!  I set it to 50 on all the 90-days tables and now we no longer
see that regular increase in disk reads between midnight of the new day and
the 1:37am re-analyze.

Thanks!


Re: [GENERAL] simple query with radically different plan after 9.0 - 9.2 upgrade

2013-11-12 Thread Kevin Goess
Thanks for the reply!

Your analysis matches everything I see here, so what you say is probably
the case. As to why it changed for us with the 9.0 = 9.2 upgrade, I also
don't know--the change was pretty dramatic though.  Since we've compensated
for it, and since you say the current behavior is actually what's expected,
I'm happy.

But since you went to the trouble to reply:

Now, the only way to get to a zero selectivity estimate for var = const
 is if the planner believes that the pg_stats most-common-values list
 for the column is complete, and the constant is nowhere in the list.
 So one plausible explanation for the change in behavior is that you
 jacked up the statistics target for the date column enough so that
 it includes all of the date values you keep in that column.


I'm not following you there, but I'm not a full-time database guy.
 Attached is the pg_stats for that column in case you find that interesting
or helpful.


 Am I right
 in guessing that you drop old data from this table?  How far back?


That's right, we store 90 days and roll up data older than that into a
different table.


-- 
Kevin M. Goess
Software Engineer
Berkeley Electronic Press
kgo...@bepress.com

510-665-1200 x179
www.bepress.com

bepress: sustainable scholarly publishing
# select * from pg_stats where tablename = 'hits_user_daily_count' and attname 
= 'date';
-[ RECORD 1 
]--+---
schemaname | public
tablename  | hits_user_daily_count
attname| date
inherited  | f
null_frac  | 0
avg_width  | 4
n_distinct | 91
most_common_vals   | 
{2013-11-11,2013-10-21,2013-10-07,2013-10-09,2013-10-03,2013-10-23,2013-11-06,2013-09-16,2013-10-02,2013-10-08,2013-09-23,2013-10-29,2013-10-15,2013-09-30,2013-10-22,2013-11-07,2013-10-28,2013-09-11,2013-11-05,2013-10-16,2013-10-30,2013-10-10,2013-11-04,2013-09-24,2013-09-17,2013-10-14,2013-10-01,2013-10-17,2013-11-08,2013-10-24,2013-09-09,2013-09-19,2013-09-10,2013-09-25,2013-10-04,2013-09-18,2013-10-31,2013-09-04,2013-09-26,2013-10-20,2013-08-29,2013-10-18,2013-08-27,2013-10-13,2013-09-12,2013-08-14,2013-09-13,2013-09-02,2013-10-25,2013-11-03,2013-08-19,2013-09-05,2013-09-27,2013-10-06,2013-10-11,2013-09-15,2013-09-03,2013-09-22,2013-10-27,2013-11-10,2013-08-28,2013-11-01,2013-08-26,2013-09-20,2013-10-19,2013-11-09,2013-10-12,2013-08-15,2013-08-30,2013-08-16,2013-08-25,2013-09-21,2013-09-28,2013-11-02,2013-10-05,2013-08-23,2013-09-08,2013-09-06,2013-09-29,2013-10-26,2013-09-07,2013-09-14,2013-09-01,2013-08-31,2013-08-20,2013-08-17,2013-08-24,2013-08-18,2013-08-22,2013-08-21,2013-11-12}
most_common_freqs  | 
{0.0144667,0.0137556,0.0136889,0.0135333,0.0134,0.0134,0.013,0.0133111,0.0132667,0.0132,0.0131556,0.0131333,0.0130667,0.013,0.0129778,0.0129333,0.0128889,0.0128,0.0127333,0.0126444,0.0126,0.0125778,0.0125111,0.0124889,0.0123778,0.0123778,0.0122667,0.0122667,0.0122,0.0121556,0.0120889,0.0119778,0.0119111,0.0118444,0.0118,0.0117333,0.0116,0.0114222,0.0114,0.0113556,0.0113111,0.0112889,0.011,0.011,0.0109556,0.0109333,0.0109111,0.0108889,0.0108444,0.0108444,0.0108222,0.0107333,0.0107333,0.0107333,0.0107333,0.0106889,0.0106,0.0105778,0.0105556,0.0105111,0.0104889,0.0104889,0.0104,0.0103778,0.010,0.0102667,0.0102,0.00995556,0.0098,0.00975556,0.0097,0.0097,0.0097,0.0096,0.0096,0.00948889,0.0094,0.0094,0.0092,0.0092,0.0086,0.00848889,0.00846667,0.0083,0.00828889,0.00826667,0.0081,0.0077,0.00717778,0.00497778,0.0018}
histogram_bounds   | 
correlation| 0.276451
most_common_elems  | 
most_common_elem_freqs | 
elem_count_histogram   | 



-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


[GENERAL] simple query with radically different plan after 9.0 - 9.2 upgrade

2013-09-20 Thread Kevin Goess
Given this table

articles_1= \d hits_user_daily_count;
  Table public.hits_user_daily_count
  Column   |  Type   | Modifiers
---+-+---
 userid| integer | not null
 date  | date| not null
 num_hits  | integer | default 0
 num_cover_page_hits   | integer | default 0
 num_additional_files_hits | integer | default 0
Indexes:
hits_user_daily_count_pkey PRIMARY KEY, btree (userid, date)
hits_user_daily_count_date btree (date)

whose data looks like this:

articles_1= select * from hits_user_daily_count limit 5;
 userid  |date| num_hits | num_cover_page_hits |
num_additional_files_hits
-++--+-+---
 1590185 | 2013-07-10 |3 |   4 |
  0
  391907 | 2013-07-10 |   16 |  12 |
  0
 1739541 | 2013-08-03 |1 |   0 |
  0
 1798435 | 2013-07-10 |0 |   1 |
  0
 1521468 | 2013-07-10 |2 |   0 |
  0

We noticed a big change after upgrading from 9.0 to 9.2. For
*yesterday‘s*date, the query plan is fine, like you’d expect

articles_1= explain (analyze, buffers) select 1
from hits_user_daily_count
where userid = 1590185 and date = '2013-07-30';

QUERY PLAN
--
 Index Only Scan using hits_user_daily_count_pkey on
hits_user_daily_count  (cost=0.00..8.02 rows=1 width=0) (actual
time=0.02
   Index Cond: ((userid = 1590185) AND (date = '2013-07-30'::date))
   Heap Fetches: 1
   Buffers: shared hit=5
 Total runtime: 0.044 ms

but for *today‘s* date it looks like it’s reading all the rows for that
date, 15,277 buffers:

articles_1= explain (analyze, buffers) select 1
from hits_user_daily_count
where userid = 1590185 and date = '2013-08-01';

QUERY PLAN
--
 Index Scan using hits_user_daily_count_date on hits_user_daily_count
(cost=0.00..7.92 rows=1 width=0) (actual time=11.957..1
   Index Cond: (date = '2013-08-01'::date)
   Filter: (userid = 1590185)
   Rows Removed by Filter: 20149
   Buffers: shared hit=15277
 Total runtime: 17.924 ms

(The dates in the queries are old because I've had this email in draft for
a while, but the behavior is still identical).

We‘ve addressed the problem by running ’analyze' on the table every day ate
about 1:30am. Buffer hits on that table go from about 1,000/sec to
70,000/sec between midnight and that analyze job, and then go back down to
1,000/sec and stay flat until midnight rolls around again.

Before the 9.0 - 9.2 upgrade, the behavior would be flat all day.

Any ideas what would be causing that problem?


[GENERAL] process deadlocking on its own transactionid?

2013-07-23 Thread Kevin Goess
We're seeing a problem with some of our processes hanging on locks.  The
select below makes it look like it's *waiting* for a ShareLock on
transactionid, but it *has* an ExclusiveLock on the same value in
virtualxid.

That makes it look like the process has deadlocked on its own
transactionid.  Or are we reading the results of this query wrong, and this
is expected behavior, and our problem lies elsewhere? (Yes, the process is
doing a select for update on this context_objects table according to
pg_stat_activity)

production= select locktype, database, relname, relation, virtualxid,
virtualtransaction, pid, mode, granted
from pg_locks left outer join pg_class on pg_class.oid = pg_locks.relation
where pid = 2288;
   locktype| database |   relname| relation |  virtualxid
 | virtualtransaction | pid  |  mode   | granted
---+--+--+--+--++--+-+-
 relation  |41194 | context_objects_pkey | 95318843 |
 | 123/45694692   | 2288 | AccessShareLock | t
 relation  |41194 | context_objects  |41553 |
 | 123/45694692   | 2288 | RowShareLock| t
 virtualxid|  |  |  | 123/45694692
| 123/45694692   | 2288 | ExclusiveLock   | t
 transactionid |  |  |  |
 | 123/45694692   | 2288 | ShareLock   | f
 tuple |41194 | context_objects  |41553 |
 | 123/45694692   | 2288 | ExclusiveLock   | t
(5 rows)
```

-- 
Kevin M. Goess
Software Engineer
Berkeley Electronic Press
kgo...@bepress.com

510-665-1200 x179
www.bepress.com

bepress: sustainable scholarly publishing


Re: [GENERAL] Why are stored procedures looked on so negatively?

2013-07-23 Thread Kevin Goess
On Tue, Jul 23, 2013 at 5:40 PM, Adrian Klaver adrian.kla...@gmail.comwrote:

 On 07/23/2013 05:29 PM, Some Developer wrote:

 I'm in the middle of building a database and was going to make extensive
 use of stored procedures and trigger functions because it makes more
 sense for the actions to happen at the database layer rather than in the
 app layer.

 Should I use them or not?


 Personally I figure the arguments for and against are closely correlated
 with where on the development chain you are, and are tied in with job
 security. If you are an app developer than it is in your interest to have
 code in the app, if you are a database developer in the database.


What he says is very true. But make sure to think about things that may
already be set up to manage the application code: versioning, testing, unit
testing, packaging, release process, and documentation--how much of that is
in place for your stored procedures and triggers?  If a developer makes a
change to application code, it gets checked in to source control, unit
tested, code reviewed, goes through some QA, and is staged for the next
roll to production--will that all happen for your stored procedures?  And
consider, there is already logic in the application, now some of the logic
will be in the application and some of it will be in the database--does it
make sense to have it in two places?

I think those are the kind of concerns that make people shy about putting
too much logic in the database.  None of them are insurmountable, but you
should at least think about them.


Re: [GENERAL] WHERE IN (subselect) versus WHERE IN (1,2,3,)

2012-03-20 Thread Kevin Goess
Thanks for looking into it, Tom.  We're using 9.0.4, so that might indeed
be the problem. What additional data (if any) would you like to see?  If
you want to look into it further, I can give you schema, though I hesitate
to spam the whole list.  I could also mock up some tables and see what's
the smallest data set that shows the problem and send you those in a dump.

The fact that the behavior changes so radically when the limit on the
joined table goes from 199 to 200 rows does make me suspect somethings not
behaving the way it should.

On Tue, Mar 20, 2012 at 4:27 AM, Tom Lane t...@sss.pgh.pa.us wrote:

 I wrote:
  You've still got a nasty join-size estimation error:

  -  Nested Loop  (cost=6.18..1939.43 rows=411736 width=8) (actual
  time=0.203..3.487 rows=35 loops=1)

  It's not apparent why that's so far off ...

 What PG version is this, anyway?  It strikes me that this estimation
 error might have something with the eqjoinsel bugs that we repaired
 in 9.0.5.  I'm not having any luck reproducing such a bogus estimate
 with current code, either, though that may just mean you've omitted
 some critical info about how the tables are set up.

regards, tom lane




-- 
Kevin M. Goess
Software Engineer
Berkeley Electronic Press
kgo...@bepress.com

510-665-1200 x179
www.bepress.com

bepress: sustainable scholarly publishing


Re: [GENERAL] WHERE IN (subselect) versus WHERE IN (1,2,3,)

2012-03-19 Thread Kevin Goess
On Mon, Mar 19, 2012 at 9:24 AM, Albe Laurenz laurenz.a...@wien.gv.atwrote:

 That means that your statistics are not accurate.

 As a first measure, you should ANALYZE the tables involved and see if
 the problem persists.  If yes, post the new plans.


Aha, thanks, that explains why my test table with one row was so bad.  But
even with all freshly ANALYZE'd tables, I still see the query reverting to
a sequential scan on that big contexts table once the number of rows in the
subselect goes over 199.  Here's a simplified version that demonstrates the
problem.

production= explain (analyze, buffers) SELECT contexts.context_key FROM
contexts JOIN articles ON (articles.context_key=contexts.context_key) WHERE
contexts.context_key IN (SELECT context_key FROM virtual_ancestors limit
200) AND articles.indexed;
 QUERY
PLAN
-
 Hash Join  (cost=7086.13..219322.15 rows=411736 width=4) (actual
time=50.118..1213.046 rows=35 loops=1)
   Hash Cond: (contexts.context_key = articles.context_key)
   Buffers: shared hit=72539 read=100104
   -  Seq Scan on contexts  (cost=0.00..190285.83 rows=1783283 width=4)
(actual time=0.040..769.891 rows=1786074 loops=1)
 Buffers: shared hit=72399 read=100054
   -  Hash  (cost=1939.43..1939.43 rows=411736 width=8) (actual
time=3.510..3.510 rows=35 loops=1)
 Buckets: 65536  Batches: 1  Memory Usage: 2kB
 Buffers: shared hit=140 read=50
 -  Nested Loop  (cost=6.18..1939.43 rows=411736 width=8) (actual
time=0.203..3.487 rows=35 loops=1)
   Buffers: shared hit=140 read=50
   -  HashAggregate  (cost=6.18..8.18 rows=200 width=4)
(actual time=0.174..0.198 rows=48 loops=1)
 Buffers: shared read=2
 -  Limit  (cost=0.00..3.68 rows=200 width=4) (actual
time=0.015..0.108 rows=200 loops=1)
   Buffers: shared read=2
   -  Seq Scan on virtual_ancestors
(cost=0.00..87676.17 rows=4759617 width=4) (actual time=0.015..0.075
rows=200 loops=1)
 Buffers: shared read=2
   -  Index Scan using articles_pkey on articles
(cost=0.00..9.64 rows=1 width=4) (actual time=0.015..0.068 rows=1 loops=48)
 Index Cond: (articles.context_key =
virtual_ancestors.context_key)
 Filter: articles.indexed
 Buffers: shared hit=140 read=48
 Total runtime: 1213.138 ms
(21 rows)


But if I write the keys in the subquery inline, I get a very nice execution
plan, all the way up to a tested maximum of about 50,000 keys:


production= explain (analyze, buffers) SELECT contexts.context_key FROM
contexts JOIN articles ON (articles.context_key=contexts.context_key) WHERE
contexts.context_key IN (2482612,2482612,...) AND articles.indexed;

QUERY PLAN
-
 Nested Loop  (cost=758.71..3418.40 rows=200 width=4) (actual
time=0.621..1.089 rows=35 loops=1)
   Buffers: shared hit=826 read=1
   -  Bitmap Heap Scan on contexts  (cost=752.58..1487.55 rows=200
width=4) (actual time=0.604..0.699 rows=48 loops=1)
 Recheck Cond: (context_key = ANY
('{2482612,2482612,...}'::integer[]))
 Buffers: shared hit=639
 -  Bitmap Index Scan on contexts_pkey  (cost=0.00..752.53
rows=200 width=0) (actual time=0.591..0.591 rows=200 loops=1)
   Index Cond: (context_key = ANY
('{2482612,2482612,...}'::integer[]))
   Buffers: shared hit=600
   -  Bitmap Heap Scan on articles  (cost=6.13..9.64 rows=1 width=4)
(actual time=0.007..0.007 rows=1 loops=48)
 Recheck Cond: (articles.context_key = contexts.context_key)
 Filter: articles.indexed
 Buffers: shared hit=187 read=1
 -  Bitmap Index Scan on articles_pkey  (cost=0.00..6.13 rows=1
width=0) (actual time=0.005..0.005 rows=1 loops=48)
   Index Cond: (articles.context_key = contexts.context_key)
   Buffers: shared hit=148
 Total runtime: 1.147 ms

Is this expected behavior, that writing the ids inline does much better
than the subquery?  I've been told that it's not, but this isn't the first
time I've seen this, so I feel like I'm not understanding something.


[GENERAL] WHERE IN (subselect) versus WHERE IN (1,2,3,)

2012-03-15 Thread Kevin Goess
My apologies, I'm sure this question has been asked before but I couldn't
find anything on the list that meant anything to me.

We have a table contexts with 1.6 million rows, and a table articles
with 1.4 million rows, where an article is a particular kind of
context.  We want to select from a join on those two tables like this

SELECT COUNT(*)
FROM contexts
JOIN articles ON (articles.context_key=contexts.context_key)
WHERE contexts.context_key IN (...);
/* and some combination of columns from articles and contexts */

If IN(...) is a query, then this guy does a seq scan on the contexts
table, even if the subquery is select col_a from kgtest where kgtest has
one row.  If however I read the ids beforehand and write them into the
query, a la IN (111,222,333...), then the everything is happy, up to at
least 20,000 values written into the sql, at which point smaller machines
will take 2-5 minutes to parse the query.

I can certainly write the ids inline into the SQL, but when I do that I get
the distinct impression that I'm Doing It Wrong.  Is this expected
behavior?  It seems surprising to me.


To demonstrate:

/* nothing up my sleeve */
# select * from kgtest;
  cola
-
 1652729
(1 row)

/* inline, good query plan */
# explain (analyze, buffers) select count(*) from contexts JOIN articles ON
(articles.context_key=contexts.context_key) where contexts.context_key in
(1652729);
 QUERY
PLAN
--
 Aggregate  (cost=3.82..3.83 rows=1 width=0) (actual time=0.188..0.189
rows=1 loops=1)
   Buffers: shared hit=7
   -  Nested Loop  (cost=0.00..3.81 rows=1 width=0) (actual
time=0.181..0.181 rows=0 loops=1)
 Buffers: shared hit=7
 -  Index Scan using contexts_pkey on contexts  (cost=0.00..1.90
rows=1 width=4) (actual time=0.109..0.112 ro
   Index Cond: (context_key = 1652729)
   Buffers: shared hit=4
 -  Index Scan using articles_pkey on articles  (cost=0.00..1.90
rows=1 width=4) (actual time=0.060..0.060 ro
   Index Cond: (articles.context_key = 1652729)
   Buffers: shared hit=3
 Total runtime: 0.324 ms
(11 rows)

  /* subselect, query plan does seq scan on contexts */


# explain (analyze, buffers) select count(*)from contexts JOIN articles ON
(articles.context_key=contexts.context_key) where contexts.context_key in
(select cola from kgtest);
   QUERY
PLAN
--
 Aggregate  (cost=118505.72..118505.73 rows=1 width=0) (actual
time=0.274..0.275 rows=1 loops=1)
   Buffers: shared hit=5
   -  Hash Join  (cost=12512.61..116661.91 rows=737524 width=0) (actual
time=0.269..0.269 rows=0 loops=1)
 Hash Cond: (contexts.context_key = articles.context_key)
 Buffers: shared hit=5
 -  Seq Scan on contexts  (cost=0.00..64533.03 rows=1648203
width=4) (actual time=0.009..0.009 rows=1 loops=1
   Buffers: shared hit=1
 -  Hash  (cost=412.56..412.56 rows=737524 width=8) (actual
time=0.110..0.110 rows=0 loops=1)
   Buckets: 4096  Batches: 32  Memory Usage: 0kB
   Buffers: shared hit=4
   -  Nested Loop  (cost=40.00..412.56 rows=737524 width=8)
(actual time=0.107..0.107 rows=0 loops=1)
 Buffers: shared hit=4
 -  HashAggregate  (cost=40.00..42.00 rows=200
width=4) (actual time=0.069..0.071 rows=1 loops=1)
   Buffers: shared hit=1
   -  Seq Scan on kgtest  (cost=0.00..34.00
rows=2400 width=4) (actual time=0.048..0.050 rows
 Buffers: shared hit=1
 -  Index Scan using articles_pkey on articles
(cost=0.00..1.84 rows=1 width=4) (actual time=0.0
   Index Cond: (articles.context_key = kgtest.cola)
   Buffers: shared hit=3
 Total runtime: 0.442 ms

-- 
Kevin M. Goess
Software Engineer
Berkeley Electronic Press
kgo...@bepress.com

510-665-1200 x179
www.bepress.com

bepress: sustainable scholarly publishing