Richard Huxton wrote:
Adam Gundy wrote:
On Thu, Apr 10, 2008 at 8:52 AM, Adam Gundy <[EMAIL PROTECTED]> wrote:
Richard Huxton wrote:
How many distinct values do you have in groups.groupid and
group_access.group_id?
 for the small database (since it shows the same problem):

 group_access: 280/268
 groups: 2006/139

 for the large database:

 group_access: same
 groups: 1712647/140

 the groupid key is an MD5 hash, so it should be uniformly distributed.
maybe that throws the stats? but, again, char works, varchar doesn't :-(

OK, I'm thinking the varchar/char part is not the issue.

Good, because it's not :-)

hmm. unfortunately it did turn out to be (part) of the issue. I've discovered that mixing char and varchar in a stored procedure does not coerce the types, and ends up doing seq scans all the time.

changing something like this:

proc x ( y char(32) )
{
   select * from groups where groupid = y
}

into this:

proc x ( y varchar(32) )
{
   select * from groups where groupid = y
}

and suddenly postgres does index lookups in the stored proc... way faster.

I *think* that when I analyze using char instead of varchar, it is
recording a stat for the large group, but for some reason with varchar
doesn't add a stat for that one.

so, the real question is, how do I fix this? I can turn the stats way
up to 1000, but that doesn't guarantee that I'll get a stat for the
large group :-(

Well, by default it will be tracking the 10 most common values (and how often they occur). As you say, this can be increased to 1000 (although it obviously takes longer to check 1000 rather than 10).

We can have a look at the stats with something like:
SELECT * FROM pg_stats WHERE tablename='group_access' AND attname='uid';
You'll be interested in n_distinct, most_common_vals and most_common_freqs.

However, I think the problem may be that PG doesn't track cross-column stats, so it doesn't know that a particular uid implies one or more particular groupid values.

I doubt we could get stats stable enough for this. the number of groups will hopefully be much larger at some point.

it's a shame the index entries can't be used to provide information to the planner, eg a rough count of the number of entries for a given key (or subset). it would be nice to be able to create eg a counted btree when you know you have this kind of data as a hint to the planner.

can I turn the statistics off completely for this column? I'm guessing
that if I can, that will mean it takes a guess based on the number of
distinct values in the groups table, which is still large number of
records, possibly enough to trigger the seqscan anyway.

No - can't disable stats. Besides, you want it the other way around - index scans for all groups except the largest.

actually, disabling seqscan at the server level gives extremely good response times. I ended up rewriting a few queries that were scanning the whole group for no good reason, and bitmap index hashing seems to take care of things nicely.

queries have gone from 30+ seconds to < 0.1 seconds.

does postgres have a way of building a 'counted index' that the
planner can use for it's record counts? some way of forcibly
maintaining a stat for every group?

No, but let's see what's in pg_stats.

no real help there. either it hits the group being read, and does a good plan, or it doesn't, and tries to seqscan (unless I disable it). even forcing stats to 1000 only bandaids the situation, given the number of groups will eventually exceed that..

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature

Reply via email to