Re: [PERFORM] NOT LIKE much faster than LIKE?

2006-01-11 Thread Andrea Arcangeli
On Tue, Jan 10, 2006 at 10:46:53AM -0500, Tom Lane wrote:
 Not with that data, but maybe if you increased the statistics target for
 the column to 100 or so, you'd catch enough values to get reasonable
 results.

Sorry, I'm not expert with postgresql, could you tell me how to increase
the statistic target?

In another email you said you applied a patch to CVS, please let me know
if you've anything to test for me, and I'll gladly test it immediately
(I've a sandbox so it's ok even if it corrupts the db ;).

Thanks!

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PERFORM] NOT LIKE much faster than LIKE?

2006-01-11 Thread Andrea Arcangeli
On Wed, Jan 11, 2006 at 09:07:45AM +, Simon Riggs wrote:
 I would suggest we do this only when all of these are true
 - when accessing more than one table, so the selectivity could effect a
 join result

FWIW my problem only happens if I join: on the main table where the
kernel_version string is stored (without joins), everything is always
blazing fast. So this requirement certainly sounds fine to me.

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [PERFORM] NOT LIKE much faster than LIKE?

2006-01-11 Thread Andrea Arcangeli
On Wed, Jan 11, 2006 at 12:40:32PM -0600, Jim C. Nasby wrote:
 On Tue, Jan 10, 2006 at 02:44:47AM +0100, Andrea Arcangeli wrote:
  cooperative runs WHERE kernel_version NOT LIKE '%% PREEMPT %%', while
  preempt runs WHERE kernel_version LIKE '%% PREEMPT %%'. The only 
  difference
 
 One thing you could do is change the like to:
 
 WHERE position(' PREEMPT ' in kernel_version) != 0

That alone fixed it, with this I don't even need the index (yet). Thanks
a lot.

 And then create a functional index on that:
 
 CREATE INDEX indexname ON tablename ( position(' PREEMPT ' in kernel_version) 
 );

The index only helps the above query with = 0 and not the one with != 0,
but it seems not needed in practice.

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [PERFORM] NOT LIKE much faster than LIKE?

2006-01-11 Thread Andrea Arcangeli
On Wed, Jan 11, 2006 at 09:39:47PM +0100, Andrea Arcangeli wrote:
 On Wed, Jan 11, 2006 at 12:40:32PM -0600, Jim C. Nasby wrote:
  On Tue, Jan 10, 2006 at 02:44:47AM +0100, Andrea Arcangeli wrote:
   cooperative runs WHERE kernel_version NOT LIKE '%% PREEMPT %%', while
   preempt runs WHERE kernel_version LIKE '%% PREEMPT %%'. The only 
   difference
  
  One thing you could do is change the like to:
  
  WHERE position(' PREEMPT ' in kernel_version) != 0
 
 That alone fixed it, with this I don't even need the index (yet). Thanks
 a lot.

The fix is online already w/o index:

http://klive.cpushare.com/?branch=allscheduler=preemptive

Of course I'm still fully available to test any fix for the previous
LIKE query if there's interest.

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PERFORM] NOT LIKE much faster than LIKE?

2006-01-10 Thread Andrea Arcangeli
On Tue, Jan 10, 2006 at 10:11:18AM -0500, Greg Stark wrote:
 
 Andrea Arcangeli [EMAIL PROTECTED] writes:
 
  Fixing this with proper stats would be great indeed. What would be the
  most common value for the kernel_version? You can see samples of the
  kernel_version here http://klive.cpushare.com/2.6.15/ .  That's the
  string that is being searched against both PREEMPT and SMP.
 
 Try something like this where attname is the column name and tablename is,
 well, the tablename:
 
 db= select most_common_vals from pg_stats where tablename = 'region' and 
 attname = 'province';
  most_common_vals 
 --
  {ON,NB,QC,BC}

Thanks for the info!

klive= select most_common_vals from pg_stats where tablename = 'klive' and 
attname = 'kernel_version';


most_common_vals



 {#1 Tue Sep 13 14:56:15 UTC 2005,#1 Fri Aug 19 11:58:59 UTC 2005,#7 SMP 
Fri Oct 7 15:56:41 CEST 2005,#1 SMP Fri Aug 19 11:58:59 UTC 2005,#2 Thu Sep 
22 15:58:44 CEST 2005,#1 Fri Sep 23 15:32:21 GMT 2005,#1 Fri Oct 21 
03:46:55 EDT 2005,#1 Sun Sep 4 13:45:32 CEST 2005,#5 PREEMPT Mon Nov 21 
17:53:59 EET 2005,#1 Wed Sep 28 19:15:10 EDT 2005}
(1 row)

klive= select most_common_freqs from pg_stats where tablename = 'klive' and 
attname = 'kernel_version';
 most_common_freqs  
   
---
 
{0.013,0.0116667,0.011,0.009,0.0073,0.0067,0.0063,0.006,0.006,0.0057}
(1 row)

klive= 

There's only one preempt near the end, not sure if it would work?

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


[PERFORM] NOT LIKE much faster than LIKE?

2006-01-09 Thread Andrea Arcangeli
Hello,

I've a performance problem with the planner algorithm choosen in a website.
See the difference between this:

http://klive.cpushare.com/?scheduler=cooperative

and this:

http://klive.cpushare.com/?scheduler=preemptive

(note, there's much less data to show with preemptive, so it's not because of
the amount of data to output)

The second takes ages to complete and it overloads the server as well at 100%
cpu load for several seconds.

cooperative runs WHERE kernel_version NOT LIKE '%% PREEMPT %%', while
preempt runs WHERE kernel_version LIKE '%% PREEMPT %%'. The only difference
is a NOT before LIKE. No other differences at all.

The planner does apparently a big mistake using the nested loop in the LIKE
query, it should use the hash join lik in the NOT LIKE query instead.

I guess I can force it somehow (I found some docs about it here:

http://www.postgresql.org/docs/8.1/static/runtime-config-query.html

) but it looks like something could be improved in the default mode too, so I'm
reporting the problem since it looks a performance bug to me. It just makes no
sense to me that the planner takes a difference decision based on a not. It
can't know if it's more likely or less likely, this is a boolean return, it's
*exactly* the same cost to run it. Making assumption without user-provided
hints looks a mistake. I never said to the db that not like is more or less
likely to return data in output than like.

Tried ANALYZE, including VACUUM FULL ANALYZE and it doesn't make a difference.

Perhaps it's analyze that suggests to use a different algorithm with not like
because there's much more data to analyze with not like than with like, but
that algorithm works much better even when there's less data to analyze.

Indexes don't make any visible difference.

postgres is version 8.1.2 self compiled from CVS 8.1 branch of yesterday.

[EMAIL PROTECTED]:~ psql -V
psql (PostgreSQL) 8.1.2
contains support for command-line editing
[EMAIL PROTECTED]:~ 

The problem is reproducible on the shell, I only need to remove explain. Of
course explain is wrong about the cost, according to explain the first query is
cheaper when infact it's an order of magnitude more costly.

klive= explain SELECT class, vendor, device, revision, COUNT(*) as nr FROM pci 
NATURAL INNER JOIN klive WHERE kernel_version LIKE '%% PREEMPT %%' GROUP BY 
class, vendor, device, revision;
  QUERY PLAN
--
 HashAggregate  (cost=1687.82..1687.83 rows=1 width=16)
   -  Nested Loop  (cost=235.86..1687.81 rows=1 width=16)
 -  Seq Scan on klive  (cost=0.00..1405.30 rows=1 width=8)
   Filter: ((kernel_version)::text ~~ '%% PREEMPT %%'::text)
 -  Bitmap Heap Scan on pci  (cost=235.86..282.32 rows=15 width=24)
   Recheck Cond: (pci.klive = outer.klive)
   -  Bitmap Index Scan on pci_pkey  (cost=0.00..235.86 rows=15 
width=0)
 Index Cond: (pci.klive = outer.klive)
(8 rows)

klive= explain SELECT class, vendor, device, revision, COUNT(*) as nr FROM pci 
NATURAL INNER JOIN klive WHERE kernel_version NOT LIKE '%% PREEMPT %%' GROUP BY 
class, vendor, device, revision;
   QUERY PLAN

 HashAggregate  (cost=3577.40..3612.00 rows=2768 width=16)
   -  Hash Join  (cost=1569.96..3231.50 rows=27672 width=16)
 Hash Cond: (outer.klive = inner.klive)
 -  Seq Scan on pci  (cost=0.00..480.73 rows=27673 width=24)
 -  Hash  (cost=1405.30..1405.30 rows=22263 width=8)
   -  Seq Scan on klive  (cost=0.00..1405.30 rows=22263 width=8)
 Filter: ((kernel_version)::text !~~ '%% PREEMPT %%'::text)
(7 rows)

klive= 

Hints welcome, thanks!


PS. All the source code of the website where I'm reproducing the problem is
available at the above url under the GPL.

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PERFORM] NOT LIKE much faster than LIKE?

2006-01-09 Thread Andrea Arcangeli
On Tue, Jan 10, 2006 at 10:29:05AM +0800, Christopher Kings-Lynne wrote:
  UNLIKELY string LIKE '%% PREEMPT %%'
 
 or:
 
  LIKELY string NOT LIKE '%% PREEMPT %%'
 
 You should be using contrib/tsearch2 for an un-anchored text search perhaps?

If I wanted to get the fastest speed possible, then I think parsing it
with python and storing true/false in a boolean like suggested before
would be better and simpler as well for this specific case.

However I don't need big performance, I need just decent performance, and it
annoys me that there heurisics where the LIKE query assumes little data
will be selected. There's no way to know that until proper stats are
recorded on the results of the query. The default should be good enough
to use IMHO, and there's no way to know if NOT LIKE or LIKE will return
more data, 50% should be assumed for both if no runtime information is
available IMHO.

IIRC gcc in a code like if (something) {a}  else {b} assumes that a is
more likely to be executed then b, but that's because it's forced to
choose something. Often one is forced to choose what is more likely
between two events, but I don't think the above falls in this case. I
guess the heuristic really wanted to speed up the runtime of LIKE, when
it actually made it a _lot_ worse. No heuristic is better than an
heuristic that falls apart in corner cases like the above LIKE.

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PERFORM] NOT LIKE much faster than LIKE?

2006-01-09 Thread Andrea Arcangeli
On Mon, Jan 09, 2006 at 09:54:44PM -0500, Tom Lane wrote:
 Extrapolating from the observation that the heuristics don't work well
 on your data to the conclusion that they don't work for anybody is not
 good logic.  Replacing that code with a flat 50% is not going to happen
 (or if it does, I'll be sure to send the mob of unhappy users waving
 torches and pitchforks to your door not mine ;-)).

I'm not convinced but of course I cannot exclude that some people may be
depending on this very heuristic. But I consider this being
bug-compatible, I've an hard time to be convinced that such heuristic
isn't going to bite other people like it did with me.

 I did just think of something we could improve though.  The pattern
 selectivity code doesn't make any use of the statistics about most
 common values.  For a constant pattern, we could actually apply the
 pattern test with each common value and derive answers that are exact
 for the portion of the population represented by the most-common-values
 list.  If the MCV list covers a large fraction of the population then
 this would be a big leg up in accuracy.  Dunno if that applies to your
 particular case or not, but it seems worth doing ...

Fixing this with proper stats would be great indeed. What would be the
most common value for the kernel_version? You can see samples of the
kernel_version here http://klive.cpushare.com/2.6.15/ .  That's the
string that is being searched against both PREEMPT and SMP.

BTW, I also run a LIKE '%% SMP %%' a NOT LIKE '%% SMP %%' but that runs
fine, probably because as you said in the first email PREEMPT is long
but SMP is short.

Thanks!

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org