[GENERAL] Q: Table scans on set difference

2006-07-14 Thread G. Ralph Kuntz, MD
What's happening here?
 
I have two tables, encounter_properties_table with about 10 rows and
xfiles with about 50 rows. The structures of these tables is as follows:
 
   Table public.encounter_properties_table
 Column |   Type   | Modifiers
+--+---
 timestamp  | timestamp with time zone | not null
 practice_id| integer  | not null
 patient_id | bigint   | not null
 properties | text |
 modified_by| bigint   | not null
 client_version | integer  |
 file_name  | character varying(255)   |
Indexes:
encounter_properties_table_pkey primary key, btree (patient_id)
fn_ix btree (file_name)
 
and
 
 Table public.xfiles
  Column  |  Type  | Modifiers
--++---
 filename | character varying(100) | not null
Indexes:
xfiles_ix1 btree (filename)
 
The following query shows that PostgreSQL 7.4 is doing table scans on both
tables:
 
explain select file_name from encounter_properties_table where file_name not
in (select filename from xfiles);
   QUERY PLAN


 Seq Scan on encounter_properties_table  (cost=0.00..1030610198.10
rows=85828 width=58)
   Filter: (NOT (subplan))
   SubPlan
 -  Seq Scan on xfiles  (cost=0.00..10755.44 rows=500944 width=59)
(4 rows)
 
I ran vacumm analyze on both tables.
 
We aborted this query when it had not finished after 4 hours.
 
We ran the same query on SQLServer 2005 with the same data and it took under
one second to finish.
 
Any ideas?

BEGIN:VCARD
VERSION:2.1
N:Kuntz;G. Ralph
FN:G. Ralph Kuntz ([EMAIL PROTECTED])
ORG:meridianEMR, Inc
TITLE:Chief Software Architect
TEL;WORK;VOICE:(973) 994-3220
TEL;HOME;VOICE:(973) 989-4392
TEL;CELL;VOICE:(973) 214-4464
TEL;WORK;FAX:(973) 994-0027
ADR;WORK:;;354 Eisenhower Parkway;Livingston;NJ;07039;United States
LABEL;WORK;ENCODING=QUOTED-PRINTABLE:354 Eisenhower Parkway=0D=0ALivingston, NJ 07039=0D=0AUnited States
EMAIL;PREF;INTERNET:[EMAIL PROTECTED]
REV:20051130T173408Z
END:VCARD


PGP.sig
Description: PGP signature


Re: [GENERAL] Q: Table scans on set difference

2006-07-14 Thread Alban Hertroys

G. Ralph Kuntz, MD wrote:

What's happening here?
 
I have two tables, encounter_properties_table with about 10 rows and

xfiles with about 50 rows. The structures of these tables is as follows:



 file_name  | character varying(255)   |



 Table public.xfiles
  Column  |  Type  | Modifiers
--++---
 filename | character varying(100) | not null


These columns are of different types, you're forcing a typecast on every 
row comparison; I think the varchar(100)'s will be upscaled to 
varchar(255) on comparison.


My advice: use the text type. It's more flexible (practically no size 
limit) and faster.



explain select file_name from encounter_properties_table where file_name not
in (select filename from xfiles);


What about:
explain select file_name from encounter_properties_table
where not exists (
select file_name from xfiles where filename = file_name);

I often even use select 1 - a constant - because I'm not interested in 
the value, but apparently selecting a column is marginally faster than 
selecting a constant. Testing will prove it, I thought I'd mention the 
possibilit.



I ran vacumm analyze on both tables.
 
We aborted this query when it had not finished after 4 hours.


Probably due to the type cast.
We used to run into this problem when using bigint index columns. We 
changed them into int (which was sufficient) and the speed went up a lot.
Later we determined - with input from this list - that the cause wasn't 
the size of the column but the type casting required to match the 
constant integer values in our queries. In our case explicit casting of 
our constant values helped.



We ran the same query on SQLServer 2005 with the same data and it took under
one second to finish.


--
Alban Hertroys
[EMAIL PROTECTED]

magproductions b.v.

T: ++31(0)534346874
F: ++31(0)534346876
M:
I: www.magproductions.nl
A: Postbus 416
   7500 AK Enschede

// Integrate Your World //

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [GENERAL] Q: Table scans on set difference

2006-07-14 Thread Tom Lane
Alban Hertroys [EMAIL PROTECTED] writes:
 G. Ralph Kuntz, MD wrote:
 explain select file_name from encounter_properties_table where file_name not
 in (select filename from xfiles);

 What about:
 explain select file_name from encounter_properties_table
 where not exists (
   select file_name from xfiles where filename = file_name);

If you only need the file name, an EXCEPT would probably work much
better:

select file_name from encounter_properties_table
except
select filename from xfiles;

Another possibility is to abuse the outer join machinery:

select file_name, ... from 
encounter_properties_table l left join xfiles r
  on l.file_name = r.filename
where r.filename is null;

Generally speaking, NOT IN performance is going to suck unless the
sub-select is small enough to fit in a hashtable.  You could consider
increasing work_mem enough that it would fit, but with 500K filenames
needed, that's probably not going to win.

regards, tom lane

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq