Re: [PERFORM] Advice about how to delete

2007-07-06 Thread Heikki Linnakangas

Arnau wrote:

  CREATE TABLE user_groups
  (
user_id   INT8 REFERENCES users(user_id),
group_id  INT8 REFERENCE groups(group_id),
CONSTRAINT pk PRIMARY_KEY ( user_id, group_id)
  )

  CREATE INDEX idx_user_id ON user_groups( user_id );


The primary key implicitly creates an index on (user_id, group_id), so 
you probably don't need this additional index.


  This works quite fast with small groups but when the group has an 
important number of users, it takes too much time. The delete_group 
action is fired from the user interface of the application.


It looks like you're not deleting rows from user_groups when a group is 
deleted. Perhaps the table definition you posted misses ON DELETE 
CASCADE on the foreign key declarations?


I would implement this with triggers. Use the ON DELETE CASCADE to take 
care of deleting rows from user_groups and create an ON DELETE trigger 
on user_groups to delete orphan rows. Like this:


CREATE OR REPLACE FUNCTION delete_orphan_users () RETURNS trigger AS $$
  DECLARE
  BEGIN
PERFORM  * FROM user_groups ug WHERE ug.user_id = OLD.user_id;
IF NOT FOUND THEN
DELETE FROM users WHERE users.user_id = OLD.user_id;
END IF;

RETURN NULL;
  END;
$$ LANGUAGE 'plpgsql';

DROP TRIGGER IF EXISTS d_usergroup ON user_groups;
CREATE TRIGGER d_usergroup AFTER DELETE ON user_groups FOR EACH ROW 
EXECUTE PROCEDURE delete_orphan_users();


This might not be significantly faster, but it's easier to work with.


  Do you have any idea about how I could improve the performance of this?


Michael Glaesemann's idea of using a single statement to delete all 
orphan users with one statement is a good one, though you want to refine 
it a bit so that you don't need to do a full table scan every time. 
Perhaps like this, before deleting rows from user_groups:


DELETE FROM users WHERE user_id IN (
  SELECT u.user_id FROM users u
  LEFT OUTER JOIN user_groups ug ON (u.user_id = ug.user_id AND 
ug.group_id <> 10)

  WHERE group_id IS NULL
  AND u.user_id IN (SELECT user_id FROM user_groups where group_id = 10)
);

Or maybe you could just leave the orphans in the table, and delete them 
later in batch?


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PERFORM] Direct I/O

2007-07-06 Thread Jignesh K. Shah
On Solaris you just look at the mount options on the file system and see
if there is a forcedirectio option enabled. Generally since PostgreSQL
doesn't use any special options for enabling directio that's a known way
to figure it out on Solaris. Atleast on Solaris the performance over
buffered filesystem is better for many workloads but not always. Plus
you typically see a small reduction in CPU usage (system) and ofcourse
memory.

However depending on workload, you may see increased latency in writes
but generally that's not the problem in many workloads since its the
multiple writes to the same file which is better using concurrentio
(modified directio) in Solaris.

As for Linux I will leave that to other experts ..

-Jignesh


lai yoke hman wrote:
> Hello,
> How can I know my PostgreSQL 8 is using direct I/O or buffered I/O? If using 
> buffered I/O, how can I enable direct I/O? What is the performance difference 
> of them?
> This is urgent, Thanks.
> _
> Windows Live Spaces is here! It’s easy to create your own personal Web site. 
> http://spaces.live.com/?mkt=en-my
>
> ---(end of broadcast)---
> TIP 5: don't forget to increase your free space map settings
>   

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [PERFORM] Advice about how to delete

2007-07-06 Thread Arnau

Hi Michael,

Michael Glaesemann wrote:


On Jul 6, 2007, at 9:42 , Arnau wrote:

  I have the following scenario, I have users and groups where a user 
can belong to n groups, and a group can have n users. A user must 
belogn at least to a group. So when I delete a group I must check that 
there isn't any orphan. To do this I have something like that:




  IF v_count = 1 THEN
DELETE FROM users WHERE user_id = result.user_id;
v_deleted = v_deleted + 1;
  END IF;


Am I right in reading that you're deleting any users that would be 
orphans? If so, you can just delete the orphans after rather than delete 
them beforehand (untested):


-- delete user_groupDELETE FROM user_groups
WHERE user_group_id = p_group_id;

-- delete users that don't belong to any group
DELETE FROM users
WHERE user_id IN (
SELECT user_id
LEFT JOIN user_groups
WHERE group_id IS NULL);

This should execute pretty quickly. You don't need to loop over any 
results. Remember, SQL is a set-based language, so if you can pose your 
question in a set-based way, you can probably find a pretty good, 
efficient solution.


  I have tested your solution and it's much worse than mine.

  My test database has about 254000 users and about 30 groups. The test 
I have done is remove a group with 258 users, my solution has taken 
about 3 seconds and your solution after 20seconds didn't finished. Of 
course the test machine is an old celeron with few MB of RAM, but as 
test machine does the job.


Thank you very much
--
Arnau

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PERFORM] Advice about how to delete

2007-07-06 Thread Michael Glaesemann


On Jul 6, 2007, at 9:42 , Arnau wrote:

  I have the following scenario, I have users and groups where a  
user can belong to n groups, and a group can have n users. A user  
must belogn at least to a group. So when I delete a group I must  
check that there isn't any orphan. To do this I have something like  
that:




  IF v_count = 1 THEN
DELETE FROM users WHERE user_id = result.user_id;
v_deleted = v_deleted + 1;
  END IF;


Am I right in reading that you're deleting any users that would be  
orphans? If so, you can just delete the orphans after rather than  
delete them beforehand (untested):


-- delete user_group 
DELETE FROM user_groups

WHERE user_group_id = p_group_id;

-- delete users that don't belong to any group
DELETE FROM users
WHERE user_id IN (
SELECT user_id
LEFT JOIN user_groups
WHERE group_id IS NULL);

This should execute pretty quickly. You don't need to loop over any  
results. Remember, SQL is a set-based language, so if you can pose  
your question in a set-based way, you can probably find a pretty  
good, efficient solution.


Michael Glaesemann
grzm seespotcode net



---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


[PERFORM] Advice about how to delete

2007-07-06 Thread Arnau

Hi all,

  I have the following scenario, I have users and groups where a user 
can belong to n groups, and a group can have n users. A user must belogn 
at least to a group. So when I delete a group I must check that there 
isn't any orphan. To do this I have something like that:


  CREATE TABLE users
  (
user_idSERIAL8 PRIMARY KEY
user_name  VARCHAR(50)
  )

  CREATE TABLE groups
  (
group_idSERIAL8 PRIMARY KEY,
group_name  VARCHAR(50)
  )

  CREATE TABLE user_groups
  (
user_id   INT8 REFERENCES users(user_id),
group_id  INT8 REFERENCE groups(group_id),
CONSTRAINT pk PRIMARY_KEY ( user_id, group_id)
  )

  CREATE INDEX idx_user_id ON user_groups( user_id );
  CREATE INDEX idx_group_id ON user_groups( group_id );

  FUNCTION delete_group( INT8 )
  DECLARE
p_groupid ALIAS FOR $1;
v_deleted INTEGER;
v_count   INTEGER;
resultRECORD;

  BEGIN
v_deleted = 0;

FOR result IN SELECT user_id FROM user_groups WHERE group_id = 
p_groupid

LOOP

  SELECT INTO v_count COUNT(user_id) FROM user_groups WHERE user_id 
= result.user_id LIMIT 2;


  IF v_count = 1 THEN
DELETE FROM users WHERE user_id = result.user_id;
v_deleted = v_deleted + 1;
  END IF;

END LOOP;

DELETE FROM groups WHERE group_id = p_groupid;

RETURN v_deleted;
  END;


  This works quite fast with small groups but when the group has an 
important number of users, it takes too much time. The delete_group 
action is fired from the user interface of the application.


  Do you have any idea about how I could improve the performance of this?

Thanks all
--
Arnau

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [PERFORM] Direct I/O

2007-07-06 Thread Alvaro Herrera
lai yoke hman wrote:

> How can I know my PostgreSQL 8 is using direct I/O or buffered I/O? If
> using buffered I/O, how can I enable direct I/O? What is the
> performance difference of them?

1. it is buffered
2. you can't
3. there isn't any because there isn't direct I/O

Unless you mess with the filesystem features, at which point I shut up
or they shut me down.

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PERFORM] improvement suggestions for performance design

2007-07-06 Thread Kalle Hallivuori

Hi Thomas & all,

2007/7/6, Thomas Finneid <[EMAIL PROTECTED]>:


Heikki Linnakangas wrote:
> ISTM that a properly normalized schema would look something like this:


[example of tables per attr referencing main table containing only primary key]


I agree that this is a way it could be done.


Indeed. Another way is postgresql-specific:

create table position (
 posX int not null,
 posY int not null,
);

create table population INHERITS position ( population int not null );
-- optionally:
create unique index population_idx(posX,posY,population);

This leads to each attr table inheriting posX, posY from position; you
never insert anything to position itself, but you can use it to list
all positions that have any attributes in any of the inherited tables
(in that sense it's a view over all its children).


In any case, there is no point in having one table per attribute, as
some attributes are logically grouped and can therefore be grouped
toghether in the table. Since there are 5-10 groups of attributes, 5-10
tables would be enough.


This sounds very sensible. This way you would send only 1 table (or
procedure, or prepared statement) name instead of as many attr_types
as you have attributes in a group.

So instead of calling 'your_procedure(type, posX, posY, data_type,
value)' for each 5 values separately you would call
'attrgroup_and_datatype_specific_procedure(posX, posY, value1, value2,
value3, value4, value5)'. Inside the procedure the inserts change from
'insert into attribute_values values (type, posX, posY, data_type,
value)' to 'insert into attrgroup_and_datatype_specific_table values
(posX, posY, value1, value2, value3, value4, value5)' -- so you save
four inserts and for each value inserted you use 2/5 extra fields
instead of 4. You are allowed to use shorter names for the tables and
procedures ;)

It should be trivial to hide this separation in client; you could even
create new tables for new kinds of attribute-datatype combinations
automatically on the fly.


They have been treating their data this way for the last 20 years, and
there is nothing on the horizon that tells neither them nor me that it
will be any different the next 10 years. So I am not sure I need to plan
for that.


Is it possible that normalization has been skipped originally because
of lack of resources or knowledge of the nature of data to be
imported, or lack of dynamism on the part of the original tools (such
as creation of type specific tables on the fly), that would now be
available, or at least worth a second look?


> A normalized schema will also take less space, which means less I/O and
> more performance,

That is what I am trying to find out, if it is true for this scenario as
well.


Well, you're saving four extra ints per each value, when you only need
two per 5-10 values.

If you happen to save numerical data as the value in the text field
for some data_types, you are losing a lot more.


> because there's no need to store metadata like the
> data_type, attr_type on every row.

data_type and attr_type are not decorative meta_data, they are actively
used as query parameters for each attribute, if they where not there I
would not be able to perform the queries I need to do.


You can still express them as table or column names rather than extra
data per row.


> Before you start fiddling with functions, I'd suggest that you try
> batching the inserts with the JDBC PreparedStatement batch facility.

I have done that, now I need to have something to compare it against,
preferably a function written in plpgsql and one in c.
So any other suggestions on how to efficiently bulk transfer the data to
the db for insertion?


COPY is plentitudes faster than INSERT:
http://www.postgresql.org/docs/8.1/interactive/sql-copy.html

If you can't just push the data straight into the final table with
COPY, push it into a temporary table that you go through with the
database procedure.

Shameless plug: If you use Java and miss COPY functionality in the
driver, it's available at

http://kato.iki.fi/sw/db/postgresql/jdbc/copy/

I was able to practically nullify time spent inserting with that.


Well, it has been tested and showed to make postgres perform much
better, ie. 100 000 inserts separated between 4 threads performed much
faster than with a single thread alone.


Sounds interesting. The results should still end up written into the
same table, so are you sure this didn't end up using the same time at
server end - would that even matter to you?

We ended up having best results with sequential batches of around 10
000 rows each.


> BTW, I concur with Y Sidhu that with data volumes as high as you have,
> partitioning is a good idea.

Yes, I will be looking into to it.


Depending on distribution of your data, saving each attribute group
(or datatype, or both) to its own table will take you some way to the
same direction.

If you have no indexes and do no deletes (like it seems to be in your
case), size of table might not matter mu