[GENERAL] Using PostgreSQL for NSS databases
I'm making this post here in hopes I may save someone from beating their head against the wall like I did... I am writing a custom Name Service Switch (NSS) module to take advantage of already existing account information in a pg database. Under certain circumstances, processes will hang due to non-recursive mutex locking during PG connection creation. It goes something like this: /etc/nsswitch.conf: passwd: files mypgmod group: files mypgmod [process with euid not found in /etc/passwd, looking up another username not found in /etc/passwd] getpwnam_r(username, ...) // nss doesn't find user in files module (/etc/passwd), // so uses mypgmod _nss_mypgmod_getpwnam_r(username, ...) pthread_mutex_lock(...) //to protect PG connection, then... PQconnectdb(...) // any number of reasons to look up info in calling user's // home directory, e.g., default password, ssl certs, etc., // resulting in call to getpwuid_r() to find user's homedir. getpwuid_r(geteuid(), ...) // nss doesn't find user in files module (/etc/passwd), // so uses mypgmod _nss_mypgmod_getpwuid_r(uid, ...) pthread_mutex_lock(...) //to protect PG connection /* HANG */ * * * The fix, if you can call it that, is to run nscd, the NSS caching daemon. Typically run as root (with account info in /etc/passwd) the second lookup will not generate a second connection attempt. Also, there exists in many linux distros a libnss-pgsql2 package which suffers from the exact same problem (there are some scattered posts on the 'net about it). Same fix ... run nscd. Cheers, Daniel Popowich -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Schema version control
Glenn Maynard writes: - Adding a NOT NULL constraint (without adding a DEFAULT). You often want to precede this with filling in any existing NULL values, so the new constraint doesn't fail. - Updating triggers, functions and their effects. For example, when I have an FTS index with a trigger to update an index column, and I change the underlying trigger, I often do something like UPDATE table SET column = column, to cause all of the update triggers to fire and recalculate the index columns. - Creating a new column based on an old one, and removing the old one; eg. add a column n, run UPDATE ... SET n = i*j * 2, and then drop the old columns i and j. - Updating data from an external source, such as ORM model code; for example, if you have a table representing external files, an update may want to calculate and update the SHA-1 of each file. - For efficiency, dropping a specific index while making a large update, and then recreating the index. In my experience, while generating schema updates automatically is handy, it tends to make nontrivial database updates more complicated. These sorts of things happen often and are an integral part of a database update, so I'm just curious how/if you deal with them. I've used Ruby's migrations, and for my Django databases I use my own migration system which is based in principle off of it: create scripts to migrate the database from version X to X+1 and X-1, and upgrade or downgrade by running the appropriate scripts in sequence. It's not ideal, since it can't generate a database at a specific version directly; it always has to run through the entire sequence of migrations to the version you want, and the migrations accumulate. However, it can handle whatever arbitrary steps are needed to update a database, and I don't need to test updates from every version to every other version. I'm with Glenn on this point. I have found updating a db version to be far more complex (for reasons he illustrates and which I find more common than not) than any automation tool can handle. And I wonder if the time spent developing such a tool (or writing changesets, xml, etc. for a given tool) actually saves development time. FWIW, this is what I do: 1. I have a table in my database, meta, that contains exactly one row, and holds configuration information. A minimal version of this table: CREATE TABLE meta ( id integer DEFAULT 1 NOT NULL CHECK (id = 1), major integer NOT NULL, minor integer NOT NULL, patch integer NOT NULL ); The first column is to guarantee I only have one row. The next three provide a tuple for my version, e.g., 1.1.3. I add other columns for software configuration as needed. 2. My application configuration has a DB_VERSION variable which defines the version of the database the software depends on. If there's a mismatch, my programs can't connect to the database, thus guaranteeing db=software happiness. 3. Like Glenn, I have a script that can go forwards and backwards, one revision at a time. The effort in maintaining this script is minimal, actually: in my RCS, I have the current schema, which is generated with: `pg_dump -O -s`. Based on diffs between current system and last revision I can manually generate the necessary DDL statements for the script. If I have data as part of my schema, like the row in my meta table, those are simple enough to examine and add to the script with DML statements. 4. Whenever a revision changes the schema, I tag my my repository, so I can easily check out that version. 5. As for branches. If production is at 1.1.3 and three branches create versions 1.1.4, another 1.1.4 and 1.2.0, and all three want to merge their various changes back into a 1.1.3 production database? I say, without being flip, don't let this happen. Here is where I think no software process can make anyone happy. It's a human process: declare someone the owner of the database schema, let them own the long term development of the schema, and if anyone needs a change, they have to communicate and vet those changes with the db designer. Dan -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Need help writing exclusion constraint
Jeff Davis writes: On Sat, 2011-01-15 at 21:32 +0100, Tomas Vondra wrote: ALTER TABLE event ADD CONSTRAINT event_overlap CHECK(overlap_at_dest(destination_id, starts, ends)); There's a race condition ... One way to fix this is locking I do not recommend locking. In fact, the primary reason that exclusion constraints exist is to prevent unnecessary locking for problems exactly like this. I included some links in my other reply that demonstrate how to avoid that excessive locking while still being safe from race conditions. I totally understand the issues of race conditions. My original reply didn't address the issue...should have. Of course race conditions are only an issue for concurrent sessions...that depends on the total application architecture. Anyway...Jeff, all your answers depend on using new features in 9.0. What would you recommend for folk still using 8.4? Without 9.0 exclusion constraints, what else can you do besides using functions in check constraints (or triggers) with appropriate locking (at some level of the overall application architecture). Dan -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] plpythonu memory leak
Alex Hunsaker writes: FYI if I don't use a slice copy here I can't get it to leak. ( find my test case at the end ) I don't know enough about python to know if thats a pl/python issue or python doing what its told-- having never really wrote any python myself. --- -- leaks big time CREATE or replace FUNCTION pygaps_leak() RETURNS void LANGUAGE plpythonu AS $$ results = plpy.execute(select generate_series(0, 100)) prev = results[0] for curr in results[1:]: prev = curr return $$; -- does not leak CREATE or replace FUNCTION pygaps_no_leak() RETURNS void LANGUAGE plpythonu AS $$ results = plpy.execute(select generate_series(0, 100)) prev = results[0] for curr in range(1, len(results)): prev = curr return $$; Alex, Great find! Yes, it's definitely leaking when taking a slice. Something is hanging on to the reference to the slice object and/or the reference count is not properly managed: I modified your leak function and added explicit calls to the python garbage collector with no result. I'll hunt around in the source for the leak. Regardless of my findings, I'll submit a bug. Thanks! Dan Popowich -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] plpythonu memory leak
I found the bug and it has been reported. Bug #5842. Details here: http://archives.postgresql.org/pgsql-bugs/2011-01/msg00134.php Dan Popowich -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Need help writing exclusion constraint
Matthew Wilson writes: I have a table like this: create table event( destination_id integer not null references destination (destination_id), starts timestamp, ends timestamp ); I want to make sure that no two rows **with the same destination_id** overlap in time. I'm not sure how to write this exclusion constraint. I know how to make the constraint to prevent any two rows from overlapping, but I want to allow rows to overlap as long as they don't have the same destination_id. Constraint expressions can only be simple boolean expressions, so can refer only to the column(s) of the current row you're inserting/updating, so to refer to other records (which you'll need to do to compare destination_ids) you need to create a function...something along the lines of this: CREATE OR REPLACE FUNCTION overlap_at_dest(dest integer, s timestamp, e timestamp) returns boolean as $_$ DECLARE c bigint; BEGIN select count(*) into c from event where (destination_id = dest) and ((starts, ends) overlaps (s,e)); return c = 0; END; $_$ LANGUAGE plpgsql; Then alter your table: ALTER TABLE event ADD CONSTRAINT event_overlap CHECK(overlap_at_dest(destination_id, starts, ends)); Cheers, Dan Popowich -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
[GENERAL] plpythonu memory leak
I am working with very large sets of time-series data. Imagine a table with a timestamp as the primary key. One question I need to ask of my data is: Are there gaps of time greater than some interval between consecutive rows? I wrote a function in plpgsql to answer this question and it worked great. Being a python zealot I decided to rewrite the function in plpythonu to compare performance. While initial comparisons seemed inconclusive, after testing on large queries (over a million records) I discovered ever-increasing time to complete the exact same query and massive memory growth in my postgres process to the point of memory starvation in under 15 queries. I've reduced my my schema to one table with one timestamp column, one type and two functions in a schema named plpythonu_bug and saved with: `pg_dump -n plpythonu_bug -s -O bug.sql`. It is attached. Here are some statistics on two separate psql sessions, one where I ran this plpgsql function several times: EXPLAIN ANALYZE SELECT count(*) from gaps('2008-01-01', '2010-01-01', '1 min'); Then a second session running the exact same query but with the plpythonu function, pygaps. Note: I had over 273,000 rows in my table. The function returned 5103 rows each run. Memory usage is from `top` output. Milliseconds, from output of explain analyze. This is on an Ubuntu 10.04 system w/ 2GB RAM, postgres 8.4.6, python 2.6.5. plpgsql function Run # Virt Resms before 101m 3500 n/a 1 103m 17m584 2 104m 17m561 3 104m 18m579 ...etc... (virtually no movement over several runs) plpythonu function -- Run # Virt Resms before 101m 3492 n/a 1 213m 122m 1836 2 339m 246m 1784 3 440m 346m 2178 ...and so on, about 100m or so increase with each run such that in a dozen or so runs I had 1.5g in resident memory and single calls to the function taking over 45 seconds. My schema is attached. Thanks for any help and insight, Dan Popowich -- -- PostgreSQL database dump -- SET statement_timeout = 0; SET client_encoding = 'UTF8'; SET standard_conforming_strings = off; SET check_function_bodies = false; SET client_min_messages = warning; SET escape_string_warning = off; -- -- Name: plpythonu_bug; Type: SCHEMA; Schema: -; Owner: - -- CREATE SCHEMA plpythonu_bug; SET search_path = plpythonu_bug, pg_catalog; -- -- Name: timerange; Type: TYPE; Schema: plpythonu_bug; Owner: - -- CREATE TYPE timerange AS ( begin_ts timestamp without time zone, end_ts timestamp without time zone ); -- -- Name: gaps(timestamp without time zone, timestamp without time zone, interval); Type: FUNCTION; Schema: plpythonu_bug; Owner: - -- CREATE FUNCTION gaps(start_ts timestamp without time zone, end_ts timestamp without time zone, gap_length interval) RETURNS SETOF timerange LANGUAGE plpgsql AS $$ DECLARE prev timestamp; curr timestamp; tr timerange; BEGIN FOR curr IN SELECT ts FROM timeseries WHERE ts BETWEEN start_ts AND end_ts ORDER BY ts LOOP IF curr - prev gap_length THEN tr.begin_ts := prev; tr.end_ts := curr; RETURN NEXT tr; END IF; prev := curr; END LOOP; RETURN; END; $$; -- -- Name: pygaps(timestamp without time zone, timestamp without time zone, interval); Type: FUNCTION; Schema: plpythonu_bug; Owner: - -- CREATE FUNCTION pygaps(start_ts timestamp without time zone, end_ts timestamp without time zone, gap_length interval) RETURNS SETOF timerange LANGUAGE plpythonu AS $$ # because pg passes date/time to python as strings I'm using pg to # recompute values as seconds so I have numbers to do math gap = plpy.execute(select extract(epoch from '%s'::interval) as sec % gap_length)[0]['sec'] results = plpy.execute(select ts, extract(epoch from ts) as epoch from timeseries where ts between '%s' and '%s' % (start_ts, end_ts)) if results.nrows() 2: return # prime the well by setting prev(ious) to the first tic and # iterate starting with the second... prev = results[0] for curr in results[1:]: # yield timestamp pairs for gaps of timestamps greater than gap if curr['epoch'] - prev['epoch'] gap: yield dict(begin_ts=prev['ts'], end_ts=curr['ts']) prev = curr return $$; SET default_tablespace = ''; SET default_with_oids = false; -- -- Name: timeseries; Type: TABLE; Schema: plpythonu_bug; Owner: -; Tablespace: -- CREATE TABLE timeseries ( ts timestamp without time zone ); -- -- PostgreSQL database dump complete -- --
Re: [GENERAL] range intervals in window function frames
Vincent Veyron writes: What I would LIKE to do is this: select *, avg(ts) over(order by ts range (interval '5 min') preceding) from sample order by ts; This? select t1.ts, t1.value, (select avg(t2.value) from (select value from sample where (t1.ts-ts)::INTERVAL = interval '5 minutes' and tst1.ts) as t2) from sample t1; Close. Your where clause needed to have (ts=t1.ts). It can also be simplified to this: select t1.ts, t1.value, (select avg(t2.value) from sample t2 where (t1.ts - t2.ts) = interval '5 min' and t2.ts = t1.ts) from sample t1 order by t1.ts; HOWEVER, the performance is horrible compared to using the avg_over_interval() function! On a table with 53,622 rows: EXPLAIN ANALYZE [the above select statement]; ... Total runtime: 481235.867 ms But this: EXPLAIN ANALYZE select ts, value, avg_over_interval(ts, interval '5 min') from sample order by ts; ... Total runtime: 3934.755 ms 8 minutes vs 4 seconds! I'm imagining window functions would be even more efficient. Yes? Can anyone answer when range intervals will be implemented for window functions, as in the quoted select at the top of this message? Cheers, Dan -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] range intervals in window function frames
Tom Lane writes: Daniel Popowich danielpopow...@gmail.com writes: Close. Your where clause needed to have (ts=t1.ts). It can also be simplified to this: select t1.ts, t1.value, (select avg(t2.value) from sample t2 where (t1.ts - t2.ts) = interval '5 min' and t2.ts = t1.ts) from sample t1 order by t1.ts; HOWEVER, the performance is horrible compared to using the avg_over_interval() function! The reason for that is the WHERE clause got rewritten into a form that can't be used efficiently with the index on t2. Phrase it the same way as in the function, ie where (t1.ts - interval '5 min') = t2.ts and t2.ts = t1.ts and you'll probably get similar results. Thanks, Tom, that explains it. EXPLAIN ANALYZE with the re-written WHERE brings the inline version down to 8.5 seconds, still twice as slow, but that's a heck of a lot better than 122 times as slow! :) Of course, since this isn't anything except inlining the function into the query, it's probably not all that exciting to you. Not terribly, but it's good to discover the function version is twice as fast. (not to mention that the function is much easier to read.) Can anyone answer when range intervals will be implemented for window functions, as in the quoted select at the top of this message? Nope. There was a patch submitted, it was rejected on a couple of grounds, and I don't know if anyone is actively working on the problem or not. Bummer. I may go ask in hackers. Thanks, Dan -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
[GENERAL] range intervals in window function frames
Hello all! I need to do moving averages over time series data and was hoping window functions could solve the problem for me, but it doesn't look like 8.4 or even 9.0 implementations are quite there, yet. Currently, if I have this table: create table sample ( tstimestamp, value integer ); create index sample_ts on sample (ts); and say I want a moving average of value over a fixed interval of five minutes (note that this could mean varying numbers of records in each frame), then I can do this: select *, avg_over_interval(ts, interval '5 min') from sample order by ts; Where avg_over_interval() is defined like this: create or replace function avg_over_interval(timestamp, interval) returns numeric as $$ select avg(value) from sample where (($1-$2) = ts) and (ts = $1); $$ language sql; What I would LIKE to do is this: select *, avg(ts) over(order by ts range (interval '5 min') preceding) from sample order by ts; Which is way cleaner and, I assume, more efficient. Questions: 1) Is there active work on window functions with frames over interval ranges? 2) If not, how can I help with that? 3) Until the functionality is in 9.x, can I make what I'm doing more efficient? Is there a better way to do this without window functions? (I tried to come up with a subquery in my select-list, but couldn't come up with legal syntax to reference the static value of ts for the current row in the subselect.) Thanks all for you help. Dan -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
[GENERAL] order of trigger firing relative to column/table constraints
I am designing a DB where column/table constraints are not sufficient for data integrity (e.g., guaranteeing non-recursive tree graphs) so I'm writing my first complicated triggers and I have been searching docs and archives of this list for detailed information on when triggers are fired relative to constraints created with CREATE TABLE and/or ALTER TABLE. For example, in what phase are CHECK and FOREIGN KEY constraints (as well as NOT NULL and UNIQUE constraints) checked relative to the order of triggers firing. Documentation clearly outlines custom trigger order as: 1before-statement 2before-row [before-row ...] --data now visible-- 3after-row [after-row ...] 4after-statement For example, it was at first surprising to discover in my before-row trigger that foreign key constraints had not yet been checked (which I assumed I could take for granted in my trigger since I had defined the column constraint). Which means the foreign key constraint checking will have to be done twice: once in my custom trigger in the before-row phase (because my logic requires it there) and again when the foreign key column constraint is checked (whenever that is). In summary, I have one general question and two specific questions: General: is it documented somewhere in any detail the order of column/table constraint checking relative to custom triggers. Specific #1: Is there a way to control the order of column/table constraints relative to custom triggers? Specific #2: If, say, CHECK and FOREIGN KEY constraints are checked in the after-row phase (or just before it, but after the before-row phase), so I need to move my custom constraint logic to the after-row phase to take advantage of these constraints is the only way for me to abort the insert or update by raising an error? (If the my logic is in the before-row phase, which is an intuitive place to put it, I could just return null, but that won't work in the after-row phase.) BTW, if I can get decent info, I'd be willing to volunteer to contribute well written documentation to give back to the cause. Thanks, Dan -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general