[GENERAL] Using PostgreSQL for NSS databases

2012-11-01 Thread Daniel Popowich

I'm making this post here in hopes I may save someone from beating
their head against the wall like I did...

I am writing a custom Name Service Switch (NSS) module to take
advantage of already existing account information in a pg database.

Under certain circumstances, processes will hang due to non-recursive
mutex locking during PG connection creation.  It goes something like
this:


/etc/nsswitch.conf:

  passwd: files mypgmod
  group:  files mypgmod


  [process with euid not found in /etc/passwd, looking up another
  username not found in /etc/passwd]

  getpwnam_r(username, ...)

// nss doesn't find user in files module (/etc/passwd),
// so uses mypgmod

_nss_mypgmod_getpwnam_r(username, ...)

pthread_mutex_lock(...) //to protect PG connection, then...
PQconnectdb(...)

// any number of reasons to look up info in calling user's
// home directory, e.g., default password, ssl certs, etc.,
// resulting in call to getpwuid_r() to find user's homedir.

getpwuid_r(geteuid(), ...)

// nss doesn't find user in files module (/etc/passwd),
// so uses mypgmod

_nss_mypgmod_getpwuid_r(uid, ...)

pthread_mutex_lock(...) //to protect PG connection
/* HANG */



* * *

The fix, if you can call it that, is to run nscd, the NSS caching
daemon.  Typically run as root (with account info in /etc/passwd) the
second lookup will not generate a second connection attempt.

Also, there exists in many linux distros a libnss-pgsql2 package which
suffers from the exact same problem (there are some scattered posts on
the 'net about it).  Same fix ... run nscd.

Cheers,

Daniel Popowich


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Schema version control

2011-02-11 Thread Daniel Popowich

Glenn Maynard writes:
 - Adding a NOT NULL constraint (without adding a DEFAULT).  You often want
 to precede this with filling in any existing NULL values, so the new
 constraint doesn't fail.
 - Updating triggers, functions and their effects.  For example, when I have
 an FTS index with a trigger to update an index column, and I change the
 underlying trigger, I often do something like UPDATE table SET column =
 column, to cause all of the update triggers to fire and recalculate the
 index columns.
 - Creating a new column based on an old one, and removing the old one; eg.
 add a column n, run UPDATE ... SET n = i*j * 2, and then drop the old
 columns i and j.
 - Updating data from an external source, such as ORM model code; for
 example, if you have a table representing external files, an update may want
 to calculate and update the SHA-1 of each file.
 - For efficiency, dropping a specific index while making a large update, and
 then recreating the index.
 
 In my experience, while generating schema updates automatically is handy, it
 tends to make nontrivial database updates more complicated.  These sorts of
 things happen often and are an integral part of a database update, so I'm
 just curious how/if you deal with them.
 
 I've used Ruby's migrations, and for my Django databases I use my own
 migration system which is based in principle off of it: create scripts to
 migrate the database from version X to X+1 and X-1, and upgrade or downgrade
 by running the appropriate scripts in sequence.
 
 It's not ideal, since it can't generate a database at a specific version
 directly; it always has to run through the entire sequence of migrations to
 the version you want, and the migrations accumulate.  However, it can handle
 whatever arbitrary steps are needed to update a database, and I don't need
 to test updates from every version to every other version.

I'm with Glenn on this point.  I have found updating a db version to
be far more complex (for reasons he illustrates and which I find more
common than not) than any automation tool can handle.  And I wonder if
the time spent developing such a tool (or writing changesets, xml,
etc. for a given tool) actually saves development time.

FWIW, this is what I do:

 1.  I have a table in my database, meta, that contains exactly one
 row, and holds configuration information.  A minimal version of this
 table:

CREATE TABLE meta (
id integer DEFAULT 1 NOT NULL CHECK (id = 1),
major integer NOT NULL,
minor integer NOT NULL,
patch integer NOT NULL
);

  The first column is to guarantee I only have one row.  The next
  three provide a tuple for my version, e.g., 1.1.3.  I add other
  columns for software configuration as needed.

 2.  My application configuration has a DB_VERSION variable which
 defines the version of the database the software depends on.  If
 there's a mismatch, my programs can't connect to the database, thus
 guaranteeing db=software happiness.

 3.  Like Glenn, I have a script that can go forwards and backwards,
 one revision at a time.  The effort in maintaining this script is
 minimal, actually: in my RCS, I have the current schema, which is
 generated with: `pg_dump -O -s`.  Based on diffs between current
 system and last revision I can manually generate the necessary DDL
 statements for the script.  If I have data as part of my schema,
 like the row in my meta table, those are simple enough to examine and
 add to the script with DML statements.

 4.  Whenever a revision changes the schema, I tag my my repository,
 so I can easily check out that version.

 5.  As for branches.  If production is at 1.1.3 and three branches
 create versions 1.1.4, another 1.1.4 and 1.2.0, and all three want to
 merge their various changes back into a 1.1.3 production database?  I
 say, without being flip, don't let this happen.  Here is where I
 think no software process can make anyone happy.  It's a human
 process: declare someone the owner of the database schema, let them
 own the long term development of the schema, and if anyone needs a
 change, they have to communicate and vet those changes with the db
 designer.


Dan

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Need help writing exclusion constraint

2011-01-19 Thread Daniel Popowich

Jeff Davis writes:
 On Sat, 2011-01-15 at 21:32 +0100, Tomas Vondra wrote:
   ALTER TABLE event ADD CONSTRAINT event_overlap
CHECK(overlap_at_dest(destination_id, starts, ends));
  
  There's a race condition 
 
 ...
 
  One way to fix this is locking 
 
 I do not recommend locking. In fact, the primary reason that exclusion
 constraints exist is to prevent unnecessary locking for problems exactly
 like this.
 
 I included some links in my other reply that demonstrate how to avoid
 that excessive locking while still being safe from race conditions.

I totally understand the issues of race conditions.  My original reply
didn't address the issue...should have.  Of course race conditions are
only an issue for concurrent sessions...that depends on the total
application architecture.

Anyway...Jeff, all your answers depend on using new features in 9.0.
What would you recommend for folk still using 8.4?  Without 9.0
exclusion constraints, what else can you do besides using functions in
check constraints (or triggers) with appropriate locking (at some
level of the overall application architecture).

Dan

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] plpythonu memory leak

2011-01-18 Thread Daniel Popowich

Alex Hunsaker writes:
 FYI if I don't use a slice copy here I can't get it to leak. ( find my
 test case at the end ) I don't know enough about python to know if
 thats a pl/python issue or python doing what its told--  having never
 really wrote any python myself.
 
 ---
 -- leaks big time
 CREATE  or replace FUNCTION pygaps_leak() RETURNS void
LANGUAGE plpythonu
AS $$
 results = plpy.execute(select generate_series(0, 100))
 prev = results[0]
 for curr in results[1:]:
   prev = curr
 return
 $$;
 
 -- does not leak
 CREATE  or replace FUNCTION pygaps_no_leak() RETURNS void
LANGUAGE plpythonu
AS $$
 results = plpy.execute(select generate_series(0, 100))
 prev = results[0]
 for curr in range(1, len(results)):
   prev = curr
 return
 $$;

Alex,

Great find!  Yes, it's definitely leaking when taking a slice.
Something is hanging on to the reference to the slice object and/or
the reference count is not properly managed: I modified your leak
function and added explicit calls to the python garbage collector with
no result.

I'll hunt around in the source for the leak.  Regardless of my
findings, I'll submit a bug.

Thanks!

Dan Popowich


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] plpythonu memory leak

2011-01-18 Thread Daniel Popowich

I found the bug and it has been reported.  Bug #5842.

Details here:

   http://archives.postgresql.org/pgsql-bugs/2011-01/msg00134.php


Dan Popowich   

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Need help writing exclusion constraint

2011-01-15 Thread Daniel Popowich

Matthew Wilson writes:
 I have a table like this:
 
 create table event(
 
 destination_id integer not null references destination
 (destination_id),
 
 starts timestamp,
 ends timestamp
 );
 
 I want to make sure that no two rows **with the same destination_id**
 overlap in time.
 
 I'm not sure how to write this exclusion constraint.  I know how to make
 the constraint to prevent any two rows from overlapping, but I want to
 allow rows to overlap as long as they don't have the same
 destination_id.

Constraint expressions can only be simple boolean expressions, so can
refer only to the column(s) of the current row you're
inserting/updating, so to refer to other records (which you'll need to
do to compare destination_ids) you need to create a
function...something along the lines of this:


  CREATE OR REPLACE FUNCTION overlap_at_dest(dest integer,
 s timestamp,
 e timestamp)
 returns boolean as $_$

  DECLARE
c bigint;
  BEGIN
select count(*) into c from event
   where (destination_id = dest)
 and ((starts, ends) overlaps (s,e));
return c = 0;
  END;

  $_$ LANGUAGE plpgsql;
  


Then alter your table:

ALTER TABLE event ADD CONSTRAINT event_overlap
 CHECK(overlap_at_dest(destination_id, starts, ends));



Cheers,
 
Dan Popowich
 

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


[GENERAL] plpythonu memory leak

2011-01-14 Thread Daniel Popowich

I am working with very large sets of time-series data.  Imagine a
table with a timestamp as the primary key.  One question I need to ask
of my data is: Are there gaps of time greater than some interval
between consecutive rows?

I wrote a function in plpgsql to answer this question and it worked
great.  Being a python zealot I decided to rewrite the function in
plpythonu to compare performance.  While initial comparisons seemed
inconclusive, after testing on large queries (over a million records)
I discovered ever-increasing time to complete the exact same query and
massive memory growth in my postgres process to the point of memory
starvation in under 15 queries.

I've reduced my my schema to one table with one timestamp column, one
type and two functions in a schema named plpythonu_bug and saved with:
`pg_dump -n plpythonu_bug -s -O  bug.sql`.  It is attached.

Here are some statistics on two separate psql sessions, one where I
ran this plpgsql function several times:


EXPLAIN ANALYZE SELECT count(*) from gaps('2008-01-01', 
  '2010-01-01', '1 min');

Then a second session running the exact same query but with the
plpythonu function, pygaps.  

   Note: I had over 273,000 rows in my table.  The function returned
   5103 rows each run.  Memory usage is from `top` output.
   Milliseconds, from output of explain analyze.  This is on an Ubuntu
   10.04 system w/ 2GB RAM, postgres 8.4.6, python 2.6.5.


   plpgsql function
   
Run #  Virt Resms
before 101m 3500   n/a
1  103m 17m584
2  104m 17m561
3  104m 18m579

...etc... (virtually no movement over several runs)


   plpythonu function
   --
Run #  Virt Resms
before 101m 3492   n/a
1  213m 122m   1836
2  339m 246m   1784
3  440m 346m   2178

...and so on, about 100m or so increase with each run such that in a
dozen or so runs I had 1.5g in resident memory and single calls to the
function taking over 45 seconds.

My schema is attached.

Thanks for any help and insight,

Dan Popowich


--
-- PostgreSQL database dump
--

SET statement_timeout = 0;
SET client_encoding = 'UTF8';
SET standard_conforming_strings = off;
SET check_function_bodies = false;
SET client_min_messages = warning;
SET escape_string_warning = off;

--
-- Name: plpythonu_bug; Type: SCHEMA; Schema: -; Owner: -
--

CREATE SCHEMA plpythonu_bug;


SET search_path = plpythonu_bug, pg_catalog;

--
-- Name: timerange; Type: TYPE; Schema: plpythonu_bug; Owner: -
--

CREATE TYPE timerange AS (
begin_ts timestamp without time zone,
end_ts timestamp without time zone
);


--
-- Name: gaps(timestamp without time zone, timestamp without time zone, 
interval); Type: FUNCTION; Schema: plpythonu_bug; Owner: -
--

CREATE FUNCTION gaps(start_ts timestamp without time zone, end_ts timestamp 
without time zone, gap_length interval) RETURNS SETOF timerange
LANGUAGE plpgsql
AS $$

DECLARE
  prev timestamp;
  curr timestamp;
  tr   timerange;
BEGIN
  FOR curr IN SELECT ts FROM timeseries
WHERE ts BETWEEN start_ts AND end_ts
ORDER BY ts
  LOOP
 IF curr - prev  gap_length THEN
 tr.begin_ts := prev;
 tr.end_ts := curr;
 RETURN NEXT tr;
 END IF;
 prev := curr;
  END LOOP;
  RETURN;
END;
$$;


--
-- Name: pygaps(timestamp without time zone, timestamp without time zone, 
interval); Type: FUNCTION; Schema: plpythonu_bug; Owner: -
--

CREATE FUNCTION pygaps(start_ts timestamp without time zone, end_ts timestamp 
without time zone, gap_length interval) RETURNS SETOF timerange
LANGUAGE plpythonu
AS $$

# because pg passes date/time to python as strings I'm using pg to
# recompute values as seconds so I have numbers to do math

gap = plpy.execute(select extract(epoch from '%s'::interval) as sec
   % gap_length)[0]['sec']

results = plpy.execute(select ts, extract(epoch from ts) as epoch
  from timeseries
  where ts between '%s' and '%s'
   % (start_ts, end_ts))
if results.nrows()  2:
return

# prime the well by setting prev(ious) to the first tic and
# iterate starting with the second...
prev = results[0]
for curr in results[1:]:
# yield timestamp pairs for gaps of timestamps greater than gap
if curr['epoch'] - prev['epoch']  gap:
yield dict(begin_ts=prev['ts'], end_ts=curr['ts'])

prev = curr

return

$$;


SET default_tablespace = '';

SET default_with_oids = false;

--
-- Name: timeseries; Type: TABLE; Schema: plpythonu_bug; Owner: -; Tablespace: 
--

CREATE TABLE timeseries (
ts timestamp without time zone
);


--
-- PostgreSQL database dump complete
--


-- 

Re: [GENERAL] range intervals in window function frames

2010-12-14 Thread Daniel Popowich

Vincent Veyron writes:
  What I would LIKE to do is this:
  
select *, avg(ts) over(order by ts range (interval '5 min') preceding)
   from sample order by ts;
  
 
 This?
 
 
 select t1.ts, t1.value, (select avg(t2.value) from (select value from
 sample where (t1.ts-ts)::INTERVAL = interval '5 minutes' and tst1.ts)
 as t2) from sample t1;

Close.  Your where clause needed to have (ts=t1.ts).  It can also be
simplified to this:

select t1.ts, t1.value, (select avg(t2.value) 
   from sample t2 
   where (t1.ts - t2.ts) = interval '5 min'
  and t2.ts = t1.ts)
  from sample t1 order by t1.ts;

HOWEVER, the performance is horrible compared to using the
avg_over_interval() function!  On a table with 53,622 rows:

   EXPLAIN ANALYZE [the above select statement];

 ...
 Total runtime: 481235.867 ms

But this:

   EXPLAIN ANALYZE select ts, value,
  avg_over_interval(ts, interval '5 min')
  from sample order by ts;

 ...  
 Total runtime: 3934.755 ms

8 minutes vs 4 seconds!

I'm imagining window functions would be even more efficient.  Yes?
Can anyone answer when range intervals will be implemented for window
functions, as in the quoted select at the top of this message?


Cheers,

Dan




-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] range intervals in window function frames

2010-12-14 Thread Daniel Popowich

Tom Lane writes:
 Daniel Popowich danielpopow...@gmail.com writes:
  Close.  Your where clause needed to have (ts=t1.ts).  It can also be
  simplified to this:
 
  select t1.ts, t1.value, (select avg(t2.value) 
 from sample t2 
 where (t1.ts - t2.ts) = interval '5 min'
and t2.ts = t1.ts)
from sample t1 order by t1.ts;
 
  HOWEVER, the performance is horrible compared to using the
  avg_over_interval() function!
 
 The reason for that is the WHERE clause got rewritten into a form that
 can't be used efficiently with the index on t2.  Phrase it the same way
 as in the function, ie
 
where (t1.ts - interval '5 min') = t2.ts
   and t2.ts = t1.ts
 
 and you'll probably get similar results.  

Thanks, Tom, that explains it.  EXPLAIN ANALYZE with the re-written
WHERE brings the inline version down to 8.5 seconds, still twice as
slow, but that's a heck of a lot better than 122 times as slow!  :)

 Of course, since this isn't anything except inlining the function
 into the query, it's probably not all that exciting to you.

Not terribly, but it's good to discover the function version is twice
as fast.  (not to mention that the function is much easier to read.)

  Can anyone answer when range intervals will be implemented for window
  functions, as in the quoted select at the top of this message?
 
 Nope.  There was a patch submitted, it was rejected on a couple of
 grounds, and I don't know if anyone is actively working on the problem
 or not.

Bummer.  I may go ask in hackers.

Thanks,

Dan

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


[GENERAL] range intervals in window function frames

2010-12-12 Thread Daniel Popowich

Hello all!

I need to do moving averages over time series data and was hoping
window functions could solve the problem for me, but it doesn't look
like 8.4 or even 9.0 implementations are quite there, yet.

Currently, if I have this table:

  create table sample (
  tstimestamp,
  value integer
  );
  create index sample_ts on sample (ts);

and say I want a moving average of value over a fixed interval of five
minutes (note that this could mean varying numbers of records in each
frame), then I can do this:

  select *, avg_over_interval(ts, interval '5 min') from sample order by ts;
  
Where avg_over_interval() is defined like this:

  create or replace function avg_over_interval(timestamp, interval)
  returns numeric as $$
 select avg(value) from sample where (($1-$2) = ts) and (ts = $1);
  $$ language sql;

What I would LIKE to do is this:

  select *, avg(ts) over(order by ts range (interval '5 min') preceding)
 from sample order by ts;

Which is way cleaner and, I assume, more efficient.

Questions:

  1) Is there active work on window functions with frames over
 interval ranges?

  2) If not, how can I help with that?

  3) Until the functionality is in 9.x, can I make what I'm doing more
 efficient?  Is there a better way to do this without window
 functions?  (I tried to come up with a subquery in my
 select-list, but couldn't come up with legal syntax to reference
 the static value of ts for the current row in the subselect.)

Thanks all for you help.

Dan

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


[GENERAL] order of trigger firing relative to column/table constraints

2009-12-17 Thread Daniel Popowich

I am designing a DB where column/table constraints are not sufficient
for data integrity (e.g., guaranteeing non-recursive tree graphs) so
I'm writing my first complicated triggers and I have been searching
docs and archives of this list for detailed information on when
triggers are fired relative to constraints created with CREATE TABLE
and/or ALTER TABLE.


For example, in what phase are CHECK and FOREIGN KEY constraints (as
well as NOT NULL and UNIQUE constraints) checked relative to the order
of triggers firing.  Documentation clearly outlines custom trigger
order as:

  1before-statement

  2before-row
  [before-row ...]

  --data now visible--
  
  3after-row
  [after-row ...]

  4after-statement

For example, it was at first surprising to discover in my before-row
trigger that foreign key constraints had not yet been checked (which
I assumed I could take for granted in my trigger since I had defined
the column constraint).  Which means the foreign key constraint
checking will have to be done twice: once in my custom trigger in the
before-row phase (because my logic requires it there) and again when
the foreign key column constraint is checked (whenever that is).

In summary, I have one general question and two specific questions:

General: is it documented somewhere in any detail the order of
column/table constraint checking relative to custom triggers.

Specific #1: Is there a way to control the order of column/table
constraints relative to custom triggers?

Specific #2: If, say, CHECK and FOREIGN KEY constraints are checked in
the after-row phase (or just before it, but after the before-row
phase), so I need to move my custom constraint logic to the after-row
phase to take advantage of these constraints is the only way for me to
abort the insert or update by raising an error?  (If the my logic is
in the before-row phase, which is an intuitive place to put it, I
could just return null, but that won't work in the after-row phase.)


BTW, if I can get decent info, I'd be willing to volunteer to
contribute well written documentation to give back to the cause.


Thanks,

Dan

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general