Re: [HACKERS] W3C Specs: Web SQL

2010-11-09 Thread Sam Mason
On Mon, Nov 08, 2010 at 12:55:22PM -0300, Alvaro Herrera wrote:
 Excerpts from Charles Pritchard's message of sáb nov 06 23:20:13 -0300 2010:
 
  Simple async sql sub-set (the spec in trouble):
  http://dev.w3.org/html5/webdatabase/
 
 This is insane.  This spec allows the server to run arbitrary SQL
 commands on the client, AFAICT.  That seems like infinite joy for
 malicious people running webservers.  The more powerful the dialect of
 SQL the client implements, the more dangerous it is.

How is this different from the server asking the client to run an
infinite loop in javascript?

-- 
  Sam  http://samason.me.uk/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash support for arrays

2010-11-02 Thread Sam Mason
On Sat, Oct 30, 2010 at 01:01:44PM -0400, Tom Lane wrote:
 marcin mank marcin.m...@gmail.com writes:
  This is what boost does:
  http://www.systomath.com/include/Boost-1_34/doc/html/boost/hash_combine.html
 
 Hmm.  I am reminded of Knuth's famous dictum: never generate random
 numbers with a method chosen at random.  Is there any actual theory
 behind that algorithm, and if so what is it?  The combination of
 shifting with addition (not xor) seems more likely to lead to weird
 cancellations than any improvement in the hash behavior.

As far as I can tell the boost combiner comes from:

  http://goanna.cs.rmit.edu.au/~jz/fulltext/jasist-tch.pdf

Which seems to be solving a completely different problem and I can't see
how relevant it is to the design of a hash combiner.  The fact that they
get trivial details wrong like 32bit integers going up to 2^32 rather
than 2^32-1 doesn't inspire confidence.

One subtle point I noticed on the boost mailing list was that they
didn't want [[1],2] hashing to the same value as [1,2].  I'm pretty sure
that this sort of construction isn't possible to express in PG, but
thought I should mention it just in case.

-- 
  Sam  http://samason.me.uk/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Query optimization problem

2010-07-21 Thread Sam Mason
On Tue, Jul 20, 2010 at 09:57:06AM +0400, Zotov wrote:
  SELECT d1.ID, d2.ID
  FROM DocPrimary d1
JOIN DocPrimary d2 ON d2.BasedOn=d1.ID
  WHERE (d1.ID=234409763) or (d2.ID=234409763)

You could try rewriting it to:

SELECT d1.ID, d2.ID
FROM DocPrimary d1
  JOIN DocPrimary d2 ON d2.BasedOn=d1.ID
WHERE d1.ID=234409763
  UNION
SELECT d1.ID, d2.ID
FROM DocPrimary d1
  JOIN DocPrimary d2 ON d2.BasedOn=d1.ID
WHERE d2.ID=234409763

This should have the same semantics as the original query.  I don't
believe PG knows how to do a rewrite like this at the moment.

-- 
  Sam  http://samason.me.uk/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Specification for Trusted PLs?

2010-05-28 Thread Sam Mason
On Thu, May 27, 2010 at 11:09:26PM -0400, Tom Lane wrote:
 David Fetter da...@fetter.org writes:
  I don't know about a *good* idea, but here's the one I've got.
 
  1.  Make a whitelist.  This is what needs to work in order for a
  language to be a fully functional trusted PL.
 
 Well, I pretty much lose interest right here, because this is already
 assuming that every potentially trusted PL is isomorphic in its
 capabilities.

That's not normally a problem.  The conventional way would be to place
the interpreter in its own sandbox, similar to how Chrome has each tab
running in its own process.  These processes are protected in a way
so that the code running inside them can't do any harm--e.g. a ptrace
jail[1].  This is quite a change from existing pl implementations, and
present a different set of performance/compatibility issues.

 If that were so, there'd not be very much point in
 supporting multiple PLs.  A good example here is R.  I have no idea
 whether PL/R is trusted or trustworthy, but in any case the main point
 of supporting that PL is to allow access to the R statistical library.
 How does that fit into a whitelist designed for some other language?
 It doesn't.

AFAIU, a trusted language should only be able to perform computation,
e.g. not touch the local filesystem, beyond readonly access to library
code, and not see the network.  Policies such as these are easy to
enforce in a ptrace jail, and would still allow a trusted pl/r to do
whatever it wants to get any pure calculation done.  As soon as it needs
to touch the file system the language becomes non-trusted.

  3.  (the un-fun part) Write tests which attempt to do things not in
  the whitelist.  We can start from the vulnerabilities so far
  discovered.
 
 And here is the *other* fatal problem: a whitelist does not in fact give
 any leverage at all for testing whether there is access to functionality
 outside the whitelist.  (It might be useful if you could enforce the
 whitelist at some sufficiently low level of the language implementation,
 but as a matter of testing, it does nothing for you.)  What you're
 suggesting isn't so much un-fun as un-possible.  Given a maze of twisty
 little subroutines all different, how will you find out if any of them
 contain calls of unwanted functionality?

A jail helps with a lot of this; the remainder is in the normal fact
that bug testing can only demonstrate the presence of bugs and you need
to do formal code proof to check for the absence of bugs.

-- 
  Sam  http://samason.me.uk/
 
 [1] http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.122.5494

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] more support for various frame types of window functions

2009-11-09 Thread Sam Mason
On Mon, Nov 09, 2009 at 10:01:53AM -0500, Tom Lane wrote:
 Heikki Linnakangas heikki.linnakan...@enterprisedb.com writes:
  One idea is to take a copy of the state datum after each row. Then,
  instead of initializing the aggregate from scratch, you can roll back
  to an earlier copied state. It doesn't always help, but might be a part
  of the solution.
 
 That requires that you know how to copy the aggregate's state.  You do
 not.  (In some cases the aggregate function has extra state besides the
 nominal transition datum...)

Other ways of doing aggregates/folds involve exploiting a tree like
structure (mostly for reasons of parallelism, but these don't apply to
PG).  For example, instead of having the triple (init,accum,final) as we
do at the moment, you could have a merge function that would take two
intermediate states and combine them.  For example, COUNT and SUM would
be the add function, MIN and MAX would be least and greatest functions
respectively, and so on.

If this method exists then you can do the fancy stuff as suggested, if
it doesn't exist then fall back to less optimal methods.

-- 
  Sam  http://samason.me.uk/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Specific names for plpgsql variable-resolution control options?

2009-11-08 Thread Sam Mason
On Fri, Nov 06, 2009 at 07:09:46PM -0800, David E. Wheeler wrote:
 On Nov 6, 2009, at 6:57 PM, Tom Lane wrote:
 If we do that, presumably the per-function syntax would be
  #variable_conflict variable_first
 and so on, which is clear enough but might be thought a bit verbose
 for something people might be pasting into hundreds of functions.
 
 I suspect that most folks will set the GUC and few will actually use  
 it in functions.

Just to be clear about the semantics; what should happen if the user
doesn't specify a value for the function?  Should PG remember the GUC
value at creation time, or pull it in at invocation time?  I'd lean
towards fixing it at function creation time as it'd be one more caveat
for security definer functions otherwise.

-- 
  Sam  http://samason.me.uk/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] half OOT, plv8js group created ^^

2009-10-27 Thread Sam Mason
On Tue, Oct 27, 2009 at 08:30:16AM -0400, Andrew Dunstan wrote:
 If someone is going to work on a JS engine for PostgreSQL (which I think 
 is a good idea, actually) I want them to work on one that is likely to 
 succeed.

The project (at the moment) just seems to be a set of pointers to code
I threw together a couple of years ago to experiment with extending
Postgres.  It's based on an oldish release of Spidermonkey and it'll
be interesting to see where things go with this.  It would be nice to
see it picked up by someone as priorities changed and I lost personal
interest in it.

-- 
  Sam  http://samason.me.uk/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [GENERAL] [HACKERS] libpq port number handling

2009-09-25 Thread Sam Mason
On Fri, Sep 25, 2009 at 09:29:24AM +0300, Peter Eisentraut wrote:
 On Thu, 2009-09-24 at 20:36 -0400, Tom Lane wrote:
  BTW, are port numbers still limited to 16 bits in IPv6?
 
 Port numbers are in TCP, not in IP.

I'd checked that it should work with IPv6, but I hadn't realized that
it was because ports were at a different level of abstraction.  This
mailing list is good for otherwise obscure details like that!

-- 
  Sam  http://samason.me.uk/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] libpq port number handling

2009-09-24 Thread Sam Mason
On Thu, Sep 24, 2009 at 07:57:55PM +0100, Sam Mason wrote:
  postg...@sussy:/root createuser -D -p ricky
 
 I don't think you want to be passing -p here; it's saying to use
 ricky as the port number, which fails (sounds like a bug if it doesn't
 complain about this) giving a port number of zero

Hum, why is PG doing an (unchecked) atoi on the user specified port
rather than leaving it up to getaddrinfo to resolve the port?  It would
seem to require changing UNIXSOCK_PATH to accept a string as the port
number, which is probably a bit much of a change.

The included doesn't feel very nice, but is probably more acceptable.

-- 
  Sam  http://samason.me.uk/
--- src/interfaces/libpq/fe-connect.c~	2009-06-11 15:49:13.0 +0100
+++ src/interfaces/libpq/fe-connect.c	2009-09-24 20:48:53.0 +0100
@@ -817,7 +817,16 @@
 
 	/* Set up port number as a string */
 	if (conn-pgport != NULL  conn-pgport[0] != '\0')
+	{
 		portnum = atoi(conn-pgport);
+		if (portnum  1 || portnum  65535)
+		{
+			appendPQExpBuffer(conn-errorMessage,
+			  libpq_gettext(invalid port number \%s\ specified\n),
+			  conn-pgport);
+			goto connect_errReturn;
+		}
+	}
 	else
 		portnum = DEF_PGPORT;
 	snprintf(portstr, sizeof(portstr), %d, portnum);

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [BUGS] BUG #5053: domain constraints still leak

2009-09-15 Thread Sam Mason
On Mon, Sep 14, 2009 at 11:20:59PM -0400, Tom Lane wrote:
 There is some moderately interesting reading material in section
 4.17.4 Domain constraints of SQL:2008.

Not sure where to look for a copy of that, nor any particularly helpful
links :(

 In particular, it appears to
 me that the standard goes out of its way to NOT claim that every value
 that is of a domain type satisfies the domain's constraints.  It looks
 to me that the implementation they have in mind is that domain
 constraints are to be checked:
 
 (1) when a value is assigned to a *table* column having that domain type;

 (2) when a value is converted to that domain type by an *explicit*
 cast construct;
 
 (3) nowhere else.

I struggle to get any useful meaning out of the SQL specs, but that
sounds about right to me.

 If I'm reading this right, it sidesteps most of the concerns we have
 been worrying about here, at the cost of being perhaps more surprising
 and less useful than one would expect.

It means that domains are a world away from ADTs (abstract data types)
and just seem to function as quick templates for creating new columns.
PG seems to be treating domains as ADTs at the moment, which is the
abstraction that's proved to be more useful in larger programming
projects.

 It would also mean that a lot
 of our existing domain behavior is wrong.  I think there is ammunition
 here for an argument that, in effect, values in flight in expression
 or query evaluation should always be considered to be of base types,
 and domain constraints should only be checked when assigning to a
 persistent storage location such as a table field or plpgsql variable
 (plus the special case for CAST constructs).

Are you considering changing PGs behavior here? and if so, what would
happen to existing behavior?

-- 
  Sam  http://samason.me.uk/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [BUGS] BUG #5053: domain constraints still leak

2009-09-15 Thread Sam Mason
On Tue, Sep 15, 2009 at 05:13:21AM +0100, Andrew Gierth wrote:
 But there's a kicker: in Subclause 6.12, cast specification, in the
 General Rules is:
 
  a) If the cast operand specifies NULL, then the result of CS is
 the null value and no further General Rules of this Subclause
 are applied.
 
 That no further General Rules clause implies (assuming it's not a
 blatant mistake in the spec) that this rule is therefore skipped in
 the case of nulls:

I think the NOT NULL constraint is a PG specific constraint, I can't see
how it's allowed in the spec.  Then again, I have trouble parsing the
spec so could well be wrong about this.

The NOT NULL constraint feels wrong as well, what are the semantics of:

  CREATE DOMAIN d AS INTEGER NOT NULL;
  SELECT a.n AS aa, b.n AS bb
  FROM (VALUES (CAST(1 AS d)),(2)) a(n)
LEFT JOIN (VALUES (CAST(1 AS d))) b(n) ON a.n = b.n;

in the presence of it?  I'm expecting aa and bb both to come out as
domain d, but this shouldn't work with what you're saying the current
semantics should be.

-- 
  Sam  http://samason.me.uk/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [BUGS] BUG #5053: domain constraints still leak

2009-09-15 Thread Sam Mason
On Tue, Sep 15, 2009 at 02:54:18PM +0100, Andrew Gierth wrote:
 the spec _does_ appear to allow CHECK(VALUE IS NOT NULL) as a
 domain constraint (in general the spec defines NOT NULL constraints
 this way),

Huh, that's a trivial rewrite isn't it.  Not sure why it didn't occur to
me that it's just syntax sugar.

 and the wording from 6.12 implies that that check is still
 skipped in the case of NULLs (so that constraint would stop you
 inserting a null into a table column (I think), but not from casting a
 null value to the domain type).

Explicitly ignoring NULL values in CAST expressions seems like a good
feature as well.  Although it gives me the feeling that domains are more
and more like a mis-designed feature.

  Sam == Sam Mason s...@samason.me.uk writes:
  Sam The NOT NULL constraint feels wrong as well, 

 I think that's just another example of Tom's initial comment about how
 broken domain not null constraints are currently.

Hum, given that it's just sugar for more general constraints I'm not
sure if it's the not null constraints that are broken or just the
current interpretation of them.  They would do the right thing if they
were only checked in a limited number of places that the user was aware
of, which the spec seems to imply is when the user explicitly asks for a
CAST to be performed or when writing into the table.

-- 
  Sam  http://samason.me.uk/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] COALESCE and NULLIF semantics

2009-09-11 Thread Sam Mason
On Fri, Sep 11, 2009 at 12:59:04PM -0400, Tom Lane wrote:
 Kevin Grittner kevin.gritt...@wicourts.gov writes:
  I'm only proposing parse-time changes for conditional
  expressions -- the CASE predicate and its abbreviations.
 
 No, you are not; you are proposing run-time changes, specifically the
 need to coerce unknown to something else long after the point where
 the unknown is just a literal constant.

One thing I've just realized these discussions have pointed out is
that PG isn't doing the correct thing all the time with types.  When
is it ever valid to see an unknown after type checking?  AFAICT, it
shouldn't ever appear and hence doing:

  CREATE VIEW v AS SELECT 'foo';

Should be doing the normal default to TEXT type.  Is that right? or does
unknown have more meaning besides just being something that needs to
be fixed up during type checking.

 I've been wondering whether it would be sensible to make the
 composite-datum constructors check for all-null fields and generate
 a plain NULL if so.  If so then ROW(NULL,NULL) would be
 indistinguishable from NULL and the semantic gripes seem to largely
 go away.  It would be a problem for anyone who actually wanted to
 distinguish those two cases, but how much do we care?

I'd prefer these semantics; it would make it do the right thing in
more cases than now.

-- 
  Sam  http://samason.me.uk/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] COALESCE and NULLIF semantics

2009-09-11 Thread Sam Mason
On Fri, Sep 11, 2009 at 06:24:22PM +0100, Sam Mason wrote:
 One thing I've just realized these discussions have pointed out is
 that PG isn't doing the correct thing all the time with types.  When
 is it ever valid to see an unknown after type checking?  AFAICT, it
 shouldn't ever appear and hence doing:
 
   CREATE VIEW v AS SELECT 'foo';
 
 Should be doing the normal default to TEXT type.  Is that right? or does
 unknown have more meaning besides just being something that needs to
 be fixed up during type checking.

Doh, sorry I shouldn't have sent that.  I wanted to spend some time to
see if I could find any other examples, but I hit send by accident.

-- 
  Sam  http://samason.me.uk/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] COALESCE and NULLIF semantics

2009-09-11 Thread Sam Mason
On Fri, Sep 11, 2009 at 12:26:45PM -0500, Kevin Grittner wrote:
 Tom Lane t...@sss.pgh.pa.us wrote:
  if that weren't true then we wouldn't be arguing about whether
  COALESCE is wrong.
  
 Yeah, I am.  When you have queries built based on which fields on a
 QBE window are filled by a user, it's not hard to come up with a
 clause like:
  
 AND (somedate  COALESCE(NULL, NULL) OR ...)
  
 We solved this by modifying our framework to pass down metadata about
 the values in addition to the values themselves.

You need a *much* more invasive change to fix this.  PG's type checker
only looks one level deep when choosing what types to replace unknown
with; what you you want is full type-inference as it's only that which
will allow you to track back up the layers and assign consistent types
to arbitrary expressions like the above.

-- 
  Sam  http://samason.me.uk/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] COALESCE and NULLIF semantics

2009-09-11 Thread Sam Mason
On Fri, Sep 11, 2009 at 01:37:00PM -0400, Tom Lane wrote:
 Kevin Grittner kevin.gritt...@wicourts.gov writes:
  Yeah, I am.  When you have queries built based on which fields on a
  QBE window are filled by a user, it's not hard to come up with a
  clause like:
  
  AND (somedate  COALESCE(NULL, NULL) OR ...)
  
 Right.  The only real way to fix that is to propagate the later
 discovery that type 'date' would be preferred back to the inputs of the
 COALESCE, which is what Sam Mason has been on about (IIUC).

Yup; sounds right.  The fact that the inputs to COALESCE here are just
simple NULL literals is making the example seem too simple.  The general
case is that of them being an arbitrary expression and you somehow need
to get the DATE type all the way back up to the top literal and make
sure that no other branch uses it as anything else.  This was what my
example with using a sub-select was about, maybe this would be better
though:

  SELECT date '2001-1-1'  COALESCE(NULL,v)
  FROM (SELECT NULL) x(v)
  WHERE v = 10;

This should fail to type check; v is being treated as both a DATE and
an INT.  Getting a compiler to do this is pretty easy (store the type
constraints some where else during type checking and then make sure they
all match at the end), but isn't the way PG works at the moment.

 I'm afraid
 that such a thing would make the behavior even more full of surprises
 than what we have now.  Resolving unknown from context is already
 action at a distance, as it were, and the longer the distance involved
 the more chance for unexpected behavior.  Not to mention the
 implementation difficulties.

Most of my experience says that type-inference actually makes things
easier.  If you're dealing with dynamically *checked* languages then I
can see where you comment comes from; but for typed languages, where
everything has to be perfect at compile time, then this doesn't seem to
be true.  This is why people who use Haskell tend to be the ones saying
things like when it type checks you can be pretty certain it's going to
work.  I'm not saying we should be going that complicated, just that
in my experience more complicated type systems imply simpler and more
understandable runtime behavior.

Implementation is a bit harder, but it won't be much more complicated
than what PG already has.  It's already dealing with most of the issues
(in a somewhat ad-hoc way) and I'd expect that getting type-inference in
would help clean other things up a bit.

-- 
  Sam  http://samason.me.uk/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] COALESCE and NULLIF semantics

2009-09-11 Thread Sam Mason
On Fri, Sep 11, 2009 at 12:41:21PM -0500, Kevin Grittner wrote:
 Sam Mason s...@samason.me.uk wrote:
  
  what you you want is full type-inference as it's only that which
  will allow you to track back up the layers and assign consistent
  types to arbitrary expressions like the above.
  
 Well, obviously that would fix it; I'm not clear on why *only* that
 would fix it.

Because, I think, if you did come up with another solution and gave it
another name most type-theorists would call it type-inference anyway.

Type inference is just a general idea and is implemented in lots of
different ways depending on the specifics of the problem.  You could
argue that PG has a limited form of type inference already.

 It seemed to me that we wouldn't have to go back up
 like that if we deferred the assignment of a type in conditional
 expressions.  I've only scanned that part of the code, so it's well
 within the range of possibility that I misunderstood something, but I
 thought the type assigned to a CASE or COALESCE is used in the context
 of evaluating enclosing expressions on the way *down*, no?

Maybe we're using different terms; but when a literal is declared you
don't know what type it is, just that it needs at most one.  It's only
later on when the variable is actually used that you find out what its
type constraints are.

-- 
  Sam  http://samason.me.uk/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] RfD: more powerful any types

2009-09-09 Thread Sam Mason
On Wed, Sep 09, 2009 at 03:23:52PM -0400, Tom Lane wrote:
 Robert Haas robertmh...@gmail.com writes:
  What we need is a system where base types are represented
  by an OID, but derived types (list and functional types) are built up
  using type constructors that take other types as arguments.
 
 This is SQL, not Haskell.  What you suggest seems about two orders of
 magnitude more complex than real-world applications could justify.

Parametric polymorphism seems to have made it well into the mainstream
now, Java, C# and many other mainstream languages have got it, it's
not just ML and Haskell any more.  Excuses of it being of esoteric
academic interest alone seem a little misplaced.

It would also tidy up a lot of the features that already exist in PG.
Arrays in PG already effectively have a type parameter, why not extend
this to normal user's code?  Even staying within the types in PG,
I've wanted to use the geometric functions parametrized over integer
and numeric types before, fixing them to double precision types seems
unfortunate.

-- 
  Sam  http://samason.me.uk/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] COALESCE and NULLIF semantics

2009-09-09 Thread Sam Mason
On Wed, Sep 09, 2009 at 10:25:34AM -0400, Tom Lane wrote:
 Now admittedly there's probably not any major technical obstacle to
 making a runtime conversion happen --- it's merely delayed invocation of
 the destination type's input function.  But I find it really ugly from a
 theoretical point of view.  Doing calculations with unknown values
 just seems wrong.

It's pretty grim.  This seems to be some strange halfway house on the
way to real type-inference, with broken semantics to boot.  How would it
prevent weirdos like:

  SELECT 'msg'||v, date_trunc('year',v), v+10
  FROM (SELECT 'hi ho') x(v);

If Kevin wants something like this it should either be real
type-inference or leave it as it is.  The problem is that each variable
should have exactly one type, type inference lets you delay finding that
type until later down the tree and then back-propagate it back up again.
Your suggestions seems to allow variables to have as many types as it
wants.

-- 
  Sam  http://samason.me.uk/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 8.5 release timetable, again

2009-08-27 Thread Sam Mason
On Thu, Aug 27, 2009 at 11:29:42AM -0400, Tom Lane wrote:
 Robert Haas robertmh...@gmail.com writes:
  Well, I wasn't suggesting adding a lot more testing of things that
  we're already testing.  I was assuming that we would craft the
  additional tests to hit areas that we are not now covering well.  My
  point here is only to what Peter said upthread: we want to be able to
  get positive results rather than waiting for enough negative results
  (whatever that means).  To get positive results, you must have a test
  suite.  While letting beta testers test whatever they want has some
  value, testing things we think might be likely hiding places for bugs
  (such as WAL recovery) has merit, too.  Making those tests
  well-organized and easily repeatable is, IMHO, a Good Thing.
 
 The problem here is the easily repeatable bit.  Almost by definition,
 easily repeatable tests don't find hard-to-reproduce problems.  I don't
 mean to suggest that they're without value, but they are no substitute
 for beta testers doing unpredictable things.

I've wondered before about using a system emulator to snapshot the disk
on each write (I'd expect you could put some pretty low level hooks
in with qemu and gdb) and then run each snapshot in another system to
make sure that either the transaction is rolled back or committed as
appropriate.  I guess this would take a while to run but may help catch
some obscure bugs.  Or this an area that's well tested already?

-- 
  Sam  http://samason.me.uk/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bug in date arithmetic

2009-08-24 Thread Sam Mason
On Mon, Aug 24, 2009 at 09:12:07AM -0700, David Fetter wrote:
 shac...@shackle:5432=# SELECT now() + '90';
?column?
 ---
  2009-09-03 19:03:43.195514-07
 (1 row)
 
 shac...@shackle:5432=# SELECT now() - '90';
 ERROR:  date/time field value out of range: 90
 HINT:  Perhaps you need a different datestyle setting.
 
 I'd be tempted to call it a flat-out bug in the first case.  Adding a
 string literal to a timestamptz should just result in an error, IMHO.

But you're not adding a string literal, you're adding an interval of
90 seconds, or 250 hours!  This is the weirdness resulting from the
lack of difference in syntax between string literals and other literals.
i.e.

  SELECT timestamptz '2000-01-01';
= 2000-01-01 00:00:00+00
  SELECT timestamptz '2000-01-01' + '90';
=  2000-01-11 10:00:00+00
  SELECT timestamptz '2000-01-01' + interval '90';
=  2000-01-11 10:00:00+00
  SELECT timestamptz '2000-01-01' + 90;
= no operator matches (rhs is of type INT)
  SELECT timestamptz '2000-01-01' + text '90';
= no operator matches (rhs is of type TEXT)

It seems to be deciding the unknown type is of type date with
subtraction for some reason:

  SELECT timestamptz '2000-01-01' + '90';
= date/time field value out of range
  SELECT timestamptz '2000-01-01' - date '90';
= date/time field value out of range

Whereas you were expecting it to be using an interval as before:

  SELECT timestamptz '2000-01-01' - interval '90';
=  1999-12-21 14:00:00+00

A timestamptz + date operator doesn't exist though, so it wouldn't
wouldn't be able to pick it and is why you were seeing this odd
behavior.

Not sure what good fixes would be; a couple of simple (and bad) ones
would be:

  1) assume literals are of type text, unless otherwise specified.

  2) require the type of the literal to be specified if there is any
  ambiguity

The second option is nicer, but defining any ambiguity doesn't seem
possible in general; worse, it would seem to change over time as
operators/types were added/removed and would cause things that used to
work to start breaking.  Then again they would now, so I'm not sure why
this would be worse.

-- 
  Sam  http://samason.me.uk/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] DELETE syntax on JOINS

2009-08-24 Thread Sam Mason
On Mon, Aug 24, 2009 at 01:41:28PM -0400, Tom Lane wrote:
 The real problem with this is that all the
 good candidates for the reserved word are things people are probably
 already using as aliases, so we'd have a large risk of breaking existing
 queries.  We could avoid that with a sufficiently ugly choice like
 
 DELETE FROM target t USING DELETE_TARGET LEFT JOIN other_table ot ON ...
 
 but yech ...

  PRIMARY or TABLE?

Both are pretty grim, but I think they're reserved at the moment.

-- 
  Sam  http://samason.me.uk/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Slaying the HYPOTamus

2009-08-24 Thread Sam Mason
On Mon, Aug 24, 2009 at 07:07:13AM -0700, David Fetter wrote:
 These next two lines are a teensy bit baroque.  Is there some
 significant speed increase that would justify them?

Just noticed with your revised code that the following check:

 On Mon, Aug 24, 2009 at 11:14:19PM +1000, Paul Matthews wrote:
  if (x == 0.0)
  return 0.0;
  else {
  yx = y/x;

is preventing a divide by zero on the line above.  So it's not a
performance hack, it's just allowing it to remain correct as a result of
changing the maths around.

  return x*sqrt(1.0+yx*yx);
  }
  }

-- 
  Sam  http://samason.me.uk/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Slaying the HYPOTamus

2009-08-24 Thread Sam Mason
On Mon, Aug 24, 2009 at 06:59:38PM +0100, Sam Mason wrote:
  On Mon, Aug 24, 2009 at 11:14:19PM +1000, Paul Matthews wrote:
   if (x == 0.0)
   return 0.0;
   else {
   yx = y/x;
 
 is preventing a divide by zero on the line above.  So it's not a
 performance hack, it's just allowing it to remain correct as a result of
 changing the maths around.

I've also just realized why it's safe to return zero here; y contains
the smaller number and so if x is zero, y must be zero as well.

-- 
  Sam  http://samason.me.uk/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] docs for random function

2009-08-16 Thread Sam Mason
Hi,

The docs for the function random() currently say:

  random value between 0.0 and 1.0, inclusive

this isn't a particularly useful invariant, what the code actually does
(and is commented as doing) is much more useful.  Any chance of changing
it to something like:

--- doc/src/sgml/func.sgml~ 2009-06-17 22:58:49.0 +0100
+++ doc/src/sgml/func.sgml  2009-08-16 13:28:18.814767110 +0100
@@ -821,7 +821,7 @@
   row
entryliteralfunctionrandom/function()/literal/entry
entrytypedp/type/entry
-   entryrandom value between 0.0 and 1.0, inclusive/entry
+   entryrandom value in the range [0.0 - 1.0)/entry
entryliteralrandom()/literal/entry
entry/entry
   /row

or maybe something more readable to non-maths people?

-- 
  Sam  http://samason.me.uk/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] GRANT ON ALL IN schema

2009-08-16 Thread Sam Mason
On Sun, Aug 16, 2009 at 03:57:08PM +0300, Peter Eisentraut wrote:
 On 2009-08-16 at 00:04 -0400, Andrew Dunstan wrote:
  SQL is not Lisp. Simple is  good. I didn't think Peter was really very 
  serious.
 
 Well, I don't know if we really need to call it lambda, but I fully
 expect to be able to use these ad hoc functions as part of other
 expressions.  So making DO or whatever a top-level command that does not
 integrate with anything else would not really satisfy me.

Wow, I didn't think you were serious either!

One thing that would make my life easier would be easier one-off custom
aggregations, this would seem to be a nice stepping stone towards that.
For instance the following agg function would have similar semantics
to fold, as found in functional languages.

  SELECT agg(LAMBDA (text,text) $$ SELECT $1||coalesce($2,''); $$,
 '', s)
  FROM (VALUES
('aa'),
('bb')) x(s);

I'd expect to get 'aabb' back if I've done something wrong/it's
not obvious.  I.e. the first parameter is like the SFUNC in CREATE
AGGREGATE, the second parameter ('') is the INITCOND, and the third
param (s) is what you want to aggregate.

You've now got two type variables in play and hence you'd want some
better support of parametric polymorphism than PG currently makes
easy.  The current AGGREGATE infrastructure seems to get away with it by
bundling this type knowledge into the aggregate itself.

Also, why isn't SQL the default language--plpgsql still needs to be
explicitly added doesn't it?

-- 
  Sam  http://samason.me.uk/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] docs for random function

2009-08-16 Thread Sam Mason
On Sun, Aug 16, 2009 at 03:57:23PM -0400, Tom Lane wrote:
 Sam Mason s...@samason.me.uk writes:
  The docs for the function random() currently say:
random value between 0.0 and 1.0, inclusive
 
 Yeah, it looks like Bruce changed this back in 8.2 and failed to fix
 the documentation to match.  I used this wording instead:
 
 !entryrandom value in the range 0.0 lt;= x lt; 1.0/entry

Cool, that should be easier on people.

-- 
  Sam  http://samason.me.uk/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] GRANT ON ALL IN schema

2009-08-15 Thread Sam Mason
On Sat, Aug 15, 2009 at 11:34:04PM +0200, Dimitri Fontaine wrote:
 Nitpicking dept, I think I prefer:
 
  DO [ [LANGUAGE] language] $$ ... $$;
  DO plperl $$ ... $$;
  DO language plpython $$ ... $$;
 
 language is optional and defaults to plpgsql.

Yup, sounds nicer.  The less globals the better!

Next all you need is to be able to PREPARE them (and somehow access the
parameters from execute) and you'll have nice local functions. :)

-- 
  Sam  http://samason.me.uk/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] GRANT ON ALL IN schema

2009-08-15 Thread Sam Mason
On Sun, Aug 16, 2009 at 02:15:39AM +0300, Peter Eisentraut wrote:
 On 2009-08-15 at 23:31 +0100, Sam Mason wrote:
  Next all you need is to be able to PREPARE them (and somehow access the
  parameters from execute) and you'll have nice local functions. :)
 
 Yeah, rather than just making up some new command for execute this
 string, this could be generalized as lambda expressions that could be
 called whereever an expression is allowed.  E.g.
 
 SELECT LAMBDA $$ ... $$;
[..]
 SELECT (LAMBDA (x int, y text) $$ ... $$) (37, 'foo');

I can't quite tell if you're being serious or not, you realize that this
leaves open the possibility of doing:

  SELECT t.n, f.op, f.fn(t.n)
  FROM generate_series(1,10) t(n), (VALUES 
('id',LAMBDA (_x int) $$ BEGIN; RETURN _x;   END; $$),
('*2',LAMBDA (_x int) $$ BEGIN; RETURN _x*2; END; $$)) f(op,fn)

And of storing lambda abstractions in tables?

-- 
  Sam  http://samason.me.uk/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch for 8.5, transformationHook

2009-08-11 Thread Sam Mason
On Mon, Aug 10, 2009 at 03:43:45PM -0400, Tom Lane wrote:
 Kevin Grittner kevin.gritt...@wicourts.gov writes:
  Tom Lane t...@sss.pgh.pa.us wrote: 
  Kevin Grittner kevin.gritt...@wicourts.gov writes:
  Still, it rates pretty high on my astonishment scale that a
  COALESCE of two untyped NULLs (or for that matter, any two values
  of unknown type) returns a text value.
  
  What would you have it do instead, throw an error?
  
  Return a value of unknown type.
  
 That would require doing actual computation on values of unknown type.

A better way would be to say it's of polymorphic type.  PG's support of
polymorphism is currently a bit ad-hoc, but this would be something I'd
love to change.  It would be quite a big change and I've not thought
through all the details yet.

 In the specific case of COALESCE, we could theoretically do that,
 since the only computation it needs is IS NULL which is
 datatype-independent.

Yes, this would be the only valid operator I can see working.  COUNT
would work as an aggregate.

 In most situations, however, you can't evaluate
 the function without knowledge of the datatype semantics.  As an
 example, consider NULLIF('0', '00').  This gives different answers if
 you suppose the literals are text than if you suppose they are integers.

Yup, which is when it gets fun and I think would mean we'd end up
throwing out a few more queries as ambiguous if I had my way!

As long as there was *one* type in the above expression then it would
be OK, for example it would be unambiguous in either of the following
cases:

  SELECT NULLIF(INT '0', '00');
  SELECT NULLIF('0', INT '00');

and I'd also like the following to be OK:

  SELECT NULLIF('0', '00') + 5;
  SELECT n+5 FROM (SELECT NULLIF('0', '00')) x(n);

But PG currently throws these out as it's type resolution (also known
as type unification) is too eager.  The same arguments would obviously
apply to any polymorphic function.  For example, I'd expect to be able
to do:

  SELECT ('{1,2}')[1] + 5;

and have PG figure out that the literal is of type INT[].  Not sure what
ambiguity is being prevented that causes PG to need the brackets, but
that's a side issue.

It also raises the issue of the fact that there's no general way
to ascribe types in PG.  You can cast (using a couple of different
syntaxes) but this isn't the same as type ascription.  For example, I'd
like to be able to do things like:

  SELECT NULLIF('0', '00')::INT + 5;

But I'm doing a cast here, I'm not saying that the NULLIF function
evaluates to a value of type INT which is what I want to be doing.  So
this currently results in 5 being returned and not NULL as I really
want.  The above obviously isn't the syntax to use as it would break
code, but the functionality would be useful.

-- 
  Sam  http://samason.me.uk/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Table and Index compression

2009-08-07 Thread Sam Mason
On Fri, Aug 07, 2009 at 10:36:39AM +0200, Pierre Frrrdddric Caillaud wrote:
 Also, about compressed NTFS : it can give you disk-full errors on read().
 While this may appear stupid, it is in fact very good.

Is this not just because they've broken the semantics of read?

 As a side note, I have also tested lzjb (the ZFS compressor) and lzo is  
 much faster, and compresses much better (sometimes 2x better).

Disks are fast and cheap; a basic IDE disk runs at over 100MB/s now, and
it's doing this in the background while your CPU is doing other stuff.
If you're also decompressing stuff you're serializing even more and
you're doing so with a much power hungrier device (the CPU).

How fast is decompression (as that seems to be your selling point)?  Lzo
claims to run at about about a third of main memory bandwidth which is
nice, however research projects found this to be far too slow and were
only getting positive results when decompression stayed in secondary
cache.  Basically decompression has to run at several GB/s for it to
have much measurable benefit.

 I made a quick check before implementing it, using python scripts to play  
 with sparse files on ext3 :
 
 - writing a sparse file is a bit (not much) slower than a regular file,
 - reading from a non-fragmented sparse file is as fast as reading a  
 regular file
 - holes do not eat OS disk cache (which is the most interesting point)
 - reading from cache is as fast as reading a regular file (and faster if  
 you don't read the holes because you know they are holes, which is the  
 case here)

Numbers?

 And, also, growing a sparse file by plugging the holes in it WILL allocate  
 blocks all over the place and render IO extremely inefficient.
 You can defrag it, of course (using fs-specific tools or just cpio), but  
 that's not high-availability...

That would not seem to difficult to solve.

 I forgot to talk about SSDs in the previous message. SSDs are quite  
 expensive, but seek really fast.

SSDs are about decreasing latency; if you're putting compression in
there you're pushing latency up as well.  If you don't care about
latency you get traditional rotating media.

-- 
  Sam  http://samason.me.uk/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Table and Index compression

2009-08-07 Thread Sam Mason
On Fri, Aug 07, 2009 at 10:33:33AM +0100, Greg Stark wrote:
 2009/8/7 Pierre Frédéric Caillaud li...@peufeu.com:
  Also, about compressed NTFS : it can give you disk-full errors on read().
 
 I suspect it's unavoidable for similar reasons to the problems
 Postgres faces. When you issue a read() you have to find space in the
 filesystem cache to hold the data. Some other data has to be evicted.
 If that data doesn't compress as well as it did previously it could
 take more space and cause the disk to become full.
 
 This also implies that fsync() could generate that error...

If that's indeed how it works it seems like one broken file system and
needs to get its block accounting in order.

When you choose a compression algorithm you know how much space a worst
case compression will take (i.e. lzo takes up to 8% more for a 4kB block
size).  This space should be reserved in case of situations like the
above and the filesystem shouldn't over-commit on this.

Never had to think about this before though so I'm probably missing
something obvious.

-- 
  Sam  http://samason.me.uk/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Table and Index compression

2009-08-07 Thread Sam Mason
On Fri, Aug 07, 2009 at 11:49:46AM +0100, Greg Stark wrote:
 On Fri, Aug 7, 2009 at 11:29 AM, Sam Masons...@samason.me.uk wrote:
  When you choose a compression algorithm you know how much space a worst
  case compression will take (i.e. lzo takes up to 8% more for a 4kB block
  size).  This space should be reserved in case of situations like the
  above and the filesystem shouldn't over-commit on this.
 
  Never had to think about this before though so I'm probably missing
  something obvious.
 
 Well most users want compression for the space savings. So running out
 of space sooner than without compression when most of the space is
 actually unused would disappoint them.

Note, that as far as I can tell for a filesystems you only need to keep
enough reserved for the amount of uncompressed dirty buffers you have in
memory.  As space runs out in the filesystem all that happens is that
the amount of (uncompressed?) dirty buffers you can safely have around
decreases.  In practical terms, this says that performance drops off
when there is less free space than the size of the filesystem's cache
and I think you have to reserve exactly one block to handle the base
case.  But there are so many problems associated with completely filling
a filesystem that I'm not sure if this would really matter.

 Also, I'm puzzled why it would the space increase would proportional
 to the amount of data and be more than 300 bytes. There's no reason it
 wouldn't be a small fixed amount. The ideal is you set aside one bit
 -- if the bit is set the rest is compressed and has to save at least
 one bit. If the bit is not set then the rest is uncompressed. Maximum
 bloat is 1-bit. In real systems it's more likely to be a byte or a
 word.

It'll depend on the compression algorithm; lz algorithms are dictionary
based so you'd have a single entry for the incompressible data and then
a pointer to the entry.

In PG's case, it would seem possible to do the compression and then
check to see if the resulting size is greater than 4kB.  If it is you
write into the 4kB page size and write uncompressed data.  Upon reading
you do the inverse, if it's 4kB then no need to decompress.  I believe
TOAST does this already.

-- 
  Sam  http://samason.me.uk/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Table and Index compression

2009-08-07 Thread Sam Mason
On Fri, Aug 07, 2009 at 12:59:57PM +0100, Greg Stark wrote:
 On Fri, Aug 7, 2009 at 12:48 PM, Sam Masons...@samason.me.uk wrote:
  Well most users want compression for the space savings. So running out
  of space sooner than without compression when most of the space is
  actually unused would disappoint them.
 
  Note, that as far as I can tell for a filesystems you only need to keep
  enough reserved for the amount of uncompressed dirty buffers you have in
  memory.  As space runs out in the filesystem all that happens is that
  the amount of (uncompressed?) dirty buffers you can safely have around
  decreases.
 
 And when it drops to zero?

That was why I said you need to have one page left to handle the base
case.  I was treating the inductive case as the interesting common case
and considered the base case of lesser interest.

  In PG's case, it would seem possible to do the compression and then
  check to see if the resulting size is greater than 4kB.  If it is you
  write into the 4kB page size and write uncompressed data.  Upon reading
  you do the inverse, if it's 4kB then no need to decompress.  I believe
  TOAST does this already.
 
 It does, as does gzip and afaik every compression system.

It's still a case that needs to be handled explicitly by the code.  Just
for reference, gzip does not appear to do this when I test it:

  echo -n 'a' | gzip  tmp.gz
  gzip -l --verbose tmp.gz

says the compression ratio is -200% (an empty string results in
an infinite increase in size yet gets displayed as 0% for some
strange reason).  It's only when you hit six 'a's that you start to get
positive ratios.  Note that that this is taking headers into account;
the compressed size is 23 bytes for both 'aaa' and 'aa' but the
uncompressed size obviously changes.

gzip does indeed have a copy method, but it doesn't seem to be being
used.

-- 
  Sam  http://samason.me.uk/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Table and Index compression

2009-08-07 Thread Sam Mason
On Fri, Aug 07, 2009 at 03:29:44PM +0200, Pierre Frrrdddric Caillaud wrote:
 vmstat output :

Sorry, I don't know enough of PGs internals to suggest anything here,
but iostat may give you more details as to what's going on.

-- 
  Sam  http://samason.me.uk/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Table and Index compression

2009-08-07 Thread Sam Mason
On Fri, Aug 07, 2009 at 04:17:18PM +0200, Pierre Frrrdddric Caillaud wrote:
 I'm answering my own question : at the beginning of the run, postgres  
 creates a 800MB temporary file, then it fills the table, then deletes the  
 temp file.
 Is this because I use generate_series to fill the test table ?

Doh, yes.  A function's result is written to temp location first and
then read back again once the function returns success.  You'll have
more luck if you do:

  SELECT now() + '1 sec'::INTERVAL, (1+random()*8),
random()*1000,n+random()*1, n+random()*1000, n
  FROM (
SELECT generate_series( 1, 6000 )) x(n);

-- 
  Sam  http://samason.me.uk/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fixing geometic calculation

2009-08-07 Thread Sam Mason
On Fri, Aug 07, 2009 at 09:49:41AM -0500, Kenneth Marshall wrote:
 On Fri, Aug 07, 2009 at 09:12:34AM -0500, Kenneth Marshall wrote:
  On Fri, Aug 07, 2009 at 11:29:47PM +1000, Paul Matthews wrote:
   We have two points with a finite separation in the x axis.
   Postgres thinks they are not the same point, nor one left of the
   other, nor to the right. This is clearly a both a physical and
   logical impossibility.

 Actually, quantum theory will allow this to happen. :)

I'm not a physicist, but I don't think it does.  QM defines the
probability distribution within which the particle will be found.  Once
you've actually observed both points you will know their physical
relation--you'll also have given them energy them so next time you look
they'll be somewhere else, but the act of observation causes the above
distribution to be collapsed.  This sidesteps the whole issue of the
fact that points in PG are defined in euclidean space and do indeed
have a definite location and can be compared at all times---they don't
arbitrarily go jumping off millions of miles away or being annihilated
by their anti-particle just because it's possible.

I would agree with Paul that EPSILON is a hack and probably should be
removed.  However it will cause user visible changes so it's not quite
as simple as that to change.  I don't have anything really very useful
to add apart from saying that maybe the default should be the other way
around?

-- 
  Sam  http://samason.me.uk/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Review: Revise parallel pg_restore's scheduling heuristic

2009-08-07 Thread Sam Mason
On Fri, Aug 07, 2009 at 10:19:20AM -0500, Kevin Grittner wrote:
 Sam Mason s...@samason.me.uk wrote: 
  
  What do people do when testing this?  I think I'd look to something
  like Student's t-test to check for statistical significance.  My
  working would go something like:
  
I assume the variance is the same because it's being tested on the
same machine.
  
samples = 20
stddev  = 144.26
avg1= 4783.13
avg2= 4758.46
t   = 0.54  ((avg1 - avg2) / (stddev * sqrt(2/samples)))
  
  We then have to choose how certain we want to be that they're
  actually different, 90% is a reasonably easy level to hit (i.e. one
  part in ten, with 95% being more commonly quoted).  For 20 samples
  we have 19 degrees of freedom--giving us a cut-off[1] of 1.328. 
  0.54 is obviously well below this allowing us to say that there's no
  statistical significance between the two samples at a 90% level.
  
 Thanks for the link; that looks useful.  To confirm that I understand
 what this has established (or get a bit of help putting in in
 perspective), what this says to me, in the least technical jargon I
 can muster, is With this many samples and this degree of standard
 deviation, the average difference is not large enough to have a 90%
 confidence level that the difference is significant.  In fact,
 looking at the chart, it isn't enough to reach a 75% confidence level
 that the difference is significant.  Significance here would seem to
 mean that at least the given percentage of the time, picking this many
 samples from an infinite set with an average difference that really
 was this big or bigger would generate a value for t this big or
 bigger.
  
 Am I close?

Yes, all that sounds as though you've got it.  Note that running the
test more times will tend to reduce the standard deviation a bit as
well, so it may well become significant.  In this case it's unlikely to
affect it much though.

 I like to be clear, because it's easy to get confused and take the
 above to mean that there's a 90% confidence that there is no actual
 significant difference in performance based on that sampling.  (Given
 Tom's assurance that this version of the patch should have similar
 performance to the last, and the samples from the prior patch went the
 other direction, I'm convinced there is not a significant difference,
 but if I'm going to use the referenced calculations, I want to be
 clear how to interpret the results.)

All we're saying is that we're less than 90% confident that there's
something significant going on.  All the fiddling with standard
deviations and sample sizes is just easiest way (that I know of) that
statistics currently gives us of determining this more formally than a
hand-wavy it looks OK to me.  Science tells us that humans are liable
to say things are OK when they're not, as well as vice versa; statistics
gives us a way to work past these limitations in some common and useful
situations.

-- 
  Sam  http://samason.me.uk/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fixing geometic calculation

2009-08-07 Thread Sam Mason
On Fri, Aug 07, 2009 at 10:29:27AM -0500, Kenneth Marshall wrote:
 On Fri, Aug 07, 2009 at 04:16:56PM +0100, Sam Mason wrote:
  points in PG [..] don't
  arbitrarily go jumping off millions of miles away or being annihilated
  by their anti-particle just because it's possible.

 It was definitely a tongue-in-cheek response since QT is not really
 a topic for this mailing list.

Yup, I know.  Hence my somewhat over the top examples.

  I would agree with Paul that EPSILON is a hack and probably should be
  removed.  However it will cause user visible changes so it's not quite
  as simple as that to change.  I don't have anything really very useful
  to add apart from saying that maybe the default should be the other way
  around?

 However, removing EPSILON completely
 is not a good idea for the exact reason it was included originally.

Hum, I think it's good in some limited situations but not by default.  I
personally think that PG should be exposing rawer access here, mainly
because FP math is hard to get right and the more we fiddle trying to
make it easier to appear to do the right thing in the common case the
more general cases become impossible.  It's similar to the auto TEXT
casting thing that was changed in 8.3, but at least you get a nice error
when things aren't automatically cast to TEXT.

There are also much more reliable ways of solving the inaccuracies than
what's done now by just relying on a simple test, interval arithmetic
is my favorite at the moment but it is slower and can make thins more
complicated.

-- 
  Sam  http://samason.me.uk/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Review: Revise parallel pg_restore's scheduling heuristic

2009-08-07 Thread Sam Mason
On Fri, Aug 07, 2009 at 10:39:19AM -0500, Kevin Grittner wrote:
 Sam Mason s...@samason.me.uk wrote: 
  Yes, all that sounds as though you've got it.
  
 Thanks.  I read through it carefully a few times, but I was still only
 80% confident that I had it more-or-less right.  ;-)

And which method did you use to determine that you're 80% confident? :)

-- 
  Sam  http://samason.me.uk/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fixing geometic calculation

2009-08-07 Thread Sam Mason
On Fri, Aug 07, 2009 at 11:40:58AM -0400, Tom Lane wrote:
 Sam Mason s...@samason.me.uk writes:
  I would agree with Paul that EPSILON is a hack and probably should be
  removed.
 
 It's a hack but it's dealing with an extremely real problem, namely
 the built-in inaccuracy of floating-point arithmetic.  You can't just
 close your eyes to that and hope that everything will be okay.

Yes, I know it's a fiddle to get right.  Choosing the right primitives
is generally the most difficult part.

 A quick look through the geometry sources says that we might not be
 critically dependent on anything except the assumption that two values
 that aren't FPeq() will have a nonzero difference.

Sorry, I'm struggling to parse that.  I think it's all the double
negatives.  Are you saying that HYPOT() should really return zero when
it's currently giving back would be FPzero?

 (If you think this
 is a tautology, you don't know enough about floating point arithmetic
 to be qualified to offer an opinion here...)

I think I have a reasonable idea about FP arithmetic, I have had to
worry about rounding modes and such like before.  Never tried to write a
FP emulator though so I'm sure I could know more.

-- 
  Sam  http://samason.me.uk/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fixing geometic calculation

2009-08-07 Thread Sam Mason
On Fri, Aug 07, 2009 at 12:50:39PM -0400, Tom Lane wrote:
 Sam Mason s...@samason.me.uk writes:
  Sorry, I'm struggling to parse that.  I think it's all the double
  negatives.  Are you saying that HYPOT() should really return zero when
  it's currently giving back would be FPzero?
 
 No, I'm worried about code that supposes that it can divide by (x - y)
 after testing that FPeq(x,y) is not true.  point_sl() for instance.

OK, but I'm still not sure what you're getting at.  If it's infinities
and NaNs then they shouldn't matter and will be taken care of by the
normal FP rules anyway.

 We could perhaps fix those specific issues by testing the difference
 explicitly instead of doing it like that.  But there's still the overall
 problem of error accumulation ...

Errors will accumulate whatever happens, that's why things like interval
arithmetic exist that usefully track those errors and why I said testing
EPSILON isn't a useful.

-- 
  Sam  http://samason.me.uk/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fixing geometic calculation

2009-08-07 Thread Sam Mason
On Fri, Aug 07, 2009 at 02:13:26PM -0400, Tom Lane wrote:
 Sam Mason s...@samason.me.uk writes:
  On Fri, Aug 07, 2009 at 12:50:39PM -0400, Tom Lane wrote:
  No, I'm worried about code that supposes that it can divide by (x - y)
  after testing that FPeq(x,y) is not true.  point_sl() for instance.
 
  OK, but I'm still not sure what you're getting at.
 
 Underflow.  x!=y does not imply (x-y) != 0, if x and y are sufficiently
 small and close together.  The difference could underflow to zero.

I've just realized why this discussion hasn't been making any sense.
I thought you were talking about correctness of the code with EPSILON
still there and not about what would happen if EPSILON was removed.
Thanks for the patience.

If EPSILON is indeed removed then yes, this will become a problem and
the easiest fix would seem to be to calculate the difference first and
test it explicitly.

The error accumulation comment also makes sense now!  Does anyone
know the original use case for using the EPSILON (need some shorthand
for that, a mail client that supports Unicode?) based comparisons so
liberally?  It only makes sense to me if they're done right at the end
of all the calculations, not all the way though.  What defines the end
seems up to the user as well, or am I missing something.

-- 
  Sam  http://samason.me.uk/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fixing geometic calculation

2009-08-07 Thread Sam Mason
On Fri, Aug 07, 2009 at 07:48:15PM +0100, Greg Stark wrote:
 On Fri, Aug 7, 2009 at 7:13 PM, Tom Lanet...@sss.pgh.pa.us wrote:
  Underflow.  x!=y does not imply (x-y) != 0, if x and y are sufficiently
  small and close together.  The difference could underflow to zero.
 
 Actually I don't think subtraction can underflow with IEEE floats but
 I don't think we want to count on IEEE floats everywhere. Even if we
 did there's the risk on intel that FPeq() gets called on values which
 have just been calculated and are still in registers but then get
 spilled to RAM and lose precision before the division happens.

If it does one subtraction in registers you can be reasonably certain
the other will be, either way just doing the subtraction and explicitly
testing if it's zero will do the right thing--the semantics of C are bad
but not that bad.

-- 
  Sam  http://samason.me.uk/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Review: Revise parallel pg_restore's scheduling heuristic

2009-08-07 Thread Sam Mason
On Fri, Aug 07, 2009 at 02:08:21PM -0500, Kevin Grittner wrote:
 With the 20 samples from that last round of tests, the answer (rounded
 to the nearest percent) is 60%, so probably noise is a good summary.
 Combined with the 12 samples from earlier comparable runs with the
 prior version of the patch, it goes to a 90% probability that noise
 would generate a difference at least that large, so I think we've
 gotten to almost certainly noise.  :-)
  
 To me, that seems more valuable for this situation than saying we
 haven't reached 90% confidence that it's a real difference.  I used
 the same calculations up through the t-statistic.

The stats people in our group just tend to say that things are
significant or not at a specific level; never bothered to find out why,
I'll ask someone when I get a chance.

 The one question I have left for this technique is why you went with
  
 ((avg1 - avg2) / (stddev * sqrt(2/samples)))
 instead of
 ((avg1 - avg2) / (stddev / sqrt(samples)))

I was just doing a literal translation of what was on the Wikipedia
page:

  http://en.wikipedia.org/wiki/Student's_t-test#Independent_two-sample_t-test

If you really want to find out, there should be much better
implementations in the pl/r language already in PG.  I'd trust R much
more than Wikipedia, but for things like this Wikipedia is reasonable.

 I assume that it's because the baseline was a set of samples rather
 than a fixed mark, but I couldn't pick out a specific justification
 for this in the literature (although I might have just missed it), so
 I'd feel more comfy if you could clarify.

Sorry, that's about my limit!  I've never studied stats, I'm a computer
science person who just happens to be around people who use stats on a
day-to-day basis and think it needs more use in the software world.  I
think you're right and you're aggregating the errors from two (assumed
independent) datasets hence you want to keep a bit more of the error in
there.  As to the formal justification (and probably proof) I've no real
idea.

 Given the convenience of capturing benchmarking data in a database,
 has anyone tackled implementation of something like the spreadsheet
 TDIST function within PostgreSQL?

Again, pl/r is what you want!

-- 
  Sam  http://samason.me.uk/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Review: Revise parallel pg_restore's scheduling heuristic

2009-08-07 Thread Sam Mason
On Fri, Aug 07, 2009 at 03:18:54PM -0400, Robert Haas wrote:
 On Fri, Aug 7, 2009 at 3:08 PM, Kevin Grittner kevin.gritt...@wicourts.gov 
 wrote:
  With the 20 samples from that last round of tests, the answer (rounded
  to the nearest percent) is 60%, so probably noise is a good summary.
 
 So should we give up on this patch?

That's the joy of stats, it only tells you *very* precisely about the
*exact* thing you've chosen to test.  Interpreting the result is still
awkward, but it does remove one problem!

If you think the tests that've been done cover the use cases that the
new code was been designed to help with and you're not showing any
benefit I'd probably give up and put it down to a learning experience.
Sorry, but I've not been following enough to comment on this much more.

-- 
  Sam  http://samason.me.uk/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] the case for machine-readable error fields

2009-08-06 Thread Sam Mason
On Wed, Aug 05, 2009 at 08:57:14PM +0200, Pavel Stehule wrote:
 2009/8/5 Tom Lane t...@sss.pgh.pa.us:
  Peter pointed out upthread that the SQL standard already calls out some
  things that should be available in this way --- has anyone studied that
  yet?
 
 yes - it's part of GET DIAGNOSTICS statement
 
 http://savage.net.au/SQL/sql-2003-2.bnf.html#condition%20information%20item%20name

Just out of interest, how is this supposed to be used?  Also, how many
other SQL statements can be run when a transaction has been aborted?  I
would've thought that only COMMIT or ROLLBACK (and their synonyms) make
sense and GET DIAGNOSTICS seems wrong for this purpose.

I (and most code I've seen) normally structures client calls off to the
database as follows:

  db.execute(BEGIN; 
INSERT INTO foo (a,b) VALUES ($1,$2);
INSERT INTO bar (c,d) VALUES ($3,$4);
SELECT frub($5,$6);
COMMIT;, a,b,c,d,e,f);

Where would a call to GET DIAGNOSTICS sensibly go?  Or is it defined
to return information about the last executed transaction, I can't find
much in the above page or in anything Google gives back about it.

Supporting it is fine from a standards point of view, from a calling
code's correctness point of view it seems much better to send the info
back at a protocol level.

-- 
  Sam  http://samason.me.uk/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] the case for machine-readable error fields

2009-08-06 Thread Sam Mason
On Thu, Aug 06, 2009 at 11:41:55AM +0200, Pavel Stehule wrote:
 typically in SQL/PSM (stored procedures - look on GET DIAGNOSTICS
 statement in plpgsql doc), maybe in ecpg. Other's environments raise
 exception - so you can get some data from exception or from special
 structures related to environment - php, ruby, .NET etc

Sorry, I should have said that I saw how it was used in stored
procedures.  My interest was in getting the client doing something
interesting, if you've already got the complexity of a stored procedure
it shouldn't be to hard to teach it where the problem is.


One thing I didn't see any comment on was on the fact that I think
CREATE UNIQUE INDEX is really creating a constraint--it's just not
showing up as one.  For the constraint name to be sent back in the case
of an error I think this needs to be changed.

Triggers (and other domain specific code) seem less important here as
they can always fail with whatever error is appropriate.

-- 
  Sam  http://samason.me.uk /

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] the case for machine-readable error fields

2009-08-05 Thread Sam Mason
On Tue, Aug 04, 2009 at 01:12:10PM -0400, Alvaro Herrera wrote:
 First we need several new error message fields: table name, function
 name, constraint name, and so on.  One possible way to go about this
 would be to give each new field its own start letter (see
 http://www.postgresql.org/docs/8.4/static/protocol-error-fields.html);
 say T for table, f for function (F is taken), c for constraint (C
 is taken), and so on.  Another possibility would be to use a single
 letter, say N, and add a subtype to it; so table name would be NT
 followed by the table name, NF for functions, etc.

As pointed out downstream this seems somewhat open-ended and arbitrary;
I would start with just making the constraint name easy to get to--I
hope this doesn't happen already as I can't see anything obvious.

My rational is that everything (short of syntax errors and strange
things in the procedural languages) is already associated with a
constraint.

Syntax errors seem awkward to get standardized reporting for, the syntax
keeps changing meaning that reporting anything more than what we do now
doesn't seem practically useful.  The calling code isn't going to be
able to generate different SQL depending on error messages we give back,
a human is needed there and can still interpret the text as well as
we've always done.

Constraints failing are a useful thing that calling code can do useful
things with and it makes sense to give this back.  These would seem to
capture everything you mentioned elsewhere except UNIQUE indexes that
weren't created as a constraint.  Maybe this could be fixed by turning
them into a constraint? as they seem like one to me.

What are people doing with parsing error messages for column names for
datatype mismatches?  I can't imagine any of my code being able to do
anything sensible in such a case.  If it's things like people giving
dates to the database in an incorrect format then that's what they get
for not doing input validation isn't it?

-- 
  Sam  http://samason.me.uk/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] the case for machine-readable error fields

2009-08-05 Thread Sam Mason
On Wed, Aug 05, 2009 at 01:29:13PM +0200, Pavel Stehule wrote:
  What are people doing with parsing error messages for column names for
  datatype mismatches? I can't imagine any of my code being able to do
  anything sensible in such a case. If it's things like people giving
  dates to the database in an incorrect format then that's what they get
  for not doing input validation isn't it?
 
 When you have a full set of constraint, then you don't need to
 validate input. Just you will execute statement.

OK, then we mean different things when we say validate input.  I was
just meaning simple things like checking dates are well formed and
that you're not passing things like 'sam's test' into the database
(i.e. that you're actually escaping things correctly).  Constraints
are different from input validation as they rely on state that the
database's client by definition doesn't have (otherwise it would be able
to do the constraint checking just as well as the database).

 When execution is
 correct, then all is ok, when not, then you have to recheck message,
 err code, ... and you have to verify, so some exception is expected or
 not. This is programming based on exceptions. Some better structured
 information helps. And what's more - this should be in conformity with
 ANSI SQL.

Humans can interpret the current error messages just fine, I don't
believe that code could do with better structured information.

It would be possible to have the *Params libpq functions (not sure where
this lives in the underlying protocols) give back errors when its inputs
can't be parsed, but that seems like a different problem.

Describing where problems are in a machine readable format from
arbitrary code seems very fragile.

-- 
  Sam  http://samason.me.uk/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] the case for machine-readable error fields

2009-08-05 Thread Sam Mason
On Wed, Aug 05, 2009 at 09:30:02AM -0500, Kevin Grittner wrote:
 Anyway, the upshot is -- I think that it would be beneficial to allow,
 to the extent we can confirm it's not a violation of any applicable
 standard, a user-defined SQLSTATE to be associated with a constraint.
 I also think that it would be valuable to provide a mechanism for
 PostgreSQL-specific application code to be able to pick off one or two
 table names related to a standard constraint violation.  I'm less
 convinced at the column or data value level, but I can see where it
 might be useful.

Not sure if overloading SQLSTATE is the right way of doing this is it?
It already has things like 23514 for a check violation and any other
client code relying in this would break if it started getting different
things back.

-- 
  Sam  http://samason.me.uk/

p.s. I think you were agreeing with everything else I was saying, even
if I didn't explain myself well enough for you to understand me!

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] the case for machine-readable error fields

2009-08-05 Thread Sam Mason
On Wed, Aug 05, 2009 at 11:32:06AM -0500, Kevin Grittner wrote:
 Sam Mason s...@samason.me.uk wrote: 
  Not sure if overloading SQLSTATE is the right way of doing this is
  it?  It already has things like 23514 for a check violation and any
  other client code relying in this would break if it started getting
  different things back.
  
 If that's the standard SQLSTATE, I agree -- it suggests a need for
 some user-controllable field which could be set to a value to indicate
 a particular problem.  Does the standard have anything like that, or
 would that be an extension?

Not sure how standard it is, but the docs[1] would suggest that it's
trying to following something.  Microsoft's MSDN docs on ODBC[2] show a
reasonable similarity, the first Oracle doc I found[3] where similar as
well.

It just looks like a fixed set of numbers for a fixed set of conditions,
can't find any canonical definition about what it's really for though.

-- 
  Sam  http://samason.me.uk/

 [1] http://www.postgresql.org/docs/current/static/errcodes-appendix.html
 [2] http://msdn.microsoft.com/en-us/library/ms714687(VS.85).aspx
 [3] http://download.oracle.com/docs/cd/B19306_01/appdev.102/a58231/appd.htm

I think I prefer PG's urls!

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] the case for machine-readable error fields

2009-08-05 Thread Sam Mason
On Wed, Aug 05, 2009 at 12:41:30PM -0500, Kevin Grittner wrote:
 From the spec:
 
 The character string value returned in an SQLSTATE parameter
 comprises a 2-character class value followed by a 3-character subclass
 value, each with an implementation-defined character set that has a
 one-octet character encoding form and is restricted to digits and
 simple Latin upper case letters. Table 32, *SQLSTATE class and
 subclass values*, specifies the class value for each condition and
 the subclass value or values for each class value.
  
 and:
  
 If a subclass value is not specified for a condition, then either
 subclass '000' or an implementation-defined subclass is returned.

Thanks, I'd not found that specified--it matches up to what I'd found
PG and other databases doing.  Still doesn't really describe the
engineering rational behind it though.

 From the table, the 23xxx series is for integrity constraint
 violations, but they appear not to have gotten too specific about
 breaking that down; thereby leaving it as an implementation choice:
  
 integrity constraint violation 23 
   (no subclass)  000
   restrict violation 001

Yes; but somewhere along the line we've got exactly the same integrity
constraint violation sqlcodes as DB2 (and Derby, but that's not very
surprising as they're both IBM).  Can't find anybody else trying very
hard though.

 Anyway, it was a bad suggestion that we provide a way to specify a
 SQLSTATE to use for a constraint failure.  I do think that some field
 which could be used for that purpose would be good.  Preferably
 something which could be specified in the declaration of the
 constraint.

I still stand by my assertion that the constraint name is sufficient for
the original purpose.

-- 
  Sam  http://samason.me.uk/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Review: Revise parallel pg_restore's scheduling heuristic

2009-08-04 Thread Sam Mason
On Mon, Aug 03, 2009 at 10:03:47AM -0500, Kevin Grittner wrote:
 That's about 0.52% slower with the patch.  Because there was over 10%
 variation in the numbers with the patch, I tried leaving out the four
 highest outliers on both, in case it was the result of some other
 activity on the system (even though this machine should have been
 pretty quiet over the weekend) and the difference fell to 0.09%.

What do people do when testing this?  I think I'd look to something like
Student's t-test to check for statistical significance.  My working
would go something like:

  I assume the variance is the same because it's being tested on the
  same machine.

  samples = 20
  stddev  = 144.26
  avg1= 4783.13
  avg2= 4758.46
  t   = 0.54  ((avg1 - avg2) / (stddev * sqrt(2/samples)))

We then have to choose how certain we want to be that they're actually
different, 90% is a reasonably easy level to hit (i.e. one part in ten,
with 95% being more commonly quoted).  For 20 samples we have 19 degrees
of freedom--giving us a cut-off[1] of 1.328.  0.54 is obviously well
below this allowing us to say that there's no statistical significance
between the two samples at a 90% level.

-- 
  Sam  http://samason.me.uk/

 [1] 
http://en.wikipedia.org/wiki/Student's_t-distribution#Table_of_selected_values

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Review: Revise parallel pg_restore's scheduling heuristic

2009-08-04 Thread Sam Mason
On Tue, Aug 04, 2009 at 10:45:52AM -0400, Tom Lane wrote:
 Sam Mason s...@samason.me.uk writes:
t   = 0.54  ((avg1 - avg2) / (stddev * sqrt(2/samples)))
 
  We then have to choose how certain we want to be that they're actually
  different, 90% is a reasonably easy level to hit (i.e. one part in ten,
  with 95% being more commonly quoted).  For 20 samples we have 19 degrees
  of freedom--giving us a cut-off[1] of 1.328.  0.54 is obviously well
  below this allowing us to say that there's no statistical significance
  between the two samples at a 90% level.
 
 Hmm, so what about 95% or 99% confidence?

The cut-off goes up to 1.729 for 95% and to 2.539 for 99%.  These values
are only really for a 20 samples with the above calculation, the link I
gave above gives a nice table for different values.

I've also realized that I did the standard deviation all wrong.  I
should have calculated them independently and then got the mean:

  stddev1 = 159.9699
  stddev2 = 129.6466
  stddev  = 144.8083  ((stddev1+stddev2) / 2)

Here it makes absolutely no difference, but when they were really
different distributions it would.

-- 
  Sam  http://samason.me.uk/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SE-PostgreSQL Specifications

2009-07-28 Thread Sam Mason
On Mon, Jul 27, 2009 at 01:53:07PM -0400, Chris Browne wrote:
 s...@samason.me.uk (Sam Mason) writes:
  On Sun, Jul 26, 2009 at 01:42:32PM +0900, KaiGai Kohei wrote:
  Robert Haas wrote:
  In some cases, the clearance of infoamtion may be changed. We often
  have dome more complex requirements also.
 
  OK, so there is some other trusted entity that has unfettered access to
  both databases and its job is to manage these requirements.
 
 No, that's not what this implies.

 What this implies is along the following lines...
 
  If a user at the more secret level updates some data that had been
  classified at a lower level, then that data gets reclassified at the
  higher level.

I still think it does; but maybe there are other ways of arranging
things.  The problem seems to be that if each user only has write access
to their own level then the is no bound as to how far the two databases
will get out of sync with each other.  Some way has to be made of
declassifying data and so bound the amount of difference between the
two.  This declassification can not done by a normal user as they can
only write in their own level.  This trusted entity has to exist to
punch a hole in the security to do something that wouldn't otherwise be
allowed to happen, information normally only flows up the hierarchy.

-- 
  Sam  http://samason.me.uk/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SE-PostgreSQL Specifications

2009-07-26 Thread Sam Mason
On Sun, Jul 26, 2009 at 01:42:32PM +0900, KaiGai Kohei wrote:
 Robert Haas wrote:
 Sam Mason wrote:
 The traditional approach would be to maintain multiple physically
 separate databases; in this setup it's obvious that when you perform a
 backup of one of these databases you're only seeing a subset of all of
 the objects.  Isn't SE-PG just allowing you to do this within a single
 PG database?
 
 Partly.  There's also a concept called read down, which is
 important.  It allows you to have, say, secret and classified data in
 the same database, and let the secret users see both types but the
 classified users see only the classified stuff, not the secret stuff.
 If you want to store intelligence data about the war in Iraq and
 intelligence data about the war in Afghanistan, it might not be too
 bad to store them in separate databases, though storing them in the
 same database might also make things simpler for users who have access
 to both sets of data.  But if you have higher and lower
 classifications of data it's pretty handy (AIUI) to be able to let the
 higher-secrecy users read the lower-secrecy data - if you used
 separate databases to simulate read-down, you'd have to replicate data
 between them, and also have some manual mechanism for tracking which
 level of secrecy applied to which to which data.
 
 It seems a correct description.
 
 In addition, we also need to prevent that higher-secrecy users writes
 anything to the lower-secrect objects to prevent information leaks.

OK, so to bulk out this physical analogy we'd have two physical servers
one that stores higher-secrecy stuff and one for lower-secrecy
stuff.  Users with higher clearance are able to read/write the higher
secrecy database but only read the lower secrecy database.  Users with
lower clearance can only read/write the lower secrecy database, ideally
they aren't even aware of the existence of the higher secrecy one.

 In some cases, the clearance of infoamtion may be changed. We often
 have dome more complex requirements also.

OK, so there is some other trusted entity that has unfettered access to
both databases and its job is to manage these requirements.

 Thus, it is necessary a capability to store and manage data objects
 with different security labeles in a single database instance here.
 (If we don't want to use commercial solutions instead.)

SE-PG is about doing the above in one database and allowing more
rigorous checks to be done?

-- 
  Sam  http://samason.me.uk/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SE-PostgreSQL Specifications

2009-07-26 Thread Sam Mason
On Sun, Jul 26, 2009 at 12:27:12PM +0900, KaiGai Kohei wrote:
 Indeed, the draft used the term of security context with minimum
 introductions, but not enough friendliness for database folks.
 
 The purpose of security context is an identifier of any subject and
 object to describe them in the security policy. Because the security
 policy is common for operating system, databases, x-window and others,
 any managed database objects needs its security context.
 
 Anyway, I need to introduce them in the security model section.

I'm coming to the conclusion that you really need to link to external
material here; there must be good (and canonical) definitions of these
things outside and because SE-PG isn't self contained I really think you
need to link to them.

This will be somewhat of a break from normal PG documentation because
so far everything has been self contained, it's chosen its own
interpretation of the SQL standard and it needs to document that.  SE-PG
will be interacting with much more code from outside and showing which
parts of these are PG specific vs. which parts are common to all SELinux
seems important.

If you try to document *everything* you're going to be writing for years
and give the impression that everything is implemented in SE-PG.  A
dividing line needs to be drawn between what is PG specific and what is
SELinux (why not SEL?).

 For the security policy, I introduce it at the security model section:
 
 | Access control is conceptually to decide a set of allowed (or denied)
 | actions between a certain subject (such as a database client) and an
 | object (such as a table), and to apply the decision on user's requests.
 | At the database privilege system, ACL stored in database objects itself
 | holds a list of allowed actions to certain database roles, and it is
 | applied on the user's request.
 | SELinux also holds massive sets of allowed actions between a certain
 | subject and a certain object, we call them security policy.
 
 Is it obscure?

I find that somewhat opaque as well! sorry

 At this point, the SELinux user's guide in Fedora is the most comprehensive
 documentation. It is described from the viewpoint of SELinux users, not
 experts or developers.
 
   http://docs.fedoraproject.org/selinux-user-guide/

OK, thanks!

-- 
  Sam  http://samason.me.uk/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] When is a record NULL?

2009-07-26 Thread Sam Mason
On Sun, Jul 26, 2009 at 12:49:32PM -0500, Kevin Grittner wrote:
 Codd, E.F. (1990). The Relational Model for Database Management
 (Version 2 ed.). Addison Wesley Publishing Company.
 ISBN 0-201-14192-2.

Looks as though I've got some reading to do then--somewhat annoying that
only second hand copies available from the US, but never mind!

 I believe that he puts forward a list of about 200 things he feels
 should be true of a database in order for him to consider it a
 relational database.  Since he was first and foremost a mathematician,
 and was something of a perfectionist, I don't think some of these are
 achievable (at least in the foreseeable future) without tanking
 performance, but it makes for an interesting read.  I find most of it
 to be on target, and it gives a unique chance to see things from the
 perspective of the inventor of relational model for database
 management.

Yup, I've heard lots and read a few smaller articles but don't think
I've got around to any of his books.

 I don't, of course, agree with him on everything.  If you think that
 the SQL standard date handling is weird, wait until you see how a
 perfectionist mathematician attempts to deal with it.  :-)  Also, the
 requirement that, to be considered a relational database, it must be
 impossible to write two queries which can be shown to be logically
 equivalent but which optimize to different access plans to be, well, a
 bit ivory tower.

Sounds as though he's using a different definition than what I would
use, but I'm sure I'll find out.

-- 
  Sam  http://samason.me.uk/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] When is a record NULL?

2009-07-26 Thread Sam Mason
On Sun, Jul 26, 2009 at 03:46:19PM -0700, David E. Wheeler wrote:
 And something I'd like to be able to handle in a while loop, as I'm  
 actually fetching one row at a time from two cursors and need to be  
 able to tell when I've reached the end of a cursor.

I'm sure I'm missing something obvious, but why doesn't the FOUND
magic variable tell you what you want?

-- 
  Sam  http://samason.me.uk/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SE-PostgreSQL Specifications

2009-07-25 Thread Sam Mason
On Sat, Jul 25, 2009 at 10:43:05AM +0900, KaiGai Kohei wrote:
 Sam Mason wrote:
 This would seem to imply that all user defined trusted code has to
 perform its own permission checks.  How is MAC any different from DAC in
 the presence of code such as:
 
 CREATE OR REPLACE FUNCTION show_customers () RETURNS SETOF RECORD
 LANGUAGE 'sql'
 SECURITY_LABEL = 'system_u:object_r:sepgsql_trusted_proc_exec_t:s0'
   AS 'SELECT * FROM customer';
 
 In this case, confined users cannot create a function labeled as
 'system_u:object_r:sepgsql_trusted_proc_exec_t:s0', because it is
 controlled by db_procedure:{create} permission.

Yes, that seems reasonable.  The fact that you're still talking about
confined users is slightly worrying and would seem to imply that
there is still a superuser/normal user divide--it's probably just a
terminology thing though.

One thing I know I don't understand is what the security labels actually
mean; I've had a couple of searches through your pages now and can't see
anything described nor pointers to external documentation.

 Confined user can create a function with user_sepgsql_proc_exec_t
 (which is the default one for confined users), but it is not a trusted
 procedure, so the SELECT * FROM customer is executed with confined
 user's privileges as is, then it will be failed due to the lack of
 permission on the customer.credit.

So an unconfined user (whatever that means??) is basically working
with DACs then?

-- 
  Sam  http://samason.me.uk/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SE-PostgreSQL Specifications

2009-07-25 Thread Sam Mason
On Sat, Jul 25, 2009 at 09:50:08PM +0900, KaiGai Kohei wrote:
 Sorry for using the undefined terminology.

I think this is the largest missing part of the docs at the moment;
there is a whole new world of definitions that need to be understood
before the SE-PG stuff is understandable/usable by anyone and none of it
is explained in a way I can understand.  External links are fine at the
moment (I think) but descriptions will need to exist.

For example you currently define a security context as a formatted
short string---how does that tell me why I would want one or what it
does!  As an example, PG currently has the following to describe what a
role is:

  http://www.postgresql.org/docs/current/static/database-roles.html

I'd expect a similar definition for each of the major terms in SE-PG;
at the moment these seem to be security label, security context,
security policy and others?  What do others think?

Hope that helps explain my confusion!  If you point me at some docs I'll
be happy to write/edit things to make them more relevant to PG.

-- 
  Sam  http://samason.me.uk/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SE-PostgreSQL Specifications

2009-07-25 Thread Sam Mason
On Sat, Jul 25, 2009 at 11:06:37AM -0400, Tom Lane wrote:
 There had better still be superusers.  Or do you want the correctness
 of your backups to depend on whether your SELinux policy is correct?

I thought the whole point of MAC was that superusers don't exist any
more--at least not with the power they currently do.  Organizations may
well not trust specific parts of their database to certain types of
backups, SE-PG should allow this to be controlled somewhat.

 The first time somebody loses critical data because SELinux suppressed
 it from their pg_dump output, they're going to be on the warpath.

That should be solved by different methods; as A.M said pg_dump can
complain if it doesn't see everything it expected to (which should
handle the naive user case) and backdoors can be put in the scheme
that will (by default?) initially allow a backup subject unfettered
read-only access to each object.  I'm expecting that this access can be
revoked as needed from sensitive tables.

-- 
  Sam  http://samason.me.uk/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SE-PostgreSQL Specifications

2009-07-25 Thread Sam Mason
On Sat, Jul 25, 2009 at 04:39:29PM -0400, Robert Haas wrote:
 On Sat, Jul 25, 2009 at 4:27 PM, Sam Masons...@samason.me.uk wrote:
  I thought the whole point of MAC was that superusers don't exist any
  more--at least not with the power they currently do.
 
 It's been billed that way, but it's not really accurate.  A more
 accurate statement would be that it's possible to create a system in
 which there is no unconfined role.

Yes, that sounds more precise!

I'm still unsure of terminology; what's a unconfined role?  I guess
the layman's description is similar to a superuser, but I'm sure
there's a more refined definition somewhere.  Hum, I've just found
Fedora's guide, is the following considered a reasonable picture:

  
http://docs.fedoraproject.org/selinux-user-guide/f10/en-US/chap-Security-Enhanced_Linux-Targeted_Policy.html

  Organizations may
  well not trust specific parts of their database to certain types of
  backups, SE-PG should allow this to be controlled somewhat.
 
 I imagine it would be possible to run pg_dump on a database where you
 couldn't see all of the objects, and get a dump of just those, but
 that's only tangentially related to whether such things as superusers
 exist.

I'm not sure what point you're trying to make; in my understanding
superusers can see and do anything--hence they can make a backup.

 If superusers DON'T exist, that would be making the opposite
 statement, namely, that there isn't ANY WAY to get a backup that you
 can be sure DOES contain all of the objects.

The traditional approach would be to maintain multiple physically
separate databases; in this setup it's obvious that when you perform a
backup of one of these databases you're only seeing a subset of all of
the objects.  Isn't SE-PG just allowing you to do this within a single
PG database?

 And while I believe
 SE-Linux/SE-PostgreSQL would allow you to configure such a system, you
 might want to think carefully before you decide to do so, and the
 system certainly shouldn't (and can't) force you to set it up that
 way.

I agree that this would seem to make the resulting system easier to
manage, however I can also imagine scenarios where the converse would
be true.  This is a fuzzy engineering decision of the sort that I don't
like making without a use case---and it would be nice to have several
here.

  pg_dump can complain if it doesn't see everything it expected to
 
 If pg_dump can tell that there is information missing, the system
 hasn't done a very good job of hiding its existence, which is surely
 the whole point here.

Hum, good point--scratch that idea then!

-- 
  Sam  http://samason.me.uk/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] When is a record NULL?

2009-07-25 Thread Sam Mason
On Sat, Jul 25, 2009 at 11:42:04AM -0500, Kevin Grittner wrote:
 Sam Mason s...@samason.me.uk wrote: 
  In this dichotomy a NULL is most definitely a value and with my
  current experience I don't understand the distinction you're trying
  to draw.
  
 There can be a place where a value *could* go which does not contain a
 value.  Codd considered it crucial, from a mathematical correctness
 point of view, that the absence of a value not be indicated by some
 special magic value, but rather by some other technique which
 indicates that there *is* no value there.  In SQL this is done with
 NULL.  Based on reading his books, it seems to me that Codd always
 seemed uncomfortable with this, since it made it appear to be some
 special value, which he was adamant that it is *not*.

Have you used more academic languages like ML or Haskell?  Their
option and Maybe types, respectively, provide the nicest practical
treatment I've seen of this.

 It seems he
 would have preferred a relational language use a term like FLAGGED AS
 MISSING rather than IS NULL.  It also would have allowed the
 flexibility to differentiate various types of missing values, such as
 FLAGGED AS UNKNOWN or FLAGGED AS NOT APPLICABLE.

Those sorts of lists get domain specific *very* quickly.  I think
languages are better off providing a type system of enough complexity to
express maybe types and letting users invent whatever is most useful for
the job at hand.

 I do think
 that set logic in relational data involves some slightly different
 twists on things than most language have.

You must live in a very different world from me then! :)

 I tend, for bettor or
 worse, to come down in agreement with the positions Codd espoused on
 most of these things.

I've not read much of his writings, any canonical references for this
sort of discussion?

  [PG] ... internally knows there is a distinction
  between the two but it doesn't like to expose this.
  
 Well, to some extent I think it's a tough problem, since the set logic
 of a relational database is implemented in C, which doesn't have the
 same concepts.  There's got to be a little slight of hand in there
 somewhere.

That's a pretty bad excuse; everything ends up as machine code
in the end.  Many languages expose very abstract and consistent
views of things, some of them a lot more awkward than that of a
relational database.  PG is admittedly hampered by a desire to follow a
particularly innovative standard and correctness for PG commonly means
keeping data safe.  Being logically consistent normally takes a minor
role, although there is quite a lot of overlap between the two.

  If your model is correct then when the IS DISTINCT FROM operator
  works on RECORDs the following should return FALSE for all of the
  following:
  
SELECT NULL  IS DISTINCT FROM ROW(NULL);
SELECT NULL  IS DISTINCT FROM ROW(NULL,NULL);
SELECT NULL  IS DISTINCT FROM ROW(NULL,ROW(NULL,NULL));
SELECT ROW(NULL) IS DISTINCT FROM ROW(NULL,ROW(NULL,NULL));
  
  i.e. there is *no* difference between a NULL record and a record
  consisting entirely of NULLs.
  
 Well, on that I would go with whatever the SQL standard says, and hope
 it's not too ambiguous.  (I haven't tried to sort though this one in
 the standard, so far.)

I've had a reasonable browse around an old copy of SQL'08 I found, but
couldn't find much helpful.  The definition of distinct (3.1.6.8 in my
copy) seems particularly vacuous, it would seem to leave it down to how
rows are constructed and I can't find many details of that.

 I was going into the theory both because it is
 the basis for some of the seemingly odd aspects of SQL, and because at
 least half the time I see someone put the word NULL immediately in
 front of the word VALUE, they are wandering into confusion on these
 issues.  (I will admit that using such technically incorrect language
 is sometimes hard to avoid without sounding stilted, even if all
 parties to the conversation know that NULL is *not* a value.)

I think that depends on what definition of value you're using.  If
you're considering it to be a value like a NULL pointer then I'd agree
as this is just a convention to treat pointers with a value of zero
specially.  If you treat values as members of a set with the set defined
as their type and a sub-type relation existing between types then a NULL
value is the only member of an unnamed type (in SQL) that's the subtype
of all other types.  There are other ways of formalizing this, and I've
probably explained it badly here, but it's a rough sketch of how I think
about it.

 I know that Codd was insistent that any relation (which included the
 result of any query) which could contain duplicate rows should be
 called a corrupted relation.  (In fact, in one of his books I think
 he averaged a comment on this point about once every two pages.)  So I
 shudder to think what his reaction would be to a relation with a row
 which contained no values.  I have a really

Re: [HACKERS] When is a record NULL?

2009-07-24 Thread Sam Mason
On Fri, Jul 24, 2009 at 09:31:13AM -0400, Tom Lane wrote:
 Brendan Jurd dire...@gmail.com writes:
  So it is wrong to talk about ROW(NULL, NULL) being NULL.  It doesn't
  have the property of being NULL or not NULL, because it is a composite
  value.  ROW(NULL, NULL) IS NULL returns true, but that is not the
  same as saying that it actually is NULL, because of the different
  semantics above.
 
 It's worse than that, because there actually is also such a thing as
 the row value being NULL --- ie, there's no row structure at all.
 At least internally, that's a completely different thing from having
 a row all of whose fields are null.

Hope nobody minds if I go off on a somewhat pedagogic bent here!

Not quite sure what you mean by the above; but I think maybe something
like:

  SELECT NULL::RECORD AS r;

PG may well treat this internally as a special case, but from a type
level I don't see any difference between the above and, say:

  SELECT ROW(1)::RECORD AS r;

In both cases we get a result that has exactly one column and this
column is of type RECORD (or ROW depending how you spell things).
The fact that it happens to be a NULL *value* in one case shouldn't
affect things at the level of *types*--unless PG has suddenly become
dependently-typed which I don't believe it wants to be.

I'm also aware that PG's handling of types with ROW values is somewhat
inconsistent when compared to other values, for example:

  SELECT (r).a
  FROM (
SELECT a,b
FROM (VALUES
  (1,2),
  (2,3)) x(a,b)) r;

here, we can look inside the RECORD named by r and pull out the value
associated with attribute a, but inside:

  SELECT (r).a
  FROM (VALUES
(ROW(1,2)),
(ROW(2,3))) x(r);

we get a message saying that the record type has not been registered
when I'd expect to get an error saying that it doesn't know which
attribute a is.  We also fail to get an error in the following case:

SELECT r
FROM (VALUES
  (ROW(1,2)),
  (ROW('a','b'))) x(r);

which (to me) seems wrong.  The (allegedly) static types in PG appear
dynamic when it comes to RECORDs.


I'd be interested in fixing this behavior but every time I've started
trying to fix this I've tended to get horribly lost in the code.

 SQL doesn't provide a test for this case that's separate from the test
 involving null-ness of individual fields.  Not much we can do about
 it though.  I'm not entirely sure that exposing the distinction would
 be helpful anyway ...

I think it would; I tend to write the following and have just realized
that it doesn't do what I thought it did:

  SELECT a.*
  FROM tbla a
LEFT JOIN tblb b ON a.id = b.id
  WHERE b IS NULL;

The intuition being that the row valued b would only be considered to
be NULL (i.e. IS NULL returned TRUE) when the LEFT JOIN failed.  The
standard way to write this is of course to write WHERE b.id IS NULL,
but I'm unsure why it's necessary to look inside the record b to get
out attribute id to see if it's NULL when it should just be possible
to look at b directly.

-- 
  Sam  http://samason.me.uk/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] When is a record NULL?

2009-07-24 Thread Sam Mason
On Fri, Jul 24, 2009 at 12:30:39PM -0500, Kevin Grittner wrote:
 Sam Mason s...@samason.me.uk wrote:
  The fact that it happens to be a NULL *value* in one case
  
 Well, according to Codd (and I tend to go with him on this) there is
 no such thing.  NULL is a way to flag a place where a value could be
 stored, but is not -- because is unknown or is not applicable in that
 context.  (He seemed to feel it was a big weakness of SQL that it
 didn't differentiate between these two conditions, but that's another
 argument.)  NULL value is an oxymoron.

I think then maybe we're talking about different things; I was trying
to draw attention to the distinction between types and values---types
allow some invariants of the code to be automatically checked before it
is run, a value only has meaning at run time with the set of possible
values an expression is defined over being constrained by its type.

In this dichotomy a NULL is most definitely a value and with my current
experience I don't understand the distinction you're trying to draw.

  SQL doesn't provide a test for this case that's separate from the
  test involving null-ness of individual fields.  Not much we can do
  about it though.  I'm not entirely sure that exposing the
  distinction would be helpful anyway ...
  
  I think it would
  
 The distinction between not having a tuple and having a tuple for
 which you don't know any applicable values seems thin.  I'm not sure
 what that would really mean.

Other languages/type systems do define this precisely.  For example,
in object orientated languages there's a big difference between a
reference to an object being NULL and some member of an object being
NULL.  Databases obviously have their own semantics, but the distinction
is well defined.

Any implementation that tries to be faithful to a standard has its hands
somewhat tied and PG is no exception.  PG currently seems to be some
hybrid half way between, it internally knows there is a distinction
between the two but it doesn't like to expose this.  For example (and
this appears particularly awkward because of annoying limitations in the
syntax PG accepts):

  SELECT y
  FROM (SELECT 1) x(a)
LEFT JOIN (SELECT 1,2) y(a,b) ON FALSE;

I think it should be valid to express this as:
  SELECT (SELECT 1,2 WHERE FALSE);
but PG doesn't like sub-queries returning two columns--but this is
material for another discussion.

This returns a single row whose only attribute is NULL (i.e. it's
rendered as '') and not as a record whose attributes are all NULL (i.e.
rendered as '(,)'). The fact PG does the former says that your mental
model isn't congruent with PGs behavior.

If your model is correct then when the IS DISTINCT FROM operator works
on RECORDs the following should return FALSE for all of the following:

  SELECT NULL  IS DISTINCT FROM ROW(NULL);
  SELECT NULL  IS DISTINCT FROM ROW(NULL,NULL);
  SELECT NULL  IS DISTINCT FROM ROW(NULL,ROW(NULL,NULL));
  SELECT ROW(NULL) IS DISTINCT FROM ROW(NULL,ROW(NULL,NULL));

i.e. there is *no* difference between a NULL record and a record
consisting entirely of NULLs.

-- 
  Sam  http://samason.me.uk/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SE-PostgreSQL Specifications

2009-07-24 Thread Sam Mason
On Sat, Jul 25, 2009 at 07:23:22AM +0900, KaiGai Kohei wrote:
 Thanks, but I found an incorrect change at the trusted procedure section.
 
 Old)
   CREATE TABLE customer (
   cid integer primary key,
   cname   varchar(32),
   credit  varchar(32)
 - SECURITY_LABEL = 'system_u:object_r:sepgsql_secret_table_t:s0'
 - );
 
 New)
   CREATE TABLE customer (
   cid integer primary key,
   cname   varchar(32),
   credit  varchar(32)
 + ) SECURITY_LABEL = 'system_u:object_r:sepgsql_secret_table_t:s0';
 
 This example intends to assign secret label on the credit column,
 not whole of the table. Note that the default security context shall
 be assigned on the table and rest of columns in this case.

The show_credit() function in this section would seem to leak authority
as well; it seems possible to determine if customers exist that
otherwise may otherwise hidden.  For example, imagine we have a row
in the customer table with cid=1 whose security label would normally
prevent.  We can perform:

  SELECT show_credit(1);

and, as far as I can tell, this call would succeed.


-- 
  Sam  http://samason.me.uk/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SE-PostgreSQL Specifications

2009-07-24 Thread Sam Mason
On Sat, Jul 25, 2009 at 09:16:47AM +0900, KaiGai Kohei wrote:
 Sam Mason wrote:
 The show_credit() function in this section would seem to leak authority
 as well; it seems possible to determine if customers exist that
 otherwise may otherwise hidden.  For example, imagine we have a row
 in the customer table with cid=1 whose security label would normally
 prevent.  We can perform:
 
   SELECT show_credit(1);
 
 and, as far as I can tell, this call would succeed.
 
 This example shows that confined client cannot read credit card number
 without using trusted procedure, but trusted procedure returns masked one.
 It does not intend to hide existence of entries within customer table.

This would seem to imply that all user defined trusted code has to
perform its own permission checks.  How is MAC any different from DAC in
the presence of code such as:

CREATE OR REPLACE FUNCTION show_customers () RETURNS SETOF RECORD
LANGUAGE 'sql'
SECURITY_LABEL = 'system_u:object_r:sepgsql_trusted_proc_exec_t:s0'
  AS 'SELECT * FROM customer';

(I hope I've modified the example correctly!)

-- 
  Sam  http://samason.me.uk/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Using a C++ library in PostgreSQL

2009-07-19 Thread Sam Mason
On Sun, Jul 19, 2009 at 10:13:35PM +0800, Xin Wang wrote:
 My research project is based on PostgreSQL. I want to get the
 Post-Schema-Validation-Infoset (PSVI) of an XML document being
 stored. However, the current libxml2 library does not support PSVI at
 all. So I have to use the Xerces C++ library in PostgreSQL.

I've never had to do anything like this before, so these will just be
general recommendations that I believe I've seen made before on this
list.

 But I wonder if it is possible to call a C++ API in a C code base.

In most common cases, yes this is perfectly possible.  Just make sure
that you catch all exceptions and have all functions that have to be
callable from normal C code declared as 'extern C'.

I have a feeling that PG does things that interact badly with the way
C++ handles exceptions.  As far as I can tell[1], this only applies when
you're calling back into Postgres' C code from C++.  There are various
other annoyances (i.e. the sort of thing you always get when trying
to get code talking together that wasn't expecting to be) to do with
Postgres, like things in the headers using names that are reserved in
C++ and not in C.

 Could anyone here give me some guidelines to do it? Can the recently
 discussed Mostly Harmless C++ patch be helpful to what I want to do?

Isolate all your C++ code away in one part of your code and have some C
code in the middle to do the translations between the two worlds.  As
far as PG should be concerned it will just be calling C code.

If you can't get that to work and performance doesn't matter too much;
put your C++ code into a separate program and have the C code run said
program when needed.

-- 
  Sam  http://samason.me.uk/

 [1] http://wiki.postgresql.org/wiki/C++_Compatibility

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Predicate migration on complex self joins

2009-07-14 Thread Sam Mason
On Mon, Jul 13, 2009 at 07:06:40PM +0100, Simon Riggs wrote:
 On Mon, 2009-07-13 at 13:33 -0400, Tom Lane wrote:
  It's hard to believe
  this sort of case comes up often enough to justify the cycles that would
  be expended (on *every* join query) to try to recognize it.
 
 Yeh, damn ORMs seem to spring up faster than vines.
 
 Not just because of this but I wonder if we might benefit from an
 optimizer setting specifically aimed at the foolishnesses of
 automatically generated SQL.

The best suggestion I heard was to carry on optimizing until the plan
looked cheap enough or all the options had been exhausted.

In practical terms; I think this means doing the planning in two stages,
try with all the simple optimizations and see if it results in less
than n page accesses, if it's above this level then try again with all
the optimizations turned on.

-- 
  Sam  http://samason.me.uk/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] New types for transparent encryption

2009-07-13 Thread Sam Mason
On Mon, Jul 13, 2009 at 01:22:30PM +0900, Itagaki Takahiro wrote:
 Sam Mason s...@samason.me.uk wrote:
  As others have said, handling encryption client side would seem to offer
  many more benefits (transparently within libpq offering easy adoption).
 
 Libpq is not the only driver. Do we need to develop transparent decryption
 for each drivers? (libpq, JDBC, npgsql, py-postgresql, ...)

Yes, by definition if it's client side then it can do whatever is
most appropriate for that specific client.  If the libraries are in
some specific language this may or may not make this easier.  For
example, adding transparent encryption appears to be somewhat trivial to
something like haskelldb[1] because it has enough information exposed to
it already to do the right thing.  Doing the right thing in languages
that don't have this level of detail already exposed, i.e. libpq and
most other libraries, is going to be different/harder.  Note that I
wasn't thinking of haskelldb when I proposed the idea, I just remembered
it when trying to think of a good example.

 Also, if you disallow server-side decode, you cannot create indexes on
 encrypted values because the same values are not always encrypted to the
 same codes. (Indexes will sort keys based on order of decoded values.)

Yes, that's the point.  Client side encryption is about protecting the
plaintext from an attacker who has full control of the server.

This may or may not be the problem you're trying to solve, but if the
attacker doesn't have control over the server then encryption doesn't
seem necessary--just use normal permissions and/or views ensure your
security properties.

 I think there is no difference between client-side decryption and
 clinet-supplied password as far as clinet-server communication is
 encrypted (i.e, SSL login).

As tomas said, there's a massive difference here.  In one, the server is
part of your trusted computing base, in another it doesn't have to be.

  Should the password be this widely shared? it would seem to make more
  sense if it was a write-only variable and never exposed outside the
  crypto module.
 
 We can use an user-defined GUC variables as a write-only variable.
 When we supply some show_hook function to GUC variable,
 SET still works but SHOW only shows '' and hides real passwords.

OK, but I still don't understand what threat model you're trying to
protect against.  Which bits of the system are trusted and which bits
are assumed malicious?  Trusted being a technical term from the
security world meaning if it goes wrong the security of the system will
fail (for example you trust your doctor to respect your privacy and
to not kill you), beware that PG uses the term differently with its
embedded languages.

It appears the main reason you're proposed this is to make it easier
for new users to use crypto with their PG database---this seems to
be the wrong way around.  In my experience, security is hard because
determining what's trusted and what's not is difficult.  Once you've
determined this it's reasonably easy to work the details out and get it
implemented.

-- 
  Sam  http://samason.me.uk/

 [1] http://haskelldb.sourceforge.net/haskelldb.pdf

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] New types for transparent encryption

2009-07-12 Thread Sam Mason
On Tue, Jul 07, 2009 at 05:35:28PM +0900, Itagaki Takahiro wrote:
 Our manual says we can use pgcrypto functions or encrypted filesystems
 for data encryption.
 http://www.postgresql.org/docs/8.4/static/encryption-options.html
 
 However, they are not always the best approaches in some cases.
 
 For pgcrypto functions, user's SQL must contain keyword strings
 and they need to consider which column is encrypted. Users complaint
 that that they want to treat encrypted values as if not-encrypted.

As others have said, handling encryption client side would seem to offer
many more benefits (transparently within libpq offering easy adoption).

 passward() and options() are SQL functions and we can re-define them
 if needed. The default implementations are to refer custom GUC variables
 (pgcrypto.password and pgcrypto.options) so that encryption are done
 only in database server and applications don't have to know the details.

Should the password be this widely shared? it would seem to make more
sense if it was a write-only variable and never exposed outside the
crypto module.  You wouldn't even need to be a super-user to collect all
the passwords otherwise, just create a function that has the name of
something common and have it stash the password aware somewhere.

-- 
  Sam  http://samason.me.uk/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch: AdjustIntervalForTypmod shouldn't discard high-order data

2009-06-01 Thread Sam Mason
On Sun, May 31, 2009 at 06:32:53PM -0400, Tom Lane wrote:
 regression=# select '999'::interval second;
 The correct interpretation of the input value is certainly 999 seconds.

Agreed; silent truncation like this is confusing and will lead to
unnecessary bugs in users' code.

 the attached patch we would render as
 regression=# select '1 day 1 hour'::interval hour;
  1 day 01:00:00
 
 There is some case to be made that we should throw error here,
 which we could do by putting error tests where the attached patch
 has comments suggesting an error test.

With things as they are I think it would be useful to throw an error
here; if the user means 25 hours they should say 25 hours!

 However I'm inclined to think
 that doing that would expose an implementation dependency rather more
 than we should.  It is usually not clear to novices that '1 day 1 hour'
 is different from '25 hours', and it would be even less clear why the
 latter would be acceptable input for an INTERVAL HOUR field when the
 former isn't.  So I'm proposing the patch as-is rather than with the
 extra error tests, but am open to being convinced otherwise.

It would only be different when the interval is used with values of type
timestamptz, or am I missing something?  How much sense does it make to
have a timezone aware interval where this distinction is true and leave
the current interval as timezone naive.  Not sure if that would help to
clean up the semantics at all or if it's just adding more unnecessary
complexity.  I have a feeling it's probably the latter, but thought it
may help things.

-- 
  Sam  http://samason.me.uk/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] search_path improvements

2009-06-01 Thread Sam Mason
On Mon, Jun 01, 2009 at 08:05:33PM +0100, Greg Stark wrote:
 As I said earlier I doubt pop or delete is ever going to actually
 be what you want. I suspect you're far more likely to want to restore
 it to what it was before you started altering it.
 
 As support I'll point out this is what our C api has. There's no short
 cut to strip out a single element of the path but the normal calling
 pattern is to set aside a copy of the old path, add modify it in some
 way -- often adding a schema to the head -- then restore the old path.

Without reading much of what's been said here (I've read maybe ten of
the posts in this thread) I'll say it sounds a lot like lexical closures
are needed.  Code is free to define and use generally use whatever is
in their closure, but can't affect what's outside it unless explicitly
granted.

I saw these mentioned in another post by David Wheeler[1] but my client
says it wasn't directly responded to.  He calls it lexical scoping
but I think closing over the environment seems more suitable---mainly
because it'll go wrong less often in the presence of functions defined
as security definer.

-- 
  Sam  http://samason.me.uk/

 [1] 
http://archives.postgresql.org/message-id/5a1fe6b1-9857-454c-a385-ba061ded3...@kineticode.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Python 3.0 does not work with PL/Python

2009-05-29 Thread Sam Mason
On Thu, May 28, 2009 at 09:06:14PM -0400, Andrew Dunstan wrote:
 Does Python 3 have some sort of usable sandbox that would mean we could 
 have a trusted plpython?

Not sure if people are aware of object-capability based approaches to
security.  A guy called Tav has come up with some code that constrains
python (i.e. you could build a sandbox out of it) and punch holes in
it where needed (i.e. you want to be able to execute queries in the
database but otherwise not, say, touch the filesystem).  The most recent
description I've found is:

  http://tav.espians.com/paving-the-way-to-securing-the-python-interpreter.html

-- 
  Sam  http://samason.me.uk/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fast ALTER TABLE ... ADD COLUMN ... DEFAULT xxx?

2009-05-22 Thread Sam Mason
On Thu, May 21, 2009 at 12:26:08PM -0400, Tom Lane wrote:
 I'd envision it
 as an extra column in pg_attribute, and it would work for any column(s).
 There's nothing to be gained by restricting it.

Yes, when I said first I meant the only thing that needs to be stored
is the first default value for the column.  The code currently assumes
that this first default value is NULL and is hence able to optimize the
case where you adding a NULLable column.  Adding this first default
value to pg_attribute would allow you to break this assumption and
allow the user to have non-NULL default values without complete table
rewrites.

I think the discussion covered this, but I don't think I was
particularly clear in my first message and thought I should attempt to
explain what I was saying.  Other comments about only allowing STABLE
expressions and non-toastable values make sense and were the sorts of
things I thought I would miss.

-- 
  Sam  http://samason.me.uk/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fast ALTER TABLE ... ADD COLUMN ... DEFAULT xxx?

2009-05-21 Thread Sam Mason
On Thu, May 21, 2009 at 12:06:29PM +0400, Dmitry Koterov wrote:
 ALTER TABLE ... ADD COLUMN ... NULL;
 
 (nullable without a default value). This is because of NULL bitmap in
 tuples. And it's greatest feature for a developer!

I don't think this is because of the NULL bitmap.  PG just never needs
to flush the changes to every tuple because it knows that all old
tuples (i.e. ones that were created before this column was added) are
supposed to be NULL.

 But another very common-case query like
 
 ALTER TABLE ... ADD COLUMN ... BOOLEAN NOT NULL DEFAULT false;
 or
 ALTER TABLE ... ADD COLUMN ... INT NOT NULL DEFAULT 0;

 So, are there plans to optimize such kind of queries? This could be done by
 many ways:

I think this hasn't been done before because it's been considered too
difficult to keep track of everything, but I've just tried to come up
with an example of why it's difficult and failed.  If I'm interpreting
things correctly it's not nearly as difficult as I thought it should
be.  All that needs to be tracked is the first default value (this is
currently assumed to be NULL).  All subsequent INSERTs will have this
value in the tuple and things should just work out.

  CREATE TABLE t ( i INTEGER PRIMARY KEY );
  INSERT INTO t (i) VALUES (1);
  ALTER TABLE t ADD COLUMN j INTEGER DEFAULT 1;
  INSERT INTO t (i) VALUES (2);
  ALTER TABLE t ALTER j SET DEFAULT 2;
  INSERT INTO t (i) VALUES (3);
  ALTER TABLE t ALTER j DROP DEFAULT;
  INSERT INTO t (i) VALUES (4);

After this we will have the following tuples:

  (1)
  (2,1)
  (3,2)
  (4,NULL)

All that needs to be done is to fill in the default for i=1 to the
first default (i.e. the integer 1) and everything is done.

Who wants to tell me what I've missed?

-- 
  Sam  http://samason.me.uk/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] RFE: Transparent encryption on all fields

2009-04-27 Thread Sam Mason
On Mon, Apr 27, 2009 at 09:24:55AM +0100, Sam Halliday wrote:
 Not looking for a Windows solution. Must be cross platform and work  
 for headless machines, laptops and desktops. Encrypted drive solutions  
 fall short of these requirements. Other considerations which rule out  
 encrypted drives have been discussed earlier in the thread.

Just for reference; TrueCrypt is for Mac OS/X and Linux.  Never tried it
on them, but it's supposed to work!

 For the record, I have a working solution at the moment that involves  
 using an encrypted drive and a manual per-user startup sequence. *I am  
 not looking for user advice*, this is an RFE for an additional server- 
 side encryption approach to security.

OK, I've just re-read your original messages--my mail client had decided
to break the thread into several pieces and I think I've put it all back
together now and have an idea what you're after.

As far as I can tell, it would currently be somewhat hard with PG the
way it's currently implemented.  It assumes shared access to various
catalogue tables and I don't think these couldn't be encrypted (I
believe they're needed during database startup).

One possible arrangement would be if each user/encryption key had its
own database cluster.  If that's OK, then maybe you could hack pg-pool
around so that once it received the secret it would be able to run off,
mount the appropriate partitions, and start the database engine before
connecting to it.  I've not used pg-pool before, but have a feeling
that it can re-write queries on the fly so must have some non-trivial
protocol knowledge--I may be wrong about that though.

Allowing multiple users/encryption keys access the same database seems
problematic; how would you allow catalogue access and enforce unique or
other constraints if the server couldn't look to see what's there.  Not
sure what you're after here though.

-- 
  Sam  http://samason.me.uk/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] RFE: Transparent encryption on all fields

2009-04-26 Thread Sam Mason
On Mon, Apr 27, 2009 at 07:03:32AM +1200, Andrej wrote:
 2009/4/26 Sam Halliday sam.halli...@gmail.com:
  I'm still talking about theft of machines (particularly laptops) as that is
  a major threat. One need only read the British newspapers to discover story
  after story of articles where sensitive information was on a laptop which
  was stolen. As pointed out elsewhere, psql + encrypted drive is entirely
  unpractical as no OS is setup to ask for an encrypted drive password on boot
  (similarly for headless machines, user interaction is required). A practical
  solution that accomplishes the same goals as the encrypted drive is
  necessary.

 Buy a higher end thinkpad, it uses a BIOS password and an ASIC
 to encrypt the data in hardware, w/o impact on performance.

There are various tools that allow you to do this without specialised
hardware, TrueCrypt[1] is one I've used in the past and is very easy for
naive users to get their heads around.

-- 
  Sam  http://samason.me.uk/

 [1] http://www.truecrypt.org/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unicode string literals versus the world

2009-04-17 Thread Sam Mason
On Thu, Apr 16, 2009 at 12:08:37PM -0400, Tom Lane wrote:
 Sam Mason s...@samason.me.uk writes:
  I've failed to keep up with the discussion so I'm not sure where this
  conversation has got to!  Is the consensus for 8.4 to enable SQL2003
  style Ulit escaped literals if and only if standard_conforming_strings
  is set?
 
 That was Peter's proposal, and no one's shot a hole in it yet ...

Just noticed that the spec only supports four hex digits; this would
imply that support for anything outside the BMP would have to be done
by encoding the character as a surrogate pair.  If the code doesn't
do this already (the original patch didn't seem to) these should be
normalised back to a single character in a similar manner to Marko's
recent patch[1].

 I think the discussion about whether/how to add a Unicode extension to
 E''-style literals is 8.5 material.  We are in beta so now is not
 the time to add new features, especially ones that weren't even on the
 TODO list before.

OK, sounds reasonable.

-- 
  Sam  http://samason.me.uk/

 [1] http://archives.postgresql.org//pgsql-hackers/2009-04/msg00904.php

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unicode string literals versus the world

2009-04-17 Thread Sam Mason
On Fri, Apr 17, 2009 at 10:15:57AM -0400, Tom Lane wrote:
 Sam Mason s...@samason.me.uk writes:
  Just noticed that the spec only supports four hex digits;
 
 Better read it again.

You're right of course.  My ability to read patches seems not to be very
good.

-- 
  Sam  http://samason.me.uk/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [rfc] unicode escapes for extended strings

2009-04-17 Thread Sam Mason
On Fri, Apr 17, 2009 at 07:01:47PM +0200, Martijn van Oosterhout wrote:
 On Fri, Apr 17, 2009 at 07:07:31PM +0300, Marko Kreen wrote:
  Btw, is there any good reason why we don't reject \000, \x00
  in text strings?
 
 Why forbid nulls in text strings?

As far as I know, PG assumes, like most C code, that strings don't
contain embedded NUL characters.  The manual[1] has this to says:

  The character with the code zero cannot be in a string constant.

I believe you're supposed to use values of type bytea when you're
expecting to deal with NUL characters.

-- 
  Sam  http://samason.me.uk/
 
 [1] 
http://www.postgresql.org/docs/current/static/sql-syntax-lexical.html#SQL-SYNTAX-STRINGS

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unicode string literals versus the world

2009-04-16 Thread Sam Mason
On Wed, Apr 15, 2009 at 11:19:42PM +0300, Marko Kreen wrote:
 On 4/15/09, Tom Lane t...@sss.pgh.pa.us wrote:
  Given Martijn's complaint about more-than-16-bit code points, I think
   the \u proposal is not mature enough to go into 8.4.  We can think
   about some version of that later, if there's enough interest.
 
 I think it would be good idea. Basically we should pick one from
 couple of pre-existing sane schemes.  Here is quick summary
 of Python, Perl and Java:
 
 Python [1]:
 
   \u - 16-bit codepoint
   \U - 32-bit codepoint
   \N{char-name}  - Characted by name

Microsoft have also gone this way in C#, named code points are not
supported however.

 Perl [2]:
 
   \x{..} - {} contains hexadecimal codepoint
   \N{char-name}  - Unicode char name

Looks OK, but the 'x' seems somewhat redundant.  Why not just:

  \{}

This would be following the BitC[2] project, especially if it was more
like:

  \{U+}

e.g.

  \{U+03BB}

would be the lowercase lambda character.  Added appeal is in the fact
that this (i.e. U+03BB) is how the Unicode consortium spells code
points.

 Java [3]:
 
   \u - 16-bit codepoint

AFAIK, Java isn't the best reference to choose; it assumed from an early
point in its design that Unicode characters were at most 16bits and
hence had to switch its internal representation to UTF-16.  I don't
program much Java these days to know how it's all worked out, but it
would be interesting to hear from people who regularly have to deal with
characters outside the BMP (i.e. code points greater than 65535).

-- 
  Sam  http://samason.me.uk/

 [1] http://msdn.microsoft.com/en-us/library/aa664669(VS.71).aspx
 [2] http://www.bitc-lang.org/docs/bitc/spec.html#stringlit

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unicode string literals versus the world

2009-04-16 Thread Sam Mason
On Thu, Apr 16, 2009 at 02:47:20PM +0300, Marko Kreen wrote:
 On 4/16/09, Sam Mason s...@samason.me.uk wrote:
  Microsoft have also gone this way in C#, named code points are not
  supported however.
 
 And it handles also non-BMP codepoints with \u escape similarly:
 
   http://en.csharp-online.net/ECMA-334:_9.4.1_Unicode_escape_sequences
 
 This makes it even more standard.

I fail to see what you're pointing out here; as far as I understand it,
\u is for BMP code points and \U extends the range out to 32bit code
points.  I can't see anything about non-BMP and \u in the above link,
you appear free to write your own surrogate pairs but that seems like an
independent issue.

I'd not realised before that C# is specified to use UTF-16 as its
internal encoding.

   This would be following the BitC[2] project, especially if it was more
   like:
 
\{U+}
 
 We already got yet-another-unique-way-of-escaping-unicode with U.
 
 Now let's try to support some actual standard also.

That comes across *very* negatively; I hope it's just a language issue.

I read your parent post as soliciting opinions on possible ways to
encode Unicode characters in PG's literals.  The U'lit' was criticised,
you posted some suggestions, I followed up with what I hoped to be a
useful addition.  It seems useful here to separate de jure from de
facto standards; implementing U'lit' would be following the de jure
standard, anything else would be de facto.

A survey of existing SQL implementations would seem to be more appropriate
as well:

Oracle: UNISTR(string-literal) and \

  It looks as though Oracle originally used UCS-2 internally (i.e. BMP
  only) but more recently Unicode support has been improved to allow
  other planes.

MS-SQL Server: 

  can't find anything remotely useful; best seems to be to use
  NCHAR(integer-expression) which looks somewhat unmaintainable.

DB2: Ustring-literal and \xx

  i.e. it follows the SQL-2003 spec

FireBird:

  can't find much either; support looks somewhat low on the ground

MySQL:

  same again, seems to assume query is encoded in UTF-8

Summary seems to be that either I'm bad at searching or support for
Unicode doesn't seem very complete in the database world and people work
around it somehow.

 You did not read my mail carefully enough - the Java and also Python/C#
 already support non-BMP chars with '\u' and exactly the same (utf16) way.

Again, I think this may be a language issue; if not then more verbose
explanations help, maybe something like sorry, I obviously didn't
explain that very well.  You will of course felt you explained it
perfectly well, but everybody enters a discussion with different
intuitions and biases, email has a nasty habit of accentuating these
differences and compounding them with language problems.

I'd never heard of UTF-16 surrogate pairs before this discussion and
hence didn't realise that it's valid to have a surrogate pair in place
of a single code point.  The docs say that D800 DF02 corresponds to
U+10302, Python would appear to follow my intuitions in that:

  ord(u'\uD800\uDF02')

results in an error instead of giving back 66306, as I'd expect.  Is
this a bug in Python, my understanding, or something else?

-- 
  Sam  http://samason.me.uk/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unicode string literals versus the world

2009-04-16 Thread Sam Mason
On Thu, Apr 16, 2009 at 10:54:16AM -0400, Tom Lane wrote:
 Sam Mason s...@samason.me.uk writes:
  I'd never heard of UTF-16 surrogate pairs before this discussion and
  hence didn't realise that it's valid to have a surrogate pair in place
  of a single code point.  The docs say that D800 DF02 corresponds to
  U+10302, Python would appear to follow my intuitions in that:
 
ord(u'\uD800\uDF02')
 
  results in an error instead of giving back 66306, as I'd expect.  Is
  this a bug in Python, my understanding, or something else?
 
 I might be wrong, but I think surrogate pairs are expressly forbidden in
 all representations other than UTF16/UCS2.  We definitely forbid them
 when validating UTF-8 strings --- that's per an RFC recommendation.
 It sounds like Python is doing the same.

OK, that's good.  I thought I was missing something.  A minor point is
that in UCS2 each 16bit value is exactly one character and characters
outside the BMP aren't supported, hence the need for UTF-16.

I've failed to keep up with the discussion so I'm not sure where this
conversation has got to!  Is the consensus for 8.4 to enable SQL2003
style Ulit escaped literals if and only if standard_conforming_strings
is set?  This seems easiest for client code as it can use this
exclusively for knowing what to do with backslashes.

-- 
  Sam  http://samason.me.uk/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unicode string literals versus the world

2009-04-16 Thread Sam Mason
On Thu, Apr 16, 2009 at 06:34:06PM +0300, Marko Kreen wrote:
 Which hints that you can aswell enter the pairs directly: \uxx\uxx.
 If I'd be language designer, I would not see any reason to disallow it.
 
 And anyway, at least mono seems to support it:
 
 using System;
 public class HelloWorld {
 public static void Main() {
 Console.WriteLine(\uD800\uDF02\n);
 }
 }
 
 It will output single UTF8 character.  I think this should settle it.

I don't have any .net stuff installed so can't test; but C# is defined
to use UTF-16 as its internal representation so it would make sense if
the above gets treated as a single character internally.  However, if it
used any other encoding the above should be treated as an error.

 The de-facto about Postgres is stdstr=off.  Even if not, E'' strings
 are still better for various things, so it would be good if they also
 aquired unicode-capabilities.

OK, this seems independent of the U'lit' discussion that started the
thread.  Note that PG already supports UTF8; if you want the character
I've been using in my examples up-thread, you can do:

  SELECT E'\xF0\x90\x8C\x82';

I have a feeling that this is predicated on the server_encoding being
set to utf8 and this can only be done at database creation time.
Another alternative would be to use the convert_from function, i.e:

  SELECT convert_from(E'\xF0\x90\x8C\x82', 'UTF8');

Never had to do this though, so there may be better options available.

 Python's internal representation is *not* UTF-16, but plain UCS2/UCS4,
 that is - plain 16 or 32-bit values.  Seems your python is compiled with
 UCS2, not UCS4.

Cool, I didn't know that.  I believe mine is UCS4 as I can do:

  ord(u'\U00010302')

and I get 66306 back rather than an error.

 As I understand, in UCS2 mode it simply takes surrogate
 values as-is.

UCS2 doesn't have surrogate pairs, or at least I believe it's considered
a bug if you don't get an error when you present it with one.

 From ord() docs:
 
   If a unicode argument is given and Python was built with UCS2 Unicode,
   then the character’s code point must be in the range [0..65535]
   inclusive; otherwise the string length is two, and a TypeError will
   be raised.
 
 So only in UCS4 mode it detects surrogates and converts them to internal
 representation.  (Which in Postgres case would be UTF8.)

I think you mean UTF-16 instead of UCS4; but otherwise, yes.

 Or perhaps it is partially UTF16 aware - eg. I/O routines do unterstand
 UTF16 but low-level string routines do not:
 
   print %s % u'\uD800\uDF02'
 
 seems to handle it properly.

Yes, I get this as well.  It's all a bit weird, which is why I was
asking if this a bug in Python, my understanding, or something else.

When I do:

  python EOF | hexdump -C
  print u\uD800\uDF02
  EOF

to see what it's doing I get an error which I'm not expecting, hence I
think it's probably my understanding.

-- 
  Sam  http://samason.me.uk/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [rfc] unicode escapes for extended strings

2009-04-16 Thread Sam Mason
On Thu, Apr 16, 2009 at 08:48:58PM +0300, Marko Kreen wrote:
 Seems I'm bad at communicating in english,

I hope you're not saying this because of my misunderstandings!

 so here is C variant of
 my proposal to bring \u escaping into extended strings.  Reasons:
 
 - More people are familiar with \u escaping, as it's standard
   in Java/C#/Python, probably more..
 - U strings will not work when stdstr=off.
 
 Syntax:
 
   \u  - 16-bit value
   \U  - 32-bit value
 
 Additionally, both \u and \U can be used to specify UTF-16 surrogate
 pairs to encode characters with value  0x.  This is exact behaviour
 used by Java/C#/Python.  (except that Java does not have \U)

Are you sure that this handling of surrogates is correct?  The best
answer I've managed to find on the Unicode consortium's site is:

  http://unicode.org/faq/utf_bom.html#utf16-7

it says:

  They are invalid in interchange, but may be freely used internal to an
  implementation.

I think this means they consider the handling of them you noted above,
in other languages, to be an error.

-- 
  Sam  http://samason.me.uk/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [rfc] unicode escapes for extended strings

2009-04-16 Thread Sam Mason
On Thu, Apr 16, 2009 at 03:04:37PM -0400, Andrew Dunstan wrote:
 Sam Mason wrote:
 Are you sure that this handling of surrogates is correct?  The best
 answer I've managed to find on the Unicode consortium's site is:
 
   http://unicode.org/faq/utf_bom.html#utf16-7
 
 it says:
 
   They are invalid in interchange, but may be freely used internal to an
   implementation.
 
 It says that about non-characters, not about the use of surrogate pairs, 
 unless I am misreading it.

No, I think you're probably right and I was misreading it.  I went
back and forth several times to explicitly check I was interpreting
this correctly and still failed to get it right.  Not sure what I was
thinking and sorry for the hassle Marko!

I've already asked on the Unicode list about this (no response yet), but
I have a feeling I'm getting worked up over nothing.

-- 
  Sam  http://samason.me.uk/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unicode string literals versus the world

2009-04-15 Thread Sam Mason
On Tue, Apr 14, 2009 at 04:01:48PM +0300, Peter Eisentraut wrote:
 On Saturday 11 April 2009 18:20:47 Sam Mason wrote:
  I can't see much support in the other database engines; searched for
  Oracle, MS-SQL, DB2 and Firebird.  MySQL has it planned for 7.1, so not
  for a while.
 
 DB2 supports it, as far as I know.

Doh, yes it does doesn't it.  Sorry I searched for a bit and failed to
find anything before.  Looks as though the signal to noise ratio was far
too low as I've just searched again and found a (single) reference to
their docs describing the feature[1].

I've also just noticed that the MySQL todo item points to several other
implementations and how they handle Unicode escape sequences.  The most
common option (bearing in mind that this is a sample of mainly FOSS
databases) seems to be doing some variant of '\u0123', as in the style
of Python.  This is only supported for literals and no support for
identifiers appears to be provided.

-- 
  Sam  http://samason.me.uk/

 [1] 
http://publib.boulder.ibm.com/infocenter/db2luw/v9r5/topic/com.ibm.db2.luw.sql.ref.doc/doc/r731.html

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unicode string literals versus the world

2009-04-11 Thread Sam Mason
On Fri, Apr 10, 2009 at 05:54:25PM -0400, Tom Lane wrote:
 It gets worse though: I have seldom seen such a badly designed piece of
 syntax as the Unicode string syntax --- see
 http://developer.postgresql.org/pgdocs/postgres/sql-syntax-lexical.html#SQL-SYNTAX-STRINGS-UESCAPE
 
 I think we need to give very serious consideration to ripping out that
 feature.

I'd agree it's comically bad syntax; I just had a look in the archives
and it was only put in a few months ago:

  http://archives.postgresql.org/pgsql-hackers/2008-10/msg01169.php

I can't see much support in the other database engines; searched for
Oracle, MS-SQL, DB2 and Firebird.  MySQL has it planned for 7.1, so not
for a while.

  http://forge.mysql.com/worklog/task.php?id=3529

-- 
  Sam  http://samason.me.uk/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Closing some 8.4 open items

2009-04-11 Thread Sam Mason
On Sat, Apr 11, 2009 at 11:13:59AM -0400, Tom Lane wrote:
 My own take on it is that actually I'd prefer one command for all of
 these.  If I say \df sum it would be good if the output included the
 sum() aggregates; the reason being that I might be wondering if I can
 create a plain function named sum.  If I have to check not only \df and
 \da but also \dw for conflicts, that's going to be a real PITA.  Also,
 pity the poor newbie who is unclear on the distinctions between these
 different function-looking animals, and is just trying to find some
 documentation on rank().
 
 If we were designing in a green field I think you could make a real
 strong case for a single \df command with an output column type having
 the alternatives regular, aggregate, window, and maybe trigger.

What would it do for triggers?

Sounds like a general identifier search; there seem to be two big
namespaces in PG at the moment, that of things that look like function
calls and that of relations (and their types).

  CREATE TABLE foo ( i int, t text );

and

  CREATE TYPE foo AS ( t text);

both go into the same namespace so would appear to be a similar symptom
as above.  I have a feeling this is going a bit further than you're
thinking above.

Not sure about the newbie argument; I'd expect them to be using google
and wouldn't know much about the backslash commands in psql.

-- 
  Sam  http://samason.me.uk/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] NaN support in NUMERIC data type

2009-04-09 Thread Sam Mason
On Wed, Apr 08, 2009 at 08:16:52PM -0400, Tom Lane wrote:
 Sam Mason s...@samason.me.uk writes:
  On Wed, Apr 08, 2009 at 06:11:59PM -0400, Tom Lane wrote:
  Anyway, I revised this a bit and applied to HEAD.
 
  I've not tested; but your changes look as though they will break:
SELECT 'Infinity'::float::numeric;
 
 That gives an error now, just as it did before, so I'm not sure what you
 think is broken about it ...

I shouldn't have responded last night; didn't realise how little of my
code was left so the semantics I was assuming weren't valid.


-- 
  Sam  http://samason.me.uk/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] NaN support in NUMERIC data type

2009-04-08 Thread Sam Mason
On Tue, Apr 07, 2009 at 12:51:21PM -0400, Tom Lane wrote:
 Sam Mason s...@samason.me.uk writes:
SELECT 'NaN'::float8;
SELECT ' NaN'::float8;
SELECT 'NaN '::float8;
SELECT '+NaN'::float8;
SELECT '-NaN'::float8;
 
 Well, the +- part must be an artifact of your strtod() implementation;
 our own code isn't doing anything to accept that.  I think it's pretty
 bogus --- NaNs do not have signs.

OK, didn't actually check the code with that; no point worrying about it
then.

 IIRC, the explicit support for leading/trailing spaces is something that
 we added in float8in long after numeric_in was written, and I think just
 nobody thought about numeric at the time.  But it's clearly inconsistent
 to allow spaces around a regular value but not a NaN.

Good, I wanted to make sure it wasn't a deliberate thing before doing
too much.

 Possibly the logic for leading/trailing spaces could be pulled out
 of set_var_from_str and executed in numeric_in instead?

Yes, I didn't want to do this before because it's called from a
couple of other places.  I looked again and realised that we're
generating those in very fixed formats so don't need to worry about
leading/trailing spaces and hence can move the code up to numeric_in.

The attached patch gives set_var_from_str an endptr similar to strtod
so handling is closer to the float[48]in code.  I moved error reporting
code outside as well to cut down on the multiple identical calls.

The included patch was generated against 8.3.5 (because that's what I
had lying around when I started playing) but applies with a little fuzz
to the latest snapshot and does the right thing for me in both versions.

-- 
  Sam  http://samason.me.uk/
*** src/backend/utils/adt/numeric.c	2009-04-07 16:49:08.0 +0100
--- src/backend/utils/adt/numeric.c	2009-04-08 11:53:13.0 +0100
***
*** 238,244 
  static void free_var(NumericVar *var);
  static void zero_var(NumericVar *var);
  
! static void set_var_from_str(const char *str, NumericVar *dest);
  static void set_var_from_num(Numeric value, NumericVar *dest);
  static void set_var_from_var(NumericVar *value, NumericVar *dest);
  static char *get_str_from_var(NumericVar *var, int dscale);
--- 238,244 
  static void free_var(NumericVar *var);
  static void zero_var(NumericVar *var);
  
! static void set_var_from_str(const char *str, const char ** endptr, NumericVar *dest);
  static void set_var_from_num(Numeric value, NumericVar *dest);
  static void set_var_from_var(NumericVar *value, NumericVar *dest);
  static char *get_str_from_var(NumericVar *var, int dscale);
***
*** 310,315 
--- 310,317 
  numeric_in(PG_FUNCTION_ARGS)
  {
  	char	   *str = PG_GETARG_CSTRING(0);
+ 	char	   *orig_str;
+ 	const char *endptr;
  
  #ifdef NOT_USED
  	Oid			typelem = PG_GETARG_OID(1);
***
*** 318,335 
  	NumericVar	value;
  	Numeric		res;
  
! 	/*
! 	 * Check for NaN
  	 */
! 	if (pg_strcasecmp(str, NaN) == 0)
! 		PG_RETURN_NUMERIC(make_result(const_nan));
  
  	/*
  	 * Use set_var_from_str() to parse the input string and return it in the
  	 * packed DB storage format
  	 */
  	init_var(value);
! 	set_var_from_str(str, value);
  
  	apply_typmod(value, typmod);
  
--- 320,369 
  	NumericVar	value;
  	Numeric		res;
  
! 	/* 
! 	 * To allow us to generate sensible error messages we tuck the
! 	 * original start of the string away so we can use it later.
  	 */
! 	orig_str = str;
! 
! 	/* skip leading spaces */
! 	while (isspace((unsigned char) *str))
! 		str++;
  
  	/*
  	 * Use set_var_from_str() to parse the input string and return it in the
  	 * packed DB storage format
  	 */
  	init_var(value);
! 	set_var_from_str(str, endptr, value);
! 
! 	/*
! 	 * we didn't see anything that looked like a numeric value
! 	 */
! 	if (str == endptr) {
! 		/*
! 		 * Check for NaN
! 		 */
! 		if (pg_strncasecmp(str, NaN, 3) == 0) {
! 			value = const_nan;
! 			endptr = str + 3;
! 		} else
! 			ereport(ERROR,
! 	(errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
! 	 errmsg(invalid input syntax for type numeric: \%s\,
! 			orig_str)));
! 	}
! 
! 	/* skip trailing spaces */
! 	while (isspace((unsigned char) *endptr))
! 		endptr++;
! 
! 	if (*endptr != '\0') {
! 		ereport(ERROR,
! (errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
!  errmsg(invalid input syntax for type numeric: \%s\,
! 		orig_str)));
! 	}
  
  	apply_typmod(value, typmod);
  
***
*** 2056,2061 
--- 2090,2096 
  	Numeric		res;
  	NumericVar	result;
  	char		buf[DBL_DIG + 100];
+ 	const char *endptr;
  
  	if (isnan(val))
  		PG_RETURN_NUMERIC(make_result(const_nan));
***
*** 2064,2070 
  
  	init_var(result);
  
! 	set_var_from_str(buf, result);
  	res = make_result(result);
  
  	free_var(result);
--- 2099,2111 
  
  	init_var(result);
  
! 	set_var_from_str(buf, endptr, result);
! 	if (endptr == buf) {
! 		ereport(ERROR,
! (errcode(ERRCODE_INVALID_TEXT_REPRESENTATION

Re: [HACKERS] NaN support in NUMERIC data type

2009-04-08 Thread Sam Mason
On Wed, Apr 08, 2009 at 06:11:59PM -0400, Tom Lane wrote:
 Sam Mason s...@samason.me.uk writes:
  On Tue, Apr 07, 2009 at 12:51:21PM -0400, Tom Lane wrote:
  IIRC, the explicit support for leading/trailing spaces is something that
  we added in float8in long after numeric_in was written, and I think just
  nobody thought about numeric at the time.  But it's clearly inconsistent
  to allow spaces around a regular value but not a NaN.
 
  The included patch was generated against 8.3.5 (because that's what I
  had lying around when I started playing) but applies with a little fuzz
  to the latest snapshot and does the right thing for me in both versions.
 
 Hmm, did it do the right thing for NaN with a typmod?  I doubt
 apply_typmod is safe for a NaN.

Oops, good catch.  Didn't think to check for that.

 Anyway, I revised this a bit and applied to HEAD.

I've not tested; but your changes look as though they will break:

  SELECT 'Infinity'::float::numeric;

I think you'll get '0' back instead of an error; either that or it's too
late for me to be thinking straight!

 I'm disinclined to back-patch; it's not really a bug
 but a definition change, and we get flack when we put that sort of
 change into stable branches ...

OK


Out of personal interest; why did you translate:

  while(isspace(*p)) p++;

into more verbose forms?  Is it so it fits in with the style in rest of
the code, or does it actually do something different that I'm missing?

-- 
  Sam  http://samason.me.uk/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] NaN support in NUMERIC data type

2009-04-07 Thread Sam Mason
Hi,

I've just noticed that the NUMERIC input function special cases NaN
values differently to the floating point input functions.  For example
the following are all accepted (on my system anyway):

  SELECT 'NaN'::float8;
  SELECT ' NaN'::float8;
  SELECT 'NaN '::float8;
  SELECT '+NaN'::float8;
  SELECT '-NaN'::float8;

whereas only the first is OK for numeric.  Is this deliberate?

A quick check of utils/adt/numeric.c would suggest that it's been
special cased as a optimisation so we don't allocate a numeric value
in set_var_from_str() unless we need to.

As a side note; I'm only really interested in the leading/trailing
spaces.  I only noticed the leading plus/minus sign when reading the
code and think it's probably better if a NaN is rejected if it has a
leading sign.  I think it would be better if it was consistent with
FLOAT, not sure how to do this in a platform independent way though.

I could submit a patch if you want; I'm unsure whether it's better to
duplicate code in numeric_in, do slightly more work and allocate memory
for NaN's when not strictly needed, or remove knowledge of NumericVar
from numeric_in altogether and push code into set_var_from_str.
Comments?

-- 
  Sam  http://samason.me.uk/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [GENERAL] string_to_array with empty input

2009-04-02 Thread Sam Mason
On Thu, Apr 02, 2009 at 02:04:41PM -0400, Tom Lane wrote:
 A correct fix
 outside-the-function would look more like
 
 case when str = '' then '{}'::text[] else string_to_array(str, ',') end
 
 which should correctly yield NULL for NULL input and an empty array
 for empty input.  Similarly, if someone wanted to force the
 single-empty-string result, they should do
 
 case when str = '' then '{}'::text[] else string_to_array(str, ',') end
 
 which also still yields NULL if str is NULL.
 
 Right at the moment, if we stick with the historical definition
 of the function, *both* camps have to write out their choice of
 the above.  Seems like this is the worst of all possible worlds.
 We should probably pick one or the other.

Yes, I'd be tempted to pick one and go with it.  It's seems a completely
arbitrary choice one way or the other but the current behaviour is
certainly wrong.

I'd go with returning a zero element array because it would do
the right thing more often when paired with array_to_string.
I've also been through the first few pages of a Google search for
array_to_string and it seems to do the right thing for the majority
of the cases.

-- 
  Sam  http://samason.me.uk/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [GENERAL] string_to_array with empty input

2009-04-01 Thread Sam Mason
 [ oops, forgot to send this to -hackers before ]

On Tue, Mar 31, 2009 at 05:08:45PM +0100, Greg Stark wrote:
 Both interpretations are clearly consistent but it depends on whether
 you think it's a bunch of text strings concatenated together or if
 it's a list of objects.
 
 The example of string_to_array('',',')::int[] is relevant to this
 point. The whole there's one empty element only makes sense if
 you're thinking in terms of string processing. If it's a list of any
 other kind of object it probably doesn't make sense; you can't say
 there's one empty integer or one empty composite object or one empty
 anything else.

I think this is about the only sensible option, but my reasoning is
somewhat different.

My original intuition was that array_to_string and string_to_array
should be (perfect) inverses of each other.  Unfortunately I can't see
any way to get this to happen; zero length arrays or NULL elements
in the array mean than array_to_string isn't injective.  This means
that the composition of the two functions won't result in an injective
function and my original premise is false.  Note that as far as I can
tell string_to_array is injective.  I'm assuming that the delimiter
won't appear as part of an element of the array; e.g. an array of
integers and space as a delimiter is OK, but using the same delimiter
with unconstrained text is not OK, a blank delimiter is never OK as it's
always part of a string.

Injective means there exists more than one array that encodes to the
same string.  The examples being how do you sanely encode '{}' and
'{NULL}' in a unique way; '{}' is a bad example because it's just
an artifact of how strings are represented.  The complications needed
to allow this to happen would make it a very similar function as the
array's normal output_function function and hence wouldn't serve a
useful purpose.  All of this implies that we have to make a compromise
somewhere.

The semantics that most closely match the existing behaviour would be;
for array_to_string:

  1) remove NULL values from input array
  2) call output_function on remaining elements
  3) intersperse[1] the delimiter between the remaining elements
  4) concatenate the resulting array

for string_to_array:

  1) check if input is zero length; return empty array
  2) split array based on delimiter and return

Note that both functions are STRICT; i.e. a NULL for either parameter
should cause the function to return NULL.  Arguably in string_to_array
it could be non-strict if the input string is empty, but it's probably
worth keeping it strict to simplify the semantics.

Here are some examples:

  array_to_string('{}'::TEXT[],',')  = ''
  array_to_string('{NULL}'::TEXT[],',')  = ''
  array_to_string('{NULL,NULL}'::TEXT[],',') = ''
  array_to_string('{a,NULL}'::TEXT[],',')= 'a'
  array_to_string('{NULL,a}'::TEXT[],',')= 'a'
  array_to_string('{a,b}'::TEXT[],',')   = 'a,b'
  array_to_string('{a,NULL,b}'::TEXT[],',')  = 'a,b'

  string_to_array('',',') = '{}'
  string_to_array(' ',',')= '{ }'
  string_to_array(',',',')= '{,}'
  string_to_array('a',',')= '{a}'
  string_to_array('a,',',')   = '{a,}'
  string_to_array(',a',',')   = '{,a}'
  string_to_array('a,b',',')  = '{a,b}'

My thinking before was that it should be doing:

  string_to_array('',',') = '{}'

instead, but I now think that Greg has a point and these are nicer/more
generally useful semantics.

Hum, that all got a bit more verbose than I was expecting.  Ah well, I
hope it's somewhat useful.

-- 
  Sam  http://samason.me.uk/
 
 [1] as in the intersperse function in Haskell
   http://www.haskell.org/onlinereport/list.html#sect17.3
   intersperse # [a, bar] == [a, #, bar]
 note that here we're working with arrays of string, rather than
 arrays of characters as in the report.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [GENERAL] string_to_array with empty input

2009-04-01 Thread Sam Mason
On Wed, Apr 01, 2009 at 10:23:18AM -0700, David E. Wheeler wrote:
 On Apr 1, 2009, at 10:05 AM, justin wrote:
 string_to_array('',',')::INT[]  works as proposed
 
 But
 string_to_array(',,,', ',' )::INT[]   Fails
 or
 string_to_array('1,2,,4', ',' )::INT[] Fails .
 
 
 I'm trying to understand the difference between a empty string to a  
 string with  many blank entries between  the delimiter.
 Consider   ',,'  = ''  once the delimiter is removed .  Yet  
 Seven zero length entries were passed.  How is that going to be  
 handled
 
 Right, it's making a special case of '', which does seem rather  
 inconsistent to me.

Yes it is; but it's a useful special case because it allows:

  string_to_array(array_to_string(col,','),',')

to do the right thing whether it's got zero or more elements in.  With
the current implementation you get a NULL back in the case of zero
elements and the expected array back the rest of the time.

To me, it doesn't really matter whether:

  string_to_array(',', ',' )::INT[]

fails or not; because array_to_string will never generate a string that
looks like this.

-- 
  Sam  http://samason.me.uk/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [GENERAL] string_to_array with empty input

2009-04-01 Thread Sam Mason
On Wed, Apr 01, 2009 at 07:40:16PM +0100, Greg Stark wrote:
 The existing behaviour of returning NULL is the only consistent
 choice since the correct value is unknown. And one could argue that
 it's easier to replace NULL with the correct value if the programmer
 knows using coalesce than it is to replace either  or {}.

Couldn't a similar argument be applied for division by zero?  Since it's
not known whether the user wants to get a divide by zero exception or
infinity PG should return NULL and punt the choice to the user.

I think everybody would agree that this would be a bad thing to do!

 But I'm
 still leaning to thinking that using an arbitrary choice that at least
 gets most users intentions is better.

I'd agree; returning NULL and not forcing the user to make a choice
is a bad design decision---the user doesn't need to put a coalesce
in and hence their code will probably break in strange ways when
they're not expecting it.  Nobody suggest adding a third parameter to
string_to_array, please!

The general mantra that seems to apply here is one good option is
better than two bad ones.

-- 
  Sam  http://samason.me.uk/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


  1   2   >