Re: [HACKERS] HTTP Frontend? (and a brief thought on materialized views)

2012-03-30 Thread Daniel Farina
On Thu, Mar 29, 2012 at 10:55 PM, Dobes Vandermeer dob...@gmail.com wrote:
 Virtual hosts. Same port.

 In that case, the frontend would not be tied to a specific PostgreSQL
 server, then?  I think initially this might complicate things a bit, and you
 could solve it by putting an HTTP proxy in front to do the virtual hosts for
 you.

I think these problems are treatable, as you mention, and if there is
to be any value at all by using mostly off-the-shelf components
(which, at this time, are rarer for the new-generation stuff).  That's
the draw, for me.

 Ah, I get it - you want a way to add some extra information to the protocol
 in a backwards compatible way.  HTTP (and SPDY) provides a standard way to
 do that.  Makes sense.

 I'll make a note of these and hash out the details a bit more once there's
 something working to add them to.

A lot of them are old ideas, but it would be nice to encourage
experimentation by getting over some of the
small-matter-of-programming and backwards-compatibility issues.

 I think SPDY or like-protocols [...] give a crisp treatment to
 interactive, stateful workloads involving

 back-and-forth between client and server with multiplexing, fixing
 some problems with the hacks in HTTP-land from before.

 It sounds like at some level you're really talking about replacing the
 built-in protocol with SPDY because SPDY is possibly a better baseline than
 updating the existing protocol.  That's an interesting idea, I think this
 project could evolve in that direction if there's demand for it.

If only so there is a smaller set of arbitrary decisions to make about
how to delimit messages...but if SPDY doesn't get widely deployed, or
exacts an unacceptable performance penalty, it is game over.  The
worst possible outcome is the thing that becomes common also is
impractical for session-oriented sequential SQL execution, but I am
hopeful that given the use cases driving this standardization process
that this is less likely to happen.  FEBE's framing/message format
doesn't seem like an area where people are itching to try wild and
crazy changes unique to Postgres (maybe someone has...) so using a
more commonly seen delimitation format seems like a concession that
opens a lot of more useful interesting doors.

This discussion is somewhat premature because HTTP 2.0 isn't
standardized, nor has SPDY become a wide-spread defacto format
(although the percentage of well-known web-infrastructure projects
implementing it has grown both impressively both recently and
overall), and there are not even straw-man bandwidth/latency
measurements (in terms of vs. FEBE), but as long as someone is
bringing up HTTP, I thought it worth discussing in a little more
depth, because it's something I poll regularly mentally, looking for a
sign that It's Time.  It wasn't ready enough for me to start a thread,
but clearly I couldn't quite resist replying to one...

-- 
fdr

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Command Triggers patch v18

2012-03-30 Thread Dimitri Fontaine
Robert Haas robertmh...@gmail.com writes:
 I'm thinking of things like extension whitelisting.  When some
 unprivileged user says CREATE EXTENSION harmless, and harmless is
 marked as superuser-only, we might like to have a hook that gets
 called *at permissions-checking time* and gets to say, oh, well, that
 extension is on the white-list, so we're going to allow it.  I think
 you can come up with similar cases for other commands, where in
 general the operation is restricted to superusers or database owners
 or table owners but in specific cases you want to allow others to do
 it.

I did that another way in previous incarnations of the patch, which was
to allow for INSTEAD OF event trigger backed by a SECURITY DEFINER
function. When the extension is whitelisted, prevent against recursion
then CREATE EXTENSION in the security definer function, then signal that
the execution should now be aborted.

That was too dangerous given the lack of policy about where exactly the
user code is fired, but I think we could now implement that for some of
the event timing specs we're listing. Only some of them, I guess only
those that are happening before we lock the objects.

I would then prefer using the INSTEAD OF words that are way more easy to
grasp than AT.

 CREATE EVENT TRIGGER name ON event_name (event_subtype_name [, ...])
 EXECUTE PROCEDURE function_name(args);

  create event trigger prohibit_some_ddl
     preceding timing spec
          when tag in ('CREATE TABLE', 'ALTER TABLE')
       execute procedure throw_an_error();

 I guess that would make sense if you think there would ever be more
 than one choice for trigger variable.  I'm not immediately seeing a
 use case for that, though - I was explicitly viewing the syntax foo

So, the variables in question are tag, objectid, objectname, schemaname
and from a very recent email context. On reflexion, I think the variable
here would only be either tag or context, and that's it.

 More generally, my thought on the structure of this is that you're
 going to have certain toplevel events, many of which will happen at
 only a single place in the code, like an object got dropped or a
 DDL command started or a DDL command ended.  So we give those
 names, like sql_drop, ddl_command_start, and ddl_command_end.  Inside

I really dislike mixing sql_drop and ddl_command_start as being the same
kind of objects here, even if I can bend my head in the right angle and
see that it's a fair view when looking at how it's implemented. I can't
see a way to explain that to users without having to explain them how
drop cascade is implemented.

So my proposal here is to “fake” a “proper“ subcommand thanks to the new
context variable. If you DROP TYPE foo CASCADE and that in turn drops a
function foo_in(), then an event trigger is fired with

   context = 'DROP TYPE'
   tag = 'DROP FUNCTION'

Same idea when you DROP TABLE … CASCADE and a SEQUENCE and a bunch of
index need to disappear too, you get an usual event trigger fired with
the context set to 'DROP TABLE' this time.

I don't think we need to arrange for explicitly publishing the context
specific information here. If we need to, we have to find the right
timing spec where we can guarantee still being in the top level command
and where we already have the details filled in, then users can attach a
trigger here and register the information for themselves.

 your trigger procedure, the set of magic variables that is available
 will depend on which toplevel event you set the trigger on, but
 hopefully all firings of that toplevel event can provide the same
 magic variables.  For example, at ddl_command_start time, you're just
 gonna get the command tag, but at ddl_command_end time you will get
 the command tag plus maybe some other stuff.

With my proposal above, you could get the same set of information when
being called as a toplevel event or a subevent (one where the context is
not null). That would mean adding object name and schema name lokkups in
the drop cascade code, though. We can also decide not to do that extra
lookup and just publish the object id which we certainly do have.

This way, the timing spec of a sub-event can still be of the same kind
as the top-level event ones, we still have before and after lock entry
points, same with lookup if we add that feature, etc.

 Now, we COULD stop there.  I mean, we could document that you can
 create a trigger on ddl_command_start and every DDL command will fire
 that trigger, and if the trigger doesn't care about some DDL
 operations, then it can look at the command tag and return without
 doing anything for the operations it doesn't care about.  The only
 real disadvantage of doing it that way is speed, and maybe a bit of
 code complexity within the trigger.  So my further thought was that

Within my “context proposal”, you also lose the ability to refer to sub
events as plain events with a context, which I find so much cleaner.

 we'd allow you to specify an optional filter 

Re: [HACKERS] poll: CHECK TRIGGER?

2012-03-30 Thread Pavel Stehule
Hello

2012/3/28 Heikki Linnakangas heikki.linnakan...@enterprisedb.com:
 Ok, seems that the API issue is settled, so I'm now looking at the code
 actually doing the checking. My first impression is that this is a lot of
 code. Can we simplify it?


I played with this and It is not be reduced without darkening current
code of pl_exec.c. So I moved code related to checking from files to
new file pl_check.c. This code is relative large - about 50 kb, but it
is relative simple and I hope it is readable. I afraid so this cannot
be reduced by reuse with other pl_func.c (significantly). Recursive
iteration over nodes is relative not big part of this patch (~25%) and
general stmt walker doesn't help too much.

 Since this is deeply integrated into the PL/pgSQL interpreter, I was
 expecting that this would run through the normal interpreter, in a special
 mode that skips all the actual execution of queries, and shortcuts all loops
 and other control statements so that all code is executed only once. That
 would mean sprinkling some if (check_only) code into the normal exec_*
 functions. I'm not sure how invasive that would be, but it's worth
 considering. I think you would be able to more easily catch more errors that
 way, and the check code would stay better in sync with the execution code.

-1

there are a few places, that are very difficult: basic block with
exception handling - exception handlers, CASE stmt,  Other issue
is increasing of complexity some routines like exec_eval*


 Another thought is that check_stmt() and all its subroutines are very
 similar to the plpgsql_dumptree() code. Would it make sense to merge those?
 You could have an output mode, in addition to the xml and plain-text
 formats, that would just dump the whole tree like plpgsql_dumptree() does.


It is difficult now - without changes in plpgsql_stmt_if,
plpgsql_stmt_case and plpgsql_stmt_block is not possible to write
general walker that is usable for checking and dumptree. It needs
redesign of these nodes first.

 In prepare_expr(), you use a subtransaction to catch any ERRORs that happen
 during parsing the expression. That's a good idea, and I think many of the
 check_* functions could be greatly simplified by adopting a similar
 approach. Just ereport() any errors you find, and catch them at the
 appropriate level, appending the error to the output string. Your current
 approach of returning true/false depending on whether there was any errors
 seems tedious.

It cannot be implemented in AST interpret. Without removing some
requested functionality - fatal_errors.


 If you create a function with an invalid body (ie. set
 check_function_bodies=off; create function ... $$ bogus $$;) ,
 plpgsql_check_function() still throws an error. It's understandable that it
 cannot do in-depth analysis if the function cannot be parsed, but I would
 expect the syntax error to be returned as a return value like other errors
 that it complains about, not thrown as a hard ERROR. That would make it more
 useful to bulk-check all functions in a database with something like select
 plpgsql_check_function(oid) from pg_class. As it is, the checking stops at
 the first invalid function with an error.

done

postgres= select plpgsql_check_function('sss'::regproc, 0);
   plpgsql_check_function
-
 error:42601:syntax error at or near adasdfsadf
 Query:  adasdfsadf
 --  ^
 Context: compilation of PL/pgSQL function sss near line 1
(4 rows)



 PS. I think plpgsql_check_function() belongs in pl_handler.c

done

Regards

Pavel Stehule


 --
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com


plpgsql_check_function2-2012-03-30.diff.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Finer Extension dependencies

2012-03-30 Thread Dimitri Fontaine
Tom Lane t...@sss.pgh.pa.us writes:
 That's how I did it first, but Alvaro opposed to that because it allows
 for more than one extension to provide for the same feature name.
   http://archives.postgresql.org/pgsql-hackers/2012-03/msg01425.php

 Right, but the question that has to be considered is how often would
 that be intentional as opposed to an undesirable name collision.
 I think Hitoshi was right upthread that it will seldom if ever be
 the case that somebody is independently reimplementing somebody
 else's API, so the use-case for intentional substitution seems thin.

I reverted that change and we're now back to:

Table pg_catalog.pg_extension_feature
   Column   | Type | Modifiers 
+--+---
 extoid | oid  | not null
 extfeature | name | not null
Indexes:
pg_extension_feature_index UNIQUE, btree (extoid, extfeature)
pg_extension_feature_oid_index UNIQUE, btree (oid)

Regards,
-- 
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support



extension-provides.v7.patch.gz
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Optimizing Nested Correlated Queries by decorrelation: GSOC 2012 Project

2012-03-30 Thread Arun Chaitanya
Hi,

I wanted to take up this as a GSOC 2012 project.

SQL supports nested queries. When the inner query contains a
correlation variable the present optimizer takes an iterative
execution plan. If the inner query scans over a relation, the
iterative plan chosen can be sub-optimal.

The goal of this project is to enable De-correlation for all possible
cases. The iterative execution plan can be converted to a set oriented
plan by taking a join over the base relations involved. Sufficient
work has already been done in this area and has been implemented in
SQL Server.

The changes required to incorporate the above mentioned strategy is in
rewriting phase to the best of my knowledge. The key idea is to
introduce the APPLY operator in the raw parse tree. In the above
mentioned Papers, the author has mentioned the process of removing the
apply. The author has proposed a set of rules which will allow us to
achieve the goal. The present postgresql optimizer community has done
some work in these lines for simple subqueries involving =,  ,  in
the predicates [ I observed it by seeing the result of EXPLAIN for
relevant queries ]. The optimization is not done for subqueries
containing aggregate queries and existential and containment queries.

An example query from TPCH benchmark discussed by the author:

select c_custkey
from customer
where 100  (select sum(o_totalprice)
from orders
where o_custkey = c_custkey)

In the above case, c_custkey is a correlation variable (parameterized)
coming from the outer query. Hence in the present system, the inner
query is executed as many times as the tuples in the customer
relation. As the subQuery involves a scan over orders relation, the
total I/O cost involved is pretty high.

Using the transformation proposed by the author, the query can be re-written as

select c_custkey
from customer left outer join
orders on o_custkey = c custkey
group by c_custkey
having 100  sum(o_totalprice)

This allows the optimizer to chose a better plan for left outer join
and avoid iterative execution. The idea here is not to get a rewritten
query as output but to generate a plan tree that does the same as the
above query.

Regards,
Arun

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Optimizing Nested Correlated Queries by decorrelation: GSOC 2012 Project

2012-03-30 Thread Robert Haas
On Fri, Mar 30, 2012 at 7:33 AM, Arun Chaitanya chaitan64a...@gmail.com wrote:
 I wanted to take up this as a GSOC 2012 project.

This would be a great query planner optimization but the chances of
getting it done in one summer as a GSoC project seem to me to be nil.
You've never had a patch accepted before; picking a massive
reorganization of the query planner as your first project is not going
to work out well.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Re: [COMMITTERS] pgsql: pg_test_timing utility, to measure clock monotonicity and timing

2012-03-30 Thread Robert Haas
On Wed, Mar 28, 2012 at 10:43 AM, Fujii Masao masao.fu...@gmail.com wrote:
 On Wed, Mar 28, 2012 at 9:19 PM, Robert Haas robertmh...@gmail.com wrote:
 On Tue, Mar 27, 2012 at 10:10 PM, Fujii Masao masao.fu...@gmail.com wrote:
 On Wed, Mar 28, 2012 at 5:17 AM, Robert Haas rh...@postgresql.org wrote:
 pg_test_timing utility, to measure clock monotonicity and timing cost.

 When I compiled this, I got a compiler warning. Attached patch
 silences the warning.

 Unfortunately, that *produces* a warning on my machine.  Normally, I
 think we handle this using INT64_FORMAT, but the fact that it's %10ld
 here and not just %lld makes that awkward.  I guess we maybe need to
 insert some kludgy workaround here - write it into a separate buffer,
 and then blank-pad it, or something like that.

 This seems a simplest workaround. How about attached patch?

Thanks, committed.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] HTTP Frontend? (and a brief thought on materialized views)

2012-03-30 Thread Merlin Moncure
On Thu, Mar 29, 2012 at 10:04 AM, Andrew Dunstan and...@dunslane.net wrote:
 1. I've been in discussion with some people about adding simple JSON extract
 functions. We already have some (i.e. xpath()) for XML.

I've built a couple of applications that push data in and out of xml
via manual composition going out and xpath coming in.  TBH, I found
this to be a pretty tedious way of developing a general application
structure and a couple of notches down from the more sql driven
approach.  Not that jsonpath/xpath aren't wonderful functions -- but I
thing for general information passing there's a better way.

Your json work is a great start in marrying document level database
features with a relational backend.  My take is that storing rich data
inside the database in json format, while tempting, is generally a
mistake.  Unless the document is black box it should be decomposed and
stored relationally and marked back up into a document as it goes out
the door.  This is why brevity and flexibility of syntax is so
important when marshaling data in and out of transport formats.  It
encourages people to take the right path and get the best of both
worlds -- a rich backend with strong constraints that can natively
speak such that writing data driven web services is easy.

What I'm saying is that jsonpath probably isn't the whole story:
another way of bulk moving json into native backend structures without
parsing would also be very helpful.  For example, being able to cast a
json document into a record or a record array would be just amazing.

merlin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Command Triggers patch v18

2012-03-30 Thread Robert Haas
On Fri, Mar 30, 2012 at 4:32 AM, Dimitri Fontaine
dimi...@2ndquadrant.fr wrote:
 I did that another way in previous incarnations of the patch, which was
 to allow for INSTEAD OF event trigger backed by a SECURITY DEFINER
 function. When the extension is whitelisted, prevent against recursion
 then CREATE EXTENSION in the security definer function, then signal that
 the execution should now be aborted.

 That was too dangerous given the lack of policy about where exactly the
 user code is fired, but I think we could now implement that for some of
 the event timing specs we're listing. Only some of them, I guess only
 those that are happening before we lock the objects.

Oh, right: I remember that now.  I still think it's a bad way to do
it, because the trigger potentially has a lot of work to do to
reconstruct a working command string, and it still ends up getting
executed by the wrong user.  For CREATE EXTENSION it's not that bad,
because the arguments to the command are so simple, but of course any
time we extend the CREATE EXTENSION syntax, the trigger needs to know
about it too whether it's security-relevant or not, and doing
something similar with, say, ALTER TABLE would be a ridiculously
complicated.  I think there is a use case for what you called an
INSTEAD OF trigger, but I don't believe in this one.  It seems to me
that there's a lot of power in being able to *just* intercept the
security decision and then let the rest of the command go about its
business.  Of course, you have to avoid getting security checks (like,
you must own the table in order to drop a column) with integrity
checks (like, you can't drop a column from pg_class) but I think
that's not very hard to get right.

 More generally, my thought on the structure of this is that you're
 going to have certain toplevel events, many of which will happen at
 only a single place in the code, like an object got dropped or a
 DDL command started or a DDL command ended.  So we give those
 names, like sql_drop, ddl_command_start, and ddl_command_end.  Inside

 I really dislike mixing sql_drop and ddl_command_start as being the same
 kind of objects here, even if I can bend my head in the right angle and
 see that it's a fair view when looking at how it's implemented. I can't
 see a way to explain that to users without having to explain them how
 drop cascade is implemented.

 So my proposal here is to “fake” a “proper“ subcommand thanks to the new
 context variable. If you DROP TYPE foo CASCADE and that in turn drops a
 function foo_in(), then an event trigger is fired with

   context = 'DROP TYPE'
   tag = 'DROP FUNCTION'

 Same idea when you DROP TABLE … CASCADE and a SEQUENCE and a bunch of
 index need to disappear too, you get an usual event trigger fired with
 the context set to 'DROP TABLE' this time.

 I don't think we need to arrange for explicitly publishing the context
 specific information here. If we need to, we have to find the right
 timing spec where we can guarantee still being in the top level command
 and where we already have the details filled in, then users can attach a
 trigger here and register the information for themselves.

I'm not sure I understand how you're using the words context and
tag.  I think for a drop trigger I would want the function to
receive this information: type of object dropped, OID of object
dropped, column number in the case of a column drop, flag indicating
whether it's a toplevel drop or a cascaded drop.  I wouldn't object to
also making the currently-in-context toplevel command tag available,
but I think most drop triggers wouldn't really care, so I wouldn't
personally spend much implementation effort on it if it turns out to
be hard.

But in general, I don't really know what a proper subcommand is or
why some subcommands should be more proper than others, or why we
should even be concerned about whether something is a subcommand at
all.  I think it's fine and useful to have triggers that fire in those
kinds of places, but I don't see why we should limit ourselves to
that.  For applications like replication, auditing, and enhanced
security, the parse tree and subcommand/non-subcommand status of a
particular operation are irrelevant.  What you need is an exact
description of the operation that got performed (e.g. the default on
table X column Y got dropped); you might be able to reverse-engineer
that from the parse tree, but it's much better to have the system pass
you the information you need more directly.  Certainly, there are
cases where you might want to have the parse tree, or even the raw
command text, available, but I'm not even convinced that that those
cases will be the most commonly used ones unless, of course, they're
the only ones we offer, in which case everyone will go down that path
by necessity.

 your trigger procedure, the set of magic variables that is available
 will depend on which toplevel event you set the trigger on, but
 hopefully all firings of that toplevel event can 

Re: [HACKERS] Optimizing Nested Correlated Queries by decorrelation: GSOC 2012 Project

2012-03-30 Thread Arun Chaitanya
Thanks a lot Heikki.
I have already posted an example in the mail.
The link to the paper is
http://www.iith.ac.in/~ravig/courses/cs5050/papers/decorrelation-cesar.pdf

Hope this helps,
Arun

On Fri, Mar 30, 2012 at 5:32 PM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 (off-list)

 You'll want to post a link to the paper, otherwise people who are not GSoC
 mentors will have no idea what you're talking about ;-). Posting an example
 query and access plans would illustrate the point, too.


 On 30.03.2012 14:33, Arun Chaitanya wrote:

 Hi,

 I wanted to take up this as a GSOC 2012 project.

 SQL supports nested queries. When the inner query contains a
 correlation variable the present optimizer takes an iterative
 execution plan. If the inner query scans over a relation, the
 iterative plan chosen can be sub-optimal.

 The goal of this project is to enable De-correlation for all possible
 cases. The iterative execution plan can be converted to a set oriented
 plan by taking a join over the base relations involved. Sufficient
 work has already been done in this area and has been implemented in
 SQL Server.

 The changes required to incorporate the above mentioned strategy is in
 rewriting phase to the best of my knowledge. The key idea is to
 introduce the APPLY operator in the raw parse tree. In the above
 mentioned Papers, the author has mentioned the process of removing the
 apply. The author has proposed a set of rules which will allow us to
 achieve the goal. The present postgresql optimizer community has done
 some work in these lines for simple subqueries involving =,  ,  in
 the predicates [ I observed it by seeing the result of EXPLAIN for
 relevant queries ]. The optimization is not done for subqueries
 containing aggregate queries and existential and containment queries.

 An example query from TPCH benchmark discussed by the author:

 select c_custkey
 from customer
 where 100  (select sum(o_totalprice)
 from orders
 where o_custkey = c_custkey)

 In the above case, c_custkey is a correlation variable (parameterized)
 coming from the outer query. Hence in the present system, the inner
 query is executed as many times as the tuples in the customer
 relation. As the subQuery involves a scan over orders relation, the
 total I/O cost involved is pretty high.

 Using the transformation proposed by the author, the query can be
 re-written as

 select c_custkey
 from customer left outer join
 orders on o_custkey = c custkey
 group by c_custkey
 having 100  sum(o_totalprice)

 This allows the optimizer to chose a better plan for left outer join
 and avoid iterative execution. The idea here is not to get a rewritten
 query as output but to generate a plan tree that does the same as the
 above query.

 Regards,
 Arun



 --
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Optimizing Nested Correlated Queries by decorrelation: GSOC 2012 Project

2012-03-30 Thread Arun Chaitanya
Thanks Robert,

Yes. I think I am being over ambitious as I never had any Open Source
development experience.
Anyways, please go through the idea. I have posted the link to the
paper in on of the replies.

Please, suggest me other options which I can take up as a GSOC 2012 project.

On Fri, Mar 30, 2012 at 5:34 PM, Robert Haas robertmh...@gmail.com wrote:
 On Fri, Mar 30, 2012 at 7:33 AM, Arun Chaitanya chaitan64a...@gmail.com 
 wrote:
 I wanted to take up this as a GSOC 2012 project.

 This would be a great query planner optimization but the chances of
 getting it done in one summer as a GSoC project seem to me to be nil.
 You've never had a patch accepted before; picking a massive
 reorganization of the query planner as your first project is not going
 to work out well.

 --
 Robert Haas
 EnterpriseDB: http://www.enterprisedb.com
 The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Optimizing Nested Correlated Queries by decorrelation: GSOC 2012 Project

2012-03-30 Thread Tom Lane
Arun Chaitanya chaitan64a...@gmail.com writes:
 The link to the paper is
 http://www.iith.ac.in/~ravig/courses/cs5050/papers/decorrelation-cesar.pdf

Given the authorship of that paper, I'd have to wonder whether Microsoft
has filed for any patents regarding these ideas.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] HTTP Frontend? (and a brief thought on materialized views)

2012-03-30 Thread Andrew Dunstan



On 03/30/2012 09:57 AM, Merlin Moncure wrote:
What I'm saying is that jsonpath probably isn't the whole story: 
another way of bulk moving json into native backend structures without 
parsing would also be very helpful. For example, being able to cast a 
json document into a record or a record array would be just amazing.


It should be possible. After all, hstore's populate_record() does that. 
If we married that logic up with the json parser we should be half way 
there.


cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] HTTP Frontend? (and a brief thought on materialized views)

2012-03-30 Thread Dobes Vandermeer
On Fri, Mar 30, 2012 at 4:30 PM, Daniel Farina dan...@heroku.com wrote:

 On Thu, Mar 29, 2012 at 10:55 PM, Dobes Vandermeer dob...@gmail.com
 wrote:
  Virtual hosts. Same port. I think SPDY or like-protocols [...] give a
 crisp treatment to
  interactive, stateful workloads involving
 
  back-and-forth between client and server with multiplexing, fixing
  some problems with the hacks in HTTP-land from before.
 
  It sounds like at some level you're really talking about replacing the
  built-in protocol with SPDY because SPDY is possibly a better baseline
 than
  updating the existing protocol.  That's an interesting idea, I think this
  project could evolve in that direction if there's demand for it.

 If only so there is a smaller set of arbitrary decisions to make about
 how to delimit messages...but if SPDY doesn't get widely deployed, or
 exacts an unacceptable performance penalty, it is game over.


Well, in our case HTTP is a clear win (but not replacement) and SPDY a
potential one (even as a replacement).  Even if SPDY is not widely adopted
it could still replace FEBE if there's a clear advantage to using it, I
don't know enough to make the call right now.


Re: [HACKERS] HTTP Frontend? (and a brief thought on materialized views)

2012-03-30 Thread Dobes Vandermeer
On Fri, Mar 30, 2012 at 9:57 PM, Merlin Moncure mmonc...@gmail.com wrote:

 On Thu, Mar 29, 2012 at 10:04 AM, Andrew Dunstan and...@dunslane.net
 wrote:
  1. I've been in discussion with some people about adding simple JSON
 extract
  functions. We already have some (i.e. xpath()) for XML.

 Your json work is a great start in marrying document level database
 features with a relational backend.  My take is that storing rich data
 inside the database in json format, while tempting, is generally a
 mistake.  Unless the document is black box it should be decomposed and
 stored relationally and marked back up into a document as it goes out
 the door.


I found storing documents (with no schema) in CouchDB very sexy.  For
analytics purposes it's very handy to throw whatever data you've got into a
document and save it to the database and sort out the schema later, when
you have a need for it.  It could be you want to allow hundreds of
different fields in the document, and the document has a hierarchal
structure, and so on ... and yet there's no interest in spending time
figuring out how to map all that effectively into a SQL schema.  Some
things might start out a black box but become interesting for reporting
purposes later.

While I was doing this I always thought this would have been a better
approach for my previous project, an accounting application.  If I could
just have stored entities like invoice  customer as a single document that
is inserted, updated, etc. atomically it would be a lot simpler and faster
than having to break things out into columns and rows spread over various
tables.  Some fields are just information that isn't searched for or used
relationally but ends up getting a field (or a whole bunch of fields, like
your address) in the DB schema anyway.  Adding, removing, and changing
columns is always scary to me in the SQL database - when there's no schema
you can fix these things as part of the definition of your view instead.

So I kind of think the document database kind of bridges the gap between an
OODBMS and the RDBMS because the document is like a little cluster of
objects and then you populate your relations by mapping a view of those
documents and use that for analysis and search.

What I'm saying is that jsonpath probably isn't the whole story:
 another way of bulk moving json into native backend structures without
 parsing would also be very helpful.  For example, being able to cast a
 json document into a record or a record array would be just amazing.


Hmm whatever that exactly is, it does sound cool!


Re: [HACKERS] Command Triggers patch v18

2012-03-30 Thread Dimitri Fontaine
Robert Haas robertmh...@gmail.com writes:
 Oh, right: I remember that now.  I still think it's a bad way to do
 it, because the trigger potentially has a lot of work to do to
 reconstruct a working command string, and it still ends up getting
 executed by the wrong user.  For CREATE EXTENSION it's not that bad,

That's true, I'm only kind of saying that the INSTEAD OF keyword still
makes sense (instead of security_checks). I agree that the feature is
simpler to use the way to propose it.

   context = 'DROP TYPE'
   tag = 'DROP FUNCTION'

 I'm not sure I understand how you're using the words context and

Yeah context is not explicit, we could call that toplevel: the command
tag of the command that the user typed. When toplevel is null, the event
trigger is fired on a command the user sent, when it's not null, the
trigger is fired on some inner command operation.

 tag.  I think for a drop trigger I would want the function to
 receive this information: type of object dropped, OID of object
 dropped, column number in the case of a column drop, flag indicating
 whether it's a toplevel drop or a cascaded drop.  I wouldn't object to
 also making the currently-in-context toplevel command tag available,
 but I think most drop triggers wouldn't really care, so I wouldn't
 personally spend much implementation effort on it if it turns out to
 be hard.

I'm not sure it would be hard as I'm only seeing a single depth possible
here, so a single per-backend string static variable would do.

 But in general, I don't really know what a proper subcommand is or
 why some subcommands should be more proper than others, or why we
 should even be concerned about whether something is a subcommand at
 all.  I think it's fine and useful to have triggers that fire in those
 kinds of places, but I don't see why we should limit ourselves to
 that.  For applications like replication, auditing, and enhanced
 security, the parse tree and subcommand/non-subcommand status of a
 particular operation are irrelevant.  What you need is an exact

Not really. When replicating you could perfectly say that you only
replicate the toplevel DROP because the replica will also do the cascade
dance and you might have decided not to replicate all related objects on
the other side.

The information you need really want not to miss is when only the
cascaded object is part of the replication, not the main one. That was
not covered by my previous patch but now we have a way to cover it.

 description of the operation that got performed (e.g. the default on
 table X column Y got dropped); you might be able to reverse-engineer
 that from the parse tree, but it's much better to have the system pass
 you the information you need more directly.  Certainly, there are
 cases where you might want to have the parse tree, or even the raw
 command text, available, but I'm not even convinced that that those
 cases will be the most commonly used ones unless, of course, they're
 the only ones we offer, in which case everyone will go down that path
 by necessity.

There are far too many variants and cases of our command to be able to
extract their parameters in a flat way (a bunch of variables compared to
a nested description ala json or xml), and I don't think such a flat
representation is going to be much better than the parse tree.

Now, we will later be able to offer a normalized rewritten command
string from the parse tree to the use, but I don't see us adding support
for that from cascaded drops, one other reason why I like to expose them
as sub commands.

 Again, I'm not understanding the distinction between toplevel events
 and sub-events.  I don't see any need for such a distinction.  I think
 there are just events, and some of them happen at command start/end
 and others happen somewhere in the middle.  As long as it's a safe and
 useful place to fire a trigger, who cares?

I guess you're head is too heavily in the code side of things as opposed
to the SQL user view point. Maybe my attempt to conciliate both views is
not appropriate, but I really do think it is.

 Given the scope of this mini expression language, we can easily bypass
 calling the executor in v1 here, and reconsider later if we want to
 allow calling a UDF in the WHEN clause… I don't think it's an easy
 feature to add in, though.

 Or a necessary one.  AFAICS, the main benefit of WHEN clauses on

Exactly.

 regular triggers is that you can prevent the AFTER trigger queue from
 getting huge, and maybe a save a little on the cost of invoking a
 trigger function just to exit again.  But neither of those should be
 relevant here - nobody does that much DDL, and anybody writing command
 triggers should understand that this is advanced magic not intended
 for beginners.  Wizards below level 10 need not apply.

Here it's only a facility to manage your event trigger code
organization, and I insisted in having it in the syntax because in the
event trigger grammar I don't see another place where to stuff 

Re: [HACKERS] HTTP Frontend? (and a brief thought on materialized views)

2012-03-30 Thread Robert Haas
On Fri, Mar 30, 2012 at 10:55 AM, Dobes Vandermeer dob...@gmail.com wrote:
 Well, in our case HTTP is a clear win (but not replacement) and SPDY a
 potential one (even as a replacement).  Even if SPDY is not widely adopted
 it could still replace FEBE if there's a clear advantage to using it, I
 don't know enough to make the call right now.

I can see that there are some advantages to having an HTTP interface
to the database, but I think throwing our existing protocol out the
window or relegating it to the status of a second-class citizen would
be foolish.  HTTP is a non-trivial protocol that tends to impose lots
of escaping and de-escaping overhead which is unnecessary for people
who just want to connect to the database and run queries.  I can
completely understand that someone might want the ability to do GET
/db/table/pk and have that return an answer very, very quickly, by
bypassing the usual parser and planner and just firing off an
index-scan and returning the results as JSON or somesuch.  But I think
it would be a serious mistake to assume that GET /q?q=myquery is going
to come out better than what we have now in the general case.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speed dblink using alternate libpq tuple storage

2012-03-30 Thread Marko Kreen
On Thu, Mar 29, 2012 at 06:56:30PM -0400, Tom Lane wrote:
 Marko Kreen mark...@gmail.com writes:
  My conclusion is that row-processor API is low-level expert API and
  quite easy to misuse.  It would be preferable to have something more
  robust as end-user API, the PQgetRow() is my suggestion for that.
  Thus I see 3 choices:
 
  1) Push row-processor as main API anyway and describe all dangerous
 scenarios in documentation.
  2) Have both PQgetRow() and row-processor available in libpq-fe.h,
 PQgetRow() as preferred API and row-processor for expert usage,
 with proper documentation what works and what does not.
  3) Have PQgetRow() in libpq-fe.h, move row-processor to libpq-int.h.
 
 I still am failing to see the use-case for PQgetRow.  ISTM the entire
 point of a special row processor is to reduce the per-row processing
 overhead, but PQgetRow greatly increases that overhead.

No, decreasing CPU overhead is minor win.  I guess in realistic
application, like dblink, you can't even measure the difference.

The *major* win comes from avoiding buffering of all rows in PGresult.
Ofcourse, this is noticeable only with bigger resultsets.
I guess such buffering pessimizes memory usage: code always
works on cold cache.  And simply keeping RSS low is good for
long-term health of a process.

Second major win is avoiding the need to use cursor with small chunks
to access resultset of unknown size.  Thus stalling application
until next block arrives from network.


The PGresult *PQgetRow() is for applications that do not convert
rows immediately to some internal format, but keep using PGresult.
So they can be converted to row-by-row processing with minimal
changes to actual code.

Note that the PGrowValue is temporary struct that application *must*
move data away from.  If app internally uses PGresult, then it's
pretty annoying to invent a new internal format for long-term
storage.

But maybe I'm overestimating the number of such applications.

 And it doesn't
 reduce complexity much either IMO: you still have all the primary risk
 factors arising from processing rows in advance of being sure that the
 whole query completed successfully.

It avoids the complexity of:

* How to pass error from callback to upper code

* Needing to know how exceptions behave

* How to use early exit to pass rows to upper code one-by-one,
  (by storing the PGresult and PGrowValue in temp place
   and later checking their values)

* How to detect that new resultset has started. (keeping track
  of previous PGresult or noting some quirky API behaviour
  we may invent for such case)

* Needing to make sure the callback does not leak to call-sites
  that expect regular libpq behaviour.
  (Always call PQregisterRowProcessor(db, NULL, NULL) after query finishes )
  [But now I'm in exception handler, how do I find the connection?]


I've now reviewed the callback code and even done some coding with it
and IMHO it's too low-level to be end-user-API.

Yes, the query-may-still-fail complexity remains, but thats not unlike
the usual multi-statement-transaction-is-not-guaranteed-to-succeed
complexity.

Another compexity that remains is how-to-skip-current-resultset,
but that is a problem only on sync connections and the answer is
simple - call PQgetResult().  Or call PQgetRow/PQrecvRow if
user wants to avoid buffering.

 Plus it conflates no more data
 with there was an error receiving the data or there was an error on
 the server side.

Well, current PQgetRow() is written with style: return only single-row
PGresult, to see errors user must call PQgetResult().  Basically
so that user it forced to fall back familiar libpq usage pattern.

It can be changed, so that PQgetRow() returns also errors.

Or we can drop it and just keep PQrecvRow().

 PQrecvRow alleviates the per-row-overhead aspect of
 that but doesn't really do a thing from the complexity standpoint;
 it doesn't look to me to be noticeably easier to use than a row
 processor callback.

 I think PQgetRow and PQrecvRow just add more API calls without making
 any fundamental improvements, and so we could do without them.  There's
 more than one way to do it is not necessarily a virtue.

Please re-read the above list of problematic situations that this API
fixes.  Then, if you still think that PQrecvRow() is pointless, sure,
let's drop it.

We can also postpone it to 9.3, to poll users whether they want
easier API, or is maximum performance important.  (PQrecvRow()
*does* have few cycles of overhead compared to callbacks.)

Only option that we have on the table for 9.2 but not later
is moving the callback API to libpq-int.h.

  Second conclusion is that current dblink row-processor usage is broken
  when user uses multiple SELECTs in SQL as dblink uses plain PQexec().
 
 Yeah.  Perhaps we should tweak the row-processor callback API so that
 it gets an explicit notification that this is a new resultset.
 Duplicating PQexec's behavior would then involve having the dblink row
 

Re: [HACKERS] Speed dblink using alternate libpq tuple storage

2012-03-30 Thread Tom Lane
Marko Kreen mark...@gmail.com writes:
 On Thu, Mar 29, 2012 at 06:56:30PM -0400, Tom Lane wrote:
 Marko Kreen mark...@gmail.com writes:
 Second conclusion is that current dblink row-processor usage is broken
 when user uses multiple SELECTs in SQL as dblink uses plain PQexec().

 Yeah.  Perhaps we should tweak the row-processor callback API so that
 it gets an explicit notification that this is a new resultset.
 Duplicating PQexec's behavior would then involve having the dblink row
 processor throw away any existing tuplestore and start over when it
 gets such a call.
 
 There's multiple ways to express that but the most convenient thing
 from libpq's viewpoint, I think, is to have a callback that occurs
 immediately after collecting a RowDescription message, before any
 rows have arrived.  So maybe we could express that as a callback
 with valid res but columns set to NULL?
 
 A different approach would be to add a row counter to the arguments
 provided to the row processor; then you'd know a new resultset had
 started if you saw rowcounter == 0.  This might have another advantage
 of not requiring the row processor to count the rows for itself, which
 I think many row processors would otherwise have to do.

 Try to imagine how final documentation will look like.

 Then imagine documentation for PGrecvRow() / PQgetRow().

What's your point, exactly?  PGrecvRow() / PQgetRow() aren't going to
make that any better as currently defined, because there's noplace to
indicate this is a new resultset in those APIs either.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speed dblink using alternate libpq tuple storage

2012-03-30 Thread Marko Kreen
On Fri, Mar 30, 2012 at 11:59:12AM -0400, Tom Lane wrote:
 Marko Kreen mark...@gmail.com writes:
  On Thu, Mar 29, 2012 at 06:56:30PM -0400, Tom Lane wrote:
  Marko Kreen mark...@gmail.com writes:
  Second conclusion is that current dblink row-processor usage is broken
  when user uses multiple SELECTs in SQL as dblink uses plain PQexec().
 
  Yeah.  Perhaps we should tweak the row-processor callback API so that
  it gets an explicit notification that this is a new resultset.
  Duplicating PQexec's behavior would then involve having the dblink row
  processor throw away any existing tuplestore and start over when it
  gets such a call.
  
  There's multiple ways to express that but the most convenient thing
  from libpq's viewpoint, I think, is to have a callback that occurs
  immediately after collecting a RowDescription message, before any
  rows have arrived.  So maybe we could express that as a callback
  with valid res but columns set to NULL?
  
  A different approach would be to add a row counter to the arguments
  provided to the row processor; then you'd know a new resultset had
  started if you saw rowcounter == 0.  This might have another advantage
  of not requiring the row processor to count the rows for itself, which
  I think many row processors would otherwise have to do.
 
  Try to imagine how final documentation will look like.
 
  Then imagine documentation for PGrecvRow() / PQgetRow().
 
 What's your point, exactly?  PGrecvRow() / PQgetRow() aren't going to
 make that any better as currently defined, because there's noplace to
 indicate this is a new resultset in those APIs either.

Have you looked at the examples?  PQgetResult() is pretty good hint
that one resultset finished...

-- 
marko


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] HTTP Frontend? (and a brief thought on materialized views)

2012-03-30 Thread Andrew Dunstan



On 03/30/2012 11:41 AM, Robert Haas wrote:

On Fri, Mar 30, 2012 at 10:55 AM, Dobes Vandermeerdob...@gmail.com  wrote:

Well, in our case HTTP is a clear win (but not replacement) and SPDY a
potential one (even as a replacement).  Even if SPDY is not widely adopted
it could still replace FEBE if there's a clear advantage to using it, I
don't know enough to make the call right now.

I can see that there are some advantages to having an HTTP interface
to the database, but I think throwing our existing protocol out the
window or relegating it to the status of a second-class citizen would
be foolish.



Right, I can't imagine it happening. And I wouldn't really be keen to 
add an alternative protocol either.


I could imagine a client which presented a SPDY interface to the world 
and translated it into standard calls, possibly via libpq.


It's well to remember that we are not a green fields project here.



HTTP is a non-trivial protocol that tends to impose lots
of escaping and de-escaping overhead which is unnecessary for people
who just want to connect to the database and run queries.  I can
completely understand that someone might want the ability to do GET
/db/table/pk and have that return an answer very, very quickly, by
bypassing the usual parser and planner and just firing off an
index-scan and returning the results as JSON or somesuch.  But I think
it would be a serious mistake to assume that GET /q?q=myquery is going
to come out better than what we have now in the general case.



Indeed.

cheers

andrew


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] tracking context switches with perf record

2012-03-30 Thread Robert Haas
I've been playing around with perf record on the IBM POWER7 machine,
and it's pretty cool.  One of the things I don't like is that some of
the tools don't produce very nice text reports that you can cut and
paste into an email - it's kind of a text-based GUI where you can zoom
around and look at things.  Still, it's nice.

One of the neat things I've just discovered is that you can use a
call-graph report on context-switches to find out what user code is
resulting in context-switching.  For example, I ran a 32-client
pgbench test at scale factor 300 for half an hour and captured data
for 10 minutes relatively near the beginning.  The partially expanded
call tree for postgres looks like this:

  - 87.79% .schedule_timeout
 + 99.99% .unix_stream_recvmsg
  - 11.63% .SyS_semtimedop
   .SyS_ipc
   syscall_exit
   semop
 - PGSemaphoreLock
- 98.48% LWLockAcquire
   - 53.11% XLogInsert
  + 48.39% heap_update
  + 18.15% heap_insert
  + 16.36% RecordTransactionCommit
  + 15.67% log_heap_clean
  + 0.98% _bt_insertonpg
   + 18.71% ProcArrayEndTransaction
   - 14.77% SimpleLruReadPage_ReadOnly
TransactionIdGetStatus
TransactionLogFetch
  - TransactionIdDidCommit
 + 53.52% HeapTupleSatisfiesVacuum
 + 45.00% HeapTupleSatisfiesMVCC
 + 0.82% HeapTupleHeaderAdvanceLatestRemovedXid
   + 6.08% TransactionIdSetPageStatus
   + 2.63% LockBuffer
   + 1.92% GetSnapshotData
   + 1.51% SimpleLruReadPage
+ 1.52% ProcSleep
  + 0.50% do_work

So what you can see here is that all the context switches that
happened in postgres with the result either of a .schedule_timeout or
a .SyS_semtimedop.  87.79% of the context switches came from
.schedule_timeout, and 99.99% of those came from unix_stream_recvmsg.
If you expand that branch of the call tree, you find that all of them
are coming eventually from secure_read; the server is waiting for a
new query from the client.  This is, obviously, overhead we can't
eliminate from this test; waiting for the client is part of the job.
The other 11.63% of the context switches result from PGSemaphoreLock
calling semop.  Of those, 98.48% come from LWLockAcquire and 1.52%
come from ProcSleep.  The ProcSleep calls are all from blocking while
attempting to acquire a heavyweight lock; and nearly all of them are
from conflicting tuple locks.  However, they account for such a small
portion of the overall context switches that they're probably not
worth worrying about.

The more interesting waits, in my view anyway, are the ones that come
from LWLockAcquire, which account for nearly all of the semaphore
sleeps.  As you can see, XLogInsert accounts for over half of those,
and ProcArrayEndTransaction and SimpleLruReadPage_ReadOnly account for
most of the rest.  Just out of curiosity, I expanded a couple of those
another level, and you can see what's going on there, too.  I was
slightly surprised by the number of calls to log_heap_clean, since I
wouldn't have expected vacuuming to generate that much work; it turns
out that 99.9% of those are HOT pruning events happening during query
execution.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speed dblink using alternate libpq tuple storage

2012-03-30 Thread Marko Kreen
On Fri, Mar 30, 2012 at 7:04 PM, Marko Kreen mark...@gmail.com wrote:
 Have you looked at the examples?  PQgetResult() is pretty good hint
 that one resultset finished...

Ok, the demos are around this long thread and hard to find,
so here is a summary of links:

Original design mail:

  http://archives.postgresql.org/message-id/20120224154616.ga16...@gmail.com

First patch with quick demos:

  http://archives.postgresql.org/message-id/20120226221922.ga6...@gmail.com

Demos as diff:

  http://archives.postgresql.org/message-id/20120324002224.ga19...@gmail.com

Demos/experiments/tests (bit messier than the demos-as-diffs):

  https://github.com/markokr/libpq-rowproc-demos

Note - the point is that user *must* call PQgetResult() when resultset ends.
Thus also the PQgetRow() does not return errors decision.

I'll put this mail into commitfest page too, seems I've forgotten to
put some mails there.

-- 
marko

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] checkpointer code behaving strangely on postmaster -T

2012-03-30 Thread Alvaro Herrera

I noticed while doing some tests that the checkpointer process does not
recover very nicely after a backend crashes under postmaster -T (after
all processes have been kill -CONTd, of course, and postmaster told to
shutdown via Ctrl-C on its console).  For some reason it seems to get
stuck on a loop doing sleep(0.5s)  In other case I caught it trying to
do a checkpoint, but it was progressing a single page each time and then
sleeping.  In that condition, the checkpoint took a very long time to
finish.

Pressing Ctrl-C in the postmaster console at that point does not have
any effect.

-- 
Álvaro Herrera alvhe...@alvh.no-ip.org

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Uppercase tab completion keywords in psql?

2012-03-30 Thread Peter Eisentraut
On fre, 2012-03-23 at 07:52 -0700, David Fetter wrote:
 On Thu, Mar 22, 2012 at 06:05:30PM -0400, Andrew Dunstan wrote:
  On 03/22/2012 05:49 PM, Bruce Momjian wrote:
  Robert Haas and I are disappointed by this change.  I liked the
  fact that I could post nice-looking SQL queries without having to
  use my capslock key (which I use as a second control key).  Any
  chance of reverting this change?
  
  
  Should it be governed by a setting?
 
 Something like (upper|lower|preserve) ?

How about this patch then?  (There are actually four possible settings,
see patch.)

diff --git i/doc/src/sgml/ref/psql-ref.sgml w/doc/src/sgml/ref/psql-ref.sgml
index b849101..be9d37d 100644
--- i/doc/src/sgml/ref/psql-ref.sgml
+++ w/doc/src/sgml/ref/psql-ref.sgml
@@ -2652,6 +2652,22 @@ bar
   /varlistentry
 
   varlistentry
+termvarnameCOMP_KEYWORD_CASE/varname/term
+listitem
+para
+Determines which letter case to use when completing an SQL key word.
+If set to literallower/literal or literalupper/literal, the
+completed word will be in lower or upper case, respectively.  If set
+to literalpreserve-lower/literal
+or literalpreserve-upper/literal (the default), the completed word
+will be in the case of the word already entered, but words being
+completed without anything entered will be in lower or upper case,
+respectively.
+/para
+/listitem
+  /varlistentry
+
+  varlistentry
 termvarnameDBNAME/varname/term
 listitem
 para
diff --git i/src/bin/psql/tab-complete.c w/src/bin/psql/tab-complete.c
index 6f481bb..00d87d5 100644
--- i/src/bin/psql/tab-complete.c
+++ w/src/bin/psql/tab-complete.c
@@ -682,7 +682,7 @@ static char **complete_from_variables(char *text,
 		const char *prefix, const char *suffix);
 static char *complete_from_files(const char *text, int state);
 
-static char *pg_strdup_same_case(const char *s, const char *ref);
+static char *pg_strdup_keyword_case(const char *s, const char *ref);
 static PGresult *exec_query(const char *query);
 
 static void get_previous_words(int point, char **previous_words, int nwords);
@@ -3048,7 +3048,7 @@ create_or_drop_command_generator(const char *text, int state, bits32 excluded)
 	{
 		if ((pg_strncasecmp(name, text, string_length) == 0) 
 			!(words_after_create[list_index - 1].flags  excluded))
-			return pg_strdup_same_case(name, text);
+			return pg_strdup_keyword_case(name, text);
 	}
 	/* if nothing matches, return NULL */
 	return NULL;
@@ -3335,9 +3335,9 @@ complete_from_list(const char *text, int state)
 			if (completion_case_sensitive)
 return pg_strdup(item);
 			else
-/* If case insensitive matching was requested initially, return
- * it in the case of what was already entered. */
-return pg_strdup_same_case(item, text);
+/* If case insensitive matching was requested initially, adjust
+ * the case according to setting. */
+return pg_strdup_keyword_case(item, text);
 		}
 	}
 
@@ -3374,9 +3374,9 @@ complete_from_const(const char *text, int state)
 		if (completion_case_sensitive)
 			return pg_strdup(completion_charp);
 		else
-			/* If case insensitive matching was requested initially, return it
-			 * in the case of what was already entered. */
-			return pg_strdup_same_case(completion_charp, text);
+			/* If case insensitive matching was requested initially, adjust the
+			 * case according to setting. */
+			return pg_strdup_keyword_case(completion_charp, text);
 	}
 	else
 		return NULL;
@@ -3484,27 +3484,48 @@ complete_from_files(const char *text, int state)
 
 
 /*
- * Make a pg_strdup copy of s and convert it to the same case as ref.
+ * Make a pg_strdup copy of s and convert the case according to
+ * COMP_KEYWORD_CASE variable, using ref as the text that was already entered.
  */
 static char *
-pg_strdup_same_case(const char *s, const char *ref)
+pg_strdup_keyword_case(const char *s, const char *ref)
 {
 	char *ret, *p;
 	unsigned char first = ref[0];
+	int		tocase;
+	const char *varval;
+
+	varval = GetVariable(pset.vars, COMP_KEYWORD_CASE);
+	if (!varval)
+		tocase = 0;
+	else if (strcmp(varval, lower) == 0)
+		tocase = -2;
+	else if (strcmp(varval, preserve-lower) == 0)
+		tocase = -1;
+	else if (strcmp(varval, preserve-upper) == 0)
+		tocase = +1;
+	else if (strcmp(varval, upper) == 0)
+		tocase = +2;
+	else
+		tocase = 0;
 
-	if (isalpha(first))
-	{
-		ret = pg_strdup(s);
-		if (islower(first))
-			for (p = ret; *p; p++)
-*p = pg_tolower((unsigned char) *p);
-		else
-			for (p = ret; *p; p++)
-*p = pg_toupper((unsigned char) *p);
-		return ret;
-	}
+	/* default */
+	if (tocase == 0)
+		tocase = +1;
+
+	ret = pg_strdup(s);
+
+	if (tocase == -2
+		|| ((tocase == -1 || tocase == +1)  islower(first))
+		|| (tocase == -1  !isalpha(first))
+		)
+		for (p = ret; *p; p++)
+			*p = pg_tolower((unsigned char) *p);
 	else
-		return pg_strdup(s);
+		for (p = ret; *p; p++)
+			*p 

Re: [HACKERS] HTTP Frontend? (and a brief thought on materialized views)

2012-03-30 Thread Daniel Farina
On Fri, Mar 30, 2012 at 9:11 AM, Andrew Dunstan and...@dunslane.net wrote:


 On 03/30/2012 11:41 AM, Robert Haas wrote:

 On Fri, Mar 30, 2012 at 10:55 AM, Dobes Vandermeerdob...@gmail.com
  wrote:

 Well, in our case HTTP is a clear win (but not replacement) and SPDY a
 potential one (even as a replacement).  Even if SPDY is not widely
 adopted
 it could still replace FEBE if there's a clear advantage to using it, I
 don't know enough to make the call right now.

 I can see that there are some advantages to having an HTTP interface
 to the database, but I think throwing our existing protocol out the
 window or relegating it to the status of a second-class citizen would
 be foolish.



 Right, I can't imagine it happening. And I wouldn't really be keen to add an
 alternative protocol either.

 I could imagine a client which presented a SPDY interface to the world and
 translated it into standard calls, possibly via libpq.

Any enhancement here that can't be used with libpq via, say, drop-in
.so seems unworkable to me, and that's why any solution that is
basically proxying to the database is basically a non-starter outside
the very earliest prototyping stages.  The tuple scanning and protocol
semantics can and even should remain the same, especially at first.

What I really want is for a mechanism to push FEBE messages down into
a transport in a extendable way (not unlike SSL, except this time
sensitive to FEBE message boundaries), taking unmodified libpq
binaries (but with a hook, of course...) into consideration.

 It's well to remember that we are not a green fields project here.

Perhaps not, which is a good reason to not put cart before horse --
I've tried to be very careful to suggest that real useful
caching/proxy/routing software needs to appear that can work,
unmodified, with hypothetical FEBE-over-SPDY communications, and
accomplish some interesting use cases, and also very careful to
suggest this is a *transport*-level feature.  It think would be fair
to demand: Okay, when Apache or NGINX or a credible embedded web
server in some language can accomplish something useful for us if we
absorb this complexity, let's put this on the table.

To have to write continuously write poolers, cachers and redirectors
(for any reason: HA, partitioning, or even just clean-looking domain
names...) is a burden on community resources. The criteria for success
is to expand the number of projects attacking these problems and
decrease the cost of writing one, and I think demanding a straw-man
implementation of, say, a query cacher or pooler and seeing how that
implementation feels is not unreasonable along with accepting the
burden of maintenance.

Robert Haas Wrote:
 HTTP is a non-trivial protocol that tends to impose lots
 of escaping and de-escaping overhead which is unnecessary for people
 who just want to connect to the database and run queries.  I can
 completely understand that someone might want the ability to do GET
 /db/table/pk and have that return an answer very, very quickly, by
 bypassing the usual parser and planner and just firing off an
 index-scan and returning the results as JSON or somesuch.  But I think
 it would be a serious mistake to assume that GET /q?q=myquery is going
 to come out better than what we have now in the general case.

I don't think this holds for SPDY, which is the only way I even
entertain thoughts of doing this.  I want FEBE, at least at first and
probably even then for quite a long while, just with framing and
headers (during start-up) that other software will understand.

Control frames are like this:

+--+
|C| Version(15bits) | Type(16bits) |
+--+
| Flags (8)  |  Length (24 bits)   |
+--+
|   Data   |
+--+

Data frames are like this:

+--+
|C|   Stream-ID (31bits)   |
+--+
| Flags (8)  |  Length (24 bits)   |
+--+
|   Data   |
+--+

The (XML, not HTML, I read this with view source, or try curl) RFC
draft is here:

http://mbelshe.github.com/SPDY-Specification/draft-mbelshe-spdy-00.xml


-- 
fdr

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tracking context switches with perf record

2012-03-30 Thread Greg Stark
On Fri, Mar 30, 2012 at 5:27 PM, Robert Haas robertmh...@gmail.com wrote:
 If you expand that branch of the call tree, you find that all of them
 are coming eventually from secure_read; the server is waiting for a
 new query from the client.  This is, obviously, overhead we can't
 eliminate from this test; waiting for the client is part of the job.

Fwiw this isn't necessarily true. How does the absolute number of
these events compare with the number of pg_bench operations done? If
it's significantly more the server could be reading on sockets while
there are partial commands there and it might be more efficient to
wait until the whole command is ready before reading. It may be that
this indicates that pg_bench is written in an inefficient way and it
should pipeline more commands but of course optimizing pg_bench is
kind of missing the point.

Also incidentally context switches is one of the things getrusage
shows so I'm still hoping to have that per-plan-node though that's
orthogonal to what this tool gives you with the call graph.


-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] HTTP Frontend? (and a brief thought on materialized views)

2012-03-30 Thread Daniel Farina
On Fri, Mar 30, 2012 at 10:21 AM, Daniel Farina dan...@heroku.com wrote:
 Any enhancement here that can't be used with libpq via, say, drop-in
 .so seems unworkable to me, and that's why any solution that is
 basically proxying to the database is basically a non-starter outside
 the very earliest prototyping stages.  The tuple scanning and protocol
 semantics can and even should remain the same, especially at first.

I should add: proxying could work well if libpq had all the right
hooks. The server could remain ignorant.  Regardless, upstream changes
result.

-- 
fdr

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tracking context switches with perf record

2012-03-30 Thread Tom Lane
Greg Stark st...@mit.edu writes:
 On Fri, Mar 30, 2012 at 5:27 PM, Robert Haas robertmh...@gmail.com wrote:
 If you expand that branch of the call tree, you find that all of them
 are coming eventually from secure_read; the server is waiting for a
 new query from the client.  This is, obviously, overhead we can't
 eliminate from this test; waiting for the client is part of the job.

 Fwiw this isn't necessarily true. How does the absolute number of
 these events compare with the number of pg_bench operations done? If
 it's significantly more the server could be reading on sockets while
 there are partial commands there and it might be more efficient to
 wait until the whole command is ready before reading. It may be that
 this indicates that pg_bench is written in an inefficient way and it
 should pipeline more commands but of course optimizing pg_bench is
 kind of missing the point.

Well, that would be on libpq's head if it were true, but I believe we're
fairly careful to not flush the output buffer until we're sending a
complete message.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tracking context switches with perf record

2012-03-30 Thread Robert Haas
On Fri, Mar 30, 2012 at 1:30 PM, Greg Stark st...@mit.edu wrote:
 On Fri, Mar 30, 2012 at 5:27 PM, Robert Haas robertmh...@gmail.com wrote:
 If you expand that branch of the call tree, you find that all of them
 are coming eventually from secure_read; the server is waiting for a
 new query from the client.  This is, obviously, overhead we can't
 eliminate from this test; waiting for the client is part of the job.

 Fwiw this isn't necessarily true. How does the absolute number of
 these events compare with the number of pg_bench operations done?

Since perf is a stochastic profiler, I don't think there's any way to
measure that using this tool.

 If
 it's significantly more the server could be reading on sockets while
 there are partial commands there and it might be more efficient to
 wait until the whole command is ready before reading. It may be that
 this indicates that pg_bench is written in an inefficient way and it
 should pipeline more commands but of course optimizing pg_bench is
 kind of missing the point.

Well, to some degree.  I think most clients are going to send commands
one at a time and wait for the responses.  OTOH, if for example libpq
is sending the data inefficiently in some way that wouldbe worth
fixing; lots of people use libpq.

 Also incidentally context switches is one of the things getrusage
 shows so I'm still hoping to have that per-plan-node though that's
 orthogonal to what this tool gives you with the call graph.

Yeah.

Actually, what is really bugging me is that I cannot find any way of
getting a profile that reflects the *time* spent waiting rather than
merely the *number* of waits.  That seems like an obvious thing to
want, and I cannot find a way to get it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] psql: tab-completion fails SET var=

2012-03-30 Thread Erik Rijkers
(in hopes that the current changes to tab-completion will help to get this 
fixed)

tab-completion goes wrong on SET setting=...

example:

If you want to input set search_path=myschema; without spaces around '=',
and you try tab-completion halfway the schemaname:

set search_path=mysch
  ^tab

then the initial part of the schema name ('mysch') gets removed and replaced 
with 'TO'.

So that you now have:

set search_path=TO cursor

You lose the part of the schema name that was already input.

With spaces that doesn't happen, but after all tabcompletion is about avoiding 
keystrokes and errors.

Tab-completion is great and this search_path-annoyance happens to me all the 
time; my fingers
can't seem to learn this exception.

(pgsql 9.2devel, problem also in latest 9.1.3)


Thanks,


Erik Rijkers




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_upgrade incorrectly equates pg_default and database tablespace

2012-03-30 Thread Jeff Davis
On Thu, 2012-03-22 at 14:55 +0200, Ants Aasma wrote:
 Hi,
 
 while working on a support case I stumbled upon a bug in pg_upgrade.
 Upgrade fails with No such file or directory when a database is
 moved to a non-default tablespace and contains a table that is moved
 to pg_default. The cause seems to be that the following test
 incorrectly equates empty spclocation with database tablespace:
 
 tblspace = PQgetvalue(res, relnum, i_spclocation);
 /* if no table tablespace, use the database tablespace */
 if (strlen(tblspace) == 0)
 tblspace = dbinfo-db_tblspace;
 
 Patch to fix this is attached.

I confirmed this bug upgrading 9.1 to master, and that this patch fixes
it. Thank you for the report!

Patch looks good to me as well, with one very minor nitpick: the added
comment is missing an apostrophe.

Bruce, can you take a look at this?

Regards,
Jeff Davis


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speed dblink using alternate libpq tuple storage

2012-03-30 Thread Tom Lane
Marko Kreen mark...@gmail.com writes:
 On Wed, Mar 07, 2012 at 03:14:57PM +0900, Kyotaro HORIGUCHI wrote:
 My suggestion - check in getAnotherTuple whether resultStatus is
 already error and do nothing then.  This allows internal pqAddRow
 to set regular out of memory error.  Otherwise give generic
 row processor error.

 Current implement seems already doing this in
 parseInput3(). Could you give me further explanation?

 The suggestion was about getAnotherTuple() - currently it sets
 always error in row processor.  With such check, the callback
 can set the error result itself.  Currently only callbacks that
 live inside libpq can set errors, but if we happen to expose
 error-setting function in outside API, then the getAnotherTuple()
 would already be ready for it.

I'm pretty dissatisfied with the error reporting situation for row
processors.  You can't just decide not to solve it, which seems to be
the current state of affairs.  What I'm inclined to do is to add a
char ** parameter to the row processor, and say that when the
processor returns -1 it can store an error message string there.
If it does so, that's what we report.  If it doesn't (which we'd detect
by presetting the value to NULL), then use a generic error in row
processor message.  This is cheap and doesn't prevent the row processor
from using some application-specific error reporting method if it wants;
but it does allow the processor to make use of libpq's error mechanism
when that's preferable.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speed dblink using alternate libpq tuple storage

2012-03-30 Thread Marko Kreen
On Fri, Mar 30, 2012 at 05:18:42PM -0400, Tom Lane wrote:
 Marko Kreen mark...@gmail.com writes:
  On Wed, Mar 07, 2012 at 03:14:57PM +0900, Kyotaro HORIGUCHI wrote:
  My suggestion - check in getAnotherTuple whether resultStatus is
  already error and do nothing then.  This allows internal pqAddRow
  to set regular out of memory error.  Otherwise give generic
  row processor error.
 
  Current implement seems already doing this in
  parseInput3(). Could you give me further explanation?
 
  The suggestion was about getAnotherTuple() - currently it sets
  always error in row processor.  With such check, the callback
  can set the error result itself.  Currently only callbacks that
  live inside libpq can set errors, but if we happen to expose
  error-setting function in outside API, then the getAnotherTuple()
  would already be ready for it.
 
 I'm pretty dissatisfied with the error reporting situation for row
 processors.  You can't just decide not to solve it, which seems to be
 the current state of affairs.  What I'm inclined to do is to add a
 char ** parameter to the row processor, and say that when the
 processor returns -1 it can store an error message string there.
 If it does so, that's what we report.  If it doesn't (which we'd detect
 by presetting the value to NULL), then use a generic error in row
 processor message.  This is cheap and doesn't prevent the row processor
 from using some application-specific error reporting method if it wants;
 but it does allow the processor to make use of libpq's error mechanism
 when that's preferable.

Yeah.

But such API seems to require specifying allocator, which seems ugly.
I think it would be better to just use Kyotaro's original idea
of PQsetRowProcessorError() which nicer to use.

Few thoughts on the issue:

--

As libpq already provides quite good coverage of PGresult
manipulation APIs, then how about:

  void PQsetResultError(PGresult *res, const char *msg);

that does:

  res-errMsg = pqResultStrdup(msg);
  res-resultStatus = PGRES_FATAL_ERROR;

that would also cause minimal fuss in getAnotherTuple().



I would actually like even more:

  void PQsetConnectionError(PGconn *conn, const char *msg);

that does full-blown libpq error logic.  Thus it would be 
useful everywherewhere in libpq.  But it seems bit too disruptive,
so I would like a ACK from a somebody who knows libpq better.
(well, from you...)

---

Another thought - if we have API to set error from *outside*
of row-processor callback, that would immediately solve the
how to skip incoming resultset without buffering it problem.

And it would be usable for PQgetRow()/PQrecvRow() too.

-- 
marko


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speed dblink using alternate libpq tuple storage

2012-03-30 Thread Tom Lane
Marko Kreen mark...@gmail.com writes:
 On Fri, Mar 30, 2012 at 05:18:42PM -0400, Tom Lane wrote:
 I'm pretty dissatisfied with the error reporting situation for row
 processors.  You can't just decide not to solve it, which seems to be
 the current state of affairs.  What I'm inclined to do is to add a
 char ** parameter to the row processor, and say that when the
 processor returns -1 it can store an error message string there.

 But such API seems to require specifying allocator, which seems ugly.

Not if the message is a constant string, which seems like the typical
situation (think out of memory).  If the row processor does need a
buffer for a constructed string, it could make use of some space in its
void *param area, for instance.

 I think it would be better to just use Kyotaro's original idea
 of PQsetRowProcessorError() which nicer to use.

I don't particularly care for that idea because it opens up all sorts of
potential issues when such a function is called at the wrong time.
Moreover, you have to remember that the typical situation here is that
we're going to be out of memory or otherwise in trouble, which means
you've got to be really circumspect about what you assume will work.
Row processors that think they can do a lot of fancy message
construction should be discouraged, and an API that requires
construction of a new PGresult in order to return an error is right out.
(This is why getAnotherTuple is careful to clear the failed result
before it tries to build a new one.  But that trick isn't going to be
available to an external row processor.)

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tracking context switches with perf record

2012-03-30 Thread Andres Freund
On Friday, March 30, 2012 06:27:36 PM Robert Haas wrote:
 Probability=No, score=-3.2 required=5.0 tests=AWL,BAYES_00,
 RCVD_IN_DNSWL_LOW autolearn=ham version=3.2.5   I've been playing around
 with perf record on the IBM POWER7 machine, and it's pretty cool.  One
 of the things I don't like is that some of the tools don't produce very
 nice text reports that you can cut and paste into an email - it's kind of
 a text-based GUI where you can zoom around and look at things.  Still,
 it's nice.
You can use --stdio to force a more plain text interface (you need perf 
annotate manually if you want line-by-line view) if you want so. Especially if 
you pipe it through cat or so.

Andres

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speed dblink using alternate libpq tuple storage

2012-03-30 Thread Marko Kreen
On Sat, Mar 31, 2012 at 1:13 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Marko Kreen mark...@gmail.com writes:
 On Fri, Mar 30, 2012 at 05:18:42PM -0400, Tom Lane wrote:
 I'm pretty dissatisfied with the error reporting situation for row
 processors.  You can't just decide not to solve it, which seems to be
 the current state of affairs.  What I'm inclined to do is to add a
 char ** parameter to the row processor, and say that when the
 processor returns -1 it can store an error message string there.

 But such API seems to require specifying allocator, which seems ugly.

 Not if the message is a constant string, which seems like the typical
 situation (think out of memory).  If the row processor does need a
 buffer for a constructed string, it could make use of some space in its
 void *param area, for instance.

If it's specified as string that libpq does not own, then I'm fine with it.

 I think it would be better to just use Kyotaro's original idea
 of PQsetRowProcessorError() which nicer to use.

 I don't particularly care for that idea because it opens up all sorts of
 potential issues when such a function is called at the wrong time.
 Moreover, you have to remember that the typical situation here is that
 we're going to be out of memory or otherwise in trouble, which means
 you've got to be really circumspect about what you assume will work.
 Row processors that think they can do a lot of fancy message
 construction should be discouraged, and an API that requires
 construction of a new PGresult in order to return an error is right out.
 (This is why getAnotherTuple is careful to clear the failed result
 before it tries to build a new one.  But that trick isn't going to be
 available to an external row processor.)

Kyotaro's original idea was to assume out-of-memory if error
string was not set, thus the callback needed to set the string
only when it really had something to say.

-- 
marko

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speed dblink using alternate libpq tuple storage

2012-03-30 Thread Tom Lane
Marko Kreen mark...@gmail.com writes:
 On Sat, Mar 31, 2012 at 1:13 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Not if the message is a constant string, which seems like the typical
 situation (think out of memory).  If the row processor does need a
 buffer for a constructed string, it could make use of some space in its
 void *param area, for instance.

 If it's specified as string that libpq does not own, then I'm fine with it.

Check.  Let's make it const char ** in fact, just to be clear on that.

 (This is why getAnotherTuple is careful to clear the failed result
 before it tries to build a new one.  But that trick isn't going to be
 available to an external row processor.)

 Kyotaro's original idea was to assume out-of-memory if error
 string was not set, thus the callback needed to set the string
 only when it really had something to say.

Hmm.  We could still do that in conjunction with the idea of returning
the string from the row processor, but I'm not sure if it's useful or
just overly cute.

[ thinks... ]  A small advantage of assuming NULL means that is that
we could postpone the libpq_gettext(out of memory) call until after
clearing the overflowed PGresult, which would greatly improve the odds
of getting a nicely translated result and not just ASCII.  Might be
worth it just for that.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgsql_fdw, FDW for PostgreSQL server

2012-03-30 Thread Albe Laurenz
I wrote:
 How about getting # of rows estimate by executing EXPLAIN for
 fully-fledged remote query (IOW, contains pushed-down WHERE clause),
and
 estimate selectivity of local filter on the basis of the statistics
 which are generated by FDW via do_analyze_rel() and FDW-specific
 sampling function?  In this design, we would be able to use quite
 correct rows estimate because we can consider filtering stuffs done
on
 each side separately, though it requires expensive remote EXPLAIN for
 each possible path.
 
 That sounds nice.

... but it still suffers from the problems of local statistics
for remote tables I pointed out.

I think that these shortcomings are not justified by the gain of
one client-server round trip less during planning.  I'd prefer
if pgsql_fdw were not dependent on remote statistics stored in the
local database.

Yours,
Laurenz Albe


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Standbys, txid_current_snapshot, wraparound

2012-03-30 Thread Andrey Velikoredchanin
2012/3/29 Marko Kreen mark...@gmail.com

 I can't find a place where WAL replay updates values under XLogCtl.
 If that really does not happen, that would explain why standbys can
 see wrong epoch.

 No clue yet how master can get broken.


Details about environment:

Debian

Linux db 2.6.32-5-amd64 #1 SMP Fri Sep 9 20:23:16 UTC 2011 x86_64 GNU/Linux

# SELECT version();
 PostgreSQL 9.1.1 on x86_64-unknown-linux-gnu, compiled by gcc-4.4.real
(Ubuntu 4.4.3-4ubuntu5) 4.4.3, 64-bit


Re: [HACKERS] [PATCH] Lazy hashaggregate when no aggregation is needed

2012-03-30 Thread Jay Levitt

Tom Lane wrote:

Ants Aasmaa...@cybertec.at writes:

A user complained on pgsql-performance that SELECT col FROM table
GROUP BY col LIMIT 2; performs a full table scan. ISTM that it's safe
to return tuples from hash-aggregate as they are found when no
aggregate functions are in use. Attached is a first shot at that.


As I commented in the other thread, the user would be a lot better off
if he'd had an index on the column in question. I'm not sure it's worth
complicating the hashagg logic when an indexscan + groupagg would
address the case better.


Would this patch help in the case where table is actually a set-returning 
function, and thus can't have an index? (I don't yet know enough about the 
tree to know when hashaggs get used). I'm wondering if this is a useful 
exception to the restrictions can't get pushed down through GROUP BYs rule.


Jay

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tracking context switches with perf record

2012-03-30 Thread Jeff Janes
On Fri, Mar 30, 2012 at 9:27 AM, Robert Haas robertmh...@gmail.com wrote:

 The more interesting waits, in my view anyway, are the ones that come
 from LWLockAcquire, which account for nearly all of the semaphore
 sleeps.  As you can see, XLogInsert accounts for over half of those,
 and ProcArrayEndTransaction and SimpleLruReadPage_ReadOnly account for
 most of the rest.  Just out of curiosity, I expanded a couple of those
 another level, and you can see what's going on there, too.  I was
 slightly surprised by the number of calls to log_heap_clean, since I
 wouldn't have expected vacuuming to generate that much work; it turns
 out that 99.9% of those are HOT pruning events happening during query
 execution.

I've generally seen a one to one correspondence between log_heap_clean
and updates to pgbench_accounts (not by sampling waits, but by
wal_debug=1).

I figured this was because the pgbench_accounts blocks are packed to
the gills by default, so the only way to make room for the next HOT
update is the prune the dead tuple left over from the previous HOT
update in that same block.

It might be interesting to see if those go away if run with pgbench -i
-F90 or some other value less than 100.

Which I've now done, and it actually goes the other way.  with -F50,
not only do almost all pgbench_accounts updates still lead to a clean,
but a good chunk of the updates to pgbench_tellers lead to a clean as
well.  If cleans are going to happen so often, would it make sense to
write a combined record for clean+hot_update?

Cheers,

Jeff

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tracking context switches with perf record

2012-03-30 Thread Robert Haas
On Fri, Mar 30, 2012 at 10:07 PM, Jeff Janes jeff.ja...@gmail.com wrote:
 Which I've now done, and it actually goes the other way.  with -F50,
 not only do almost all pgbench_accounts updates still lead to a clean,
 but a good chunk of the updates to pgbench_tellers lead to a clean as
 well.  If cleans are going to happen so often, would it make sense to
 write a combined record for clean+hot_update?

I think that'd be hard to arrange.  You might thing that the prune
happens when we go to update and find that there isn't room on the
page, but that's actually not how it works at all:

# Ideally we would do defragmenting only when we are about to attempt
# heap_update on a HOT-safe tuple.  The difficulty with this approach
# is that the update query has certainly got a pin on the old tuple, and
# therefore our attempt to acquire a buffer cleanup lock will always fail.
# (This corresponds to the idea that we don't want to move the old tuple
# out from under where the query's HeapTuple pointer points.  It might
# be possible to finesse that, but it seems fragile.)

So what we do instead is:

# The currently planned heuristic is to prune and defrag when first accessing
# a page that potentially has prunable tuples (as flagged by the pd_prune_xid
# page hint field) and that either has free space less than MAX(fillfactor
# target free space, BLCKSZ/10) *or* has recently had an UPDATE fail to
# find enough free space to store an updated tuple version.  (These rules
# are subject to change.)

So, in other words, the prune happens before we have any idea that
we're going to attempt an update.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] measuring lwlock-related latency spikes

2012-03-30 Thread Robert Haas
On Fri, Mar 30, 2012 at 1:55 PM, Robert Haas robertmh...@gmail.com wrote:
 Actually, what is really bugging me is that I cannot find any way of
 getting a profile that reflects the *time* spent waiting rather than
 merely the *number* of waits.  That seems like an obvious thing to
 want, and I cannot find a way to get it.

I ended up taking a stab at writing some userspace code for this - see
attached.  It measures the amount of time taken for each contended
lwlock acquisition and prints a crude histogram of the results at
postmaster-shutdown time.  At least on the IBM POWER7 machine, running
pgbench with 32 threads, where pg_test_timing shows that getting the
time takes less than a microsecond 96%+ of the time, this seemed to
have no real impact on the tps numbers - perhaps because the workload
is I/O bound.  Risking the possible ire of people who object to large
attachments, I'm attaching the results this generated on a 30-minute,
32-thread pgbench run at scale factor 300.  To minimize said ire, I've
run the file through bzip2.

What's interesting about this is that, while there is plenty of
waiting for the usual suspects - ProcArrayLock (4), WALInsertLock (7),
and CLogControlLock (11), the waits are all pretty short:

2012-03-31 02:33:25 UTC [50305] LOG:  lock 0: 2:520, 3:
2012-03-31 02:33:25 UTC [50305] LOG:  lock 3: 1:9, 2:36838
2012-03-31 02:33:25 UTC [50305] LOG:  lock 4: 1:33, 2:216964
2012-03-31 02:33:25 UTC [50305] LOG:  lock 7: 1:39, 2:406249
2012-03-31 02:33:25 UTC [50305] LOG:  lock 8: 1:4, 2:34
2012-03-31 02:33:25 UTC [50305] LOG:  lock 11: 1:99, 2:374559
2012-03-31 02:33:25 UTC [50305] LOG:  lock 17: 2:24, 3:24

That's saying that there were over 400,000 contended acquisitions of
WALInsertLock, but the longest one had fls(wait_time_in_us) = 2, or in
other words it took less than 4us to get the lock.  So what happens if
we grep the log file for the biggest offenders?

2012-03-31 02:33:25 UTC [50305] LOG:  lock 204610: 20:1
2012-03-31 02:33:25 UTC [50305] LOG:  lock 272958: 23:1
2012-03-31 02:33:25 UTC [50305] LOG:  lock 325412: 20:1
2012-03-31 02:33:25 UTC [50305] LOG:  lock 325784: 21:1
2012-03-31 02:33:25 UTC [50305] LOG:  lock 360016: 5:1, 21:1
2012-03-31 02:33:25 UTC [50305] LOG:  lock 444886: 23:1
2012-03-31 02:33:25 UTC [50305] LOG:  lock 499890: 20:1
2012-03-31 02:33:25 UTC [50305] LOG:  lock 533418: 20:1
2012-03-31 02:33:25 UTC [50305] LOG:  lock 610484: 6:1, 20:1
2012-03-31 02:33:25 UTC [50305] LOG:  lock 897798: 22:1
2012-03-31 02:33:25 UTC [50305] LOG:  lock 1027128: 7:1, 20:1
2012-03-31 02:33:25 UTC [50305] LOG:  lock 1074256: 5:1, 21:1
2012-03-31 02:33:25 UTC [50305] LOG:  lock 1132586: 5:1, 23:1
2012-03-31 02:33:25 UTC [50305] LOG:  lock 1312178: 16:1, 22:1

If I'm reading this right, fls(wait_time_in_us) = 20 means a 1 second
delay, which I think means that a couple of those waits were = *8
seconds*.  gdb reveals that in the test configuration, all odd
numbered locks between 169 and 2097319 are some buffer's
io_in_progress_lock, and all even numbered locks between 170 and
2097320 are some buffer's content_lock, which means, if I'm not
confused here, that every single lwlock-related stall  1s happened
while waiting for a buffer content lock.  Moreover, each event
affected a different buffer.  I find this result so surprising that I
have a hard time believing that I haven't screwed something up, so if
anybody can check over the patch and this analysis and suggest what
that thing might be, I would appreciate it.  I would be a lot less
surprised if the waits revolved around the IO-in-progress locks, since
it's not that hard to imagine an I/O taking a really long time on a
busy system.  But I didn't think we were ever supposed to hold content
locks for that long.

The other thing that baffles me about these numbers is that they don't
provide any fodder for explaining the periodic drops in throughput
that happen when the system checkpoints.  I had assumed they would
show up as long lwlock waits, like somebody hanging on to
WALInsertLock while everybody else in the system piles up behind them.
 That's not reflected in these numbers - the few very long waits show
just ONE guy waiting a really long time for the lock.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


lwhistogram.patch
Description: Binary data


log.wp.lwhistogram.32.300.1800.bz2
Description: BZip2 compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers