[HACKERS] Arbitary file size limit in twophase.c

2008-05-13 Thread Gavin Sherry
There's an apparently arbitary limit of 10,000,000 bytes in twophase.c
on the size of a two phase commit file. I can't see why this limit
exists.

I hit this limit by creating a prepared transaction which included
dropping a schema with about 25,000 objects in it and then trying to
commit it. The result is that ReadTwoPhaseFile() returns NULL and
FinishPreparedTransaction() says it detected a corrupted file.

Thoughts?

Thanks,

Gavin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Arbitary file size limit in twophase.c

2008-05-13 Thread Gavin Sherry
On Tue, May 13, 2008 at 10:34:23AM -0400, Tom Lane wrote:
 Gavin Sherry [EMAIL PROTECTED] writes:
  There's an apparently arbitary limit of 10,000,000 bytes in twophase.c
  on the size of a two phase commit file. I can't see why this limit
  exists.
 
 The comment seems perfectly clear about why the limit exists:
 
  * Check file length.  We can determine a lower bound pretty easily. We
  * set an upper bound mainly to avoid palloc() failure on a corrupt file.

Oops. Where was my brain?

 Perhaps it'd be better to use malloc() than palloc(), so that we'd not
 lose control on out-of-memory, and then deem the file too big only
 if we couldn't malloc the space.

That seems better.

Thanks,

Gavin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Declarative partitioning grammar

2008-01-17 Thread Gavin Sherry
On Mon, Jan 14, 2008 at 10:45:28PM -0500, Tom Lane wrote:
 Jeff Cohen [EMAIL PROTECTED] writes:
  In the proposed solution, hash and list partitions work for all types  
  that support an equality operator, and range partitions work for all  
  types that support fully-ordered comparison.
 
 Surely a hashing method would require a *hashable* equality operator,
 ie a hash opclass; likewise range partitions would demand a matching
 btree opclass.  You could do list partitions with an equality operator
 of either kind.

Right.

 
 Essentially all of the system's current knowledge about the properties
 of specific operators is encoded as operator classes for one of these
 two built-in index types.  If you want to make assumptions about the
 behavior of an operator, it really needs to be founded on these types of
 opclasses --- or else you're buying into inventing a comparable amount
 of infrastructure for some other organizational concept.

Right, we obviously don't want to do that.

 
 I think Peter's point was that you might want to think about
 generalizing your concepts so that other kinds of operator classes could
 someday serve as the foundations for other kinds of partitioning rules.

Let me see if I've understood: certain operator classes either describe
or allow certain kinds of partitioning: hash is obvious, btree allows
equality and range based approaches, gist allows users a whole range of
possibilities. So, a truly extensible system would define the
partitioning type in the catalog?

That's an interesting idea. It presents problems, I think, for the
grammar I've proposed because some grammatical constructs are tied to
range, some to hash, some to list. Any insights into how we could
achieve both?

Thanks,

Gavin

PS: Heading off into the French country side for a little while and responses
may be a little slow.

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] Declarative partitioning grammar

2008-01-15 Thread Gavin Sherry
On Tue, Jan 15, 2008 at 10:36:17AM -0500, Tom Lane wrote:
 Markus Schiltknecht [EMAIL PROTECTED] writes:
  Jeff Cohen wrote:
  If you don't define a default partition to handle outliers,  the 
  insert should fail with an error.
 
  IMO, you should always have a default partition, then, so as not to 
  violate the constraints (by rejecting tuples which are correct according 
  to the constraints).
 
 I don't agree with that at all.  I can imagine plenty of situations
 where a tuple falling outside the range of available partitions *should*
 be treated as an error.  For instance, consider timestamped observations
 --- data in the future is certainly bogus, and data further back than
 you want to deal with must be an entry error as well.
 
 I agree that there needs to be a way to have a default partition,
 but there needs to be a way to not have one, too.

Jeff and I discussed this and we came to the same conclusion. We will
propose grammar for handling it. Many users we talk to would fall into the
class of people who would want an error if the data fell outside the
defined partitions.

Thanks,

Gavin

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Declarative partitioning grammar

2008-01-14 Thread Gavin Sherry
On Sat, Jan 12, 2008 at 04:01:19PM +0530, NikhilS wrote:
 Hi,
 
  We did look at allowing general functions for partitioning and this
  was one concern.  The other is that we want to enforce that a row
  only gets inserted into a single partition, so we wanted a
  declarative syntax where it was relatively easy to check that range
  and list specifications don't overlap.
 
 
 Detection of mutually exclusive ranges might not turn out to be so easy
 afterall. I think there is some code in the constraint_exclusion area which
 might help out in this.

In some prototyping code it didn't seem too difficult but if we've made
a mistake we might have to look at the CE code.

Thanks,

Gavin


---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Declarative partitioning grammar

2008-01-12 Thread Gavin Sherry
On Sat, Jan 12, 2008 at 05:47:30PM +, Simon Riggs wrote:
 On Sat, 2008-01-12 at 01:59 +0100, Gavin Sherry wrote:
  The syntax is half the problem, performance is the other.
 
 The syntax looks great to me, but I think it is about 5% of the problem,
 maybe less. I don't really have any questions about the syntax, but I
 may have thoughts when the implementation details emerge.

Yes, that's for another thread. Since the discussion was abot using
grammar to control partitions I wanted to get some grammar out. More
details on other stuff soon.

 
 I'm not sure you'll be able to use PARTITION BY since its part of the
 SQL Standard for Windowed grouping, which we do hope to implement one
 day. It will be confusing to have two completely separate meanings for
 the one phrase in our grammar.

I think it's fine. It doesn't cause conflicts in the grammar (in fact,
the Greenplum grammar implements both meanings right now with no
confusion).

Thanks,

Gavin

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps

2008-01-11 Thread Gavin Sherry
On Fri, Jan 11, 2008 at 08:07:18AM +, Simon Riggs wrote:
 On Fri, 2008-01-11 at 02:28 +0100, Gavin Sherry wrote:
  On Thu, Jan 10, 2008 at 09:30:10PM +, Simon Riggs wrote:
 We cannot perform partition exclusion using this type of WHERE clause 
 at
 planning time because the CURRENT DATE function is STABLE. 

We can do the exact same thing -- if it's a direction people want to
take. In fact, we can do it better/faster because once we've evaluated 
one 
partition we know that there are no others to evaluate.
   
   Lost you completely here. I'm explaining to you that *nobody* can solve
   those problems solely at planning time, by definition, so it has to be
   done at execution time. I'm not saying anything about your way, my way.
  
  Sorry, I wasn't clear enough. I was trying to say, if we're going to do
  something in the executor (for right or wrong) the declarative approach
  can do it too. Since there will be partition bounding information
  available, we can do partition selection in the executor (maybe the
  planner should tell us to do it).
 
 Of course. It's an identical situation for both. Regrettably, none of
 your comments about dynamic partitioning and planning were accurate as a
 result. 

That's not true. We will still have planning drive the partition
selection when the predicate is immutable, thus having more accurate
plans. Some but not all plans use current_date and similar.

 
  Okay. As I said above, nothing in declarative partitioning rules out
  partition selection with stable functions. So, we lets do it, assuming
  everyone else thinks it is a good idea.
 
 If you check the archives this was long ago been identified as a
 requirement. And I said exactly the things you said, BTW, when trying to
 say it didn't matter.
 
 I've kept a list of requests for improvement that I can share with you;
 I've always been loathe to publish a list of bad points.

Why, please send them.

Thanks,

Gavin

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps

2008-01-11 Thread Gavin Sherry
On Fri, Jan 11, 2008 at 11:49:50AM +, Simon Riggs wrote:
 On Fri, 2008-01-11 at 10:25 +0100, Gavin Sherry wrote:
   
   Of course. It's an identical situation for both. Regrettably, none of
   your comments about dynamic partitioning and planning were accurate as a
   result. 
  
  That's not true. We will still have planning drive the partition
  selection when the predicate is immutable, thus having more accurate
  plans. 
 
 Not really.
 
 The planner already evaluates stable functions at plan time to estimate
 selectivity against statistics. It can do the same here.
 
 The boundary values can't be completely trusted at plan time because
 they are dynamic, but they're at least as accurate as ANALYZE statistics
 (and probably derived at identical times), so can be used as estimates.
 So I don't see any reason to imagine the plans will be badly adrift.

Okay, it's good that you want the planner to look at those. Did you
consider the point I made about the sheer amount of data the planner
would have to consider for large cases?

 We're back to saying that if the visibility map is volatile, then SE
 won't help you much. I agree with that and haven't argued otherwise.
 Does saying it make us throw away SE? No, at least, not yet and not for
 that reason.

Yes, I'm not against SE I just think that only having it would see a
serious regression for larger user. Personally, I think SE would be a
great idea for append only tables since it removes the thing I'm most
worried about with it: the need to vacuum to 'turn it on'.

 
 SE does what I was looking for it to do, but doesn't do all of what
 you'd like to achieve with partitioning, because we're looking at
 different use cases. I'm sure you'd agree that all large databases are
 not the same and that they can have very different requirements. I'd
 characterise our recent positions on this that I've been focused on
 archival requirements, whereas you've been focused on data warehousing.

I think that sums it up, although I'd also say that declarative
partitioning is suitable for all those with largish amounts of data
which they know how they want stored. This points to another case that
SE suits: those who don't know how or (maybe more importantly) don't
care to manage their data.

I'll go back to what I said above. SE looks like a good performance
boost for archival read only data. If we tighten up the definitions of
how some tables can be used -- append only -- then we can remove the
vacuum requirement and also change other characteristics of the storage.
For example, reduced visibilty information, compression, etc. These are
hot topics for people with that kind of data.

 The difference really lies in how much activity and of what kind occurs
 on a table. You don't unload and reload archives regularly, nor do you

I couldn't agree more.

 perform random updates against the whole table. I'm sure we'd quickly
 agree that many of the challenges you've seen recently at Greenplum
 would not be possible with core Postgres, so I'm not really trying too
 hard to compete.

There we diverge. Yes, Greenplum produces systems for very large amounts
of data, peta-byte range in fact. However, the architecture --
individual post-masters on a single CPU with their own storage -- means
that we see the problems of non-distributed database users when we look
at data at the node level. This is why I say that VACUUMing such
systems, under the SE model, after you do a load of data is just
impossible (I know that after time the cost of vacuum will stabilise but
there's always the initial data load).

Thanks,

Gavin

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


[HACKERS] Declarative partitioning grammar

2008-01-11 Thread Gavin Sherry
Hi all,

Many of you will have read the dynamic partitioning thread here:

http://archives.postgresql.org/pgsql-hackers/2008-01/msg00028.php

I've proposed an alternative approach, which we've called declarative
partitioning which is grammar based. This grammar was developed by Jeff
Cohen at Greenplum with some assistance from myself. It is to be
completely open source.

The grammar will need some refinement and we'd appreciate comments,
suggestions, etc. The grammar is designed to address the range of use
cases we have out there.

Basics
--

CREATE TABLE is modified to accept a PARTITION BY clause. This clause
contains one or more partition declarations. The syntax is as follows:

PARTITION BY {partition_type} (column_name[, column_name...])
[PARTITIONS number]
  (
   partition_declaration[, partition_declaration...]

  )

The partition type can be one of HASH, RANGE or LIST. The column names
are the partitioning key. The PARTITIONS sub clause instructs the system
on the number of partitions to create if we're doing HASH or, in the
case of LIST or RANGE can act as a safe guard for users who want to
ensure that they do not generate more than a certain number of
partitions.

We have discussed adding the partition type REFERENCE which is akin to
the LIKE clause in CREATE TABLE. That is, it duplicates the partition
configuration of the specified table. Input would be appreciated.

Partition declarations
--

Hash


...   PARTITION BY HASH(order_date) PARTITIONS 5;

This will create 5 partitions on the column order_date. Inserts will be
distributed roughly evenly across the 5 partitions.

List


...PARTITION BY LIST (state)
   (PARTITION q1_northwest VALUES ('OR', 'WA'),
PARTITION q1_southwest VALUES ('AZ', 'UT', 'NM'),
PARTITION q1_northeast VALUES  ('NY', 'VM', 'NJ'),
PARTITION q1_southeast VALUES ('FL', 'GA'),
PARTITION q1_northcentral VALUES ('SD', 'WI'),
PARTITION q1_southcentral VALUES ('OK', 'TX'));

Here, we produce 6 different partitions. The first partition groups
states in the North West of the USA. We introduce here the named
partition concept for clarity.

Range
-

Range has the most expressive grammar. I'll introduce it in steps:

...PARTITION BY RANGE (b)
(  
PARTITION aa start (date '2007-01-01') end (date '2008-01-01'),
PARTITION bb start (date '2008-01-01') end (date '2009-01-01') 
);

Here, we create 2 partitions: aa and bb. Partition aa has the range
2007-01-01 to 2008-01-01; partition bb has the range 2008-01-01 to
2009-01-01. Intervals always have this problem: are the bounds included
in the range? To deal with this we define: the start of the range is
included in the range. The ending bound is not. This can be modified
with the keywords INCLUSIVE and EXCLUSIVE, which modify this property on
a rule by rule basis.

It is common that these partitions follow a pattern, such as following
every week, month or year. So, we support the following specification:

...   PARTITION BY RANGE(order_date)
  (
START (date '2005-12-01') end (date '2007-12-01') 
   EVERY(interval '2 months')
  );

If we like, we can mix the specification a little:

...   PARTITION BY RANGE(order_date)
 ( PARTITION Q1_2005 end (date '2005-04-01'),
   PARTITION Q2_2005 end (date '2005-07-01'),
   PARTITION Q3_2005 end (date '2005-10-10'),
   PARTITION Q4_2005 end (date '2006-01-01'),
   START (date '2006-02-01') end (date '2008-04-01') 
 EVERY (interval '2 weeks')
 );

an interesting result of the flexibility of the grammar we've come up
with is that you can do something like this:

...   PARTITION BY RANGE(order_date)
  ( PARTITION minny end date '2004-12-01'),
end (date '2006-12-01'),
PARTITION maxny start (date '2006-12-01')
  ); 

Here, when order_date is less than 2004-12-01, we put the data in minny,
when it is between 2004-12-01 and 2006-12-01 we put it in an unnamed
partition and after this we put it in maxny.

Tablespaces
---

We allow inline tablespace specification, such as:

...   PARTITION BY RANGE(order_date)
  (
PARTITION minny TABLESPACE compress,
start (date '2004-12-01') end (date '2006-12-01') TABLESPACE hot,
PARTITION maxny TABLESPACE compress
  );

I've used the term compress here intentionally. A number of operating
systems now ship file systems which can compress partitions. Users with
issues with the amount of data they want to keep online can delay the
time until they need new storage to a future date by compressing less
regularly used data with this technique, for a performance cost. Data
being used heavily can live on an uncompressed file system, affected.

Multi-column support


We can have multi-column partitions.

...  PARTITION BY LIST (state, deptno)
  ( 
VALUES ('OR', 1, 

Re: [HACKERS] Declarative partitioning grammar

2008-01-11 Thread Gavin Sherry
On Fri, Jan 11, 2008 at 07:46:36PM -0500, Merlin Moncure wrote:
 On Jan 11, 2008 6:19 PM, Gavin Sherry [EMAIL PROTECTED] wrote:
  Hi all,
 
  Many of you will have read the dynamic partitioning thread here:
 
  http://archives.postgresql.org/pgsql-hackers/2008-01/msg00028.php
 
  I've proposed an alternative approach, which we've called declarative
  partitioning which is grammar based. This grammar was developed by Jeff
  Cohen at Greenplum with some assistance from myself. It is to be
  completely open source.
 
 
 I like the syntax, but what I really want to know is how well this is
 going to work with the query planner.  The current 8.x table
 partitioning mechanism has a lot of issues...it causes more problems
 than it solves unless you are willing to strictly control how you
 query the tables...I usually end up rolling my own.

The syntax is half the problem, performance is the other. I will bring
the performance issues up in another thread. Yes, we are confident that
we can address the performance issues that rule out the existing
partitioning for many applications. We need it for our own stuff! :P

Thanks,

Gavin

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps

2008-01-10 Thread Gavin Sherry
On Thu, Jan 10, 2008 at 07:25:00AM +, Simon Riggs wrote:
 On Thu, 2008-01-10 at 03:06 +0100, Gavin Sherry wrote:
  If the exclusion is executor driven, the planner cannot help but
  create a seq scan plan. The planner will think you're returning 100X
  rows when really you end up returning X rows. After that, all
  decisions made by the planner are totally bogus.
 
 One of the most important queries on large tables is handling
   WHERE Eventdate = CURRENT DATE;

Really? Well, this isn't handled by the current partitioning approach
(i.e., constraint exclusion) so users have worked around it.

 
 We cannot perform partition exclusion using this type of WHERE clause at
 planning time because the CURRENT DATE function is STABLE. 

We can do the exact same thing -- if it's a direction people want to
take. In fact, we can do it better/faster because once we've evaluated one 
partition we know that there are no others to evaluate.

 So it seems clear that we need to make partition exclusion work at
 executor time, whatever else we do.

One example doesn't make the rule. Most people doing range partitioning
are going to be doing either specific dates or date ranges -- i.e.,
constants or things that can be folded to constants by the planner. At
least, that's what I've seen.

Thanks,

Gavin

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps

2008-01-10 Thread Gavin Sherry
On Thu, Jan 10, 2008 at 04:51:04PM +, Simon Riggs wrote:
 On Thu, 2008-01-10 at 03:06 +0100, Gavin Sherry wrote:
 
   If people with large tables like partitioning why is Oracle moving
   towards automated partitioning in 11g? Automated partitioning was one of
  
  Have you used Oracle's partitioning? 
 
 Since you ask, yep, certified on it, plus DB2, Teradata and Greenwich.

It was just a joke.

 
  What about consistency over time? High levels of control for users if
  they want it? Even simple things like tablespace support? High
  performance unload and load performance much better than now?
  Reliable and consistent query plans? Support for datatypes other than
  those relating to time.
 
 Most of these comments seem like emotional appeals, so refuting them one
 by one is just going to look and smell like an argument, and a fruitless
 one as well since we agree on so many things.

Sigh. You can call those questions what you like but your approach
doesn't deal with them and the current partitioning approach, for it's
problems, does deal with them.

 So, I get the message that you really want the DDL approach and agree
 that you've demonstrated there are use cases that need it that you are
 interested in. That's fine by me as long as we can each work on parts of
 it to get it done. Will it be possible for you to do that?

I assured you offlist that it was. This is something Greenplum needs
right now and I'm being paid to deliver it.

 I feel I can say that because AFAICS we can in principle have both
 dynamic and declarative techniques, even on the same table. Around half
 of the mechanics would be identical anyway.
 
 So from here I say lets work together to get the basic mechanics right.
 My experience with PostgreSQL is that the team/list always comes up with
 a better idea in the end and I'm sure we will this time too.

I agree. Scrutiny of ideas almost always leads to better ideas.

 I have one main technical concern which is the storage model, which is
 the source of the difference between the two types of partitioning we've
 been discussing. I see difficulties originating there, so I will start
 another thread to examine just those items in detail.

I look forward to it. As promised, I'll soon post some syntax of what we
have been working on.

Thanks,

Gavin

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps

2008-01-10 Thread Gavin Sherry
On Thu, Jan 10, 2008 at 09:30:10PM +, Simon Riggs wrote:
   We cannot perform partition exclusion using this type of WHERE clause at
   planning time because the CURRENT DATE function is STABLE. 
  
  We can do the exact same thing -- if it's a direction people want to
  take. In fact, we can do it better/faster because once we've evaluated one 
  partition we know that there are no others to evaluate.
 
 Lost you completely here. I'm explaining to you that *nobody* can solve
 those problems solely at planning time, by definition, so it has to be
 done at execution time. I'm not saying anything about your way, my way.

Sorry, I wasn't clear enough. I was trying to say, if we're going to do
something in the executor (for right or wrong) the declarative approach
can do it too. Since there will be partition bounding information
available, we can do partition selection in the executor (maybe the
planner should tell us to do it).

 
   So it seems clear that we need to make partition exclusion work at
   executor time, whatever else we do.
  
  One example doesn't make the rule. Most people doing range partitioning
  are going to be doing either specific dates or date ranges -- i.e.,
  constants or things that can be folded to constants by the planner. At
  least, that's what I've seen.
 
 It's not always true that planning time = execution time.
 
 Using CURRENT DATE to access current day, week, month is common in many
 applications, as is parameterised SQL for higher transaction rate apps.
 
 We we can't re-write all the SQL, so we need to make it work.
 
 I think if you demand a full function implementation of partitioning,
 you'd better take into account *all* of the requirements.

Okay. As I said above, nothing in declarative partitioning rules out
partition selection with stable functions. So, we lets do it, assuming
everyone else thinks it is a good idea.

Thanks,

Gavin

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps

2008-01-09 Thread Gavin Sherry
On Wed, Jan 09, 2008 at 11:47:31AM -0500, Chris Browne wrote:
 [EMAIL PROTECTED] (Simon Riggs) writes:
  I think we have an opportunity to bypass the legacy-of-thought that
  Oracle has left us and implement something more usable.
 
 This seems like a *very* good thing to me, from a couple of
 perspectives.
 
[snip]
 2.  Blindly following what Oracle does has always been a dangerous
 sort of thing to do.
 
 There are two typical risks:
 
   a) There's always the worry that they may have patented some
  part of how they implement things, and if you follow too
  closely, There Be Dragons...

I think that could be equally said of the dynamic partitioning approach.
In fact, it might be more likely since declarative partitioning has been
around for eons.

   b) They have enough billion$ of development dollar$ and
  development re$ource$ that they can follow strategies that
  are too expensive for us to even try to follow.

I don't see that as an argument against the declarative approach.
Reading the details on this thread so far, I think Simon's approach is
probably more complex from an implementation POV.

 
 3.  If, rather than blindly following, we create something at least
 quasi-new, there is the chance of doing fundamentally better.
 
 This very thing happened when it was discovered that IBM had a
 patent on the ARC cacheing scheme; the clock system that emerged
 was a lot better than ARC ever was.

Well, I don't think I'm proposing we /blindly follow/ others. I propose
we choose a grammar which takes the best of what others have tried to
do. Oracle's grammar is hideous, IBM's is too restrictive, for example.

  One major advantage of the dynamic approach is that it can work on
  multiple dimensions simultaneously, which isn't possible with
  declarative partitioning. For example if you have a table of Orders then
  you will be able to benefit from Segment Exclusion on all of these
  columns, rather than just one of them: OrderId, OrderDate,
  RequiredByDate, LastModifiedDate. This will result in some sloppiness
  in the partitioning, e.g. if we fill 1 partition a day of Orders, then
  the OrderId and OrderData columns will start out perfectly arranged. Any
  particular RequiredByDate will probably be spread out over 7 partitions,
  but thats way better than being spread out over 365+ partitions.
 
 I think it's worth observing both the advantages and demerits of this
 together.
 
 In effect, with the dynamic approach, Segment Exclusion provides its
 benefits as an emergent property of the patterns of how INSERTs get
 drawn into segments.
 
 The tendancy will correspondly be that Segment Exclusion will be able
 to provide useful constraints for those patterns that can naturally
 emerge from the INSERTs.

Many people, in my experience, doing the kind of data processing which
benefits from partitioning are regularly loading data, rather than
collecting it in an OLTP fashion. Lets take the easily understandable
concept of processing web site traffic. If the amount of data is large
enough to benefit from partitioning, they probably have multiple web
servers and therefore almost certainly multiple log files. If these
files are not sorted into a single file, the records will not have a
naturally progressing chronology: every file we go back to the beginning
of the period of time the load covers. If you add parallelism to your load,
things look even more different. This means you could end up with a
bunch of partitions, under the dynamic model, which all cover the same
time range.

Then there's the way that really big databases are used (say, up around
Simon's upper bound of 16 TB). It is expensive to keep data online so
people aren't. They're loading and unloading data all the time, to
perform different forms of analysis. A common scenario in the example
above might be to unload all but the current month's data and then load
the same month from the previous year. The unload step needs to be
costly (i.e., TRUNCATE). Then, there's no guarantee that what they're
interested in is the date range at all. They may want to compare user
agent (look at bot activity). In this case, the partitioning is across a
small list of strings (well, numbers most likely). Here, the partitions
would all have the same range. Partitioning would be useless.

 We can therefore expect useful constraints for attributes that are
 assigned in some kind of more or less chronological order.  Such
 attributes will include:
 
  - Object ID, if set by a sequence
  - Processing dates
 
 There may be a bit of sloppiness, but the constraints may still be
 useful enough to exclude enough segments to improve efficiency.

I really like the idea of the system being able to identify trends and
patterns in the data (actually useful at the user level) but it's the
uncertainty of how the partitioning will work for different kinds of
data that I don't like.

 
 _On The Other Hand_, there will be 

Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps

2008-01-09 Thread Gavin Sherry
On Wed, Jan 09, 2008 at 02:38:21PM -0500, Chris Browne wrote:
 Ron Mayer [EMAIL PROTECTED] writes:
  Or am I missing something?
 
 Well, this can head in two directions...
 
 1.  Suppose we're not using an organize in CLUSTER order approach.
 
 If the data is getting added in roughly by order of insertion order,
 then there's no reason to expect San Francisco data to be clustered
 together.
 
 Ditto for John Smiths; we can expect them to be as scattered as
 their dates of creation.
 
 1.  Suppose we *are* using an organize in CLUSTER order approach.
 
 In that case, yes, indeed, you can expect segments to contain specific
 ranges of the data.
 
 However, in that case, the ONLY dimension under which the Segment
 Exclusion may be expected to be useful is that of the first column of
 the index being used.  Smith should be useful to SE, but not John,
 because, as a secondary criteria, the first name values will be
 scattered across all segments.

One of the reasons for using partitions is that the usefulness of
indexes breaks down because of the low cardinality of the indexed
column. Also, the random IO can kill your performance.

In my opinion, CLUSTER (and vacuum, mentioned else where) are not data
management tools.

Thanks,

Gavin

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps

2008-01-09 Thread Gavin Sherry
On Wed, Jan 09, 2008 at 08:17:41PM +, Simon Riggs wrote:
 On Wed, 2008-01-09 at 20:03 +0100, Gavin Sherry wrote:
 
  I think Simon's approach is
  probably more complex from an implementation POV.
 
 Much of the implementation is exactly the same, and I'm sure we agree on
 more than 50% of how this should work already. We just need to close in
 on the remainder.
 
 My current opinion on the SE approach is the opposite of the one above,
 though it gets us nowhere just to state it. I'm trying to avoid opinion
 and look at the details, which is the reason my viewpoint recently
 changed in favour of the dynamic approach as the main thrust for
 implementation.
 
 I've written a detailed email this morning to explain where and how the
 problems lie, which are nowhere near the syntax level. I haven't ruled
 out a declarative approach yet, but I need some detailed technical
 review of the issues. I hope you'll be replying to that?

I'm responding to that email seperately. Just a matter or addressing all
the points in the detail it deserves.

 
   3.  If, rather than blindly following, we create something at least
   quasi-new, there is the chance of doing fundamentally better.
   
   This very thing happened when it was discovered that IBM had a
   patent on the ARC cacheing scheme; the clock system that emerged
   was a lot better than ARC ever was.
  
  Well, I don't think I'm proposing we /blindly follow/ others. I propose
  we choose a grammar which takes the best of what others have tried to
  do. Oracle's grammar is hideous, IBM's is too restrictive, for example.
 
 I assume the new grammar is good and if we do go that way, it sounds
 like the right starting place. 
 
One major advantage of the dynamic approach is that it can work on
multiple dimensions simultaneously, which isn't possible with
declarative partitioning. For example if you have a table of Orders then
you will be able to benefit from Segment Exclusion on all of these
columns, rather than just one of them: OrderId, OrderDate,
RequiredByDate, LastModifiedDate. This will result in some sloppiness
in the partitioning, e.g. if we fill 1 partition a day of Orders, then
the OrderId and OrderData columns will start out perfectly arranged. Any
particular RequiredByDate will probably be spread out over 7 partitions,
but thats way better than being spread out over 365+ partitions.
   
   I think it's worth observing both the advantages and demerits of this
   together.
   
   In effect, with the dynamic approach, Segment Exclusion provides its
   benefits as an emergent property of the patterns of how INSERTs get
   drawn into segments.
   
   The tendancy will correspondly be that Segment Exclusion will be able
   to provide useful constraints for those patterns that can naturally
   emerge from the INSERTs.
  
  Many people, in my experience, doing the kind of data processing which
  benefits from partitioning are regularly loading data, rather than
  collecting it in an OLTP fashion. Lets take the easily understandable
  concept of processing web site traffic. If the amount of data is large
  enough to benefit from partitioning, they probably have multiple web
  servers and therefore almost certainly multiple log files. If these
  files are not sorted into a single file, the records will not have a
  naturally progressing chronology: every file we go back to the beginning
  of the period of time the load covers. If you add parallelism to your load,
  things look even more different. This means you could end up with a
  bunch of partitions, under the dynamic model, which all cover the same
  time range.
 
 Depends how big we make the partitions and how sloppy this is as to
 whether that is a problem or not. We might still expect a x100 gain from
 using the SE approach depending upon the data volume.

This comes back to my problem with the dynamic approach: the user might
get a certain experience but with the declarative approach the
expectation matches the result.

 
  Then there's the way that really big databases are used (say, up around
  Simon's upper bound of 16 TB). It is expensive to keep data online so
  people aren't. They're loading and unloading data all the time, to
  perform different forms of analysis. 
 
 That isn't my experience. That sounds very time consuming.

I cannot offer any evidence but it's what we see companies doing.

 The storage cost issue was the reason Andrew wanted offline segments,
 and why I have been talking about hierarchical storage.

The thing is, people with lots of data usually have good hierarchical
storage systems. Can we really do better?

 
  A common scenario in the example
  above might be to unload all but the current month's data and then load
  the same month from the previous year. The unload step needs to be
  costly (i.e., TRUNCATE). Then, there's no guarantee that what they're
  interested in is the date range at all. They may want

Re: [HACKERS] Named vs Unnamed Partitions

2008-01-09 Thread Gavin Sherry
On Wed, Jan 09, 2008 at 08:51:30PM +, Simon Riggs wrote:
  That's what I would have done if it was easier to do with constraint 
  exclusion 
  (did only date partitioning), as the reporting queries will always have 
  some 
  server (stats by services, each service being installed on 1 or more 
  servers) 
  and date restrictions.
 
 Hmm, well if you found declaring the partitions a problem with
 constraint exclusion it's not going to be any easier using other
 declarative approaches.

I disagree (although it is unreasonable for me to do so without posting
syntax -- it's coming). Proper grammar for partition support means
running a single DDL command. The user does not have to line up table
generation with rules (or triggers) and check constraints. As such, I
believe it to be much much easier.

Thanks,
Gavin

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps

2008-01-09 Thread Gavin Sherry
Hi Simon,

On Wed, Jan 02, 2008 at 05:56:14PM +, Simon Riggs wrote:
 Segment Exclusion
 -
 
 After we note that a segment is read-only we can scan the segment and
 record min/max values for all columns. These are then implicit
 constraints, which can then be used for segment exclusion in a similar
 manner as we do with the current constraint exclusion mechanism.

What about columns which are varlen? The maximum value could be very long -- 1
GB in fact. Say we decide to not support those. Well, what about
numeric? What about user defined types?

 This would also allow a Merge Join to utilise exclusion for the inner
 plan at execution time. Instead of scanning the whole inner table plan
 from the start, we would be able to start the scan from the appropriate
 segment. This would require us to pass the current value of the outer
 plan down into the inner plan. The typical executor nodes on the inner
 plan would be a SortNode and below that a SeqScan. In that case the
 executor would need to pass the outer value from the MergeJoinNode down
 thru the SortNode to the SeqScan node. The SeqScan node could then
 perform partition exclusion, reducing the time for that scan and also
 reducing the time for the resulting sort. This sounds like it might be
 worth doing in normal cases also. It might turn out that the potentially
 applicable cases are already excluded during planning, I haven't thought
 about that aspect in enough detail yet.

I don't think that would work at all. Say you have have skew of data
across the partitions. You may end up doing a seq scan for every outer
tuple. It would look like a really expensive nested loop.

 If we collect data for all columns then many of our implicit constraints
 would be useless. e.g. if a column only has a few values and these are
 present in all segments. Matching our predicate against all constraints
 would be expensive, so we must weed out poor constraints. We would do
 this by removing any constraint that overlapped more than 10% of other
 segments. Various heuristics would likely need to be discovered during
 development to make this work without resorting to manual commands.
 
 Note that all of this exclusion is happening in the executor, not the
 planner. That allows this form of exclusion to work with stable
 functions and parameters without problem.

So, in that case, you've totally undercut the planner. All the steps up
the tree would be based on the stats for scanning the entire table but
you've only returned part of it.

To solve that, you could put the min/max values in a catalog somewhere.
For a 16 TB table, that's 16000 min/max values. That's pretty expensive.
And how do we handle making things non-read-only. You'd have to wait for
all current queries to finish... that could take a long time.

How do we handle crashes during setting of min/max? I presume it needs
WAL support.

What happens if the free space map gives us a free block in a read only
segment. Then we need to look at concurrency and transactionality of
min/max values don't we? If changing it makes it not read-only (which
seems to be the case) it would be trivial to totally degrade your
partitioning scheme to full seq scan. One a 16 TB table. And getting the
performance back might involve a day or more or VACUUMing. Impossible.

 Visibility Map
 --
[snip]
 No dynamic shared memory cache is required because any concurrent
 changes to the table would be ignored by a Scan anyway, so it doesn't
 matter if an INSERT, UPDATE or DELETE occurs while we are scanning. Any
 new scans that start will attempt to lock the table and then perform a
 rel cache check before continuing. So the visibility will be set
 correctly for *that* scan at least.

Is that true in the presence of READ COMMITTED?

 
 In most cases the visibility map can be summarised as a single boolean
 to show whether any 100% visible segments exist. That makes accessing
 the map very cheap in the common, unset case.

What do we do in failure scenarios. Does this go into WAL? Is it
replicated via log shipping techniques. You mention Slony support below.
I don't know that Slony's target is Very Big Tables (up to 16 TB) but
even if it was being used for a 'small' system of a few hundred GB, when
you fail over the system has degraded if you aren't vacuuming it. More
over, the layout of data may be different and therefore the performance
of the segment exclusion. That's a bit of a surprise.

 
 Setting the Visibility Map
 --
 
 VACUUM would scan all READ_WRITE_ALLOWED segments and mark some of
 them as EFFECTIVE_READ_ONLY if 100% of the remaining rows are visible to
 all current backends. Marking the map will cause a rel cache
 invalidation.

Argh. Remember, we're talking about 16 TB tables here. With the best of
tuning, that takes all day. So, day 1, load data; day 2 VACUUM it? With
the existing partitioning approach, there's no such thing.

 
 We would never mark the highest numbered 

Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps

2008-01-09 Thread Gavin Sherry
Hi Simon,

On Wed, Jan 09, 2008 at 03:08:08PM +, Simon Riggs wrote:
 Do people really like running all that DDL? There is significant
 manpower cost in implementing and maintaining a partitioning scheme,
 plus significant costs in getting it wrong. 

Well... that's impossible for me to say. Most of the people who use lots
of data like using MDX (a large and quite complex reporting and
modelling language from MS, which is more complex than SQL IMHO) so...

 
 If people with large tables like partitioning why is Oracle moving
 towards automated partitioning in 11g? Automated partitioning was one of

Have you used Oracle's partitioning? I'm not surprised they're moving
away ;-).

More seriously, I didn't know that. Do you have a URL?

 the major goals for this next set of Postgres partitioning functionality
 also, whether or not we have declarative partitioning. My thinking is
 lets go for the best ideas and skip over the stuff that will be (is)
 obsolete before its left the design stage.

Well, you're assuming that because declarative partitioning has been around a
while, it's obselete. I don't think that's the case. Postgres has been
around about as long...

 I see many more systems in the Terabyte range now than I did 10 years
 ago, but I see about the same number of DBAs. We'll always need experts,
 but I feel we should be using our expertise to simplify the standard
 cases, not just maintain the same level of difficulty in managing them.
 One of the big benefits of the dynamic partitioning approach is that it
 needs no DDL. So it will work out-of-the-box, for anybody.

It's fine to say that, but people have already started talking about
grammar extensions for dynamic partitions. People want the tools to
manage it. There's difficulty in learning how and when to use the
different VACUUM options that have been discussed.

 Deleting older data would be optimised under the proposed scheme, so
 that's not really a problem. Loading data is actually slightly harder
 and slower with declarative partitioning (see below).
 
 Merging and splitting partitions are tools for fine tuning a very
 complex partitioning scheme. They do also allow a non-linear
 segmentation scheme, which might aid performance in some cases.

What about adding partitions into a set of partitions. Greg's post on
that was dismissed and it might not be representative but we have users
doing that all the time.

 One major advantage of the dynamic approach is that it can work on
 multiple dimensions simultaneously, which isn't possible with
 declarative partitioning. For example if you have a table of Orders then
 you will be able to benefit from Segment Exclusion on all of these
 columns, rather than just one of them: OrderId, OrderDate,
 RequiredByDate, LastModifiedDate. This will result in some sloppiness
 in the partitioning, e.g. if we fill 1 partition a day of Orders, then
 the OrderId and OrderData columns will start out perfectly arranged. Any
 particular RequiredByDate will probably be spread out over 7 partitions,
 but thats way better than being spread out over 365+ partitions.
 
 When we look at the data in the partition we can look at any number of
 columns. When we declaratively partition, you get only one connected set
 of columns, which is one of the the reasons you want multi-dimensional
 partitioning in the first place.
 
  To this end, we (well, Jeff Cohen) looked at the syntax and semantics of
  partitining in leading databases (Oracle, Informix, DB2) and came up
  with a highly expressive grammar which takes the best of each I think
  (I'll post details on the grammar in a seperate thread). The idea is
  that range (for example, a date range), list (a list of distinct values)
  and hash partitioning be supported on multiple columns. Partitions can
  be added, merged, truncated. Partitions can reside on different
  tablespaces. The existing issues with the rewriter, COPY support and the
  like go away by smartening up the backend.
 
  To explore the grammar and semantics Jeff and I (to a small extent) have
  implemented the parsing of such a grammar in the backend. At the moment,
  this just results in the generation of a bunch of rules (i.e., it
  produces the current behaviour, as if someone had written rules
  themselves) and debugging output.
  
  The fully fledged approach will see partitioning rules stored in
  a new system catalog which can be interrogated by the planner or COPY to
  determine which partition a given tuple belongs in or which partition(s)
  a WHERE clause matches.
 
 When loading data we would need to examine each incoming tuple and
 distribute it to the correct place in the partitioned heap. That's a
 non-trivial overhead on data loading that I would like to avoid, since
 if we rely on the natural locality of loaded data then that overhead is
 zero.

Yes, but it's a lot faster than what we have at the moment and the
dynamic approach is not without its costs. I've pointed it out before,
but I will 

Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps

2008-01-08 Thread Gavin Sherry
On Wed, Jan 02, 2008 at 05:56:14PM +, Simon Riggs wrote:
 This technique would be useful for any table with historical data keyed
 by date or timestamp. It would also be useful for data where a
 time-of-insert component is implicit, such as many major entity tables
 where the object ids are assigned by a sequence. e.g. an Orders table
 with an OrderId as PK. Once all orders placed in a period have been
 shipped/resolved/closed then the segments will be marked read-only.
 
 Its not really going to change the code path much for small tables, yet
 seems likely to work reasonably well for large tables of all shapes and
 sizes. If a segment is being updated, we leave it alone, and maybe never
 actually set the visibility map at all. So overall, this idea seems to
 cover the main use case well, yet with only minimal impact on the
 existing use cases we support.
 
 
 As before, I will maintain this proposal on the PG developer Wiki, once
 we get down to detailed design.
 
 
 
 Like it?

Simon,

A novel approach to the problem. For me, this proposal addresses some
of the other problems in postgres (visibility in the heap vs. index)
rather than the problems with partitioning itself.

It might seem otherwise, but for me partitioning is tool for managing
large volumes of data. It allows the user to encode what they know about
the nature of their data, and that's often about management. In this
way, your proposal actually makes partitioning too smart: the user wants
to tell the system how to organise the data, as opposed to the other way
around.

At Greenplum, we've been discussing this in depth. Interestingly, we
also felt that the storage layer could be much smarter with large tables
with predictable layouts and predictable data patterns. But the thing
is, people with large tables like partitioning, putting different
partitions on different storage; they like the ability to merge and
split partitions; they like truncating old partitions. In a word, they
seem to like the managability partitions give them -- as well as the
performance that comes with this.

To this end, we (well, Jeff Cohen) looked at the syntax and semantics of
partitining in leading databases (Oracle, Informix, DB2) and came up
with a highly expressive grammar which takes the best of each I think
(I'll post details on the grammar in a seperate thread). The idea is
that range (for example, a date range), list (a list of distinct values)
and hash partitioning be supported on multiple columns. Partitions can
be added, merged, truncated. Partitions can reside on different
tablespaces. The existing issues with the rewriter, COPY support and the
like go away by smartening up the backend.

To explore the grammar and semantics Jeff and I (to a small extent) have
implemented the parsing of such a grammar in the backend. At the moment,
this just results in the generation of a bunch of rules (i.e., it
produces the current behaviour, as if someone had written rules
themselves) and debugging output.

The fully fledged approach will see partitioning rules stored in
a new system catalog which can be interrogated by the planner or COPY to
determine which partition a given tuple belongs in or which partition(s)
a WHERE clause matches.

Yes, this is the traditional approach but I think it still has merit. We
shall continue working on this approach because it is what our customers
have asked for. We would also like to see it get into postgres too.

Thanks,

Gavin

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] VLDB Features

2007-12-12 Thread Gavin Sherry
On Wed, Dec 12, 2007 at 08:26:16PM +0100, Markus Schiltknecht wrote:
 Isn't Gavin Sherry working on this? Haven't read anything from him
 lately...
 
 Me neither.  Swallowed by Greenplum and France.
 
 Hm.. good for him, I guess!

Yes, I'm around -- just extremely busy with a big release at Greenplum as 
well as other Real Life stuff.

Thanks,

Gavin

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Adjusting index special storage for pg_filedump's convenience

2007-04-09 Thread Gavin Sherry
On Mon, 9 Apr 2007, Tom Lane wrote:

 We put in a workaround a long time ago to make it possible to tell the
 difference between btree and hash special space, which are also the same
 size: there's an unused 16 bits in hash special space that we fill with
 a specific value.  As of 8.2 this doesn't work as well as it used to,
 because the corresponding space in a btree page is now used for a vacuum
 cycle ID and so there's 1 chance in 65536 of a false match.  Still, it's
 a lot better than nothing.

Sounds... reasonable. Especially if you add the flags test below.


 I'd like to tweak things for 8.3 so that pg_filedump can work reasonably
 well again.  It looks like the hash solution would work for gist, gin,
 and bitmap: rearranging fields would allow us to put in a 16-bit ID
 field in all three cases.  (For bitmap, I'm assuming that
 bm_hrl_words_used could be reduced to 16 bits without problem --- it is
 a per-page count not something larger, right?)

Yes, I've reduced this already but hadn't in previous patches, from
memory. I'd add a filler of uint16 now. Got a number I should use?

 One problem with that is that with four special values, there'd be 1
 chance in 16384 of misidentifying a btree page because of chance values
 of the vacuum cycle ID.  This can be improved a bit if we put the flags
 fields (for those index types that have 'em) in a consistent place too:
 we can disbelieve that an index is of type X if it doesn't have a flags
 value that fits.  I don't see any way to make it completely bulletproof
 without enlarging the special space, which seems an unreasonable price
 to pay.  But even one chance in 16K is way better than the current
 situation.

Sounds like the only workable approach.

Thanks,

Gavin

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Bitmap index thoughts (another segfault)

2007-04-08 Thread Gavin Sherry

 I'm seeing a segfault on a size TPC-H size 10 database. The patch and
 code are:
 - bitmap patch from 12 Mar
 - 8.3 dev from 27 Mar

Thanks Mark. I tracked this down. I'll post a new patch soon.

Gavin

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Bitmap index thoughts (another segfault)

2007-04-08 Thread Gavin Sherry
On Sat, 7 Apr 2007, Mark Kirkwood wrote:

 Mark Kirkwood wrote:
 bitmap=# SELECT count(*) FROM bitmaptest
 WHERE val1 in (1,7)
 AND val0 IN (4,3)
 ;

 ERROR:  XX000: unknown stream type 2
 LOCATION:  stream_add_node, tidbitmap.c:1033

Thanks. Turned out to be a typo in stream_add_node()! I'll post a new
patch soon. Thanks for the test kit and your testing!

Gavin

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] Many unfinished patches

2007-04-02 Thread Gavin Sherry
I am currently finishing off an improved VACUUM implementation for
bitmaps. The rest of the patch is ready for review.

I will try and post a patch within 24 hours.

Gavin

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Bitmapscan changes - Requesting further feedback

2007-03-20 Thread Gavin Sherry
On Tue, 20 Mar 2007, Joshua D. Drake wrote:

 Hackers et al... I was wondering if there are any outstanding issues
 that need to be resolved in terms of the clustered index/bitmap changes?

 From the testing that I have done, plus a couple of others it is a net
 win (at least from DBA space).

Not sure if you're talking about bitmap indexes here. If so, I'm working
on VACUUM support.

Gavin

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] Why is this allowed?

2007-03-10 Thread Gavin Sherry
On Sat, 10 Mar 2007, Chuck McDevitt wrote:

 Ok...

 Just to be clear, the ISO SQL spec says that  INTERVAL '1' DAY is the
 correct way to specify a one-day interval.
 That's why it is surprising that PostgreSQL treats it differently, with
 no error or warning.

 The PostgreSQL syntax INTERVAL '1 DAY' is non-standard.

 Is fixing this on the TODO list?

Yes: Add ISO INTERVAL handling

There's detail there about the nature of support required. Personally, I
think the syntax is awful. I don't understand why intervals are handled so
differently in the spec as compared to other data types. This syntax is
well supported by Oracle and DB2 though. I guess it should be completed in
postgres.

Thanks,

Gavin

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [HACKERS] Stream bitmaps

2007-03-08 Thread Gavin Sherry
On Thu, 8 Mar 2007, Heikki Linnakangas wrote:

 Hi Gavin,

 Any progress?


Really busy at the moment, but it's on my TODO list for today.

Thanks,

Gavin

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] PostgreSQL - 'SKYLINE OF' clause added!

2007-03-07 Thread Gavin Sherry
On Wed, 7 Mar 2007, Josh Berkus wrote:

 Approximate queries is something with DSS users *want*.  Jim Grey addressed
 this in his ACM editiorial on the databases of the future.  It's something
 that *I* want, and if the Greenplum people aren't speaking up here, it's
 because they're not paying atttention.

 Now, I don't know if this Skyline patch is our answer for approximate queries.
 Maybe I should pester Meredith about getting QBE free of its IP issues; it
 certainly looked more flexible than Skyline.  In either case, the code
 probably needs a complete refactor.

What people want from approximate queries is different to this: the
desire is usually to balance run time with level of accuracy/quality (some
times the desire is to have accurate results as well as similar results).
Neither skyline or QBE are about this. The only thing in the spec which
addresses this is 'tablesample'.

Thanks,

Gavin

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] PostgreSQL - 'SKYLINE OF' clause added!

2007-03-06 Thread Gavin Sherry
On Tue, 6 Mar 2007, Alvaro Herrera wrote:

 Also, keep in mind that there were plenty of changes in the executor.
 This stuff is not likely to be very easy to implement efficiently using
 our extant executor machinery; note that Ranbeer mentioned
 implementation of block nested loop and other algorithms.  Not sure
 how easy would be to fold that stuff into the optimizer for multi-input
 aggregates, instead of hardwiring it to the SKYLINE OF syntax.


Yes, there's been a lot of working on calculating skyline efficiently,
with different sorting techniques and so on. This is the most interesting
part of the idea. You could calculate the query Ranbeer gave using pure
SQL and, perhaps, use of some covariance aggregates or something already.
Of course, it gets harder when you want to calculate across many
dimensions.

Personally, I'd love to see some of these newer data analysis
capabilities added to PostgreSQL -- or at least put out there as
interesting patches.

Thanks,

Gavin

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] Stream bitmaps

2007-03-05 Thread Gavin Sherry
Heikki,

On Mon, 5 Mar 2007, Heikki Linnakangas wrote:

 Hi all,

 I'd like to see the indexam API changes needed by the bitmap indexam to
 be committed soon. Has anyone looked at the proposed API in the latest
 patch? Any thoughts?

Thanks for looking at it!


 I'm quite happy with it myself, with a few reservations:

 - All the getbitmap implementations except the new bitmap indexam are
 just boilerplate. How about making getbitmap-function optional, and
 having a generic implementation that fills in a hash bitmap using the
 traditional getnext function?

 - getbitmap is passed an existing bitmap as argument, and the
 implementation needs to OR the existing bitmap with new tuples. How
 about AND? An indexam could be smart about ANDing with an existing
 bitmap, for example skipping to the first set bit in the existing bitmap
 and starting the scan from there.

 - I'd like to have support to return candidate matches with both
 getbitmap and getnext. A simple flag per page of results would be enough
 for getbitmap, I think.

 - StreamBitmap and HashBitmap are separate node types, but OpStream is
 not. opaque-field in the StreamBitmap struct is not really that opaque,
 it needs to be a StreamNode. I drew a UML sketch of what I think the
 class-hierarchy is
 (http://community.enterprisedb.com/streambitmaps.png). This is
 object-oriented programming, we're just implementing classes and
 inheritance with structs and function pointers. The current patch mixes
 different techniques, and that needs to be cleaned up.

All good ideas, I'll look at them more closely in the morning.

 I'd like to see a separate patch that contains just the API changes.
 Gavin, could you extract an API-only patch from the bitmap index patch?
 I can work on it as well, but I don't want to step on your toes.

Okay, I can do that.

Thanks,

Gavin

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] [PATCHES]

2007-03-05 Thread Gavin Sherry
On Wed, 28 Feb 2007, Tom Lane wrote:

 AFAICT, the footer in question tries to make it illegal for us even to
 have the message in our mail archives.  If I were running the PG lists,
 I would install filters that automatically reject mails containing such
 notices, with a message like Your corporate lawyers do not deserve to
 have access to the internet.  Go away until you've acquired a clue.


I just noticed Bruce's email suggesting that this is going ahead:

http://archives.postgresql.org/pgsql-performance/2007-03/msg00098.php

I do not think this is a good idea. These disclaimer messages are
generally tacted on by the MTA of the company at which the person works,
as others have noted. There seem to be about 10 such emails to general per
month, not sure about other lists. FWIW, usually it is just one or two
offenders.

We do not suffer this problem in isolation. I think the Debian project has
tried to address this partly with this:

http://www.debian.org/MailingLists/disclaimer

Couldn't we place a disclaimer on the subscription page and welcome email
which says some of the interesting things in this disclaimer and says that
code contributions are implicitly licensed BSD. No idea about the legal
issues involved.

Another way of looking at this is that we cannot be 100% thorough. What
about disclaimers in German? What about false-positives?

Another concern I've had is that the assumption is that those wanting to
contribute (code) will just have to go register some throw away hotmail or
yahoo account. I think this will make tracing the origin of code more
difficult in the long term.

Just my thoughts.

Gavin

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-04 Thread Gavin Sherry
On Mon, 5 Mar 2007, Mark Kirkwood wrote:

 To add a little to this - forgetting the scan resistant point for the
 moment... cranking down shared_buffers to be smaller than the L2 cache
 seems to help *any* sequential scan immensely, even on quite modest HW:

 e.g: PIII 1.26Ghz 512Kb L2 cache, 2G ram,

 SELECT count(*) FROM lineitem (which is about 11GB) performance:

 Shared_buffers  Elapsed
 --  ---
 400MB   101 s
 128KB74 s

 When I've profiled this activity, I've seen a lot of time spent
 searching for/allocating a new buffer for each page being fetched.
 Obviously having less of them to search through will help, but having
 less than the L2 cache-size worth of 'em seems to help a whole lot!

Could you demonstrate that point by showing us timings for shared_buffers
sizes from 512K up to, say, 2 MB? The two numbers you give there might
just have to do with managing a large buffer.

Thanks,

Gavin

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] [PATCHES] - WIP Patch Updatable Cursor

2007-02-27 Thread Gavin Sherry
On Wed, 28 Feb 2007, John Bartlett wrote:

 Hi,

 A list of ctids is stored in the file.

I would have thought these would be stored in memory. If the set got
large, you'd use a temporary file the way other systems which overflow to
disk do?


 The file is used to store the ctids during an updatable cursor transaction.

 It is set up as a permanent file as it has a potential lifetime of
 preserving data between crashes of the backend. Temporary files tend to be
 used for data that is defined within a single command. In this case the file
 needs to exist within a transaction and across backend processes.

It does not. Cursors are implicitly closed when a session is closed. A
backend crash or system restart closes all open sessions.


 The file gram.y has been corrected in my version.

 The files ctidListStore.c and ctidListStore.h were pasted into the patch
 file, as the diff -N command produced a file of several hundred thousand
 lines.

Edit the file with a text editor. If you know which files should be
excluded (like tags files), use diff --exclude=pattern.

Thanks,

Gavin

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] Bitmap index stuff

2007-02-26 Thread Gavin Sherry
On Mon, 26 Feb 2007, Heikki Linnakangas wrote:

 Hi,

 How are you doing with the bitmap indexes?

I need to send of a patch fixing the last bug you pointed out. The code
needs a merge of HEAD.


 I've been trying to get my head around the patch a couple of times to
 add the vacuum support, but no matter how simple I try to keep it, I
 just always seem to get stuck.

 It looks like vacuum support would need:

 - something similar to read_words, let's call it iterate_set_bits, that
 returns each set bit from a bitmap vector, keeping the buffer locked
 over calls.
 - ability to clear the bit returned by iterate_set_bits. If normal index
 scans also used this, the same functions could be used to support the
 kill_prior_tuple thingie.

Okay.


 The above also needs to be able to recompress a page if it gets
 fragmented by repeated setting and clearing of bits.

Yes.

 I still feel that the data structures are unnecessarily complex. In
 particular, I'd like to get rid of the special-cased last_word and
 last_comp_word in the lov item. Perhaps we could instead embed a normal,
 but smaller, BMBitmapData structure in the lov item, and just add a
 length field to that?

I'm not sure that this really simplifies the code. I agree things could be
simpler though.


 You have a lot of code to support efficient building of a bitmap index.
 I know you've worked hard on that, but do we really need all that? How
 did the bitmap build work in the previous versions of the patch, and how
 much faster is the current approach?

I included details on a previous email, I thought. Basically, in cases
where the data is distributed as follows:

1 1 1 1 1 1 1  2 2 2 2 2 2 2  3 3 3 3 3 3 3 3 ...

We're very fast in both versions. If the data is distributed as:

1 2 3 4 5 6  1 2 3 4 5 6

In the original version(s), we were terribly slow (in my test, 7 times
slower than btree). Considering the kind of data sets bitmap suits, this
made bitmap unusable. With the rewrite, we're much faster (in my test,
faster than btree).

The test case was: a table with 600M rows with 100,000 distinct keys to be
indexed.

 BTW: It occured to me that since we're piggybacking on b-tree's strategy
 numbers, comparison operators etc, conceivably we could also use any
 other indexam. For example, a bitmap GiST would be pretty funky. We'll
 probably leave that for future versions, but have you given that any
 thought?

True. I haven't given it any thought though. Interesting... I'd have to
think of some interesting data sets which would suit the capabilities
(operators) we have with GiST.

Thanks,

Gavin

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] SCMS question

2007-02-22 Thread Gavin Sherry
On Thu, 22 Feb 2007, Tom Lane wrote:

 Gregory Stark [EMAIL PROTECTED] writes:
  If we want to minimize the pain of changing and keep the same mode of
  operation Subversion is definitely the right choice. Its goal was to provide
  the same operational model as CVS and fix the implementation and 
  architectural
  problems.

 Erm ... but this is not an argument in favor of changing.

 AFAIR the only real disadvantage of CVS that we've run up against is
 that it's hard to shuffle files around to different directories without
 losing their change history (or more accurately, making the history
 harder to find).  Now that is a pretty considerable annoyance on some
 days, but it's not sufficient reason to change to something else.
 I have no doubt that every other SCMS has annoyances of its own.

It's not a problem for the project but I personally experience pain with
CVS. I often want to take a bunch of commits and merge them into seperate
trees (like Greenplum DB or my private bitmap index tree). This is a lot
easier with the patch set based SCMs like darcs/monotone/git/etc.

Just my thoughts.

Gavin

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [HACKERS] Status of Hierarchical Queries

2007-02-21 Thread Gavin Sherry
On Wed, 21 Feb 2007, Jonah H. Harris wrote:

 As was discussed in several threads, I'd handed over the
 responsibility of hierarchical queries to Greg Stark several weeks
 ago.  He posted a preliminary patch which I don't believe anyone
 looked at.  For 8.3's sake, I wanted to make sure we get the status of
 this out on the table so there won't be any surprises like those
 related to 8.2.

 Where are we at?  Has anyone reviewed the preliminary work?  Any
 comments, suggestions, etc?

Yes, I looked at it.

The WITH support seems okay. I guess I'd thought it might be represented
different internally (not a sub query) but the approach Greg has taken is
probably more straight forward (in that you get a lot of proven code for
free). It should work fine for recursive queries too, if you just re-seed
the param keys for every scan of the 'sub-query'.

I wonder if anyone can think of a good way to cost the recursive side of
the query. I'm still pre-coffee and it hurts my head :).

Gavin

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Status of Hierarchical Queries

2007-02-21 Thread Gavin Sherry
On Wed, 21 Feb 2007, Gregory Stark wrote:

 Gavin Sherry [EMAIL PROTECTED] writes:

  The WITH support seems okay. I guess I'd thought it might be represented
  different internally (not a sub query) but the approach Greg has taken is
  probably more straight forward (in that you get a lot of proven code for
  free). It should work fine for recursive queries too, if you just re-seed
  the param keys for every scan of the 'sub-query'.

 I don't think it works for recursive queries. Since you can't have the same
 executor plan in motion for two different sets of parameters simultaneously.
 That's why I was talking about a Memoize node.

Can you elaborate on the 'two different sets of parameters' bit? I'm still
without coffee.

 It is sufficient for the non-recursive case which might make it worthwhile
 putting it in 8.3. But even there user's expectations are probably that the
 reason they're writing it as a cte is precisely to avoid duplicate execution.

I wonder if the planner should decide that?

Thanks,

Gavin

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Status of Hierarchical Queries

2007-02-21 Thread Gavin Sherry
On Thu, 22 Feb 2007, Gregory Stark wrote:

 Gavin Sherry [EMAIL PROTECTED] writes:

  Can you elaborate on the 'two different sets of parameters' bit? I'm still
  without coffee.

 The spec allows for arbitrarily complex recursive query structures. Including
 mutually recursive queries, and even non-linearly recursive queries. I found
 grokking them requires far stronger brews than coffee :)

Hehe.

 But in a simple recursive tree search you have a node which wants to do a join
 between the output of tree level n against some table to produce tree level
 n+1. It can't simply execute the plan to produce tree level n since that's the
 same tree it's executing itself. If it calls the Init method on itself it'll
 lose all its state.

 There's another reason it can't just execute the previous node. You really
 don't want to recompute all the results for level n when you go to produce
 level n+1. You want to keep them around from the previous iteration. Otherwise
 you have an n^2 algorithm.

Right. When I've spent some idle cycles thinking through this in the past
I figured that in a non-trivial query, we'd end up with a bunch of
materialisations, one for each level of recursion. That sounds very ugly.

  It is sufficient for the non-recursive case which might make it worthwhile
  putting it in 8.3. But even there user's expectations are probably that the
  reason they're writing it as a cte is precisely to avoid duplicate 
  execution.
 
  I wonder if the planner should decide that?

 That's one option. For the non-recursive case we could inline the cte subquery
 everywhere it's referenced and then add smarts to the planner to find
 identical subqueries and have a heuristic to determine when it would be
 advantageous to calculate the result once.

 The alternative is to retain them as references to a single plan. Then have a
 heuristic for when to inline them.

 In neither case is a heuristic going to be particularly good. The problem is
 that for any reasonably complex plan it'll be cheaper to execute it only once
 than multiple times. Unless there's an advantage to be gained by inlining it
 such as being able to push conditions down into it. But the only way to find
 out if that will be possible would be to try planning it and see.

Pushing down predicates was the exact idea I had in mind.

Thanks,

Gavin

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] infinity and DATE type

2007-02-21 Thread Gavin Sherry
On Thu, 22 Feb 2007, Warren Turkal wrote:

 On Thursday 22 February 2007 00:20, Tom Lane wrote:
  Nobody got round to it.  I believe it's listed in the TODO file ...

 It's not at [1]. Should someone add it to the TODO?

 [1]http://www.postgresql.org/docs/faqs.TODO.html

Yes it is.

Allow infinite dates and intervals just like infinite timestamps

Gavin

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] TopPlan, again

2007-02-18 Thread Gavin Sherry
On Sun, 18 Feb 2007, Tom Lane wrote:

 We've repeatedly discussed getting rid of execution-time access to the
 Query structure --- here's one old message about it:
 http://archives.postgresql.org/pgsql-hackers/1999-02/msg00388.php
 and here's a recent one:
 http://archives.postgresql.org/pgsql-hackers/2006-08/msg00734.php
 I think it's time to bite the bullet and do that.

Great.

 The other big problem is the rangetable (rtable): currently it contains
 Query trees for subqueries (including views) so unless we clean that up
 we aren't going to be all that far ahead in terms of reducing the
 overhead.  I'm envisioning creating a compact rangetable entry struct
 with just the fields the executor needs:

   rtekind
   relid
   eref(might only need the table alias name not the column names)
   requiredPerms
   checkAsUser

 and flattening subquery rangetables into the main list, so that there's
 just one list and rangetable indexes are unique throughout a plan tree.
 That will allow subqueries to execute with the same EState as the main
 query and thus simplify nodeSubplan and nodeSubqueryScan.  This list
 will also provide a simple way for the plan cache module to know which
 relations to lock before determining whether the plan has been invalidated.

Cool.

 Comments, objections?  Also, any thoughts about the names to use for
 these new node types?  As I commented last year, I'm not completely
 happy with TopPlan because it won't actually be a subtype of Plan,
 but I don't have a better idea.  Also I'm unsure what to call the
 cut-down RangeTblEntry struct; maybe RunTimeRangeTblEntry?

I think TopPlan is misleading. What about MetaPlan instead of TopPlan? I
think RunTimeRangeTblEntry is okay, though long. ExecRangeTblEntry?

Thanks,

Gavin

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] Bitmap index thoughts

2007-02-08 Thread Gavin Sherry
On Thu, 8 Feb 2007, Heikki Linnakangas wrote:

 Gavin Sherry wrote:
  I will update the code tomorrow. The focus will be cleaning up the
  executor modifications. Please look else where for now.

 I'm getting a segfault with this test script:

 
 CREATE TABLE bmtest (i int);

 INSERT INTO bmtest SELECT 1 FROM generate_series(1,10) a;
 INSERT INTO bmtest SELECT 10 FROM generate_series(1,100) a;
 DELETE FROM bmtest WHERE i = 1;
 VACUUM bmtest;

 CREATE INDEX i_bmtest ON bmtest USING bitmap (i);

 INSERT INTO bmtest SELECT 10 FROM generate_series(1,10) a;
 


Hmm... this triggers a bug in the newly rewritten update code I think.
I'll post a fix soon.

Thanks for testing!

Gavin

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [HACKERS] --enable-debug does not work with gcc

2007-02-02 Thread Gavin Sherry
On Fri, 2 Feb 2007, NikhilS wrote:

 Hi,

 configure with --enable-debug does not seem to add -g to CFLAGS while
 compiling with gcc. Guess we will need to change configure.in as below:

Erm... works for me and everyone else... AFAIK.

Thanks,

Gavin

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] --enable-debug does not work with gcc

2007-02-02 Thread Gavin Sherry
On Fri, 2 Feb 2007, NikhilS wrote:

 Hi,

 Indeed it does, apologies for not doing the entire groundwork. But what it
 also does is that it adds -O2 by default for gcc even when --enable-debug is
 specified. gdb is not able to navigate the stack traces properly with this
 optimization in place. Especially tracing of static functions becomes
 difficult. Has this issue been faced by anybody else? If so can try out a
 patch to avoid using O2 with enable-debug.

Yes, this is known. The thing with gcc is, it only emits some warnings at
-O2. I'm not that this is why we do not set optimisation to 0 but have
long assumed it to be the case. I imagine that it's fairly standard
practice for people doing debugging to CFLAGS=-O0 as an argument to
configure.


---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Bitmap index thoughts

2007-02-02 Thread Gavin Sherry
On Thu, 1 Feb 2007, Bruce Momjian wrote:


 Where are we on this patch?  Does it have performance tests to show
 where it is beneificial?  Is it ready to be reviewed?

Here's an updated patch:

http://www.alcove.com.au/~swm/bitmap-2007-02-02.patch

In this patch, I rewrote the index build system. It was fast before for
well clustered data but for poorly clustered data, it was very slow. Now,
it is pretty good for each distribution type.

I have various test cases but the one which showed bitmap a poor light was
a table of 600M rows. The key to the table had a cardinality of 100,000.
When the table was loaded with keys clustered, the build time was 1000
seconds with bitmap (2200 with btree). With poorly clustered data (e.g.,
the index key was (1, 2, 3, ..., 6000, 1, 2, 3, ...)), the build time for
bitmap was 14000 seconds!

So, I rewrote this to compress data using HRL encoding (the same scheme we
use in the bitmap AM itself). Now, clustered data is just as fast and
unclustered data is 2000 seconds.

The select performance at a cardinality of 100,000 is similar to btree but
faster with lower cardinalities.

Jie also contributed a rewrite of the WAL code to this patch. Not only is
the code faster now, but it handles the notion of incomplete actions --
like btree and friends do. The executor code still needs some work from me
-- Jie and I have dirtied things up while experimenting -- but we would
really like some review of the code so that this can get squared away
well before the approach of 8.3 feature freeze.

One of the major deficiencies remaining is the lack of VACUUM support.
Heikki put his hand up for this and I'm holding him to it! ;-)

I will update the code tomorrow. The focus will be cleaning up the
executor modifications. Please look else where for now.

Thanks,

Gavin

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [HACKERS] Bitmap index thoughts

2007-02-01 Thread Gavin Sherry
On Thu, 1 Feb 2007, Bruce Momjian wrote:


 Where are we on this patch?  Does it have performance tests to show
 where it is beneificial?  Is it ready to be reviewed?

I've got an updated patch which adds significant performance improvements
for worse case data distributions. It also contains a rewrite of the WAL
code to handle incomplete actions.

I haven't worked on the stuff discussed below with Heikki. It's a lot of
work and probably more suitable for a second generation.

I've just got to finish testing the merge of Tom's operator family stuff
and then I'll send off the patch and performance figures.

Thanks,

Gavin

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Data archiving/warehousing idea

2007-01-31 Thread Gavin Sherry
On Thu, 1 Feb 2007, Chris Dunlop wrote:

 G'day hackers,

G'Day Chris,

 already - I couldn't find anything in the mail archives, but
 that doesn't mean it's not there...)

There has been a lot of discussion about this kind of thing over the
years.

 The main idea is that, there might be space utilisation and
 performance advantages if postgres had hard read-only tables,
 i.e. tables which were guaranteed (by postgres) to never have
 their data changed (insert/update/delete).

 This could potentially save disk space by allowing book
 keeping elements in the page/tuple headers to be removed, e.g.
 visibility information etc.  Also, some indexes could
 potentially be packed tighter if we know the data will never
 change (of course this is already available using the fillfactor
 control).

Well, there is also CPU overhead doing MVCC but there are a few
fundamental problems that must be overcome. The most significant is that
no useful table is always read only, otherwise you could never load it.
What do we do in the presence of a failure during the load or a user
issued ABORT? I guess we'd truncate the table... What about replay after a
crash?

Another way of looking at it is, we use the 'bookkeeping' information in
the tuple header for concurrency and for handling the abortion of the
transaction.

 The idea would be to introduce a statement something like:

   ALTER TABLE foo SET ARCHIVE;

I'd not thought of that approach. There are two problems: some archive
tables are so large that loading them and then reprocessing them isn't
appealing. Secondly, we'd be rewriting the binary structure of the table
and this does not suit the non-overwriting nature of Postgres's storage
system.

A different approach discussed earlier involves greatly restricting the
way in which the table is used. This table could only be written to if an
exclusive lock is held; on error or ABORT, the table is truncated.

The problem is that a lot of this looks like a hack and I haven't seen a
very clean approach which has gone beyond basic brain dump.

Thanks,

Gavin

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Questions about warnings

2007-01-25 Thread Gavin Sherry
On Thu, 25 Jan 2007, Magnus Hagander wrote:

 I'm looking over the VC build trying to eliminate what warnings are
 left. One thing that appears in a couple of places is stuff like:

 .\src\bin\psql\print.c(2014): warning C4090: 'function' : different
 'const' qualifiers

Seems like other projects have encountered this problem. Looks like a
simple type difference. Casting to (void *) should fix it.

http://mirror.ethereal.com/lists/ethereal-dev/200502/msg00170.html

Gavin

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] New feature proposal

2007-01-24 Thread Gavin Sherry
On Wed, 24 Jan 2007, Sorin Schwimmer wrote:

 Dear Developers,

 I would like to suggest the inclusion of an extension
 in PostgreSQL. There are instances, I found, when one
 needs to INSERT several times the same record in a
 table. The front-end application can do it easy in a
 loop of a sort, but on remote servers (and that's the
 norm these days) it creates unnecessary network
 traffic.

 My suggestion is to allow INSERT to do it REPEAT x.
 This should allow, in my view, the followings:
 a) INSERT INTO my_table (field1, field2, field3)
VALUES (value1, value2, value3) REPEAT 5;

postgres=# create table baz (i int, j text);
CREATE TABLE
postgres=# insert into baz (i, j) select 1, 'hello' from
generate_series(1, 5);
INSERT 0 5
postgres=# select * from baz;
 i |   j
---+---
 1 | hello
 1 | hello
 1 | hello
 1 | hello
 1 | hello
(5 rows)

 b) INSERT INTO my_table (field1, field2, field3)
VALUES (x, value2/x, value3) REPEAT (x=3);
 should insert the followings:
 1, value2, value3
 2, value2/2, value3
 3, value2/3, value3

Yuk! Besides, it can be done similarly to the above.

 This suggestion comes for a practical project that I
 have.

Well, the good thing is, you can use generate_series() now! :-)

Thanks,

Gavin

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Planning aggregates which require sorted or distinct

2007-01-20 Thread Gavin Sherry
On Sat, 20 Jan 2007, Simon Riggs wrote:

 On Sat, 2007-01-20 at 15:58 +1100, Gavin Sherry wrote:
  On Fri, 19 Jan 2007, Tom Lane wrote:
 
   Gavin Sherry [EMAIL PROTECTED] writes:
On Fri, 19 Jan 2007, Tom Lane wrote:
Er, what primary key would that be exactly?  And even if you had a key,
I wouldn't call joining on it trivial; I'd call it expensive ...
  
I should have used slightly different language. What I meant to say was,
both sets are primarily sorted by saledate so they can be merged back
together. This is why I said it was trivial.

 Yes, your fan out plan sounds best, together with the optimisation to
 remove whatever you call the individual strands of the fan-out. Think of
 a good name now, so we can discuss it more easily...

Hah. Yes. I've been using the term 'subplan' but it isn't a good one. Can
we just choose something 'cool' like iPlan2006? ;)

  Yep. I was thinking about this all morning. I think I've over engineered
  the problem in my head. Window function input just looks like a slightly
  more complex distinct aggregate input. I'll think on it more though.

 I'm working on modifying Tuplestore that will allow you to store the
 last N tuples, rather than all previous input. This is specifically
 designed to allow Merge Join to do Mark/Restore without materializing
 the whole sort set. This can also be used to materialize Windows (i.e.
 window clause in SQL:2003), so you can store the current row plus n
 PRECEDING and/or n FOLLOWING rows as appropriate. Reading from the
 Window would then be similar-ish to doing a Mark/Restore pair, which we
 can rename to MarkWindowStart and ReturnToWindowStart.

Wow. What a coincidence! Windows are slightly more complex though. As you
probably know, there are two ways of specifying the window frame: by an
absolute number of rows (ROWS N PRECEDING, for example); or, by a 'range'
(RANGE N PRECEDING), where the range, in the case of 'preceding', is
determined by subtracted the range parameter from the value of the current
field -- i.e., the window attribute.

This presents a problem in terms of managing the size of the buffer. If
you have data clustered around a value then the amount of data input into
the window function when we cross this value explodes. The tuplestore code
can or will deal with this but it is currently not designed for this kind
of usage pattern. That is, every time we advance a row we must (a)
recalculate multiset to be input to the window function and (b) generate
the value from the window function by passing the input to it. The problem
arises when the window contains more data than can be stored in work_mem.
Then, each time we need to recalculate the window function, we'll churn
data through the buffer and get no effect from the buffering itself.

A lot of the research around window functions recommends offseting this
effect by buffering data 'pre-aggregated' for each distinct value in the
buffer. Most of the research, however, relies on a trick we don't have
available in the SQL window function spec: windows in SQL can have a
partition (irrelevant here), data sort order and a range; windows in the
world of window function streaming data research have this and a 'slide'.
Slide is the interval at which the aggregate is regenerated and in SQL the
value is regenerated for every new row. The research concedes that at this
level, preaggregation either stops making sense of is very complex.

I've come up with a way to be able to do it, but not for all aggregates
(median() for example). I'll discuss this in the proposal to be sent
through soon. The problem is, the requirements we have for buffering data
around window functions could be very complex.

 I'll present the prototype shortly, I've got a few bugs, but the basic
 principle is working already. I'm happy to craft that to your exact
 needs, so that you'll be able to press ahead with the rest of the
 implementation of Windowed functions.

Awesome. I will get the proposal off so that we can discuss the
requirements at length.

 The Materialize node is almost unchanged, but I was thinking of renaming
 it when used in this way to make the EXPLAIN more readable. Perhaps we
 should call it a Materialized Window for both the MJ and Window function
 cases.

I think that would be confusing in the case of MJ.

 This won't help with UNBOUNDED window definitions, but I imagine that
 these are best handled by running aggregates which would be an O(N)
 operation, rather than recalculating everything each time, which would
 be O(N^2).

Correct. You only need to recalculate the aggregate if it has a moving
window frame.

  To bring out a slightly different point -- and I know this is putting the
  cart before the horse -- but window functions will (potentially) output
  rows in the wrong order. I made a passing reference to this earlier. For
  example, say we have a table employees with the following data:
 
  empno | salary | age
  
  1 | 2000   | 50

Re: [HACKERS] Planning aggregates which require sorted or distinct

2007-01-20 Thread Gavin Sherry
On Sat, 20 Jan 2007, Tom Lane wrote:

 Gavin Sherry [EMAIL PROTECTED] writes:
  We want to answer the following: for each employee: what is their rank in
  terms of salary and what is their rank in terms of age. This query
  answers that:

  select empno, rank() over (order by salary) as srank,
rank() over (order by age) as arank
from employees order by empno;

 Eeek.  This seems like the worst sort of action-at-a-distance.  How does
 rank() know what value it's supposed to report the rank of?

This is a frustratingly inconsistent bit of the spec. Rank is defined as
follows:

RANK() OVER WNS is equivalent to:
   ( COUNT (*) OVER (WNS1 RANGE UNBOUNDED PRECEDING)
   - COUNT (*) OVER (WNS1 RANGE CURRENT ROW) + 1 )

Say the salary column has the following values: {100, 200, 200, 300}. This
would give the following output: {1, 2, 2, 4}. DENSE_RANK() would give:
{1, 2, 2, 3}.

These functions are pretty ugly (if you think about them in terms of our
existing aggregates). However, they are by far the most heavily used
window functions (along with ROW_NUMBER()).

Thanks,


Gavin

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] Planning aggregates which require sorted or distinct

2007-01-20 Thread Gavin Sherry
On Sat, 20 Jan 2007, Gregory Stark wrote:

 However for RANGE UNBOUNDED PRECEDING we can apply a different plan. Keep the
 state variable for each window aggregate around for the entire time. For each
 record apply the state transition function then apply the FINAL function to
 generate the result for that record but keep the state variable as it was for
 the next record.

Yes, I made reference to this case else where in the email (or was it a
different email in the thread?).

Thanks,

Gavin


---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] Planning aggregates which require sorted or distinct

2007-01-19 Thread Gavin Sherry
On Fri, 19 Jan 2007, Tom Lane wrote:

 Gavin Sherry [EMAIL PROTECTED] writes:
  What we want to do is have a kind of 'sub plan' for each aggregate. In
  effect, the plan might start looking like a directed graph.  Here is part
  of the plan as a directed graph.

 GroupAggregate
/-^---...
| |
| |
^ |
|   Unique
| ^
| |
  Sort   Sort
(saledate)(saledate,prodid)
^ ^
| |
-- Fan Out ...
  ^
  |
 Scan

  This idea was presented by Brian Hagenbuch at Greenplum. He calls it a
  'Fan Out' plan. It is trivial to rejoin the data because all data input to
  the aggregates is sorted by the same primary key.

 Er, what primary key would that be exactly?  And even if you had a key,
 I wouldn't call joining on it trivial; I'd call it expensive ...

I should have used slightly different language. What I meant to say was,
both sets are primarily sorted by saledate so they can be merged back
together. This is why I said it was trivial.

Thanks,

Gavin

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] Planning aggregates which require sorted or distinct

2007-01-19 Thread Gavin Sherry
On Sat, 19 Jan 2007, Karen Hill wrote:

 Gavin Sherry wrote:
  Recenly, I've been researching and putting together a proposal for window
  functions.

 Implementing NULLS FIRST and NULLS LAST appears like another
 challenging step to getting window functions wrapped up.  Has your
 research lead you to any ideas on what your strategy for NULLS FIRST
 and NULLS LAST likely would be?

Explicit handling of NULL is only in the Oracle extension to window
functions. I will look at the core SQL functionality first -- maybe with
the extension of 'LEAD' and 'LAG' because they are heavily utilitised by
those using window functions.

Besides, Tom's been working on NULLS FIRST/LAST already.

gavin

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [HACKERS] Planning aggregates which require sorted or distinct

2007-01-19 Thread Gavin Sherry
On Fri, 19 Jan 2007, Tom Lane wrote:

 Gavin Sherry [EMAIL PROTECTED] writes:
  On Fri, 19 Jan 2007, Tom Lane wrote:
  Er, what primary key would that be exactly?  And even if you had a key,
  I wouldn't call joining on it trivial; I'd call it expensive ...

  I should have used slightly different language. What I meant to say was,
  both sets are primarily sorted by saledate so they can be merged back
  together. This is why I said it was trivial.

 Ah, my misunderstanding.  Then isn't this basically isomorphic to what
 I was thinking of, ie, somewhat-smarter Aggref nodes attached to the
 existing GroupAggregate plan node?

Yep. I was thinking about this all morning. I think I've over engineered
the problem in my head. Window function input just looks like a slightly
more complex distinct aggregate input. I'll think on it more though.

To bring out a slightly different point -- and I know this is putting the
cart before the horse -- but window functions will (potentially) output
rows in the wrong order. I made a passing reference to this earlier. For
example, say we have a table employees with the following data:

empno | salary | age

1 | 2000   | 50
2 | 6000   | 30
3 | 3000   | 20

We want to answer the following: for each employee: what is their rank in
terms of salary and what is their rank in terms of age. This query
answers that:

select empno, rank() over (order by salary) as srank,
  rank() over (order by age) as arank
  from employees order by empno;

The result will be:

empno | salary | age

1 | 1  | 3
2 | 3  | 2
3 | 2  | 1

Both window functions provide results based on the order of their input.
So, in terms of empno, srank will output in this order: empno = 1, 3, 2;
arank will output in this order: empno = 3, 2, 1. We need to glue these
back together and the only way I can think how is via a synthetic key.
Ideally, the planner would have some input on how to clue about how large
the result set will be and the orders from the window functions so that it
can decide whether to use nested loop, merge join or hash join to do it.

Can you think of a different approach?

Thanks,

Gavin

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


[HACKERS] Planning aggregates which require sorted or distinct input

2007-01-18 Thread Gavin Sherry
Recenly, I've been researching and putting together a proposal for window
functions. I have not finished this but when I do, I will post it. A nice
list of examples can be found here[1].

Rather than spend a lot of time talking about the problems window
functions present to the planner and executor, I'd like to bring up the
topic of an existing piece of SQL which is well understood and presents a
similar problem. Consider the following query:

select saledate, count(distinct prodid), count(distinct sellerid),
   sum(price) from sales group by 1;

The point here is that aggregates usually just receive the input that the
lower level of the plan generates. When qualified with distinct, however,
that changes. Notice that the count on prodid and sellerid receive
unique input while the sum on price does not.

We do not create seperate plan nodes to achieve this. Rather, it is a
property of the aggregation code inside the executor[2].

It seems to me that by creating actual plan nodes for this distinct step
we can improve the range of options for executing these types of queries.
For example, consider a more complex query than the above that
groups over a join using a join key of saledate, prodid (and that the
planner implements with a merge join). This means that the sort order is
preserved and count(distinct prodid) will receive sorted input. As far as
I can tell, however, the executor doesn't know this and but the planner
does. That is, the sort step inside the aggregate code is redundant.

Another way it could be improved is if we ever introduce a 'hash distinct'
execution node. This has been discussed before.

My interest here is not so much to do with distinct aggregates but,
rather, window functions. Window functions have this same problem as the
input to the functions is generally sorted by different keys.

So, hypothetically, lets say we wanted to create a plan for the above
query which had an explicit Sort - Unique 'branch' for each of the
distinct aggregates. This is actually difficult to represent with the
existing plan constructs, as it turns out.

Currently, the plan looks something like:


   GroupAggregate
 ^
 |
  Sort Op
 ^
 |
   Scan


What we want to do is have a kind of 'sub plan' for each aggregate. In
effect, the plan might start looking like a directed graph.  Here is part
of the plan as a directed graph.

   GroupAggregate
  /-^---...
  | |
  | |
  ^ |
  |   Unique
  | ^
  | |
Sort   Sort
  (saledate)(saledate,prodid)
  ^ ^
  | |
  -- Fan Out ...
^
|
   Scan

This idea was presented by Brian Hagenbuch at Greenplum. He calls it a
'Fan Out' plan. It is trivial to rejoin the data because all data input to
the aggregates is sorted by the same primary key. Also, we could/would
optimize this to perform a single sort for those columns which require the
same sort order (saledate and sum(price) in the example query). An extra
step would be required for (some) window functions because they wont
necessarily preserve order. That's not important now.

An alternative approach would be a 'pipeline' design. This would fit into
the existing plan structure better. In this approach, each aggregate
would be a distinct step in the plan.


Finalize (like Result)
^
|
 Agg (sum(price))
^
|
   Sort (saledate)
^
|
 Agg (count(sellerid))
^
|
Sort (saleid, sellerid)/Unique
^
|
 Agg (count(prodid))
^
|
Sort (saleid, prodid)/Unique
^
|
  Scan

Now, this would actually work as follows: the input would be received from
the Scan node. We sort by the input by the key saleid, prodid and produce
a unique result. It is input to the aggregate and we produce a result for
a distinct grouping expression. We then pass the output of the aggregate
for this grouping expression up the tree along with a pointer to the scan
data. We do not discard the 

Re: [HACKERS] Design notes for EquivalenceClasses

2007-01-17 Thread Gavin Sherry
On Wed, 17 Jan 2007, Tom Lane wrote:

 strict.  However, we also allow equivalence clauses that appear below the
 nullable side of an outer join to form EquivalenceClasses; for these
 classes, the interpretation is that either all the values are equal, or
 all (except pseudo-constants) have gone to null.  (This requires a
 limitation that non-constant members be strict, else they might not go
 to null when the other members do.)  Consider for example

   SELECT *
 FROM a LEFT JOIN
  (SELECT * FROM b JOIN c ON b.y = c.z WHERE b.y = 10) ss
  ON a.x = ss.y
 WHERE a.x = 42;

 We can form the below-outer-join EquivalenceClass {b.y c.z 10} and thereby
 apply c.z = 10 while scanning c.  (The reason we disallow outerjoin-delayed
 clauses from forming EquivalenceClasses is exactly that we want to be able
 to push any derived clauses as far down as possible.)  But once above the
 outer join it's no longer necessarily the case that b.y = 10, and thus we
 cannot use such EquivalenceClasses to conclude that sorting is unnecessary
 (see discussion of PathKeys below).  In this example, notice also that
 a.x = ss.y (really a.x = b.y) is not an equivalence clause because its
 applicability to b is restricted by the outer join; thus we do not make
 the mistake of concluding b.y = 42, even though we do have an equivalence
 class for {a.x 42}.

This seems to be correct. A nice improvement.

 With a little further thought, it becomes apparent that nestloop joins
 can also produce sorted output.  For example, if we do a nestloop join
 between outer relation A and inner relation B, then any pathkeys relevant
 to A are still valid for the join result: we have not altered the order of
 the tuples from A.  Even more interesting, if there was an equivalence clause
 A.X=B.Y, and A.X was a pathkey for the outer relation A, then we can assert
 that B.Y is a pathkey for the join result; X was ordered before and still
 is, and the joined values of Y are equal to the joined values of X, so Y
 must now be ordered too.  This is true even though we used neither an
 explicit sort nor a mergejoin on Y.  (Note: hash joins cannot be counted
 on to preserve the order of their outer relation, because the executor
 might decide to batch the join, so we always set pathkeys to NIL for
 a hashjoin path.)  Exception: a RIGHT or FULL join doesn't preserve the
 ordering of its outer relation, because it might insert nulls at random
 points in the ordering.

I was thinking about this, but in relation to hash joins. A hash join
cannot be guaranteed to produce output sorted according to the pathkey of
the outer relation (as explained in the existing README). I wonder,
however, if it might be useful for hash join to pass a hint that the
output is known ordered (i.e., the join was not split into multiple
batches).

Thanks,

Gavin

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Design notes for EquivalenceClasses

2007-01-17 Thread Gavin Sherry
On Wed, 17 Jan 2007, Tom Lane wrote:

 Gavin Sherry [EMAIL PROTECTED] writes:
  I was thinking about this, but in relation to hash joins. A hash join
  cannot be guaranteed to produce output sorted according to the pathkey of
  the outer relation (as explained in the existing README). I wonder,
  however, if it might be useful for hash join to pass a hint that the
  output is known ordered (i.e., the join was not split into multiple
  batches).

 Yeah, I've considered that, but I think it'd have to be the other way
 around: the planner tells the executor that it's assuming the output is
 sorted, hence do not split into multiple batches.  This has the usual
 assortment of problems if the planner has badly misestimated the
 rowcount :-(

Yep, I thought of that and discarded it for the reason you state.

I still think there would be some benefit to passing a hint up the
execution tree, effectively turning explicit sorts into no ops. This,
however, breaks the major rule in the executor: do what ever the plan
tells you to do.

Thanks,

Gavin

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] ideas for auto-processing patches

2007-01-10 Thread Gavin Sherry
On Wed, 10 Jan 2007, Jim C. Nasby wrote:

 On Thu, Jan 11, 2007 at 08:04:41AM +0900, Michael Glaesemann wrote:
  Wouldn't there be some value to knowing whether the patch failed
  due to
  bitrot vs it just didn't work on some platforms out of the gate?
 
  I'm having a hard time figuring out what that value would be. How
  would that knowledge affect what's needed to fix the patch?

 I was thinking that knowing it did work at one time would be useful, but
 maybe that's not the case...

It might be useful to patch authors who submit code which remains
unreviewed for some time. Then the submitter or reviewer will be able to
know at what date the code drifted. This might be easier than looking
through the commit history and trying to locate the problem, I think.

Still, the more interesting thing to me would be to be able to provide in
the patch a set of custom tests inside of the regression test frame work
which aren't suitable as RTs in the long term but will be able to tell the
patch author if their code works correctly on a variety of platforms.

Thanks,

Gavin

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] [PATCHES] SGML index build fix

2007-01-07 Thread Gavin Sherry
On Sun, 7 Jan 2007, Tom Lane wrote:

 Peter Eisentraut [EMAIL PROTECTED] writes:
  Tom Lane wrote:
  Perhaps even more to the point, what makes you think that someone
  will notice the warning?  If the docs build is one step in an
  automated build process, this seems unlikely.

  Taking a closer look, it's pretty much guaranteed that no one will see
  them, because the targets they are attached to are intermediate,
  normally followed by latex runs.

 If we think this is a problem, ISTM the correct answer is to just force
 a repeat jade run when doing make all.  The only objection to that
 AFAICS is that when you're doing docs work and only need a draft to
 look at, you'd rather it not run twice.  But perhaps we could address
 that by providing a separate target, make draft say, that runs jade
 but once.

That's a nice approach. Those working on the docs will know about the
draft target and those just wanting to build the docs for publication will
get the result.

Gavin

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] ideas for auto-processing patches

2007-01-04 Thread Gavin Sherry
On Thu, 4 Jan 2007 [EMAIL PROTECTED] wrote:

 1. Pull source directly from repositories (cvs, git, etc.)  PLM
 doesn't really track actually scm repositories.  It requires
 directories of source code to be traversed, which are set up by
 creating mirrors.

It seems to me that a better approach might be to mirror the CVS repo --
or at least make that an option -- and pull the sources locally. Having to
pull down 100MB of data for every build might be onerous to some build
farm members.

Thanks,

Gavin

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] ideas for auto-processing patches

2007-01-04 Thread Gavin Sherry
On Thu, 4 Jan 2007, Alvaro Herrera wrote:

 Gavin Sherry wrote:
  On Thu, 4 Jan 2007 [EMAIL PROTECTED] wrote:
 
   1. Pull source directly from repositories (cvs, git, etc.)  PLM
   doesn't really track actually scm repositories.  It requires
   directories of source code to be traversed, which are set up by
   creating mirrors.
 
  It seems to me that a better approach might be to mirror the CVS repo --
  or at least make that an option -- and pull the sources locally. Having to
  pull down 100MB of data for every build might be onerous to some build
  farm members.

 Another idea is using the git-cvs interoperability system, as described
 here (albeit with SVN, but the idea is the same):

 http://tw.apinc.org/weblog/2007/01/03#subverting-git

It seems like that will just add one more cog to the machinary for no
extra benefit. Am I missing something?


 Now, if we were to use a distributed system like Monotone this sort of
 thing would be completely a non-issue ...

Monotone is so 2006. The new new thing is mercurial!

Gavin

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] ideas for auto-processing patches

2007-01-04 Thread Gavin Sherry
On Thu, 4 Jan 2007, Andrew Dunstan wrote:

 Gavin Sherry wrote:
  On Thu, 4 Jan 2007 [EMAIL PROTECTED] wrote:
 
  1. Pull source directly from repositories (cvs, git, etc.)  PLM
  doesn't really track actually scm repositories.  It requires
  directories of source code to be traversed, which are set up by
  creating mirrors.
 
  It seems to me that a better approach might be to mirror the CVS repo --
  or at least make that an option -- and pull the sources locally. Having to
  pull down 100MB of data for every build might be onerous to some build
  farm members.
 


 I am not clear about what is being proposed. Currently buildfarm syncs
 against (or pulls a fresh copy from, depending on configuration) either
 the main anoncvs repo or a mirror (which you can get using cvsup or rsync,
 among other mechanisms). I can imagine a mechanism in which we pull
 certain patches from a patch server (maybe using an RSS feed, or a SOAP
 call?) which could be applied before the run. I wouldn't want to couple
 things much more closely than that.

With PLM, you could test patches against various code branches. I'd
guessed Mark would want to provide this capability. Pulling branches from
anonvcvs regularly might be burdensome bandwidth-wise. So, like you say, a
local mirror would be beneficial for patch testing.

 The patches would need to be vetted first, or no sane buildfarm owner will
 want to use them.

It would be nice if there could be a class of trusted users whose patches
would not have to be vetted.

Thanks,

Gavin

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] ideas for auto-processing patches

2007-01-04 Thread Gavin Sherry
On Thu, 4 Jan 2007, Andrew Dunstan wrote:

 Gavin Sherry wrote:
 
  With PLM, you could test patches against various code branches. I'd
  guessed Mark would want to provide this capability. Pulling branches from
  anonvcvs regularly might be burdensome bandwidth-wise. So, like you say, a
  local mirror would be beneficial for patch testing.


 I think you're missing the point. Buildfarm members already typically have
 or can get very cheaply a copy of each branch they build (HEAD and/or
 REL*_*_STABLE).  As long as the patch feed is kept to just patches which
 they can apply there should be no great bandwidth issues.

Right... my comment was more for Mark.

  It would be nice if there could be a class of trusted users whose patches
  would not have to be vetted.
 
 

 Beyond committers?

Hmmm... good question. I think so. I imagine the group would be small
though.

Thanks,

Gavin

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] Dead Space Map for vacuum

2006-12-28 Thread Gavin Sherry
On Thu, 28 Dec 2006, Heikki Linnakangas wrote:

 ITAGAKI Takahiro wrote:
  Hello,
 
  NTT staffs are working on TODO item:
  | Create a bitmap of pages that need vacuuming
 
  We call the bitmap Dead Space Map (DSM), that allows VACUUM to scan
  only pages that need vacuuming or freezing. We'd like to discuss the
  design on hackers and make agreements with community.

 Great!

I agree!


  We implemented the basic parts of it and measured the performance.
  As expected, VACUUM took shorter time when the fraction of updates are low.
  DSM is useful for large but not so heavily-updated databases. The overhead
  of managing DSM seemed to be negligible small in our CPU-loaded tests.
 
  There are a lot of choices in implementation. We followed the descriptions
  in TODO list and past discussions in some parts, but did not in other parts
  for some reasons. I would appreciate your comments and suggestions.

 I experimented with a different DSM design last winter. I got busy with
 other things and never posted it AFAIR, but the idea was to store a
 bitmap in the special area on every 32k heap page. That had some advantages:

 * doesn't require a new dedicated shared memory area that needs to be
 allocated and tuned.
 * doesn't introduce a (single) new global lwlock that might become hotspot.
 * WAL logging is quite simple, since the bitmaps are on normal pages
 handled by buffer manager.

I too have experimented with this idea. 4 DSM pages means 2 bits per
block. What were you using the other bit for? When I discussed this some
time ago with Jan Weick, we recommended we track blocks in the FSM with
this extra bit.

One problem this approach may have is contention for the DSM pages and the
granularity of the lock associated with it. Locking a DSM page to update
one bit will affect writes to 32K pages. This might be less of a problem
than I think because of the very low cost of setting the bit.

  | In the event of a system crash, the bitmap would probably be invalidated.
 
  We bought it for simplicity. Currently, all DSM are lost after crash.
  All tables will be untracked status. Full-scanned vacuum is needed
  after the lost of DSM.

 If you flush the DSM to disk at checkpoint, it should be easy to bring
 it up-to-date on WAL replay. Having to do full-scanning vacuum after
 crash can be a nasty surprise.

I agree. One interesting insight is, we have 4 bits left over (those not
used for the 4 DSM blocks). We might use those status bits for some
purpose. For example, when a page is dirtied and we update the DSM page,
we set a DSM dirtied bit. Checkpoint would then only flush DSM pages
dirtied since the last check point. This is important in the presense of
lots of read only tables.

  | One complexity is that index entries still have to be vacuumed, and doing
  | this without an index scan (by using the heap values to find the index 
  entry)
  | might be slow and unreliable, especially for user-defined index functions.
 
  Indexes are still needed to be full-scanned at the present moment. We are
  also researching a retail index vacuum method, but it requires more works.

 Yeah, that's an old story :(.

 BTW: Yesterday I realized that bitmap indexes could be retail vacuumed
 safely. You'll still need to visit all bitmaps to find the dead bit, but
 you only need to check the bitmap page that corresponds the tid of the
 dead tuple.

Cool. The problem is solving it for the other AMs, particularly btree.


  | http://archives.postgresql.org/pgsql-hackers/2004-03/msg00957.php
  | Maintain 2 bits per block that tell if the block has been vaccumed of all
  | dead tuples since the last time it was dirtied, and if all its tuples are
  | completely frozen.
 
  We use 1 bit per block, so we cannot separate pages that need either
  vacuum or freeze only. The reason is that we cannot determine where to
  record before/after updated tuples. If the transaction is commited,
  before-version should be recorded into needs-vacuum bitmap and
  after-version into needs-freeze bitmap. But on rollback, it is inverted.
  We cannot judge which we should do until the end of transaction.

 Yeah, that makes the DSM less useful. Are you thinking of freezing more
 aggressively because of that? Or doing a full-scanning vacuum every now
 and then to freeze?

I don't see any problem with moving to two status bits per block in the
DSM-in-heap model, so we can track this better. What do you think?

Thanks,

Gavin

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] Bitmap index thoughts

2006-12-27 Thread Gavin Sherry
On Wed, 27 Dec 2006, Heikki Linnakangas wrote:

 Gavin Sherry wrote:
  On Tue, 26 Dec 2006, Heikki Linnakangas wrote:
  for typical bitmap index use cases and most of the needed pages should
  stay in memory, but could we simplify this? Why do we need the auxiliary
  heap, couldn't we just store the blk+offset of the LOV item directly in
  the b-tree index item?
 
  The problem is, the b-tree code is very much tied to the heap. I don't
  want to modify the b-tree code to make bitmap indexes work (better).
  What's really tempting is to just manage our own balanced tree within the
  bitmap index file(s) itself. It would start from the metapage and simply
  spill to other 'special' index pages when necessary. The problem is, we do
  not have b-tree code generic enough that it would allow us to do this
  trivially -- consider concurrency and WAL in particular, which we
  currently get for free. I guess this is why I've been ignoring this issue
  :-).

 Maybe we could reuse the code in ginbtree.c. Looks like Teodor  Oleg
 had the same problem :).

 Modifying the nbtree code doesn't seem that difficult either. AFAICS,
 the only places where the heap is from within nbtree code is in index
 building and uniqueness checks.

I'll take a look and think about it.


  And instead of having separate LOV pages that store a number of LOV
  items, how about storing each LOV item on a page of it's own, and using
  the rest of the page to store the last chunk of the bitmap. That would
  eliminate one page access, but more importantly, maybe we could then get
  rid of all the bm_last_* attributes in BMLOVItemData that complicate the
  patch quite a bit, while preserving the performance.
 
  That's an interesting approach. We would still need a concept of
  last_word, at the very least, and probably last_comp_word for convenience.

 Why?

The two last words of the bitmap vector, stored in the LOV item, provide a
few things: they buffer the addition of new matches in the vector and they
ease the calculation of the current position of the end of the vector.
Jie, do you want to add more?

I think we could do this differently by calculating everything based on
the other data stored in the lov item and data page (last_tid_location and
num_hrl_words_used, from memory). But, I'm not sure. Jie?

Thanks,

Gavin

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [HACKERS] Bitmap index thoughts

2006-12-27 Thread Gavin Sherry
On Wed, 27 Dec 2006, Heikki Linnakangas wrote:

 Jie Zhang wrote:
  The bitmap data segment sounds good in terms of space. The problem is that
  one bitmap is likely to occupy more pages than before, which may hurt the
  query performance.

 We could have segments of say 1/5 of page. When a bitmap grows larger
 than that, the bitmap would be moved to a page of its own. That way we
 wouldn't get unnecessary fragmentation with large bitmaps, but small
 bitmaps would be stored efficiently.

Yes.


  I have been thinking along the lines of increasing the
  number of last bitmap words stored in each LOV item, but not to occupy one
  page. This may prevent some cases Gavin indicated here, but not all.

 That sounds like more special cases and complexity. I like the segment
 idea more.

 But actually I'm not convinced we need to worry about efficient storage
 of small bitmaps at all. The typical use case for bitmap indexes is
 large tables with small number of distinct values, and the problem
 doesn't really arise in that scenario. Let's keep it simple for now, we
 can enhance it in later releases.

The scenario I'm concerned about is where a sales data base, say, has
100,000 products. However, only 500 or 1000 products are popular. They
dominate, say 99% of the sales. The other 99,900 products consume a
little bit over 8K each for very little benefit :-(.

This is pretty contrived but it seem real world enough...

Gavin

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [HACKERS] Bitmap index thoughts

2006-12-26 Thread Gavin Sherry
Hey Heikki,

On Tue, 26 Dec 2006, Heikki Linnakangas wrote:

 I've been skimming through the bitmap index patch...

 A scan needs to access at least five pages:

 1. B-tree index (root+others, depending on depth)
 2. The auxiliary heap page
 3. bitmap index meta page
 4. LOV page
 5. bitmap page

 That seems like a lot of indirection. A high startup cost is probably ok

You're right, this is excessive and it was something I'd hoped to have
ripped out, but...

 for typical bitmap index use cases and most of the needed pages should
 stay in memory, but could we simplify this? Why do we need the auxiliary
 heap, couldn't we just store the blk+offset of the LOV item directly in
 the b-tree index item?

The problem is, the b-tree code is very much tied to the heap. I don't
want to modify the b-tree code to make bitmap indexes work (better).
What's really tempting is to just manage our own balanced tree within the
bitmap index file(s) itself. It would start from the metapage and simply
spill to other 'special' index pages when necessary. The problem is, we do
not have b-tree code generic enough that it would allow us to do this
trivially -- consider concurrency and WAL in particular, which we
currently get for free. I guess this is why I've been ignoring this issue
:-).

 And instead of having separate LOV pages that store a number of LOV
 items, how about storing each LOV item on a page of it's own, and using
 the rest of the page to store the last chunk of the bitmap. That would
 eliminate one page access, but more importantly, maybe we could then get
 rid of all the bm_last_* attributes in BMLOVItemData that complicate the
 patch quite a bit, while preserving the performance.

That's an interesting approach. We would still need a concept of
last_word, at the very least, and probably last_comp_word for convenience.
Your idea doesn't require any extra space, either, which is good.
Something I've been working through is the idea of a 'bitmap data
segment'. Currently, we store the HRL compressed bitmap data to the extent
of the page. That is, sizeof(BMBitmap) is as close to BLCKSZ as we can
make it. The problem is that we may have some bitmaps where a few values
occur only a small number of times and are well clustered at the beginning
of the heap. In that circumstance, we use a whole page to store a small
number of words and the free space cannot be used by any other vector.
Now, say a segment was some fraction the size of BLCKSZ, we use less space
for those vectors with few tuples in the heap. Just an idea...

Thanks,

Gavin

PS: Another versio of the patch shall be forthcoming shortly. I've been
working on compressing the data in memory during CREATE INDEX instead of
just managing arrays of TIDs in memory as we did previously. The array of
TIDs works great for well clustered data but it stinks for poorly
clustered data as we approach maintenance_word_mem and have to swap a lot.

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Question about debugging bootstrapping and catalog

2006-12-18 Thread Gavin Sherry
On Mon, 18 Dec 2006, Gregory Stark wrote:


 I've been fooling with catalog entries here and I've obviously done something
 wrong. But I'm a bit frustrated trying to debug initdb. Because of the way it
 starts up the database in a separate process I'm finding it really hard to
 connect to the database and get a backtrace. And the debugging log is being
 spectacularly unhelpful in not telling me where the problem is.

 Are there any tricks people have for debugging bootstrapping processing? I
 just need to know what index it's trying to build here and that should be
 enough to point me in the right direction:

 creating template1 database in /var/tmp/db7/base/1 ... FATAL:  could not 
 create unique index
 DETAIL:  Table contains duplicated values.


Not much fun. Run src/include/catalog/duplicate_oids first.

Thanks,

Gavin

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Bitmap index status

2006-10-18 Thread Gavin Sherry
On Wed, 18 Oct 2006, Heikki Linnakangas wrote:

 Hi,

 I don't want to harass you :), but what's the status with the bitmap
 index code? Is there something I can do to help?


Hi Heikki,

The streaming is implemented, as are range queries. I need to bring it up
to HEAD and back-patch to bizgres since... it's not diverged fairly
significantly from that code base.

Two outstanding items are handling vacuum and I was considering having a
bitmap selectivity function but I haven't really looked into it.

Once I bring it up to HEAD I'll post.

Thanks,

Gavin


---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] pg_internal.init is hazardous to your health

2006-10-17 Thread Gavin Sherry
On Tue, 17 Oct 2006, Tom Lane wrote:

 Dirk Lutzebaeck and I just spent a tense couple of hours trying to
 figure out why a large database Down Under wasn't coming up after being
 reloaded from a base backup plus PITR recovery.  The symptoms were that
 the recovery went fine, but backend processes would fail at startup or
 soon after with could not open relation XX/XX/XX: No such file type of
 errors.

 The answer that ultimately emerged was that they'd been running a
 nightly maintenance script that did REINDEX SYSTEM (among other things
 I suppose).  The PITR base backup included pg_internal.init files that
 were appropriate when it was taken, and the PITR recovery process did
 nothing whatsoever to update 'em :-(.  So incoming backends picked up
 init files with obsolete relfilenode values.

Ouch.

 We don't actually need to *update* the file, per se, we only need to
 remove it if no longer valid --- the next incoming backend will rebuild
 it.  I could see fixing this by making WAL recovery run around and zap
 all the .init files (only problem is to find 'em), or we could add a new
 kind of WAL record saying remove the .init file for database XYZ
 to be emitted whenever someone removes the active one.  Thoughts?

The latter seems the Right Way except, I guess, that the decision to
remove the file is buried deep inside inval.c.

Thanks,

Gavin

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Bitmap index status

2006-09-26 Thread Gavin Sherry
On Tue, 26 Sep 2006, Heikki Linnakangas wrote:

 Looks a bit better now, though I think you need to think more about the
 encapsulation of the structs. More detailed comments below.

 Jie Zhang wrote:
  Essentially, we want to have a stream bitmap object that has an iterator,
  which will be able to iterate through the bitmaps. The BitmapIndexscan,
  BitmapAnd, BitmapOr will be executed once and return a streamp bitmap or a
  hash bitmap. The BitmapHeapscan then calls tbm_iterate() to iterate
  through
  the bitmaps.
 
  The StreamBitmap structure will look like below.
 
  struct StreamBitmap {
  NodeTag type; /* to make it a valid Node */
  PagetableEntry entry; /* a page of tids in this stream bitmap */
 
  /* the iterator function */
  void (*next)(StreamBitmap*);
  Node* state; /* store how this stream bitmap generated,
  and all necessary information to
  obtain the next stream bitmap. */
  };

 I'd suggest making state just a (void *). It's private to the producer
 of the bitmap, and I don't see a reason to expose it. I assume that the
 next-function fills in the PageTableEntry with the next set of tids.

  Two new state objects will look like below. At the same time, we introduce
  three new node types: T_StreamBitmapAND, T_StreamBitmapOR,
  And T_StreamBitmapIndex, to define different states.
 
  /*
  * Stores the necessary information for iterating through the stream
  bitmaps
  * generated by nodeBitmapAnd or nodeBitmapOr.
  */
  struct StreamBitmapOp {
  NodeTag type; /* handles T_StreamBitmapAND and T_StreamBitmapOR */
  List* bitmaps;
  };

 AFAICS, this struct is private to tidbitmap.c, where the new
 stream-enabled tbm_intersect etc. functions are defined. Also, I don't
 see a reason why it needs to by a valid Node.

Well, making it a valid nodes makes it easy to identify (IsA) and gives us
access to copy/equal frameworks. I do think that it is best to bury this
in the bitmap code however.

  * Stores some necessary information for iterating through the stream
  * bitmaps generated by nodeBitmapIndexscan.
  */
  struct StreamBitmapIndex {
  NodeTag type; /* handle T_StreamBitmapIndex */
  IndexScanDesc scan;
  BlockNumber nextBlockNo;/* next block no to be read */
  };

 Where would this struct be defined? I think different index access
 methods might want to have different kind of states. The struct above
 assumes that the position of an index scan is always represented by the
 nextBlockNo. That seems to be the right thing for the bitmap indexam, so
 this struct is fine for StreamBitmaps returned by bmgetbitmap, but not
 necessary for others.

right.


  Then we will have the iterator functions like the following:
 
  ...
 
  void StreamBitmapIndexNext(StreamBitmap* node) {
  StreamBitmapIndex* sbi = (StreamBitmapIndex*) node-state;
  amgetbitmap(sbi-scan, NULL, sbi-nextBlockNo);
  }

 This means that the amgetbitmap function would still be called many
 times in each scan.  What would amgetbitmap return? A new StreamBitmap
 on each call?

 I'd like to see just one call to amgetbitmap. It would a) fill in the
 hash bitmap passed to it, b) return a new hash bitmap, or c) return a
 new StreamBitmap, with a indexam specific next-function that returns the
 tids one page at a time. I think we'll also need some kind of a
 destructor in StreamBitmap that's called by the consumer of the bitmap
 after it's done with it.

Right, I agree. I am working on this now.

Thanks,

gavin

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] pg_upgrade: downgradebility

2006-09-20 Thread Gavin Sherry
On Wed, 20 Sep 2006, Andrew Sullivan wrote:

 On Wed, Sep 20, 2006 at 12:54:14PM +0200, Zdenek Kotala wrote:
  My first question is how important is downgrade for You and Your customers?
 
 
  And second is how to verify that downgrade is possible?

 Well, one way to do it is to set up a Slony replica using the older
 version of the software.  So, if you've upgraded to 8.1.x, you
 replicate to an old 8.0.x back end as well.  If 8.1 doesn't work for
 you, you just MOVE everything back to the 8.0.x back end, and you're
 golden.

Well, I think that people who really want downgrade in such a tool are
those for which slony replication is just not an option. That is, data in
the range of hundreds of gigabytes. Using slony to upgrade is often not
practical either.

I wonder if pg_upgrade could be designed in such a way that upgrade is the
same as downgrade from a development point of view. That is, the tool can
change the system from one binary format to another.

Thanks,

Gavin

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] Bitmap index status

2006-09-19 Thread Gavin Sherry
On Tue, 19 Sep 2006, Heikki Linnakangas wrote:

 Jie Zhang wrote:
  Hi Heikki and all,
 
  Please find the latest bitmap index patch in the attachment. This patch is
  generated against the postgresql cvs head.
 

 Thanks.

 The handling of stream and hash bitmaps looks pretty complicated to me.
 All the bitmap-related nodes have logic to handle both types slightly
 differently. It all seems to come down to that if a subnode (or
 amgetbitmap in a bitmap index scan node) returns a StreamBitmap, the
 caller needs to call the subnode many times, until it returns a NULL.
 With a HashBitmap, the caller only calls the subnode once.

 I think amgetbitmap should be called just once per index scan, and it
 should return either a hash bitmap or a stream bitmap. The same applies
 to all the executor nodes that return bitmaps, they would only return a
 single HashBitmap or StreamBitmap, and the upper node would call
 tbm_iterate repeatedly on that.

 StreamBitmap would contain a callback (filled by the indexam) that
 tbm_iterate would call to fill the next TBMIterateResult.

Right, this was the approach taken by an earlier version of the patch I
had worked on. It was significantly uglified by the need to keep the index
state around and to be careful about what amrescan might do behind out
back. I will, however, introduce the idea again because it makes the code
much cleaner and more logical, as you seem to suggest.

Thanks,

Gavin

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Timezone List

2006-09-06 Thread Gavin Sherry
On Wed, 6 Sep 2006, Tom Lane wrote:

 Martijn van Oosterhout kleptog@svana.org writes:
  In the CVS version there is a table with this information:
  http://developer.postgresql.org/pgdocs/postgres/view-pg-timezonenames.html

 Actually, what that view gives you is timezone offset abbreviations, not
 the full zone names that you could use with SET TIME ZONE.  It strikes
 me that we should have a view for that as well.  We could use code
 similar to scan_available_timezones() to generate the view output.

 It's somewhat urgent to address this now, because pg_timezonenames is
 sitting on the obvious name for such a view, and once we release 8.2
 we won't be able to change it.  On reflection I think the existing view
 is wrongly named --- perhaps it should be pg_timezoneabbrevs?  Or
 more readably, perhaps pg_timezone_abbrevs, with pg_timezone_names for
 the other view.

I think 'abbrev' is a like unintuitive. How about 'short_names'?

Gavin

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] [PATCHES] Trivial patch to double vacuum speed

2006-09-04 Thread Gavin Sherry
On Mon, 4 Sep 2006, Joshua D. Drake wrote:


  I don't have a concrete proposal to make, but I do think that the
  current patch-queue process is not suited to the project as it stands
  today.  Maybe if this issue-tracking stuff gets off the ground, we
  could let developers place ACK or NAK flags on patches they've looked
  at, and have some rule about ACK-vs-NAK requirements for something to go
  in.

 How about *requiring* test cases that prove the patch?

People including regression tests is not a replacement for code review.
For a non-trivial patch, an SQL test will only exercise a few code paths.
Moreover, it wont say anything about code quality, maintainability or
general correctness or completeness. It will still have to be reviewed.

Thanks

Gavin

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Getting a move on for 8.2 beta

2006-09-01 Thread Gavin Sherry
On Fri, 1 Sep 2006, Tom Lane wrote:

 My feeling is that we ought to bounce bitmap indexes and updatable views
 as not being ready, accept all the contrib stuff, and try to get the
 other items done in time for a beta at, say, the end of next week.

For what it's worth, Jie and I hope to have finished the bitmap streaming
this weekend which is the only outstanding issue with the code as it
currently stands. I now realise (nothing like hindsight) that we should
have posted a patch before we looked at streaming bitmaps because they've
proved more fiddly to implement than I thought :-(. At that point, we'd
more or less addressed the issues I'd raised when I posted at feature
freeze time.

Thanks,

Gavin

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] [PATCHES] WIP: bitmap indexes

2006-08-14 Thread Gavin Sherry
On Mon, 14 Aug 2006, Tom Lane wrote:

 Gavin Sherry [EMAIL PROTECTED] writes:
  Attached is an update to the patch implementing bitmap indexes Jie sent
  last week.

 What's the current status of this patch ... has any work been done since
 the first of the month?

Yes. I am tidying up the executor code at the moment so that the on
disk bitmap support is more or less hidden from the executor itself. and
Jie has been fixing multicolumn support and working on the planner. I've
also brought the code into line with PostgreSQL style, etc.


 I suppose the patch as given here no longer applies to HEAD, because of
 recent changes in rmgr for Simon's restartable-recovery patch (should be
 easy enough to fix) and consumption of OIDs by other patches (also easy
 to fix, in principle, but doubtless tedious).

Right.

 If you have to renumber OIDs I'd suggest starting at 3000 for your added
 OIDs, to leave some daylight for any other patches that go in during the
 next couple weeks.

Thanks.

I will post an updated patch in a few days time.

Gavin

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] [PATCHES] WIP: bitmap indexes

2006-08-14 Thread Gavin Sherry
On Mon, 14 Aug 2006, Tom Lane wrote:

 Gavin Sherry [EMAIL PROTECTED] writes:
  I will post an updated patch in a few days time.

 OK.  Do you want me to work on the discussed amgetmulti change, or would
 that just be joggling your elbow at the moment?

Yes, that would be joggling ;).

The approach I am taking is more or less the one you proposed a few weeks
ago. Namely, to pass the bitmap object as an argument to amgetmulti and
have the routine fill it in as it sees fit.

There is some trickiness, however.

One of the main reasons for the uglification of the executor in Jie's
original patch was that she wanted to avoid the inefficiency of
translating the on disk bitmap representation to the TID bitmap
representation. If the plan calls for a straight on disk
bitmap read or AND/ORing together a few on-disk bitmaps this is justified.
If we're AND/ORing together an ondisk bitmap read and a TID bitmap scan of
a btree, we're in trouble. In this case, the existing code implements the
current semantics of amgetmulti.

What I am doing is treating the bitmap object as opaque, storing the data
in the format the AM wants, and providing a 'translate to TID bitmap'
routine (trivial for btree).

Do you have any thoughts on this?

Thanks,

Gavin

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] [PATCHES] WIP: bitmap indexes

2006-08-14 Thread Gavin Sherry
On Mon, 14 Aug 2006, Tom Lane wrote:

 Gavin Sherry [EMAIL PROTECTED] writes:
  One of the main reasons for the uglification of the executor in Jie's
  original patch was that she wanted to avoid the inefficiency of
  translating the on disk bitmap representation to the TID bitmap
  representation.

 Offhand that seems like micro-optimization, at least as stated -- you
 want to think about number of page fetches rather than exactly how a
 bitmap is represented.  I've still not got round to reading the patch
 in detail, but after thinking about it in the shower on a few mornings,
 it seems to me that the key concept we'd like to implement is that a
 bitmap index can provide its data in a page-by-page fashion without the
 overhead of fabricating the bitmap in-memory as btree forces us to do.
 That is, the current implementation involves doing the whole index scan
 and building a bitmap in memory, which we then read out page-wise for
 the heap scan.  The weak spot of this approach is that the in-memory
 bitmap can get unreasonably large.  With a bitmap index, you can pass
 data back in a page-by-page fashion: here are the TID indexes to hit on
 this page, then here are the TID indexes to hit on the next page, etc.
 Correct me if I'm wrong, but isn't the patch's present hacking on the
 executor intended to make it happen like this?

Not really. It reads ahead on the bitmap index and passes back the bitmap
words. The other executor routines are hacked up to process the data in
this format.

If I understand your idea correctly, we could change this to read, say, a
page of the index at a time, store this internally in the state object we
pass around, and we can then read out of this the TIDs on a given heap
page which match the query. Once we process all the bitmap data, we just
pull more.


 The main architectural idea I have at the moment is that this should all
 be private between tidbitmap.c and the bitmap index AM.  I think we
 could perhaps have a flavor of tidbitmap that is serial access as
 opposed to the present random-access, ie, instead of keeping the whole
 bitmap in memory, you have a function pointer that you can call to
 demand the next bitmap page in sequence.  tidbitmap.c will need to
 manage both kinds of bitmaps and be prepared to do ANDs and ORs across
 both types, but just at the moment I see no showstopper reason why this
 can't work.

 Or is that exactly what you said already?  It's late here and I'm a
 bit tired...

This is better than what I had in mind. It seems to me that part of this
which will be a litle ugly is dealing with key in (1,2,3,...)
transformation. Let me think about it...

Thanks,

Gavin

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] 8.2 features status

2006-08-03 Thread Gavin Sherry
On Fri, 4 Aug 2006, Bruce Momjian wrote:


 My outlook is that it isn't a lot of _new_ things that you couldn't do
 before, but rather improvements of existing functionality.

It seems as though the majority of things on Tom's list are new things you
couldn't do (at all easily) before.

Gavin

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] 8.2 features status

2006-08-03 Thread Gavin Sherry
On Fri, 4 Aug 2006, Bruce Momjian wrote:

 Gavin Sherry wrote:
  On Fri, 4 Aug 2006, Bruce Momjian wrote:
 
  
   My outlook is that it isn't a lot of _new_ things that you couldn't do
   before, but rather improvements of existing functionality.
 
  It seems as though the majority of things on Tom's list are new things you
  couldn't do (at all easily) before.

 To me new things are like PITR, Win32, savepoints, two-phase commit,
 partitioned tables, tablespaces.  These are from 8.0 and 8.1.  What is
 there in 8.2 like that?

Well, GIN and some of the unreviewed stuff (bitmaps, plpgsql debugger,
updateable views) are in the same league as the stuff in 8.0 in terms of
user demand and catching up with competitors, I think.

A lot of the things on Tom's list are new bits of functionality to things
added around 8.0 and 8.1 (major enhancements to the usability of
constraint exclusion, for example). We knew then that these needed
additional functionality to fill them out and make them useful to a wide
range of people. Ideally we'd have both at each release but reality
doesn't work like that.

Thanks,

Gavin


---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] UPDATE/DELETE XXX WHERE CURRENT OF cursor_name

2006-07-24 Thread Gavin Sherry
On Mon, 24 Jul 2006, Golden Liu wrote:

 Updateable cursors are used as follows:

 begin;
 declare foo cursor for select * from bar for update;
 fetch foo;
 update bar set abc='def' where current of foo;
 fetch foo;
 delete from bar where current of foo;
 commit;


 PostgreSQL doesn't support this feature now ( 8.1.4). Will PGSQL
 support it recently? Does anyone work on this?


No one has stepped up to do this for 8.2 so unfortunately you will most
likely not see this within the next year or so :-(.

Thanks,

Gavin

PS: sorry for not responding to your private email in time.

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] On-disk bitmap index patch

2006-07-24 Thread Gavin Sherry
On Mon, 24 Jul 2006, Bruce Momjian wrote:

 Jie Zhang wrote:
   IIRC they quoted the cardinality of 1 as something that is still
   faster than btree for several usecases.
  
   And also for AND-s of several indexes, where indexes are BIG, your btree
   indexes may be almost as big as tables but the resulting set of pages is
   small.
 
  Yeah, Hannu points it out very well -- the bitmap index works very well when
  columns have low cardinalities and AND operations will produce small number
  of results.

 What operations on columns of low cardinality produce a small number of
 results?  That seems contradictory.

WHERE a = 1 and b = 2

a = 1 may be 5% of the table and b = 2 may be 5% of the table but their
intersection may be .001%.

Luke: the URL you sent to the bitmap slides was internal to Greenplum.
Would you be able to put them on a public site?

Thanks,

Gavin

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] On-disk bitmap index patch

2006-07-23 Thread Gavin Sherry
On Mon, 17 Jul 2006, Jie Zhang wrote:

 Hi,

 I have posted a patch to the CVS head for on-disk bitmap index to
 pgsql-patches. If this can get in 8.2, that would be great. Any comments and
 suggestions are welcome.

 I still need to add several items:

 (1) README file in src/backend/access/bitmap.
 (2) Bitmap index documentation.
 (3) Hiding the internal btree.

 Also, I have disabled the multi-column index support because there is a
 known problem. Assume that there is a bitmap index on a and b. When a query
 predicate has only a, the current code may generate a wrong result. That's
 because the current code assumes that b is null. The ultimate problem is
 because the search code only handles one bitmap vector now. I need a fix to
 support manipulating multiple bitmap vectors.

 If you find any other problems, please let me know.

Is anyone else looking at this patch? I am reviewing it with Jie, tidying
it up and trying to solve some of the problems mentioned above. If anyone
wants to take a look at us, please ask for an updated patch.

Thanks,

Gavin

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] On-disk bitmap index patch

2006-07-23 Thread Gavin Sherry
On Sun, 23 Jul 2006, Tom Lane wrote:

 Gavin Sherry [EMAIL PROTECTED] writes:
  Is anyone else looking at this patch?

 It's on my list of things to look at, but so are a lot of other patches ;-)

 A couple of comments after a *very* fast scan through the patch:

 * The xlog routines need help; they seem to not be updated for recent
 changes in the API for xlog recovery code.

Yep. The patch was actually against 8.1 and was hastily brought up to 8.2.
I think Jie's intention was to simply let everyone know that this was
going on.


 * The hacks on vacuum.c (where it tests the AM name) are utterly
 unacceptable.  If you need the main code to do something special for a
 particular index AM, define a bool flag column for it in pg_am.

Yes.


 * The interface to the existing executor bitmap scan code is mighty
 messy --- seems like there's a lot of almost duplicate code, a lot
 of rather invasive hacking, etc.  This needs to be rethought and
 refactored.

I agree.


 * Why in the world is it introducing duplicate copies of a lot
 of btree comparison functions?  Use the ones that are there.

Yes, I raised this with Jie and she has fixed it. One thought is, we may
want to rename those comparison functions prefixed with 'bm' to make their
naming less confusing. They'll be used by btree, gin and bitmap index
methods. Anyway, a seperate patch.


 * The patch itself is a mess: it introduces .orig and .rej files,
 changes around $PostgreSQL$ lines, etc.


Right, not to mention patches to configure and a lot of style which needs
to be knocked into shape.

 However, the main problem I've got with this is that a new index AM is a
 pretty large burden, and no one's made the slightest effort to sell
 pghackers on taking this on.  What's the use-case, what's the
 performance like, where is it really an improvement over what we've got?

For low cardinality sets, bitmaps greatly out perform btree. Jie and
others at Greenplum tested this implementation (and others) against very
large, low cardinality sets. I wasn't involved in it. I urge Jie and/or
Luke to present those results.

Thanks,

Gavin

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Adding a pgbench run to buildfarm

2006-07-23 Thread Gavin Sherry
On Mon, 24 Jul 2006, Mark Kirkwood wrote:

 Tom Lane wrote:
  Mark Kirkwood [EMAIL PROTECTED] writes:
  Scale factor 10 produces an accounts table of about 130 Mb. Given that
  most HW these days has at least 1G of ram, this probably means not much
  retrieval IO is tested (only checkpoint and wal fsync). Do we want to
  try 100 or even 200? (or recommend scale factor such that size  ram)?
 
  That gets into a different set of questions, which is what we want the
  buildfarm turnaround time to be like.  The faster members today produce
  a result within 10-15 minutes of pulling their CVS snaps, and I'd be
  seriously unhappy if that changed to an hour or three.  Maybe we need to
  divorce compile/regression tests from performance tests?
 
 

 Right - this leads to further questions like, what the performance
 testing on the buildfarms is actually for. If it is mainly to catch
 regressions introduced by any new code, then scale factor 10 (i.e
 essentially in memory testing) may in fact be the best way to show this up.

It introduces a problem though. Not all machines stay the same over time.
A machine may by upgraded, a machine may be getting backed up or may in
some other way be utilised during a performance test. This would skew the
stats for that machine. It may confuse people more than help them...

At the very least, the performance figures would need to be accompanied by
details of what other processes were running and what resources they were
chewing during the test.

This is what was nice about the OSDL approach. Each test was preceeded by
an automatic reinstall of the OS and the machines were specifically for
testing. The tester had complete control.

We could perhaps mimic some of that using virtualisation tools which
control access to system resources but it wont work on all platforms. The
problem is that it probably introduces a new variable, in that I'm not
sure that virtualisation software can absolutely limit CPU resources a
particular container has. That is, you might not be able to get
reproducible runs with the same code. :(

Just some thoughts.

Thanks,

Gavin

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Units in postgresql.conf

2006-07-20 Thread Gavin Sherry
On Thu, 20 Jul 2006, Peter Eisentraut wrote:

 One frequent source of confusion are the different units that the parameters
 in postgresql.conf use.  shared_buffers is in 8 kB, work_mem is in 1 kB;
 bgwriter_delay is in milliseconds, checkpoint_warning is in seconds.
 Obviously, we can't change that without inconveniencing a lot of users.

 I think it would be useful to allow units to be added to these settings, for
 example

 shared_buffers = 1000kB
 checkpoint_warning = 30s

 This would also allow

 shared_buffers = 512MB

 which is a bit cumbersome to calculate right now (you'd need = 65536).

 I haven't thought yet how to parse or implement this, but would people find
 this useful?

Please! Yes!

Gavin

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Units in postgresql.conf

2006-07-20 Thread Gavin Sherry
On Thu, 20 Jul 2006, Josh Berkus wrote:

 Peter,

  I don't understand how that is related.  Or what a conversion utility
  would be for that matter.

 Well, the main issue with changing the units of the PostgreSQL.conf file
 from a user perspective is that the numbers from you 8.0/8.1 conf file
 would no longer work.   A little conversion utilitily to turn your 8.0
 file into an 8.2 file would help solve that.

Josh,

I would imagine that Peter intends to handle backward compatibility by
processing values without explicit units in the units assumed pre 8.2.

Thanks,

Gavin

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] plPHP and plRuby

2006-07-16 Thread Gavin Sherry
On Sun, 16 Jul 2006, Joshua D. Drake wrote:

 Hello,

 However plRuby is even a stranger beast as it uses an entirely ruby
 build system. I am also fairly confident that it does not meat the
 PostgreSQL style guidelines.

Well... JDBC used its own.


 Is there enough interest in plRuby to get it where it needs to be for
 possible inclusion into core?

I'd like to see it ship with the core distribution.

Thanks,

Gavin

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] Removing AddDepends; should I bother with a project?

2006-07-10 Thread Gavin Sherry
On Mon, 10 Jul 2006, Bruce Momjian wrote:

 Josh Berkus wrote:
  Folks,
 
  For the code sprint, I'm starting off by removing the projects from
  contrib which need to be removed by still have some usefulness.  I'm not
  exactly sure what to do with adddepends, though.   It seems unlike to
  lead an independent existence on pgFoundry; I'm inclined to just nuke it.

 I vote for the nuclear option.  ;-)

There are still 7.2 systems out there which need it. The problem is,
adddepend is broken when run against 8.1. It breaks on serial, I think.

Gavin

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] vacuum, performance, and MVCC

2006-06-22 Thread Gavin Sherry
On Thu, 22 Jun 2006, Agent M wrote:


 On Jun 22, 2006, at 9:56 PM, Christopher Kings-Lynne wrote:

  The example is a very active web site, the flow is this:
  query for session information
  process HTTP request
  update session information
  This happens for EVERY http request. Chances are that you won't have
  concurrent requests for the same row, but you may have well over 100
  HTTP
  server processes/threads answering queries in your web server farm.
 
  You're crazy :)  Use memcache, not the DB :)

 Still, the database is the one central location that the apaches can
 connect too- postgres already has a lot of application platform
 features- locking synchronization, asynchronous notifications,
 arbitrary pl code.

 Personally, I think that a special non-MVCC table type could be
 created- the catalogs are similarly flat. What I envision is a table
 type that can only be accessed outside transactions (like AutoCommit
 mode)- this is already possible to implement in plperl for a single
 session. It would be more efficient to have something like a global
 temp table hanging around...

 Just some random ideas...

Unfortunately, it's not so simple.

What if a user enters a transaction block, modifies a normal table,
modifies this 'special table'... then rolls back? This is the problem
MySQL has with innodb and myisam tables.

Well... we could just document that. If only.

What happens if, as a part the update to the special table, we encounter
and error? MVCC currently guarantees that this modification will be
invisible. Without MVCC, we have no such capability.

There seems to be a bit of confusion about what MVCC is. PostgreSQL is not
the only MVCC database. InnoDB is MVCC. Oracle is MVCC. As far as I know,
PostgreSQL is the only MVCC database with a 'non-overwriting storage
manager'. The other MVCC databases maintain UNDO logs outside of the
table's data files. When an update occurs, the existing row version is
copied to te UNDO file, the new data replaces the old and a backward
pointer from the table row to the UNDO log is created. Concurrent reads
must go into the UNDO log and find the version visible to them. This
implements the principle of MVCC and uses snapshot isolation, like we do,
to isolate read/write concurrency.

Overwriting MVCC comes with its own baggage. Ask any Oracle user about
error ORA-01555[1]. There's also the added cost of managing the UNDO logs,
the cost of jumping around between files to get row versions and so on.

Also, it leads to inefficiency with variable size data types. The new
version of a row might be longer or shorter than the previous version and
this has to be causing them a headaches and performance penalties.

As for single version databases -- they have a tonne of problems
themselves. They need to maintain UNDO functionality so as to deal with
the error handling I detailed above but they also cannot implement MVCC
rules for all cases: readers to not block writes, writers do not block
readers. Instead, they implement a large lock matrix and certain types of
queries use certain types of granular locks.

Some potentially interesting reading material on these ideas:

J. Gray  A Reuter, Transaction Processing: Concepts and Techniques
US Patent Number 5,870,758 -- (one of?) Oracle's snapshot isolation patent
Tom Lane's MVCC talk:
http://www.postgresql.org/files/developer/transactions.pdf

Thanks,

Gavin


---
[1] Basically, the UNDO logs are a circular buffer and, remarkably, Oracle
doesnt seem to let the buffer expand if there is a long running
transaction. (We do this for WAL).

Basically, in this case Oracle finds itself in a compromising position
because if any more data is written to the UNDO log the query affected
will not be able to roll back and/or concurrent queries will not be able
to find the correct row version for their snapshot.

UNDO log size can be adjusted but this situation bites Oracle users
constantly.

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] vacuum, performance, and MVCC

2006-06-22 Thread Gavin Sherry
On Thu, 22 Jun 2006, Jonah H. Harris wrote:

 Not in all systems.  A few now perform in-memory UNDO and only write
 it to disk if and when it is required.

Interesting...


  Overwriting MVCC comes with its own baggage. Ask any Oracle user about
  error ORA-01555[1]. There's also the added cost of managing the UNDO logs,
  the cost of jumping around between files to get row versions and so on.

 This seems to be going in the direction of our common MySQL
 discussions; relying on old failures and mistakes to base our
 assumptions on the current version.  Please stay apprised of current
 developments in other systems.

Erm. Perhaps a poor example as I was not trying to put Oracle in a poor
light. Rather, I was trying to point out that each method has its
disadvantages. If the update in place method has no detractions we
shouldn't be hanging on to our existing mechanism.

  J. Gray  A Reuter, Transaction Processing: Concepts and Techniques

 Pretty much older than dirt, discusses locking, and barely touches on
 MVCC.  Still has some good concepts though.

The really useful section of this book is the discussion of snapshot
isolation. That's the important thing here. Conceivably we could have a
higher performance storage system but, IMHO, it must implement snapshot
isolation.

Thanks,

Gavin

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] DB2-style INS/UPD/DEL RETURNING

2006-03-13 Thread Gavin Sherry
On Sun, 12 Mar 2006, Jonah H. Harris wrote:

 I was talking with Jonathan Gennick about the INS/UPD/DEL RETURNING stuff,
 and he recommended looking into the way DB2 handles similar functionality.
 After looking into it a bit, it's more inline with what Tom's suggestion was
 regarding a query from the operation rather than returning the values in the
 manner currently required.

 Here's DB2's syntax... does anyone have any familiarity with it?

 Simply put, it's sort-of like:

 SELECT * FROM (FINAL | NEW | OLD) TABLE (INSERT | UPDATE | DELETE)

 I'd like to hear from anyone that's used it to see if it really is better...
 logically it seems nicer, but I've never used it.

It works well for cases where you want to pass the result of an
insert/delete/update to another query. There was a paper on IBM developer
works on how they got the 7 or so queries in an order transaction in TPC-C
down to 3 queries and increased throughput impressively.

This doesn't solve the generated keys problem that the Java and probably
.NET interfaces have. Mind, RETURNING doesn't solve anything either.

I prefer this syntax to RETURNING. Then again, Oracle is a bigger target
than DB2 so... I'm not sure.

Thanks,

Gavin

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] DB2-style INS/UPD/DEL RETURNING

2006-03-13 Thread Gavin Sherry
On Mon, 13 Mar 2006, Tom Lane wrote:

 Gavin Sherry [EMAIL PROTECTED] writes:
  On Sun, 12 Mar 2006, Jonah H. Harris wrote:
  SELECT * FROM (FINAL | NEW | OLD) TABLE (INSERT | UPDATE | DELETE)

  This doesn't solve the generated keys problem that the Java and probably
  .NET interfaces have. Mind, RETURNING doesn't solve anything either.

 Why not?  AFAICS, either one lets you get at generated keys.

There are a few different ways to get at generated keys from JDBC at
least. The case we cannot trivially deal with is when the code executes a
statement and then wants a result set of all generated keys. That is, it
doesn't register which generated keys it wants returned before the query
is executed.


 It's quite unclear to me what the difference is between FINAL and
 NEW ... any clarification there?

NEW returns the representation of the data which the statement creates;
FINAL is the final representation of the data, after AFTER triggers have
been applied.


 The OLD idea is cute but I'm not sure how useful it really is.  They
 seem to have missed a bet anyway: if I understand how this works, you
 can't get values from both new and old row states in the UPDATE case.
 The classification seems bogus for both INSERT and DELETE, too; neither
 of them have more than one row state to deal with.

Right, it's not as useful as our OLD.*, NEW.*.


 Also, is the front SELECT allowed to have its own WHERE, or is it
 constrained to return exactly one row per inserted/updated/deleted row?
 If it can have a WHERE then there's a syntactic ambiguity in
   SELECT ... FROM NEW TABLE UPDATE ... WHERE ...

That's definately ambiguous. The manual doesn't clarify and I do not have
DB2 installed locally.


 More generally, this syntax is problematic in that it's syntactically
 possible to use SELECT FROM NEW TABLE ... as a sub-query, which seems
 like a truly horrid idea from both semantics and implementation
 perspectives.

I cannot see any reference to whether this is allowed in DB2. The DB2
manual and other IBM apps use it extensively in named expressions. Ie,

WITH
foo as (SELECT FROM NEW TABLE(...)),
bar as (SELECT FROM OLD TABLE(...))
SELECT ... FROM foo, bar

It does say that a 'data change table reference' is simply a type of table
reference so I suppose it can occur in a sub query. The ability to have
INSERT ... RETURNING in a from clause would achieve most of this, I think.

Thanks,

Gavin

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] to_char and i18n

2006-03-02 Thread Gavin Sherry
 Gavin Sherry wrote:
  On Wed, 21 Dec 2005, Tom Lane wrote:
 
   Manuel Sugawara masm@fciencias.unam.mx writes:
(Some time ago I proposed an--incomplete--patch and it was rejectd by
Karel arguing that to_char functions should behave *exactly* the same
way that they do in Oracle.)
  
   That is the accepted plan for to_char ... of course, if Oracle changes
   to_char every so often, it'll get more interesting to decide what to do.
 
  There's some functionality in 10g which PostgreSQL does not have:
 
  TZD - returns the short timezone string with daylight saving information,
  eg: PDT

This is the same as TZ and it is easy to implement.

 
  TZM - timezone offset minutes part

Trivial

 
  TZH - timezone offset hours part

Trivial

 
  TZR -  timezone region (US/Pacific, for example)

We don't currently have an offset - region name lookup table but it
should be easy enough to implement...


 
  RR/ - accept 'rounded' years, eg 99-1-1 = 1999-1-1 (kind of pointless)
 
  FF - specify how many digits to the right of the decimal place to display,
  when looking at factions of seconds. Eg: HH:MM:SS.FF3 would produce
  15:56:22.123

Trivial

 
  X - the local radix character. Eg: HH:MM:SSXFF would produce 15:56:22.123
 

I don't know how to get this character... is it included in the locale
data some where (and where, specifically)

  E - Era name (like, Japanese Imperial) (kind of pointless)
  EE - Full era name

No idea where to get this data.

 
  DS - Locale formatted short date. For example, DD/MM/ for the Brits,
  MM/DD/ for the Yanks

Is this desirable? It may lead to confusion with datestyle.

 
  DL - Locale formatted long date. Eg: fmDay, dd. Month  in Germany

Should be straight forward - if the underlying library will honour locale.

 
  SCC - Like 'CC', but will carry a - (minus) for BC dates (I'm not sure if
  this implies that Oracle wants BC dates to be marked 'BC'. I don't have
  an Oracle system around at the moment to check though :-()

Thoughts?

 
  TS - Locale formatted short time.

Should be straight forward - if the underlying library will honour locale.

 
  YEAR - Year in words

Hmmm. This would be hard to do if we want to support local language
settings.

 
  SYEAR - Year in words, prefixed by minus sign for BC dates

As above.

 
  S - , prefixed by minus sign for BC dates

Should be straight forward.

Any comments on the above?

Gavin

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] Status of INS/UPD/DEL RETURNING?

2006-03-01 Thread Gavin Sherry
On Wed, 1 Mar 2006, Bruce Momjian wrote:

 Jonah H. Harris wrote:
  Hey guys,
 
  What's the status of the current INSERT/UPDATE/DELETE RETURNING patch?  Is
  it ready to go or does it need to be cleaned up?

 Uh, I don't remember seeing any patch like that.  Where is it?

Omar Kilani sent in a patch before 8.1 FF derived from (??) some playing
around I had done. It still needs work.

Thanks,

Gavin

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] PostgreSQL unit tests

2006-02-23 Thread Gavin Sherry
On Fri, 24 Feb 2006, Michael Glaesemann wrote:

 [I neglected to cc the list in my reply earlier. Apologies to Gavin
 for the double-post.]

 On Feb 23, 2006, at 11:40 , Gavin Sherry wrote:


   I do think that unit testing of areas such as data types would be
  useful,
  particularly the date/time code and arrays as I consider that area
  of the
  code quite fragile. I wouldn't expect the unit tests to find any bugs.
  Rather, it would make it easier, I think, for people (particularly new
  comers) to hack on that part of the code with more confidence.
 

 This is the area I specifically had in mind when thinking of unit
 tests. I am looking to do more work on the date/time code in
 particular, and having a unit testing framework and tests to verify
 expected behavior would definitely give me a greater sense of
 security that I wasn't mucking stuff up.

Yes. You sound like the perfect person to implement this :-).

 Your earlier proposal was probably the one I remembered. Could you
 comment on your experience with any of the existing unit testing
 frameworks? In particular, did any stand out in applicability to
 Postgres? What's your take on possible licensing issues?

I looked at Check and CuTest from memory. The former was more
sophisticated, if memory serves me correctly, because it had the ability
to fork and run the code from a child to see if it segfaulted, for
example.

The licensing issue is more of a pain. Obviously we cannot incorporate GPL
stuff into to the code base. CuTest seems to have a pretty BSD compatible
license, though.

That said, the actual implementation is very straight forward: 1) Queue
test functions which assert something about the result of a function 2)
Run the tests 3) capture pass/fails. We have some special requirements
with our code because it can ereport()/elog(). As such, it is quite
attractive to just write our own, if unit testing is to proceed.

Thanks,

Gavin

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] Attempting upgrade path; is this possible?

2006-02-22 Thread Gavin Sherry
On Wed, 22 Feb 2006, Shaun Thomas wrote:

 I'm in charge of a very large database, and we're using a highly
 decrepit version of Postgresql currently.  After searching through the
 archives, Google, and trying out several replication engines, I have a
 question.

 I had originally considered Slony-I, as it doesn't seem to require
 version compatibility between nodes like pgCluster, so upgrading from
 7.4.2 to 8.1.3 would be a possible, if slow process.  But after looking
 into the level of micro-management necessary, such as defining sets of
 every table on a per-database level, then having it add artificial
 primary-keys to applicable tables, it just doesn't seem like a good
 choice.  Not a fault of Slony-I, but several multi-gig databases hosting
 hundreds of tables would be a nightmare to use with Slony-I.

There are tools in the /tools directory. In particular, take a look at
/tools/altperl. You can use to set up slony and replicate all tables with
very little hassle. Slony adds the primary keys for you.


 Then I thought about the backup/recovery system and the WAL files.
 Would this scenario be possible:

 1. Do a pg_dumpall on the existing database running 7.4.2.
 2. Do a psql -f foo template1 on the new database running 8.1.3.
 3. Wait a very long time while the new database loads.
 4. Shut down old database.
 5. Start the new database in restore mode, and point it to the WAL
files from the old database.
 6. Wait for restore to finish.
 7. Restart the new database.

This is not possible. On your 7.4 systems tables have a unique object ID
to identify them. When you restore the dump on the 8.1 system they will
have different object IDs. There are several other issues of this
nature.

Also, the binary format of the log files has changed and the whole process
would be significantly more difficult than using slony to upgrade. The
size of your databases does not sound like an issue - lots of people have
done what you're doing with GB range databases.


 I wondered about this, as the pg_dumpall/restore would take a very long
 time  for a 50GB database cluster, but theoretically the WAL files would
 continue to accumulate on the old db while this loading was taking
 place.
 If the WAL formats were compatible, the total upgrade time would only be
 restricted to how long it took to replay the WAL files in the new
 database.  Does the current format of the WAL files make this possible?
 If not, is such an option for the future?

It is possible that someone could write a translation tool which
translates WAL entries into SQL, but there is some messiness to deal with
(backup blocks, create table foo; insert into foo; drop table foo; and
more). I think the best solution is an inplace upgrade tool which does all
the binary conversions, additions and subtractions itself. This could be
quite cheap because the conversion will often only affect system catalogs
not user tables.

Thanks,

Gavin

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] PostgreSQL unit tests

2006-02-22 Thread Gavin Sherry
On Wed, 22 Feb 2006, Alvaro Herrera wrote:

 Robert Treat wrote:

  You could check into what spikesource has been doing.  I believe they mostly
  just piggyback off of our regression tests for postgresql core, but there
  might still be something that could be built upon.  If you look at this url
  http://developer.spikesource.com/spikewatch/index.jsp?show=component-resultscomp-id=22074
  the actual success information isnt terribly exciting but the code 
  coverage
  url shows something of more interest. There is more stuff if you dig around 
  a
  bit.

 This can't be right.  The report for function coverage shows 100% for
 all utf8_and_*.c files, at the end of the listing.  Notice how C/D
 coverage (I don't know what it means but I assume it's somehow computed
 per lines of code or something) is 0, which is probably the correct
 result, because our regression tests do not test charset conversions at
 all.

 I think the bug may be that they use function names to see what is
 actually tested ...

 IIRC Gavin Sherry gave a URL to a test coverage result some centuries
 ago.  The only thing that I remember about the result was that it was
 surprinsingly low (IMHO at least).

Yes. Coverage was about 50% from memory. This was coverage resulting from
regression tests.

I previously proposed integrating a unit test framework into PostgreSQL.
Getting started wasn't much fun and I gave up. This is because unit
testing is really suited to a functional programming model, IMHO. Testing
the most complex parts of the postgres backend requires a *lot* of state
to be initialised and we'd really have to stomp all over the backend. I do
think that unit testing of areas such as data types would be useful,
particularly the date/time code and arrays as I consider that area of the
code quite fragile. I wouldn't expect the unit tests to find any bugs.
Rather, it would make it easier, I think, for people (particularly new
comers) to hack on that part of the code with more confidence.

The areas of the backend which do not suit unit testing are usually
associated with lots of state or lots of concurrency - like WAL, buffer
manager and so on. The approaches we have at the moment -- regression
tests, user test, load generators (benchmarks et al) -- do an okay job but
they are a brute force approach. They test the common code path, not the
uncommon one. Approaches used by other projects include sophisticated
static analysis (both of language semantics and appliation semantics, such
as 'function bar should never be call unless function foo has been called
first'); model checking (difficult to implement with limited results) and
once of of patches that introduce disasterous conditions in the code
(like low level functions randomly failing, to check that the callers of
such functions deal with such conditions correctly).

Some basic static analysis has been applied to the backend (Stanford
Checker and coverity, etc), no model checking (as far as I am aware) and
Tom had some good results with the last approach easlier in the year, but
I cannot remember what area of the code he was focusing on.

The problem is, these approaches are either of limited use or time
consuming and niche enough that they cannot be done often or across the
whole tree.

Any other thoughts on this?

Thanks,

Gavin

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


  1   2   3   4   5   >