date:20060626

Re: [HACKERS] Table clustering idea

2006-06-26 Thread Luke Lonergan

Jim,

On 6/26/06 8:15 PM, "Jim C. Nasby" <[EMAIL PROTECTED]> wrote:

> On a somewhat related note, I think that it would be advantageous if the
> FSM had a means to prefer certain pages for a given tuple over other
> pages. This would allow for a better way to keep heap and possibly index
> data more compacted, and it would also be a means of keeping tables
> loosely clustered. It would also make it far easier to shrink heaps that
> have become bloated, because the FSM could be told to favor pages at the
> beginning of the relation.

Interesting idea - page affinity implemented using the FSM.

WRT feasibility of BTREE organized tables, I'm not sure I see the problem.
Teradata implemented a hashing filesystem for their heap storage and I've
always wondered about how they handle collision and chaining efficiently,
but it's a solved problem for sure - knowing that makes the challenge that
much easier :-)

- Luke 

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org

Re: [HACKERS] [COMMITTERS] pgsql: Clamp last_anl_tuples to n_live_tuples, in case we vacuum a table

2006-06-26 Thread Tom Lane

Alvaro Herrera <[EMAIL PROTECTED]> writes:
> On a loosely related matter, how about changing pg_class.relpages to
> pg_class.reldensity?

IIRC, I considered this earlier, and rejected it because it doesn't cope
well with the corner case relpages == 0.  Also, it'll break existing
clients that expect to find relpages and reltuples, if there are any
--- which is not a very strong argument, certainly, but the argument in
favor of changing seems even weaker.

regards, tom lane

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org

Re: [HACKERS] [COMMITTERS] pgsql: Clamp last_anl_tuples to n_live_tuples, in case we vacuum a table

2006-06-26 Thread Alvaro Herrera

Alvaro Herrera wrote:

> Clamp last_anl_tuples to n_live_tuples, in case we vacuum a table without
> analyzing, so that future analyze threshold calculations don't get confused.
> Also, make sure we correctly track the decrease of live tuples cause by
> deletes.

Now that the values of n_live_tuples and n_dead_tuples are correct,
maybe we should expose them in functions like n_tuples_inserted and all
the rest?  Probably n_live_tuples is not as useful (because you can get
a very similar, if not equal, value from pg_class.reltuples), but
n_dead_tuples seems worthwhile.  I'd also add them to pg_stat_all_tables.

On a loosely related matter, how about changing pg_class.relpages to
pg_class.reldensity?  The idea being that instead of storing the number
of pages, we store the average tuple density, which is what we use
relpages for anyway.

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [HACKERS] [COMMITTERS] pgsql: Disallow changing/dropping default expression

2006-06-26 Thread Bruce Momjian

Tom Lane wrote:
> [EMAIL PROTECTED] (Bruce Momjian) writes:
> > Disallow changing/dropping default expression of a SERIAL column
> 
> Wasn't this patch rejected?  Your summary is utterly wrong about what
> it does, which makes me wonder whether you reviewed it at all.  One
> would also think that a change in user-visible behavior would merit
> at least some documentation diffs.

Well, it seemed it controlled whether dependency allowed you do drop a
default sequence for a table.  There is something I didn't like about
the patch now that I look at it again --- it uses constants 0-2 when it
should use defines or something clearer.

I thought we had decided that we could not make SERIAL just a macro, and
therefore we have to restrict how we allow modifications.  If someone
wants total control, they should create the DEFAULT manually, but SERIAL
was going to be hard-wired.

Anyway, what is your opinion on this?

Yea, agreed on the documentation issue.  That SERIAL distinction, if
that is the direction we are going, should be documented.

-- 
  Bruce Momjian   [EMAIL PROTECTED]
  EnterpriseDBhttp://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] Inheritance, CREATE TABLE LIKE, and partitioned tables

2006-06-26 Thread Bruce Momjian


If you want to merge those functions, please do it as a separate patch
now that the patch has been applied.  Having too much unrelated stuff in
a patch does confuse things.

---

Greg Stark wrote:
> 
> Currently analyze.c and tablecmds.c both have some very similar code to handle
> copying columns from "parent" tables. As long as they were just copying
> columns and in the case of tablecmds.c copying check constraints that wasn't
> really that bad an idea, the code is pretty simple.
> 
> However with partitioned tables it becomes important to copy more table
> attributes like foreign key constraints and hopefully one day indexes. And it
> would be awfully convenient to have CREATE TABLE LIKE have options to copy the
> same things that inherited tables get. And have exactly the same semantics.
> 
> So I'm suggesting refactoring the code from analyze.c and tablecmds.c into
> functions to copy the columns, constraints, indexes etc.
> 
> For example I see a functions like:
> 
> List *CopyTableColumns(relation source, List *target_schema)
> List *CopyTableCheckConstraints(relation source, List *target_schema)
> ...
> 
> To do this though might require some changes in the algorithm used for
> inherited tables. Currently it builds up the list of merged columns
> incrementally. I'm thinking it would be more natural to accumulate all the
> columns from parents and then remove duplicates in a single pass. I think it
> should be possible to maintain precisely the same semantics doing this though.
> 
> I may be able to make AddInherits a consumer for these functions as well,
> though it would be a bit awkward since it would have to construct a fake list
> of ColumnDefs to act as the target schema. It would have the side effect of
> making the constraint comparison use change_varattnos_of_a_node and then
> compare the binary representations rather than decompiling the constraints to
> do the comparison. I'm not sure if that's the same semantics.
> 
> To a certain degree I feel like this is just make-work. The existing code
> works fine and I can just happily keep additing functionality to both
> analyze.c and tablecmds.c. And it's possible we won't always want to have the
> two match.
> 
> Has anyone looked at applying the ADD INHERITS patch yet? Would it be more or
> less likely to be accepted if it were a bigger patch that refactored all this
> stuff like I'm talking about?
> 
> -- 
> greg
> 
> 
> ---(end of broadcast)---
> TIP 5: don't forget to increase your free space map settings
> 

-- 
  Bruce Momjian   [EMAIL PROTECTED]
  EnterpriseDBhttp://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

Re: [HACKERS] GIN index creation extremely slow ?

2006-06-26 Thread Christopher Kings-Lynne


It looks like we still don't have any docs for GIN in the tree so I
don't know if those timings are expected or not ...


Ummm my bad.  Sorry...


---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [HACKERS] Table clustering idea

2006-06-26 Thread Jim C. Nasby

On Sun, Jun 25, 2006 at 08:04:18PM -0400, Luke Lonergan wrote:
> Other DBMS have index organized tables that can use either hash or btree
> organizations, both of which have their uses.  We are planning to
> implement btree organized tables sometime - anyone else interested in
> this idea?

I'm curious how you'll do it, as I was once told that actually trying to
store heap data in a btree structure would be a non-starter (don't
remember why).

On a somewhat related note, I think that it would be advantageous if the
FSM had a means to prefer certain pages for a given tuple over other
pages. This would allow for a better way to keep heap and possibly index
data more compacted, and it would also be a means of keeping tables
loosely clustered. It would also make it far easier to shrink heaps that
have become bloated, because the FSM could be told to favor pages at the
beginning of the relation.
-- 
Jim C. Nasby, Sr. Engineering Consultant  [EMAIL PROTECTED]
Pervasive Software  http://pervasive.comwork: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf   cell: 512-569-9461

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] vacuum, performance, and MVCC

2006-06-26 Thread Bruce Momjian

Jim C. Nasby wrote:
> On Mon, Jun 26, 2006 at 02:32:59PM -0400, Bruce Momjian wrote:
> > 
> > It is certainly possible to do what you are suggesting, that is have two
> > index entries point to same chain head, and have the index access
> > routines figure out if the index qualifications still hold, but that
> > seems like a lot of overhead.
> > 
> > Also, once there is only one visible row in the chain, removing old
> > index entries seems quite complex because you have to have vacuum keep
> > the qualifications of each row to figure out which index tuple is the
> > valid one (seems messy).
>  
> Perhaps my point got lost... in the case where no index keys change
> during an update, SITC seems superior in every way to my proposal. My
> idea (let's call it Index Tuple Page Consolidation, ITPC) would be
> beneficial to UPDATEs that modify one or more index keys but still put
> the tuple on the same page. Where SITC would be most useful for tables
> that have a very heavy update rate and very few indexes, ITPC would
> benefit tables that have more indexes on them; where presumably it's
> much more likely for UPDATEs to change at least one index key (which
> means SITC goes out the window, if I understand it correctly). If I'm
> missing something and SITC can in fact deal with some index keys
> changing during an UPDATE, then I see no reason for ITPC.

I understood what you had said.  The question is whether we want to get
that complex with this feature, and if there are enough use cases
(UPDATE with index keys changing) to warrant it.

-- 
  Bruce Momjian   [EMAIL PROTECTED]
  EnterpriseDBhttp://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] vacuum, performance, and MVCC

2006-06-26 Thread Jim C. Nasby

On Mon, Jun 26, 2006 at 02:32:59PM -0400, Bruce Momjian wrote:
> 
> It is certainly possible to do what you are suggesting, that is have two
> index entries point to same chain head, and have the index access
> routines figure out if the index qualifications still hold, but that
> seems like a lot of overhead.
> 
> Also, once there is only one visible row in the chain, removing old
> index entries seems quite complex because you have to have vacuum keep
> the qualifications of each row to figure out which index tuple is the
> valid one (seems messy).
 
Perhaps my point got lost... in the case where no index keys change
during an update, SITC seems superior in every way to my proposal. My
idea (let's call it Index Tuple Page Consolidation, ITPC) would be
beneficial to UPDATEs that modify one or more index keys but still put
the tuple on the same page. Where SITC would be most useful for tables
that have a very heavy update rate and very few indexes, ITPC would
benefit tables that have more indexes on them; where presumably it's
much more likely for UPDATEs to change at least one index key (which
means SITC goes out the window, if I understand it correctly). If I'm
missing something and SITC can in fact deal with some index keys
changing during an UPDATE, then I see no reason for ITPC.

> ---
> 
> Jim C. Nasby wrote:
> > On Sun, Jun 25, 2006 at 09:13:48PM +0300, Heikki Linnakangas wrote:
> > > >If you can't expire the old row because one of the indexed columns was
> > > >modified, I see no reason to try to reduce the additional index entries.
> > > 
> > > It won't enable early expiration, but it means less work to do on update. 
> > > If there's a lot of indexes, not having to add so many index tuples can 
> > > be 
> > > a significant saving.
> > 
> > While catching up on this thread, the following idea came to me that I
> > think would allow for not updating an index on an UPDATE if it's key
> > doesn't change. If I understand Bruce's SITC proposal correctly, this
> > would differ in that SITC requires that no index keys change.
> > 
> > My idea is that if an UPDATE places the new tuple on the same page as
> > the old tuple, it will not create new index entries for any indexes
> > where the key doesn't change. This means that when fetching tuples from
> > the index, ctid would have to be followed until you found the version
> > you wanted OR you found the first ctid that pointed to a different page
> > (because that tuple will have it's own index entry) OR you found a tuple
> > with a different value for the key of the index you're using (because
> > it'd be invalid, and there'd be a different index entry for it). I
> > believe that the behavior of the index hint bits would also have to
> > change somewhat, as each index entry would now essentially be pointing
> > at all the tuples in the ctid chain that exist on a page, not just
> > single tuple.
> > 
> > In the case of an UPDATE that needs to put the new tuple on a different
> > page, our current behavior would be used. This means that the hint bits
> > would still be useful in limiting the number of heap pages you hit. I
> > also believe this means that we wouldn't suffer any additional overhead
> > from our current code when there isn't much free space on pages.
> > 
> > Since SITC allows for in-page space reuse without vacuuming only when no
> > index keys change, it's most useful for very heavily updated tables such
> > as session handlers or queue tables, because those tables typically have
> > very few indexes, so it's pretty unlikely that an index key will change.
> > For more general-purpose tables that have more indexes but still see a
> > fair number of updates to a subset of rows, not having to update every
> > index would likely be a win. I also don't see any reason why both
> > options couldn't be used together.
> > -- 
> > Jim C. Nasby, Sr. Engineering Consultant  [EMAIL PROTECTED]
> > Pervasive Software  http://pervasive.comwork: 512-231-6117
> > vcard: http://jim.nasby.net/pervasive.vcf   cell: 512-569-9461
> > 
> 
> -- 
>   Bruce Momjian   [EMAIL PROTECTED]
>   EnterpriseDBhttp://www.enterprisedb.com
> 
>   + If your life is a hard drive, Christ can be your backup. +
> 
> ---(end of broadcast)---
> TIP 5: don't forget to increase your free space map settings
> 

-- 
Jim C. Nasby, Sr. Engineering Consultant  [EMAIL PROTECTED]
Pervasive Software  http://pervasive.comwork: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf   cell: 512-569-9461

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] [PATCHES] Non-transactional pg_class, try 2

2006-06-26 Thread Simon Riggs

On Mon, 2006-06-26 at 16:54 -0400, Alvaro Herrera wrote:
> > > Another optimization: if we are sure that unfreezing works, we can
> even
> > > mark a table as frozen in a postmaster environment, as long as we
> take
> > > an ExclusiveLock on the table.  Thus we know that the vacuum is
> the sole
> > > transaction concurrently accessing the table; and if another
> transaction
> > > comes about and writes something after we're finished, it'll
> correctly
> > > unfreeze the table and all is well.
> > 
> > Why not just have a command to FREEZE and UNFREEZE an object? It can
> > hold an ExclusiveLock, avoiding all issues. Presumably FREEZE and
> > UNFREEZE are rare commands?
> 
> Ok, if I'm following you here, your point is that FREEZE'ing a table
> sets the relminxid to FrozenXid, and UNFREEZE removes that; and also,
> in
> between, no one can write to the table?
> 
> This seems to make sense.  However, I'm not very sure about the
> FREEZE'ing operation, because we need to make sure the table is really
> frozen.  So we either scan it, or we make sure something else already
> scanned it; to me what makes the most sense is having a VACUUM option
> that would do the freezing (and a separate command to do the
> unfreezing). 

Sounds like we're in step here:

VACUUM FREEZE
-- works at either table or database level
-- takes ExclusiveLock, reads all blocks of a table, freezing rows
-- once complete, all write operations are prevented until...

ALTER TABLE xxx UNFREEZE;
ALTER DATABASE xxx UNFREEZE;
-- takes AccessExclusiveLock, allows writes again

CREATE DATABASE automatically does unfreeze after template db copy

Suggest that we prevent write operations on Frozen tables by revoking
all INSERT, UPDATE or DELETE rights held, then enforcing a check during
GRANT to prevent them being re-enabled. Superusers would need to check
every time. If we dont do this, then we will have two contradictory
states marked in the catalog - privilges saying Yes and freezing saying
No.

Not sure where pg_class_nt comes in here though, even though I think I
still want it.

-- 
  Simon Riggs
  EnterpriseDB  http://www.enterprisedb.com


---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org

Re: [HACKERS] GIN index creation extremely slow ?

2006-06-26 Thread Oleg Bartunov


gin uses maintenance_work_mem,so try to increase it and see dramatic
improvements

Oleg
On Mon, 26 Jun 2006, Stefan Kaltenbrunner wrote:


on IRC somebody mentioned that it took >34h to greate a GIN index (on a
tsvector) on a ~3 Million column table (wikipedia dump) with a
reasonable speced box (AMD 3400+).
After getting hold of a dump of said table (around 4,1GB in size) I
managed to get the following timings:

test=# CREATE INDEX idxFTI_idx ON wikipedia USING gist(vector);
CREATE INDEX
Time: 416122.896 ms

so about 7 minutes - sounds very reasonable

test=# CREATE INDEX idxFTI2_idx ON wikipedia USING gin(vector);
CREATE INDEX
Time: 52681605.101 ms

ouch - that makes for a whoppy 14,6hours(!). During that time the box is
completely CPU bottlenecked and during virtually no IO at all - (varing
maintainance_work_mem does not seem to make any noticable difference).

That box is a fast Dual Opteron 2.6Ghz with 8GB RAM and a 4 disk RAID10
for the WAL and 12 disks for the data running a very recent -HEAD
checkout ...

It looks like we still don't have any docs for GIN in the tree so I
don't know if those timings are expected or not ...


Stefan

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster



Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] Non-transactional pg_class, try 2

2006-06-26 Thread Alvaro Herrera

Tom Lane wrote:
> Alvaro Herrera <[EMAIL PROTECTED]> writes:
> > What I'm after is not freezing for read-only media, nor archive, nor
> > read-only tables.  What I'm after is removing the requirement that all
> > databases must be vacuumed wholly every 2 billion transactions.
> 
> Well, if that's the only goal then I hardly think we need to have a
> discussion, because your alternative #1 is *obviously* the winner:
> 
> > 1. Remove the special case, i.e., process frozen databases in VACUUM
> >like every other database.
> >This is the easiest, because no extra logic is needed.  Just make
> >sure they are vacuumed in time.  The only problem would be that we'd
> >need to uselessly vacuum tables that we know are frozen, from time to
> >time.  But then, those tables are probably small, so what's the
> >problem with that?

I'm happy to do at least this for 8.2.  We can still try to do the
non-transactional catalog later, either in this release or the next; the
code is almost there, and it'll be easier to discuss/design because
we'll have taken the relminxid stuff out of the way.

So if everyone agrees, I'll do this now.  Beware -- this may make you
vacuum databases that you previously weren't vacuuming.  (I really doubt
anyone is setting datallowconn=false just to skip vacuuming big
databases, but there are people with strange ideas out there.)


> So if you want to bring in the other goals that you're trying to pretend
> aren't there, step right up and do it.  You have not here made a case
> that would convince anyone that we shouldn't just do #1 and be done with
> it.

We can do it in a separate discussion.

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly

Re: [HACKERS] [PATCHES] Non-transactional pg_class, try 2

2006-06-26 Thread Tom Lane

Alvaro Herrera <[EMAIL PROTECTED]> writes:
> What I'm after is not freezing for read-only media, nor archive, nor
> read-only tables.  What I'm after is removing the requirement that all
> databases must be vacuumed wholly every 2 billion transactions.

Well, if that's the only goal then I hardly think we need to have a
discussion, because your alternative #1 is *obviously* the winner:

> 1. Remove the special case, i.e., process frozen databases in VACUUM
>like every other database.
>This is the easiest, because no extra logic is needed.  Just make
>sure they are vacuumed in time.  The only problem would be that we'd
>need to uselessly vacuum tables that we know are frozen, from time to
>time.  But then, those tables are probably small, so what's the
>problem with that?

> 2. Mark frozen databases specially somehow.
>To mark databases frozen, we need a way to mark tables as frozen.
>How do we do that?  As I explain below, this allows some nice
>optimizations, but it's a very tiny can full of a huge amount of
>worms.

Avoiding a vacuum pass over template0 once every 2 billion transactions
cannot be thought worthy of the amount of complexity and risk entailed
in the nontransactional-catalog thing.  Especially since in the normal
case those would be read-only passes (the tuples all being frozen already).

So if you want to bring in the other goals that you're trying to pretend
aren't there, step right up and do it.  You have not here made a case
that would convince anyone that we shouldn't just do #1 and be done with
it.

regards, tom lane

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly

Re: [HACKERS] "Truncated" tuples for tuple hash tables

2006-06-26 Thread Tom Lane

Simon Riggs <[EMAIL PROTECTED]> writes:
> On Mon, 2006-06-26 at 16:43 -0400, Tom Lane wrote:
>> analyze.c (tuples collected in-memory for stats analysis)

> Do we save enough there for us to care?

Possibly not --- it's certainly low-priority, but I listed it for
completeness.

regards, tom lane

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

Re: [HACKERS] Inheritance, CREATE TABLE LIKE, and partitioned tables

2006-06-26 Thread Greg Stark

elein <[EMAIL PROTECTED]> writes:

> People will turn around and immediately as for create table like without us
> making the assumptions it wanted all of the extras that come with inheritance.
> COPY is a copy of the table, not additional functionality.  The added 
> flexibility
> of adding only what is necessary makes more sense than going in to guess
> what was added and removing it later if it is not needed.
> 
> This does not preclude adding a copy table like (with extras) though if you 
> must.

I'm not too sure what you're saying here. But just to be clear, the spec
specifies what CREATE TABLE LIKE does and does not copy. Any behaviour
different from the spec would have to be explicitly requested with an optional
clause.

-- 
greg


---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [HACKERS] "Truncated" tuples for tuple hash tables

2006-06-26 Thread Simon Riggs

On Mon, 2006-06-26 at 16:43 -0400, Tom Lane wrote:
> I wrote:
> > There isn't any benefit except where we collect lots of tuples, which is
> > to say tuplesort/tuplestore/tuplehashtable.  In other places in the
> > executor, there's basically only one transient tuple in existence per
> > plan node; jumping through hoops to save 16 bytes per plan node is just
> > silly.  (What's more, as of 8.1 most of those tuples will be in "virtual
> > tuple" format anyway, and so the optimization wouldn't make any
> > difference at all...)
> 
> After further study of the code, here's my hit-list of places that could
> make worthwhile use of MinimalTuples:
> 
>   tuplesort.c (in-memory, on-disk case done already)
>   tuplestore.c (in-memory and on-disk)
>   TupleHashTable (execGrouping.c --- used by nodeAgg and nodeSubplan)
>   hash joins (in-memory hash table and tuple "batch" files)

Thats the list I thought you meant.

>   analyze.c (tuples collected in-memory for stats analysis)

Do we save enough there for us to care?

Will that allow us to increase the sample size for larger tables?

-- 
  Simon Riggs
  EnterpriseDB  http://www.enterprisedb.com


---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match

Re: [HACKERS] [PATCHES] Non-transactional pg_class, try 2

2006-06-26 Thread Alvaro Herrera

Simon Riggs wrote:
> On Mon, 2006-06-26 at 13:58 -0400, Alvaro Herrera wrote:

> > The relminxid Patch
> > ===
> > 
> > What I'm after is not freezing for read-only media, nor archive, nor
> > read-only tables. 
> 
> OK, but I am... but I'm happy to not to confuse the discussion.

Ok :-)  I think I put a note about this but removed it while
restructuring the text so it would be clearer.  The note is that while I
don't care about read-only stuff in this proposal, it may be that
read-only tables may come as a "side effect of implementing this.  But I
agree we should not make the discussion more complex than it already is.

> > 2. Mark frozen databases specially somehow.
> >To mark databases frozen, we need a way to mark tables as frozen.
> >How do we do that?  As I explain below, this allows some nice
> >optimizations, but it's a very tiny can full of a huge amount of
> >worms.
> 
> At this stage you talk about databases, yet below we switch to
> discussing tables. Not sure why we switched from one to the other.

Sorry, I forgot one step.  To mark a database frozen, we must make sure
that all tables within that database are frozen as well.  So the first
step to freezing a database is freezing all its tables.

> > Marking a Table Frozen
> > ==
> > 
> > Marking a table frozen is simple as setting relminxid = FrozenXid for a
> > table.  As explained above, this cannot be done in a regular postmaster
> > environment, because a concurrent transaction could be doing nasty stuff
> > to a table.  So we can do it only in a standalone backend.
> 
> Surely we just lock the table? No concurrent transactions?

No, because a transaction can have been started previously and yet not
hold any lock on the table, and write on the table after the vacuum
finishes.  Or write on an earlier page of the table, after the vacuuming
already processed it.  But here it comes one of the "nice points" below,
which was that if we acquire a suitable exclusive lock on the table, we
_can_ mark it frozen.  Of course, this cannot be done by plain vacuum,
because we want the table to be still accesible by other transactions.
This is where VACUUM FREEZE comes in -- it does the same processing as
lazy vacuum, except that it locks the table exclusively and marks it
with FrozenXid.

> > Nice optimization: if we detect that a table is fully frozen, then
> > VACUUM is a no-op (not VACUUM FULL), because by definition there are no
> > tuples to remove.
> 
> Yes please, but we don't need it anymore do we? Guess we need it for
> backwards compatibility? VACUUM still needs to vacuum every table.

Sorry, I don't understand what you mean here.  We don't need what
anymore?

> > Another optimization: if we are sure that unfreezing works, we can even
> > mark a table as frozen in a postmaster environment, as long as we take
> > an ExclusiveLock on the table.  Thus we know that the vacuum is the sole
> > transaction concurrently accessing the table; and if another transaction
> > comes about and writes something after we're finished, it'll correctly
> > unfreeze the table and all is well.
> 
> Why not just have a command to FREEZE and UNFREEZE an object? It can
> hold an ExclusiveLock, avoiding all issues. Presumably FREEZE and
> UNFREEZE are rare commands?

Ok, if I'm following you here, your point is that FREEZE'ing a table
sets the relminxid to FrozenXid, and UNFREEZE removes that; and also, in
between, no one can write to the table?

This seems to make sense.  However, I'm not very sure about the
FREEZE'ing operation, because we need to make sure the table is really
frozen.  So we either scan it, or we make sure something else already
scanned it; to me what makes the most sense is having a VACUUM option
that would do the freezing (and a separate command to do the
unfreezing).

> > Where are the problems in this approach?
> > 
> > 2. The current implementation puts the unfreezing in LockRelation.  This
> > is a problem, because any user can cause a LockRelation on any table,
> > even if the user does not have access to that table.
> 
> That last bit just sounds horrible to me. But thinking about it: how
> come any user can lock a relation they shouldn't even be allowed to know
> exists? Possibly OT.

Hmm, I guess there must be several commands that open the relation and
lock it, and then check permissions.  I haven't checked the code but you
shouldn't check permissions before acquiring some kind of lock, and we
shouldn't be upgrading locks either.

> I can see other reasons for having pg_class_nt, so having table info
> cached in shared memory does make sense to me (yet not being part of the
> strict definitions of the relcache).

Yeah.

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

---(end of broadcast)---
TIP 4: Have you searched our list archives?

Re: [HACKERS] [COMMITTERS] pgsql: Change the row constructor syntax (ROW(...))

2006-06-26 Thread Tom Lane

Bruce Momjian <[EMAIL PROTECTED]> writes:
> Does this complete this TODO item?
> o Allow UPDATE tab SET ROW (col, ...) = (...) for updating multiple
>   columns

No, that's unrelated.

regards, tom lane

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org

Re: [HACKERS] "Truncated" tuples for tuple hash tables

2006-06-26 Thread Tom Lane

I wrote:
> There isn't any benefit except where we collect lots of tuples, which is
> to say tuplesort/tuplestore/tuplehashtable.  In other places in the
> executor, there's basically only one transient tuple in existence per
> plan node; jumping through hoops to save 16 bytes per plan node is just
> silly.  (What's more, as of 8.1 most of those tuples will be in "virtual
> tuple" format anyway, and so the optimization wouldn't make any
> difference at all...)

After further study of the code, here's my hit-list of places that could
make worthwhile use of MinimalTuples:

tuplesort.c (in-memory, on-disk case done already)
tuplestore.c (in-memory and on-disk)
TupleHashTable (execGrouping.c --- used by nodeAgg and nodeSubplan)
hash joins (in-memory hash table and tuple "batch" files)
analyze.c (tuples collected in-memory for stats analysis)

It looks like there is actually not anyplace else in the executor where
we "materialize" tuples anymore, except for execMain.c's INSERT/UPDATE
code, which of course is going to want full tuples it can stash on disk.
Everything else is dealing in TupleTableSlots that probably contain
virtual tuples.

So in one sense this *is* "all across the executor".  But the amount of
code to touch seems pretty limited.

regards, tom lane

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match

Re: [HACKERS] Inheritance, CREATE TABLE LIKE, and partitioned tables

2006-06-26 Thread elein

On Mon, Jun 26, 2006 at 12:31:24PM -0400, Greg Stark wrote:
> 
> Currently analyze.c and tablecmds.c both have some very similar code to handle
> copying columns from "parent" tables. As long as they were just copying
> columns and in the case of tablecmds.c copying check constraints that wasn't
> really that bad an idea, the code is pretty simple.
> 
> However with partitioned tables it becomes important to copy more table
> attributes like foreign key constraints and hopefully one day indexes. And it
> would be awfully convenient to have CREATE TABLE LIKE have options to copy the
> same things that inherited tables get. And have exactly the same semantics.

People will turn around and immediately as for create table like without us
making the assumptions it wanted all of the extras that come with inheritance.
COPY is a copy of the table, not additional functionality.  The added 
flexibility
of adding only what is necessary makes more sense than going in to guess
what was added and removing it later if it is not needed.

This does not preclude adding a copy table like (with extras) though if you 
must.

> 
> So I'm suggesting refactoring the code from analyze.c and tablecmds.c into
> functions to copy the columns, constraints, indexes etc.
> 
> For example I see a functions like:
> 
> List *CopyTableColumns(relation source, List *target_schema)
> List *CopyTableCheckConstraints(relation source, List *target_schema)
> ...
> 
> To do this though might require some changes in the algorithm used for
> inherited tables. Currently it builds up the list of merged columns
> incrementally. I'm thinking it would be more natural to accumulate all the
> columns from parents and then remove duplicates in a single pass. I think it
> should be possible to maintain precisely the same semantics doing this though.

Be careful because the code may be taking into account multiple inheritance 
here.
I'm not sure about this. It should be looked at carefully.

> 
> I may be able to make AddInherits a consumer for these functions as well,
> though it would be a bit awkward since it would have to construct a fake list
> of ColumnDefs to act as the target schema. It would have the side effect of
> making the constraint comparison use change_varattnos_of_a_node and then
> compare the binary representations rather than decompiling the constraints to
> do the comparison. I'm not sure if that's the same semantics.
Again, be sure it is the same semantics before going that way.
> 
> To a certain degree I feel like this is just make-work. The existing code
> works fine and I can just happily keep additing functionality to both
> analyze.c and tablecmds.c. And it's possible we won't always want to have the
> two match.
> 
> Has anyone looked at applying the ADD INHERITS patch yet? Would it be more or
> less likely to be accepted if it were a bigger patch that refactored all this
> stuff like I'm talking about?
> 
> -- 
> greg
> 
---elein
[EMAIL PROTECTED]

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org

Re: [HACKERS] [PATCHES] Non-transactional pg_class, try 2

2006-06-26 Thread Simon Riggs

On Mon, 2006-06-26 at 13:58 -0400, Alvaro Herrera wrote:
> Ok, let's step back to discuss this again.  Sorry for the length -- this
> is a description of the problem I'm trying to solve, the issues I found,
> and how I tried to solve them.

Thanks. This is good.

> The relminxid Patch
> ===
> 
> What I'm after is not freezing for read-only media, nor archive, nor
> read-only tables. 

OK, but I am... but I'm happy to not to confuse the discussion.

> Now, why do we need to vacuum whole databases at a time?

> So, we have to do something to cope with frozen databases.  I see two
> ways:
> 
> 1. Remove the special case, i.e., process frozen databases in VACUUM
>like every other database.
>This is the easiest, because no extra logic is needed.  Just make
>sure they are vacuumed in time.  The only problem would be that we'd
>need to uselessly vacuum tables that we know are frozen, from time to
>time.  But then, those tables are probably small, so what's the
>problem with that?

> 2. Mark frozen databases specially somehow.
>To mark databases frozen, we need a way to mark tables as frozen.
>How do we do that?  As I explain below, this allows some nice
>optimizations, but it's a very tiny can full of a huge amount of
>worms.

At this stage you talk about databases, yet below we switch to
discussing tables. Not sure why we switched from one to the other.

> Marking a Table Frozen
> ==
> 
> Marking a table frozen is simple as setting relminxid = FrozenXid for a
> table.  As explained above, this cannot be done in a regular postmaster
> environment, because a concurrent transaction could be doing nasty stuff
> to a table.  So we can do it only in a standalone backend.

Surely we just lock the table? No concurrent transactions?

> On the other hand, a "frozen" table must be marked with relminxid =
> a-regular-Xid as soon as a transaction writes some tuples on it.  Note
> that this "unfreezing" must take place even if the offending transaction
> is aborted, because the Xid is written in the table nevertheless and
> thus it would be incorrect to lose the unfreezing.
> 
> This is how pg_class_nt came into existence -- it would be a place where
> information about a table would be stored and not subject to the rolling
> back of the transaction that wrote it.  So if you find that a table is
> frozen, you write an unfreezing into its pg_class_nt tuple, and that's
> it.
> 
> Nice optimization: if we detect that a table is fully frozen, then
> VACUUM is a no-op (not VACUUM FULL), because by definition there are no
> tuples to remove.

Yes please, but we don't need it anymore do we? Guess we need it for
backwards compatibility? VACUUM still needs to vacuum every table.

> Another optimization: if we are sure that unfreezing works, we can even
> mark a table as frozen in a postmaster environment, as long as we take
> an ExclusiveLock on the table.  Thus we know that the vacuum is the sole
> transaction concurrently accessing the table; and if another transaction
> comes about and writes something after we're finished, it'll correctly
> unfreeze the table and all is well.

Why not just have a command to FREEZE and UNFREEZE an object? It can
hold an ExclusiveLock, avoiding all issues. Presumably FREEZE and
UNFREEZE are rare commands?


> Where are the problems in this approach?
> 
> 1. Performance.  We'll need to keep a cache of pg_class_nt tuples.  This
> cache must be independent of the current relcache, because the relcache
> is properly transactional while the pg_class_nt cache must not be.
> 
> 2. The current implementation puts the unfreezing in LockRelation.  This
> is a problem, because any user can cause a LockRelation on any table,
> even if the user does not have access to that table.

That last bit just sounds horrible to me. But thinking about it: how
come any user can lock a relation they shouldn't even be allowed to know
exists? Possibly OT.

I can see other reasons for having pg_class_nt, so having table info
cached in shared memory does make sense to me (yet not being part of the
strict definitions of the relcache).

-- 
  Simon Riggs
  EnterpriseDB  http://www.enterprisedb.com


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

Re: [HACKERS] "Truncated" tuples for tuple hash tables

2006-06-26 Thread Tom Lane

Simon Riggs <[EMAIL PROTECTED]> writes:
> On Mon, 2006-06-26 at 14:36 -0400, Tom Lane wrote:
>> I thought for awhile about MemoryTuple (as contrasted to HeapTuple) but
>> that seems too generic.  Any other thoughts?

> I like MemoryTuple but since we only use it when we go to disk...

> ExecutorTuple, MinimalTuple, DataOnlyTuple, MultTuple, TempFileTuple

MinimalTuple seems good to me ...

regards, tom lane

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] [COMMITTERS] pgsql: Change the row constructor syntax (ROW(...))

2006-06-26 Thread Bruce Momjian


Does this complete this TODO item?

o Allow UPDATE tab SET ROW (col, ...) = (...) for updating multiple
  columns


---

Tom Lane wrote:
> Log Message:
> ---
> Change the row constructor syntax (ROW(...)) so that list elements foo.*
> will be expanded to a list of their member fields, rather than creating
> a nested rowtype field as formerly.  (The old behavior is still available
> by omitting '.*'.)  This syntax is not allowed by the SQL spec AFAICS,
> so changing its behavior doesn't violate the spec.  The new behavior is
> substantially more useful since it allows, for example, triggers to check
> for data changes with 'if row(new.*) is distinct from row(old.*)'.  Per
> my recent proposal.
> 
> Modified Files:
> --
> pgsql/doc/src/sgml:
> syntax.sgml (r1.106 -> r1.107)
> 
> (http://developer.postgresql.org/cvsweb.cgi/pgsql/doc/src/sgml/syntax.sgml.diff?r1=1.106&r2=1.107)
> pgsql/src/backend/parser:
> parse_expr.c (r1.192 -> r1.193)
> 
> (http://developer.postgresql.org/cvsweb.cgi/pgsql/src/backend/parser/parse_expr.c.diff?r1=1.192&r2=1.193)
> parse_target.c (r1.143 -> r1.144)
> 
> (http://developer.postgresql.org/cvsweb.cgi/pgsql/src/backend/parser/parse_target.c.diff?r1=1.143&r2=1.144)
> pgsql/src/include/parser:
> parse_target.h (r1.39 -> r1.40)
> 
> (http://developer.postgresql.org/cvsweb.cgi/pgsql/src/include/parser/parse_target.h.diff?r1=1.39&r2=1.40)
> pgsql/src/test/regress/expected:
> triggers.out (r1.22 -> r1.23)
> 
> (http://developer.postgresql.org/cvsweb.cgi/pgsql/src/test/regress/expected/triggers.out.diff?r1=1.22&r2=1.23)
> pgsql/src/test/regress/sql:
> triggers.sql (r1.12 -> r1.13)
> 
> (http://developer.postgresql.org/cvsweb.cgi/pgsql/src/test/regress/sql/triggers.sql.diff?r1=1.12&r2=1.13)
> 
> ---(end of broadcast)---
> TIP 9: In versions below 8.0, the planner will ignore your desire to
>choose an index scan if your joining column's datatypes do not
>match
> 

-- 
  Bruce Momjian   [EMAIL PROTECTED]
  EnterpriseDBhttp://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org

Re: [HACKERS] [PATCHES] Non-transactional pg_class, try 2

2006-06-26 Thread Alvaro Herrera

[Resending: apparently the previous one to the list was eaten by spam
filters or something.  Changing SMTP relay again ... ]


Ok, let's step back to discuss this again.  Sorry for the length -- this
is a description of the problem I'm trying to solve, the issues I found,
and how I tried to solve them.

The relminxid Patch
===

What I'm after is not freezing for read-only media, nor archive, nor
read-only tables.  What I'm after is removing the requirement that all
databases must be vacuumed wholly every 2 billion transactions.

Now, why do we need to vacuum whole databases at a time?

The Transaction Id Counter
==

We know that the Xid counter is weird; it cycles, for starters, and also
there are special values at the "start" of the cycle that are lesser
than all other values (BootstrapXid, FrozenXid).  The idea here is to
allow the counter to wrap around and old tuples not be affected, i.e.,
appear like they were committed in some distant past.

So we use the special Xid values to mark special stuff, like tuples
created by the bootstrap processing (which are always known to be good)
or tuples in template databases that are not connectable ("frozen"
databases).  We also use FrozenXid to mark tuples that are very old,
i.e. were committed a long time ago and never deleted.  Any such tuple
is unaffected by the status of the Xid counter.

It should be clear that we must ensure that after a suitable amount of
"time" (measured in advancement of the Xid counter) has passed, we
should change the old Xids in tuples to the special FrozenXid value.
The requirement for whole-database vacuuming is there because we
need to ensure that this is done in all the tables in the database.

We keep track of a "minimum Xid", call it minxid.  The Xid generator
refuses to assign a new Xid counter if this minxid is too far in the
past, because we'd risk causing Xid-wraparound data loss if we did; the
Xid comparison semantics would start behaving funny, and some tuples
that appeared to be alive not many transactions ago now suddenly appear
dead.  Clearly, it's important that before we advance this minxid we
ensure that all tables in the database have been under the process of
changing all regular Xids into FrozenXid.

Currently the only way to ensure that all tables have gone through this
process is processing them in a single VACUUM pass.  Skip even one
table, and you can forget about advancing the minxid.  Even if the
skipped table was vacuumed in the transaction just before this one.
Even if the table is fully frozen, i.e., all tables on it are marked
with FrozenXid.  Even if the table is empty.

Tracking minxid Per Table
=

So, my idea is to track this minxid per table.  To do this, I added a
column to pg_class called relminxid.  The minimum of it across a
database is used to determine each database's minimum, datminxid.  The
minimum of all databases is used to advance the global minimum Xid
counter.

So, if a table has 3 tuples whose Xmins are 42, 512 and FrozenXid, the
relminxid is 42.  If we keep track of all these religiously during
vacuum, we know exactly what is the minxid we should apply to this
particular table.

It is obvious that vacuuming one table can set the minimum for that
table.  So when the vacuuming is done, we can recalculate the database
minimum; and using the minima of all databases, we can advance the
global minimum Xid counter and truncate pg_clog.  We can do this on each
single-table vacuum -- so, no more need for database-wide vacuuming.

If a table is empty, or all tuples on it are frozen, then we must mark
the table with relminxid = RecentXmin.  This is because there could be
an open transaction that writes a new tuple to the table after the
vacuum is finished.  A newly created table must also be created with
relminxid = RecentXid.  Because of this, we never mark a table with
relminxid = FrozenXid.


Template Databases
==

Up to this point everything is relatively simple.  Here is where the
strange problems appear.  The main issue is template databases.

Why are template databases special?  Because they are never vacuumed.
More generally, we assume that every database that is marked as
"datallowconn = false" is fully frozen, i.e. all tables on it are
frozen.  Autovacuum skips them.  VACUUM ignores them.  The minxid
calculations ignore them.  They are fully frozen so they don't matter
and they don't harm anybody.

That's fine and dandy until you realize what happens when you freeze a
database, let a couple billion transactions pass, and then create a
database using that as a template (or just "reallow connections" to a
database).  Because all the tables were frozen 2 billion transaction
ago, they are marked with an old relminxid, so as soon as you vacuum any
table, the minxid computations went to hell, and we have a DoS
condition.

So, we have to do something to cope with frozen databases.  I see two
ways:

1. Remove the sp

Re: [HACKERS] Overhead for stats_command_string et al, take 2

2006-06-26 Thread Bruce Momjian

Tom Lane wrote:
> Bruce Momjian <[EMAIL PROTECTED]> writes:
> > Oh, good point.  Do we remove stats_command_string?
> 
> You mean, remove the option to turn it off?  I don't think so.  Aside
> from whatever remaining overhead there is, there's a possible security
> argument to be made that one might not want one's commands exposed,
> even to other sessions with the same userid.

OK.

-- 
  Bruce Momjian   [EMAIL PROTECTED]
  EnterpriseDBhttp://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] "Truncated" tuples for tuple hash tables

2006-06-26 Thread Simon Riggs

On Mon, 2006-06-26 at 14:36 -0400, Tom Lane wrote:
> Simon Riggs <[EMAIL PROTECTED]> writes:
> > On Mon, 2006-06-26 at 16:48 +0200, Martijn van Oosterhout wrote:
> >> Anyway, I think it's a good idea. Most places in the backend after the
> >> SeqScan/IndexScan node really don't care about most of the header
> >> fields and being able to drop them would be nice.
> 
> > I understood Tom meant to do this only for HashAgg and Tuplestore. Tom,
> > is it possible to extend this further across the executor as Martijn
> > suggests? That would be useful, even if it is slightly harder to measure
> > the benefit than it is with the can-spill-to-disk cases.
> 
> There isn't any benefit 

OK, see that... 

> I thought for awhile about MemoryTuple (as contrasted to HeapTuple) but
> that seems too generic.  Any other thoughts?

I like MemoryTuple but since we only use it when we go to disk...

ExecutorTuple, MinimalTuple, DataOnlyTuple, MultTuple, TempFileTuple

Pick one: I'm sorry I opined.

-- 
  Simon Riggs
  EnterpriseDB  http://www.enterprisedb.com


---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] vacuum, performance, and MVCC

2006-06-26 Thread Bruce Momjian

Zeugswetter Andreas DCP SD wrote:
> 
> > > head of the chain yet.  With an index scan, finding the head is
> easy, 
> > > but for a sequential scan, it seems more difficult, and we don't
> have 
> > > any free space in the tail of the chain to maintain a pointer to the
> head.
> > 
> > Thinking some more, there will need to be a bit to uniquely 
> > identify the head of a CITC.
> 
> I don't think so. It would probably be sufficient to impose an order on
> the CITC.
> e.g. the oldest tuple version in the CITC is the head. 
> (An idea just in case we can't spare a bit :-) 

Well, if we need to scan the page quickly, having the bit, or a bit
combination that can only be the head, is helpful.  What we usually do
is to combine a SITC bit with another bit that would never be set for
SITC, and that is the head, and you use macros to properly do tests.  We
do this already in a number of cases.

-- 
  Bruce Momjian   [EMAIL PROTECTED]
  EnterpriseDBhttp://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] Overhead for stats_command_string et al, take 2

2006-06-26 Thread Bruce Momjian


I ran your test with all defaults in 8.1 and CVS HEAD on a BSD/OS dual
Xeon and got:

8.1.X   1.946
HEAD1.986

I ran the test ten times on three runs and took the middle value.

It is a slowdown of 2%.  I used these configure options:

configure \
--with-tcl \
--with-perl \
--with-tclconfig=/u/lib \
--with-includes="/usr/local/include/readline 
/usr/contrib/include" \
--with-libraries="/usr/local/lib /usr/contrib/lib" \
--with-openssl \
--enable-thread-safety \
--enable-nls

---

Tom Lane wrote:
> I redid my previous measurements after finishing up the weekend's
> hacking.  The numbers shown below are elapsed time in seconds for
> 
>   time psql -f testfile.sql postgres >/dev/null
> 
> using CVS HEAD and REL8_1_STABLE branch tip, compiled --enable-debug
> --disable-cassert, no nondefault options except for turning fsync off
> (which doesn't particularly affect read-only tests like these anyway).
> The machines are both running current Fedora Core 5.  The older x86
> machine is known to have the slow-gettimeofday problem from previous
> experimentation with EXPLAIN ANALYZE.  Each number is the median of 3
> or more tests, rounded off to 0.1 second (I wouldn't put a lot of faith
> in differences of 0.1 sec or so, because of the variance I saw in the
> tests).
> 
>   x86 x86_64
> 
>   8.1 HEAD8.1 HEAD
> 
> 10 "SELECT 1;"25.927.0 9.29.1
> with stats_command_string=1   63.527.618.79.2
> with log_min_duration_statement=100   26.927.8 9.69.2
> with statement_timeout=10027.528.6 9.69.8
> with all 3 features   66.129.319.59.7
> 
> BEGIN, 10 "SELECT 1;", COMMIT 21.223.1 8.38.4
> with stats_command_string=1   52.323.515.48.5
> with log_min_duration_statement=100   22.123.4 8.48.4
> with statement_timeout=10023.724.3 8.68.8
> with all 3 features   55.225.516.08.8
> 
> I chose the log_min_duration_statement and statement_timeout settings
> high enough so that no actual logging or timeout would happen --- the
> point is to measure the measurement overhead.
> 
> The good news is that we've pretty much licked the problem of
> stats_command_string costing an unreasonable amount.
> 
> The bad news is that except in the stats_command_string cases, HEAD
> is noticeably slower than 8.1 on the machine with slow gettimeofday.
> In the single-transaction test this might be blamed on the addition
> of statement_timestamp support (which requires a gettimeofday per
> statement that wasn't there in 8.1) ... but in the one-transaction-
> per-statement tests that doesn't hold water, because each branch is
> doing a gettimeofday per statement, just in different places.
> 
> Can anyone else reproduce this slowdown?  It might be only an artifact
> of these particular builds, but it's a bit too consistent in my x86 data
> to just ignore.
> 
> BTW, according to "top" the CPU usage percentages in these tests are
> on the order of 55% backend, 45% psql.  Methinks psql needs a round
> of performance tuning ...
> 
>   regards, tom lane
> 
> ---(end of broadcast)---
> TIP 6: explain analyze is your friend
> 

-- 
  Bruce Momjian   [EMAIL PROTECTED]
  EnterpriseDBhttp://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org

[HACKERS] GIN index creation extremely slow ?

2006-06-26 Thread Stefan Kaltenbrunner

on IRC somebody mentioned that it took >34h to greate a GIN index (on a
tsvector) on a ~3 Million column table (wikipedia dump) with a
reasonable speced box (AMD 3400+).
After getting hold of a dump of said table (around 4,1GB in size) I
managed to get the following timings:

test=# CREATE INDEX idxFTI_idx ON wikipedia USING gist(vector);
CREATE INDEX
Time: 416122.896 ms

so about 7 minutes - sounds very reasonable

test=# CREATE INDEX idxFTI2_idx ON wikipedia USING gin(vector);
CREATE INDEX
Time: 52681605.101 ms

ouch - that makes for a whoppy 14,6hours(!). During that time the box is
completely CPU bottlenecked and during virtually no IO at all - (varing
maintainance_work_mem does not seem to make any noticable difference).

That box is a fast Dual Opteron 2.6Ghz with 8GB RAM and a 4 disk RAID10
for the WAL and 12 disks for the data running a very recent -HEAD
checkout ...

It looks like we still don't have any docs for GIN in the tree so I
don't know if those timings are expected or not ...


Stefan

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] vacuum, performance, and MVCC

2006-06-26 Thread Jim C. Nasby

On Fri, Jun 23, 2006 at 06:37:01AM -0400, Mark Woodward wrote:
> While we all know session data is, at best, ephemeral, people still want
> some sort of persistence, thus, you need a database. For mcache I have a
> couple plugins that have a wide range of opitions, from read/write at
> startup and shut down, to full write through cache to a database.
> 
> In general, my clients don't want this, they want the database to store
> their data. When you try to explain to them that a database may not be the
> right place to store this data, they ask why, sadly they have little hope
> of understanding the nuances and remain unconvinced.

Have you done any benchmarking between a site using mcache and one not?
I'll bet there's a huge difference, which translates into hardware $$.
That's something managers can understand.
-- 
Jim C. Nasby, Sr. Engineering Consultant  [EMAIL PROTECTED]
Pervasive Software  http://pervasive.comwork: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf   cell: 512-569-9461

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

[HACKERS] Inheritance, CREATE TABLE LIKE, and partitioned tables

2006-06-26 Thread Greg Stark


Currently analyze.c and tablecmds.c both have some very similar code to handle
copying columns from "parent" tables. As long as they were just copying
columns and in the case of tablecmds.c copying check constraints that wasn't
really that bad an idea, the code is pretty simple.

However with partitioned tables it becomes important to copy more table
attributes like foreign key constraints and hopefully one day indexes. And it
would be awfully convenient to have CREATE TABLE LIKE have options to copy the
same things that inherited tables get. And have exactly the same semantics.

So I'm suggesting refactoring the code from analyze.c and tablecmds.c into
functions to copy the columns, constraints, indexes etc.

For example I see a functions like:

List *CopyTableColumns(relation source, List *target_schema)
List *CopyTableCheckConstraints(relation source, List *target_schema)
...

To do this though might require some changes in the algorithm used for
inherited tables. Currently it builds up the list of merged columns
incrementally. I'm thinking it would be more natural to accumulate all the
columns from parents and then remove duplicates in a single pass. I think it
should be possible to maintain precisely the same semantics doing this though.

I may be able to make AddInherits a consumer for these functions as well,
though it would be a bit awkward since it would have to construct a fake list
of ColumnDefs to act as the target schema. It would have the side effect of
making the constraint comparison use change_varattnos_of_a_node and then
compare the binary representations rather than decompiling the constraints to
do the comparison. I'm not sure if that's the same semantics.

To a certain degree I feel like this is just make-work. The existing code
works fine and I can just happily keep additing functionality to both
analyze.c and tablecmds.c. And it's possible we won't always want to have the
two match.

Has anyone looked at applying the ADD INHERITS patch yet? Would it be more or
less likely to be accepted if it were a bigger patch that refactored all this
stuff like I'm talking about?

-- 
greg


---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: Anyone still care about Cygwin? (was Re: [HACKERS] [CORE] GPL

2006-06-26 Thread Josh Berkus

Dave, all,

> Ahh, you miss the point though - they vanish back into the woodwork when
> they realise that they can't brag that they host the entire project.
> It's not that they want to help, they just want maximum publicity off
> our name for as little hardware as possible.

I seem to have missed a cycle.  Who are we talking about?

-- 
Josh Berkus
PostgreSQL @ Sun
San Francisco

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [HACKERS] vacuum, performance, and MVCC

2006-06-26 Thread Bruce Momjian


It is certainly possible to do what you are suggesting, that is have two
index entries point to same chain head, and have the index access
routines figure out if the index qualifications still hold, but that
seems like a lot of overhead.

Also, once there is only one visible row in the chain, removing old
index entries seems quite complex because you have to have vacuum keep
the qualifications of each row to figure out which index tuple is the
valid one (seems messy).

---

Jim C. Nasby wrote:
> On Sun, Jun 25, 2006 at 09:13:48PM +0300, Heikki Linnakangas wrote:
> > >If you can't expire the old row because one of the indexed columns was
> > >modified, I see no reason to try to reduce the additional index entries.
> > 
> > It won't enable early expiration, but it means less work to do on update. 
> > If there's a lot of indexes, not having to add so many index tuples can be 
> > a significant saving.
> 
> While catching up on this thread, the following idea came to me that I
> think would allow for not updating an index on an UPDATE if it's key
> doesn't change. If I understand Bruce's SITC proposal correctly, this
> would differ in that SITC requires that no index keys change.
> 
> My idea is that if an UPDATE places the new tuple on the same page as
> the old tuple, it will not create new index entries for any indexes
> where the key doesn't change. This means that when fetching tuples from
> the index, ctid would have to be followed until you found the version
> you wanted OR you found the first ctid that pointed to a different page
> (because that tuple will have it's own index entry) OR you found a tuple
> with a different value for the key of the index you're using (because
> it'd be invalid, and there'd be a different index entry for it). I
> believe that the behavior of the index hint bits would also have to
> change somewhat, as each index entry would now essentially be pointing
> at all the tuples in the ctid chain that exist on a page, not just
> single tuple.
> 
> In the case of an UPDATE that needs to put the new tuple on a different
> page, our current behavior would be used. This means that the hint bits
> would still be useful in limiting the number of heap pages you hit. I
> also believe this means that we wouldn't suffer any additional overhead
> from our current code when there isn't much free space on pages.
> 
> Since SITC allows for in-page space reuse without vacuuming only when no
> index keys change, it's most useful for very heavily updated tables such
> as session handlers or queue tables, because those tables typically have
> very few indexes, so it's pretty unlikely that an index key will change.
> For more general-purpose tables that have more indexes but still see a
> fair number of updates to a subset of rows, not having to update every
> index would likely be a win. I also don't see any reason why both
> options couldn't be used together.
> -- 
> Jim C. Nasby, Sr. Engineering Consultant  [EMAIL PROTECTED]
> Pervasive Software  http://pervasive.comwork: 512-231-6117
> vcard: http://jim.nasby.net/pervasive.vcf   cell: 512-569-9461
> 

-- 
  Bruce Momjian   [EMAIL PROTECTED]
  EnterpriseDBhttp://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [HACKERS] "Truncated" tuples for tuple hash tables

2006-06-26 Thread Tom Lane

Simon Riggs <[EMAIL PROTECTED]> writes:
> On Mon, 2006-06-26 at 16:48 +0200, Martijn van Oosterhout wrote:
>> Anyway, I think it's a good idea. Most places in the backend after the
>> SeqScan/IndexScan node really don't care about most of the header
>> fields and being able to drop them would be nice.

> I understood Tom meant to do this only for HashAgg and Tuplestore. Tom,
> is it possible to extend this further across the executor as Martijn
> suggests? That would be useful, even if it is slightly harder to measure
> the benefit than it is with the can-spill-to-disk cases.

There isn't any benefit except where we collect lots of tuples, which is
to say tuplesort/tuplestore/tuplehashtable.  In other places in the
executor, there's basically only one transient tuple in existence per
plan node; jumping through hoops to save 16 bytes per plan node is just
silly.  (What's more, as of 8.1 most of those tuples will be in "virtual
tuple" format anyway, and so the optimization wouldn't make any
difference at all...)

> IMHO it would be better to call the format TupleWithoutVisibilityData so
> its very clear that accessing the visibility fields aren't present and
> any attempt to access them would be a mistake. TruncatedTuple isn't
> clear about what's missing or its consequences.

I'm not wedded to "TruncatedTuple", but I don't like your suggestion
better; it presumes too much about what the difference might be between
truncated and full tuples.  Even today, I don't think
"TupleWithoutVisibilityData" makes it clear that t_ctid is missing;
and down the road there might be other header fields that we don't need
to have in in-memory tuples.  Another small problem is that given our
naming conventions for structs vs pointers to structs, using "Data" as
the last word of a struct name is a bad idea --- for consistency,
pointers to it would be typedef TupleWithoutVisibility, which seems a
bit weird.  For consistency with the existing struct names, I think we
need to choose a name of the form "Tuple".

I thought for awhile about MemoryTuple (as contrasted to HeapTuple) but
that seems too generic.  Any other thoughts?

regards, tom lane

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [HACKERS] "Truncated" tuples for tuple hash tables

2006-06-26 Thread Simon Riggs

On Mon, 2006-06-26 at 16:48 +0200, Martijn van Oosterhout wrote:
> On Mon, Jun 26, 2006 at 10:36:00AM -0400, Tom Lane wrote:
> > While looking at the recently-noticed problem that HashAggregate nodes
> > store more columns of the input than they need to, I couldn't help
> > noticing how much of the hashtable space goes into HeapTuple header
> > overhead.  A couple months ago we were able to get a useful improvement
> > in sorting by not storing unnecessary header fields in sort files, and
> > I'm strongly tempted to do the same in tuple hash tables.

> Anyway, I think it's a good idea. Most places in the backend after the
> SeqScan/IndexScan node really don't care about most of the header
> fields and being able to drop them would be nice.

I understood Tom meant to do this only for HashAgg and Tuplestore. Tom,
is it possible to extend this further across the executor as Martijn
suggests? That would be useful, even if it is slightly harder to measure
the benefit than it is with the can-spill-to-disk cases.

IMHO it would be better to call the format TupleWithoutVisibilityData so
its very clear that accessing the visibility fields aren't present and
any attempt to access them would be a mistake. TruncatedTuple isn't
clear about what's missing or its consequences.

-- 
  Simon Riggs
  EnterpriseDB  http://www.enterprisedb.com

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match

Re: [HACKERS] vacuum, performance, and MVCC

2006-06-26 Thread Jim C. Nasby

On Sun, Jun 25, 2006 at 09:13:48PM +0300, Heikki Linnakangas wrote:
> >If you can't expire the old row because one of the indexed columns was
> >modified, I see no reason to try to reduce the additional index entries.
> 
> It won't enable early expiration, but it means less work to do on update. 
> If there's a lot of indexes, not having to add so many index tuples can be 
> a significant saving.

While catching up on this thread, the following idea came to me that I
think would allow for not updating an index on an UPDATE if it's key
doesn't change. If I understand Bruce's SITC proposal correctly, this
would differ in that SITC requires that no index keys change.

My idea is that if an UPDATE places the new tuple on the same page as
the old tuple, it will not create new index entries for any indexes
where the key doesn't change. This means that when fetching tuples from
the index, ctid would have to be followed until you found the version
you wanted OR you found the first ctid that pointed to a different page
(because that tuple will have it's own index entry) OR you found a tuple
with a different value for the key of the index you're using (because
it'd be invalid, and there'd be a different index entry for it). I
believe that the behavior of the index hint bits would also have to
change somewhat, as each index entry would now essentially be pointing
at all the tuples in the ctid chain that exist on a page, not just
single tuple.

In the case of an UPDATE that needs to put the new tuple on a different
page, our current behavior would be used. This means that the hint bits
would still be useful in limiting the number of heap pages you hit. I
also believe this means that we wouldn't suffer any additional overhead
from our current code when there isn't much free space on pages.

Since SITC allows for in-page space reuse without vacuuming only when no
index keys change, it's most useful for very heavily updated tables such
as session handlers or queue tables, because those tables typically have
very few indexes, so it's pretty unlikely that an index key will change.
For more general-purpose tables that have more indexes but still see a
fair number of updates to a subset of rows, not having to update every
index would likely be a win. I also don't see any reason why both
options couldn't be used together.
-- 
Jim C. Nasby, Sr. Engineering Consultant  [EMAIL PROTECTED]
Pervasive Software  http://pervasive.comwork: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf   cell: 512-569-9461

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org

Re: [HACKERS] [PATCHES] Non-transactional pg_class, try 2

2006-06-26 Thread Alvaro Herrera


Ok, let's step back to discuss this again.  Sorry for the length -- this
is a description of the problem I'm trying to solve, the issues I found,
and how I tried to solve them.

The relminxid Patch
===

What I'm after is not freezing for read-only media, nor archive, nor
read-only tables.  What I'm after is removing the requirement that all
databases must be vacuumed wholly every 2 billion transactions.

Now, why do we need to vacuum whole databases at a time?

The Transaction Id Counter
==

We know that the Xid counter is weird; it cycles, for starters, and also
there are special values at the "start" of the cycle that are lesser
than all other values (BootstrapXid, FrozenXid).  The idea here is to
allow the counter to wrap around and old tuples not be affected, i.e.,
appear like they were committed in some distant past.

So we use the special Xid values to mark special stuff, like tuples
created by the bootstrap processing (which are always known to be good)
or tuples in template databases that are not connectable ("frozen"
databases).  We also use FrozenXid to mark tuples that are very old,
i.e. were committed a long time ago and never deleted.  Any such tuple
is unaffected by the status of the Xid counter.

It should be clear that we must ensure that after a suitable amount of
"time" (measured in advancement of the Xid counter) has passed, we
should change the old Xids in tuples to the special FrozenXid value.
The requirement for whole-database vacuuming is there because we
need to ensure that this is done in all the tables in the database.

We keep track of a "minimum Xid", call it minxid.  The Xid generator
refuses to assign a new Xid counter if this minxid is too far in the
past, because we'd risk causing Xid-wraparound data loss if we did; the
Xid comparison semantics would start behaving funny, and some tuples
that appeared to be alive not many transactions ago now suddenly appear
dead.  Clearly, it's important that before we advance this minxid we
ensure that all tables in the database have been under the process of
changing all regular Xids into FrozenXid.

Currently the only way to ensure that all tables have gone through this
process is processing them in a single VACUUM pass.  Skip even one
table, and you can forget about advancing the minxid.  Even if the
skipped table was vacuumed in the transaction just before this one.
Even if the table is fully frozen, i.e., all tables on it are marked
with FrozenXid.  Even if the table is empty.

Tracking minxid Per Table
=

So, my idea is to track this minxid per table.  To do this, I added a
column to pg_class called relminxid.  The minimum of it across a
database is used to determine each database's minimum, datminxid.  The
minimum of all databases is used to advance the global minimum Xid
counter.

So, if a table has 3 tuples whose Xmins are 42, 512 and FrozenXid, the
relminxid is 42.  If we keep track of all these religiously during
vacuum, we know exactly what is the minxid we should apply to this
particular table.

It is obvious that vacuuming one table can set the minimum for that
table.  So when the vacuuming is done, we can recalculate the database
minimum; and using the minima of all databases, we can advance the
global minimum Xid counter and truncate pg_clog.  We can do this on each
single-table vacuum -- so, no more need for database-wide vacuuming.

If a table is empty, or all tuples on it are frozen, then we must mark
the table with relminxid = RecentXmin.  This is because there could be
an open transaction that writes a new tuple to the table after the
vacuum is finished.  A newly created table must also be created with
relminxid = RecentXid.  Because of this, we never mark a table with
relminxid = FrozenXid.


Template Databases
==

Up to this point everything is relatively simple.  Here is where the
strange problems appear.  The main issue is template databases.

Why are template databases special?  Because they are never vacuumed.
More generally, we assume that every database that is marked as
"datallowconn = false" is fully frozen, i.e. all tables on it are
frozen.  Autovacuum skips them.  VACUUM ignores them.  The minxid
calculations ignore them.  They are fully frozen so they don't matter
and they don't harm anybody.

That's fine and dandy until you realize what happens when you freeze a
database, let a couple billion transactions pass, and then create a
database using that as a template (or just "reallow connections" to a
database).  Because all the tables were frozen 2 billion transaction
ago, they are marked with an old relminxid, so as soon as you vacuum any
table, the minxid computations went to hell, and we have a DoS
condition.

So, we have to do something to cope with frozen databases.  I see two
ways:

1. Remove the special case, i.e., process frozen databases in VACUUM
   like every other database.
   This is the easiest, because no extra lo

Re: [HACKERS] ERROR: invalid page header in block

2006-06-26 Thread Simon Riggs

On Mon, 2006-06-26 at 16:28 +0300, Alexandru Coseru wrote:

> billing=> select sum(duration)/60 from cdr__2006 where 
> callstarttime<'2006-06-19 00:00:00';
> ERROR:  invalid page header in block 3527 of relation "cdr__2006_04_22";
> 
> Table cdr__2006_04_22 is a child of cdr__2006_04 , which is a child to 
> cdr__2006.
> 
> Can somebody explain this error ?  What should I do ?

Backup your database immediately. You may have corrupted data blocks.

-- 
  Simon Riggs
  EnterpriseDB  http://www.enterprisedb.com


---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] Overhead for stats_command_string et al, take 2

2006-06-26 Thread Bruce Momjian

Tom Lane wrote:
> I wrote:
> > IIRC, newer BSDen use a kernel call for this, so you should be able to
> > measure it on your own machine.  Just tweak ps_status.c to force it to
> > select PS_USE_NONE instead of PS_USE_SETPROCTITLE to generate a
> > comparison case.  I'll try it on my old HPUX box too.
> 
> On HPUX, I get a median time of 5.59 sec for CVS HEAD vs 5.36 sec with
> ps_status diked out, for the test case of 1 "SELECT 1;" as separate
> transactions, assert-disabled build.  So, almost 10% overhead.  Given
> that the transactions can't get any more trivial than this, that's about
> a worst-case number.  Not sure if it's worth worrying about or not.
> However Kris Kennaway's report a couple weeks ago suggested things might
> be worse on BSD.

Yep, I see 8% here.  I will add a patch to allow the ps display to be
turned off.

-- 
  Bruce Momjian   [EMAIL PROTECTED]
  EnterpriseDBhttp://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [HACKERS] Overhead for stats_command_string et al, take 2

2006-06-26 Thread Tom Lane

Bruce Momjian <[EMAIL PROTECTED]> writes:
> Yep, I see 8% here.  I will add a patch to allow the ps display to be
> turned off.

I think we'd still want a backend to set the PS display once with its
identification data (user/DB name and client address).  It's just the
transient activity updates that should stop.

regards, tom lane

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] Overhead for stats_command_string et al, take 2

2006-06-26 Thread Tom Lane

Bruce Momjian <[EMAIL PROTECTED]> writes:
> Oh, good point.  Do we remove stats_command_string?

You mean, remove the option to turn it off?  I don't think so.  Aside
from whatever remaining overhead there is, there's a possible security
argument to be made that one might not want one's commands exposed,
even to other sessions with the same userid.

regards, tom lane

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match

Re: [HACKERS] Overhead for stats_command_string et al, take 2

2006-06-26 Thread Bruce Momjian

Tom Lane wrote:
> Bruce Momjian <[EMAIL PROTECTED]> writes:
> > Yep, I see 8% here.  I will add a patch to allow the ps display to be
> > turned off.
> 
> I think we'd still want a backend to set the PS display once with its
> identification data (user/DB name and client address).  It's just the
> transient activity updates that should stop.

Oh, good point.  Do we remove stats_command_string?  Does it have any
measurable overhead?  I see a little here, like 1%.

-- 
  Bruce Momjian   [EMAIL PROTECTED]
  EnterpriseDBhttp://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match

Re: [HACKERS] vacuum, performance, and MVCC

2006-06-26 Thread Zeugswetter Andreas DCP SD


> > head of the chain yet.  With an index scan, finding the head is
easy, 
> > but for a sequential scan, it seems more difficult, and we don't
have 
> > any free space in the tail of the chain to maintain a pointer to the
head.
> 
> Thinking some more, there will need to be a bit to uniquely 
> identify the head of a CITC.

I don't think so. It would probably be sufficient to impose an order on
the CITC.
e.g. the oldest tuple version in the CITC is the head. 
(An idea just in case we can't spare a bit :-) 

Andreas

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match

Re: [HACKERS] vacuum, performance, and MVCC, and compression

2006-06-26 Thread PFC




What about increasing the size of an existing index entry?  Can that be
done easily when a new row is added?


I'd say it looks pretty much like inserting a new index tuple...
Say "value" is the indexed column.

Find first page in the index featuring "value".
1   If there is space on the page,
		add the tuple id to the list of the corresponding index entry (just like  
creating a new index tuple, but uses less space).

else
look at next page.
If next page has an index tuple with the same indexed value,
goto 1
else
insert new page and create an index tuple on it


I would be worried about the overhead of doing that on compression and
decompression.


	The compression methods mentioned in the article which was passed on the  
list seemed pretty fast. From IO-limited, the test database became  
CPU-limited (and a lot faster).



---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match

Re: Anyone still care about Cygwin? (was Re: [HACKERS] [CORE] GPL

2006-06-26 Thread Dave Page

 

> -Original Message-
> From: Devrim GUNDUZ [mailto:[EMAIL PROTECTED] 
> Sent: 26 June 2006 15:03
> To: Dave Page
> Cc: Robert Treat; pgsql-hackers@postgresql.org; Andrew 
> Dunstan; Tom Lane
> Subject: Re: Anyone still care about Cygwin? (was Re: 
> [HACKERS] [CORE] GPL
> 
>
> Ok, I talked with them now. They need the following information:
> 
> - What type of hardware do we need? Is an Opteron enough? 
> Memory? Disk?

Depends what it was used for.

> - Do we need a private box, or is a VM enough?

As above.

> - What is the average traffic of our web servers?

Dunno, but not a huge figure - the network is heavily specc'ed towards
high availability and coping with a good /.ing. Individual servers tend
to be very lightly loaded.

> Bandwith is not cheap in Turkey, so they are inclined to put 
> that server
> somewhere outside Turkey. 
> 
> They are ready to proceed, they just want to learn the details.

Well I think Robert is suggesting we setup a box with a bunch of VM's
running different OS's as buildfarm clients. I don't know how useful
that would be, but at the least, with VMWare (or Bochs if it's mature
enough now) and a Windows licence it could run nightly builds of Mingw
and Cygwin, and with just VMWare it could do Solaris x86, and an
assortment of Linuxes and *BSDs.

Regards, Dave.

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

Re: [HACKERS] vacuum, performance, and MVCC, and compression

2006-06-26 Thread Bruce Momjian

PFC wrote:
> 
>   There were some talks lately about compression.
>   With a bit of lateral thinking I guess this can be used to contain the  
> bloat induced by updates.
>   Of course this is just my hypothesis.
> 
>   Compression in indexes :
> 
>   Instead of storing (value, tuple identifier) keys in the indexes, store 
>  
> (value, [tuple identifier list]) ; ie. all tuples which have the same  
> indexed value are referenced by the same index tuple, instead of having  
> one index tuple per actual tuple.
>   The length of the list would of course be limited to the space actually 
>  
> available on an index page ; if many rows have the same indexed value,  
> several index tuples would be generated so that index tuples fit on index  
> pages.
>   This would make the index smaller (more likely to fit in RAM) at the 
> cost  
> of a little CPU overhead for index modifications, but would make the index  
> scans actually use less CPU (no need to compare the indexed value on each  
> table tuple).

What about increasing the size of an existing index entry?  Can that be
done easily when a new row is added?

>   Compression in data pages :
> 
>   The article that circulated on the list suggested several types of  
> compression, offset, dictionary, etc. The point is that several row  
> versions on the same page can be compressed well because these versions  
> probably have similar column values.
> 
>   Just a thought...

I would be worried about the overhead of doing that on compression and
decompression.

-- 
  Bruce Momjian   [EMAIL PROTECTED]
  EnterpriseDBhttp://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match

Re: [HACKERS] "Truncated" tuples for tuple hash tables

2006-06-26 Thread Tom Lane

"Bort, Paul" <[EMAIL PROTECTED]> writes:
> Tom Lane said:
>> As long as the code using it never tries to access any
>> of the missing fields (t_xmin through t_ctid), this would work exactly
>> like a normal HeapTuple.

> This sounds like a security risk.

No more than any other wild-pointer problem.  At the level of C code
it's always trivial to break anything ;-).  The reason we don't need
to worry is that in the upper levels of the executor there is no such
thing as access to system columns --- any attempt to fetch a system
column is converted to a Var reference that appears in the initial
table-scan node, and thereafter it's an ordinary user column.  This
must be so because trying to keep the system columns in their normal
privileged positions breaks down as soon as you consider a join; there
would only be room for one, and the query might well be trying to fetch
(say) ctid from more than one table.  So any code that was trying to
fetch these fields would be buggy anyway.

In the case of the TupleHashTable code, the only access that need be
provided is through a TupleTableSlot.  We could get a little bit of
protection against programming mistakes by using the "virtual tuple"
feature of slots to disallow attempts to fetch any system columns from
a truncated tuple.  I'm not sure if this would be feasible for tuplesort
though; haven't looked at how it's used yet.

regards, tom lane

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org

Re: [HACKERS] vacuum, performance, and MVCC, and compression

2006-06-26 Thread PFC



There were some talks lately about compression.
	With a bit of lateral thinking I guess this can be used to contain the  
bloat induced by updates.

Of course this is just my hypothesis.

Compression in indexes :

	Instead of storing (value, tuple identifier) keys in the indexes, store  
(value, [tuple identifier list]) ; ie. all tuples which have the same  
indexed value are referenced by the same index tuple, instead of having  
one index tuple per actual tuple.
	The length of the list would of course be limited to the space actually  
available on an index page ; if many rows have the same indexed value,  
several index tuples would be generated so that index tuples fit on index  
pages.
	This would make the index smaller (more likely to fit in RAM) at the cost  
of a little CPU overhead for index modifications, but would make the index  
scans actually use less CPU (no need to compare the indexed value on each  
table tuple).


Compression in data pages :

	The article that circulated on the list suggested several types of  
compression, offset, dictionary, etc. The point is that several row  
versions on the same page can be compressed well because these versions  
probably have similar column values.


Just a thought...

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: Anyone still care about Cygwin? (was Re: [HACKERS] [CORE] GPL

2006-06-26 Thread Dave Page

 

> -Original Message-
> From: Robert Treat [mailto:[EMAIL PROTECTED] 
> Sent: 26 June 2006 16:08
> To: Dave Page
> Cc: pgsql-hackers@postgresql.org; Andrew Dunstan; Tom Lane; 
> Devrim GUNDUZ
> Subject: Re: Anyone still care about Cygwin? (was Re: 
> [HACKERS] [CORE] GPL
> 
> Right, but I am thinking that maybe we should ask these guys 
> if they can offer 
> boxes for the buildfarm, which could run from a celeron box.  

Ahh, you miss the point though - they vanish back into the woodwork when
they realise that they can't brag that they host the entire project.
It's not that they want to help, they just want maximum publicity off
our name for as little hardware as possible.

I should mention at this point that not all companies are like this, and
we do have some very generous contributors to whom we are very grateful.

Regards, Dave.

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [HACKERS] vacuum, performance, and MVCC

2006-06-26 Thread Bruce Momjian

Hannu Krosing wrote:
> > > pass 3: clean heap based on ctid from pass 1
> > > 
> > > If yo do it this way, you dont need to invent new data structures to
> > > pass extra info about CITC internals to passes 2 and 3
> > > 
> > > On more thing - when should free space map be notified about free space
> > > in pages with CITC chains ?
> > 
> > Uh, well, I am thinking we only free CITC space when we are going to use
> > it for an UPDATE, rather than free things while doing an operation.  It
> > is good to keep the cleanup overhead out of the main path as much as
> > possible.
> 
> So vacuum should only remove dead CITC chains and leave the ones with
> live tuples to CITC internal use ?

Yes, it has to.  What else would it do?  Add index entries?

> That would also suggest that pages having live CITC chains and less than
> N% of free space should mot be reported to FSM.

Parts of the CITC that are not visible can be used for free space by
vacuum, but the visible part is left alone.

-- 
  Bruce Momjian   [EMAIL PROTECTED]
  EnterpriseDBhttp://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org

Re: [HACKERS] vacuum, performance, and MVCC

2006-06-26 Thread Bruce Momjian

Martijn van Oosterhout wrote:
-- Start of PGP signed section.
> On Mon, Jun 26, 2006 at 10:50:26AM -0400, Bruce Momjian wrote:
> > > > I suppose we would also change the index_getmulti() function to return
> > > > a set of ctids plus flags so the caller knows to follow the chains,
> > > > right? 
> > > 
> > > It is probably better to always return the pointer to the head of CITC
> > > chain (the one an index points to) and do extra visibility checks and
> > > chain-following on each access. This would keep the change internal to
> > > tuple fetching functions.
> > 
> > So index_getnext() traverses the chain and returns one member per call. 
> > Makes sense.  Just realize you are in a single index entry returning
> > multiple tuples.  We will need some record keeping to track that.
> 
> Yes, and for index_getmulti (which doesn't visit the heap at all) we'll
> have to change all the users of that (which aren't many, I suppose).
> It's probably worth making a utility function to expand them.
> 
> I'm still confused where bitmap index scan fit into all of this. Is
> preserving the sequential scan aspect of these a goal with this new
> setup?

No.  I was just pointing out that if you get to the tuple via an index,
you get handed the head of the SITC via the index tuple, but if you are
doing a sequential scan, you don't get it, so you have to find it, or
any other non-visible SITC header.

-- 
  Bruce Momjian   [EMAIL PROTECTED]
  EnterpriseDBhttp://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org

Re: [HACKERS] "Truncated" tuples for tuple hash tables

2006-06-26 Thread Bort, Paul

Tom Lane said:
> 
> To make use of a TruncatedTuple, we'd set up a temporary HeapTupleData
> struct with its t_data field pointing 16 bytes before the start of the
> TruncatedTuple.  As long as the code using it never tries to 
> access any
> of the missing fields (t_xmin through t_ctid), this would work exactly
> like a normal HeapTuple.
> 

This sounds like a security risk. What's the worst thing that could be
in
those 16 bytes, and could that be used to bite (sorry) us? 

If those 16 bytes could be user data in another tuple, there might be an
attack there.


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly

Re: [HACKERS] vacuum, performance, and MVCC

2006-06-26 Thread Hannu Krosing

Ühel kenal päeval, E, 2006-06-26 kell 16:58, kirjutas Martijn van
Oosterhout:
> On Mon, Jun 26, 2006 at 10:50:26AM -0400, Bruce Momjian wrote:
> > > > I suppose we would also change the index_getmulti() function to return
> > > > a set of ctids plus flags so the caller knows to follow the chains,
> > > > right? 
> > > 
> > > It is probably better to always return the pointer to the head of CITC
> > > chain (the one an index points to) and do extra visibility checks and
> > > chain-following on each access. This would keep the change internal to
> > > tuple fetching functions.
> > 
> > So index_getnext() traverses the chain and returns one member per call. 
> > Makes sense.  Just realize you are in a single index entry returning
> > multiple tuples.  We will need some record keeping to track that.
> 
> Yes, and for index_getmulti (which doesn't visit the heap at all) we'll
> have to change all the users of that (which aren't many, I suppose).
> It's probably worth making a utility function to expand them.
> 
> I'm still confused where bitmap index scan fit into all of this. Is
> preserving the sequential scan aspect of these a goal with this new
> setup?

Bitmap index scan does not have to change much - only the function that
gets tuple by its ctid must be able to trace forward chains within the
page.

-- 

Hannu Krosing
Database Architect
Skype Technologies OÜ
Akadeemia tee 21 F, Tallinn, 12618, Estonia

Skype me:  callto:hkrosing
Get Skype for free:  http://www.skype.com



---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

Re: Anyone still care about Cygwin? (was Re: [HACKERS] [CORE] GPL

2006-06-26 Thread Robert Treat

On Monday 26 June 2006 03:28, Dave Page wrote:
> > -Original Message-
> > From: Robert Treat [mailto:[EMAIL PROTECTED]
> > Sent: 24 June 2006 20:50
> > To: pgsql-hackers@postgresql.org
> > Cc: Andrew Dunstan; Tom Lane; Dave Page
> > Subject: Re: Anyone still care about Cygwin? (was Re:
> > [HACKERS] [CORE] GPL
> >
> >
> > Dave,
> >
> > wasn't someone just trying to donate a machine to us for the
> > website but we
> > weren't sure what to do with it?  One that could do VM's?
> > Seems we could use
> > that for some buildfarm members maybe.
>
> As with most of these, the two I was discussing recently fell through
> (usual problem, company making the offer seems to think we run the
> entire project off one ancient server, and therefore think that the
> celeron box they offer will entitle them to be listed as hosts of the
> entire project).
>

Right, but I am thinking that maybe we should ask these guys if they can offer 
boxes for the buildfarm, which could run from a celeron box.  

-- 
Robert Treat
Build A Brighter Lamp :: Linux Apache {middleware} PostgreSQL

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] vacuum, performance, and MVCC

2006-06-26 Thread Hannu Krosing

Ühel kenal päeval, E, 2006-06-26 kell 10:50, kirjutas Bruce Momjian:
> Hannu Krosing wrote:
> > ?hel kenal p?eval, E, 2006-06-26 kell 14:56, kirjutas Martijn van
> > Oosterhout:
> > > On Mon, Jun 26, 2006 at 07:17:31AM -0400, Bruce Momjian wrote:
> > > > Correct!  We use the same pointers used by normal UPDATEs, except we set
> > > > a bit on the old tuple indicating it is a single-index tuple, and we
> > > > don't create index entries for the new tuple.  Index scan routines will
> > > > need to be taught about the new chains, but because only one tuple in
> > > > the chain is visible to a single backend, the callers should not need to
> > > > be modified.
> > > 
> > > I suppose we would also change the index_getmulti() function to return
> > > a set of ctids plus flags so the caller knows to follow the chains,
> > > right? 
> > 
> > It is probably better to always return the pointer to the head of CITC
> > chain (the one an index points to) and do extra visibility checks and
> > chain-following on each access. This would keep the change internal to
> > tuple fetching functions.
> 
> So index_getnext() traverses the chain and returns one member per call. 
> Makes sense.  Just realize you are in a single index entry returning
> multiple tuples.  We will need some record keeping to track that.

Maybe we need to push visibility checks further down, so that
index_getnext() returns only the one heap row that is visible.

> > > And for bitmap index scans you would only remember the page in
> > > the case of such a tuple, since you can't be sure the exact ctid you've
> > > got is the one you want.
> > 
> > no, you should only use the pointer to CITC head outside tuple access
> > funtions. And this pointer to CITC head is what is always passed to
> > those access functions/macros.
> > 
> > The VACUUM would run its passes thus:
> > 
> > pass 1: run over heap, collect pointers to single dead tuples, and fully
> > dead CITC chains (fully dead = no live tuples on this page). Clean up
> > old tuples from CITC chains and move live tuples around so that CITC
> > points to oldest possibly visible (not vacuumed) tuple. Doing this there
> > frees us from need to collect a separate set of pointers for those. Or
> > have you planned that old tuples from CITC chains are collected on the
> > go/as needed ? Of course we could do both.
> 
> Non-visible CITC members should be freed during an UPDATE on the same
> page, so vacuum doesn't have to be involved.

Ok.

> > pass 2: clean indexes based on ctid from pass 1
> > 
> > pass 3: clean heap based on ctid from pass 1
> > 
> > If yo do it this way, you dont need to invent new data structures to
> > pass extra info about CITC internals to passes 2 and 3
> > 
> > On more thing - when should free space map be notified about free space
> > in pages with CITC chains ?
> 
> Uh, well, I am thinking we only free CITC space when we are going to use
> it for an UPDATE, rather than free things while doing an operation.  It
> is good to keep the cleanup overhead out of the main path as much as
> possible.

So vacuum should only remove dead CITC chains and leave the ones with
live tuples to CITC internal use ?

That would also suggest that pages having live CITC chains and less than
N% of free space should mot be reported to FSM.

> Also, seems I can't spell algorithms very well:
> 
> Definition:  Single-Index-Tuple Chain (SITC)
>  -
> Thinking of vacuum, right now it does these cleanups:
> 
>   o  non-visible UPDATEs on the same page with no key changes
>   o  non-visible UPDATEs on the same page with key changes
>   o  non-visible UPDATEs on different pages
>   o  DELETEs
>   o  aborted transactions
> 
> The big question is what percentage of dead space is the first one?  My
> guess is 65%.

Can be from 0% to 99.9%, very much dependent on application.

-- 

Hannu Krosing
Database Architect
Skype Technologies OÜ
Akadeemia tee 21 F, Tallinn, 12618, Estonia

Skype me:  callto:hkrosing
Get Skype for free:  http://www.skype.com

NOTICE: This communication contains privileged or other confidential
information. If you have received it in error, please advise the sender
by reply email and immediately delete the message and any attachments
without copying or disclosing the contents.


---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match

Re: [HACKERS] vacuum, performance, and MVCC

2006-06-26 Thread Martijn van Oosterhout

On Mon, Jun 26, 2006 at 10:50:26AM -0400, Bruce Momjian wrote:
> > > I suppose we would also change the index_getmulti() function to return
> > > a set of ctids plus flags so the caller knows to follow the chains,
> > > right? 
> > 
> > It is probably better to always return the pointer to the head of CITC
> > chain (the one an index points to) and do extra visibility checks and
> > chain-following on each access. This would keep the change internal to
> > tuple fetching functions.
> 
> So index_getnext() traverses the chain and returns one member per call. 
> Makes sense.  Just realize you are in a single index entry returning
> multiple tuples.  We will need some record keeping to track that.

Yes, and for index_getmulti (which doesn't visit the heap at all) we'll
have to change all the users of that (which aren't many, I suppose).
It's probably worth making a utility function to expand them.

I'm still confused where bitmap index scan fit into all of this. Is
preserving the sequential scan aspect of these a goal with this new
setup?

Have a nice day,
-- 
Martijn van Oosterhout  http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to 
> litigate.


signature.asc
Description: Digital signature

Re: [HACKERS] "Truncated" tuples for tuple hash tables

2006-06-26 Thread Tom Lane

Martijn van Oosterhout  writes:
> On Mon, Jun 26, 2006 at 10:36:00AM -0400, Tom Lane wrote:
>> Unlike the case with sort temp files, it's important to be able to
>> access the stored data without moving/copying it.  So, not wishing to
>> duplicate all the tuple access machinery we have already, I'm
>> envisioning a compromise design that leaves a couple bytes on the table
>> but looks enough like a standard tuple to be directly usable.

> I considered this, but ran into the problem that heap_getattr fell back
> to fastgetattr, which wouldn't know what kind of tuple it was given.
> Now, if you're going to add a special heap_getattr for these tuples,
> then ofcourse there's no problem.

No, I'm specifically *not* going to do that.  AFAICS the proposed
representation requires no changes in heap_getattr; if it did, it'd
be too invasive to contemplate :-(.  It should look exactly like any
other HeapTuple structure, except that the "transaction status" fields
will contain garbage.  Do you see something I missed?

regards, tom lane

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [HACKERS] vacuum, performance, and MVCC

2006-06-26 Thread Hannu Krosing

Ühel kenal päeval, E, 2006-06-26 kell 09:10, kirjutas Mark Woodward:
> > Ãœhel kenal pÃ¤eval, R, 2006-06-23 kell 17:27, kirjutas Bruce Momjian:
> >> Jonah H. Harris wrote:
> >> > On 6/23/06, Tom Lane <[EMAIL PROTECTED]> wrote:
> >> > > What I see in this discussion is a huge amount of "the grass must be
> >> > > greener on the other side" syndrome, and hardly any recognition that
> >> > > every technique has its downsides and complications.
> >> >
> >> > I'm being totally objective.  I don't think we should abandon
> >> > PostgreSQL's overall design at all, because we do perform INSERTs and
> >> > DELETEs much better than most systems.  However, I've looked at many
> >> > systems and how they implement UPDATE so that it is a scalable
> >> > operation.  Sure, there are costs and benefits to each implementation,
> >> > but I think we have some pretty brilliant people in this community and
> >> > can come up with an elegant design for scalable UPDATEs.
> >>
> >> I think the UPDATE case is similar to the bitmap index scan or perhaps
> >> bitmap indexes on disk --- there are cases we know can not be handled
> >> well by our existing code, so we have added (or might add) these
> >> features to try to address those difficult cases.
> >
> > Not really. Bitmap index scan and bitmap index are both new additions
> > working well with existing framework.
> >
> > While the problem of slowdown on frequent updates is real, the suggested
> > fix is just plain wrong, as it is based on someones faulty assumption on
> > how index lookup works, and very much simplified view of how different
> > parts of the system work to implement MVCC.
> 
> Yes, the suggestion was based on MVCC concepts, not a particular
> implementation.

On the contrary - afaik, it was loosely based on how Oracle does it with
its rollback segments, only assuming that rollback segments are kept in
heap and that indexes point only to the oldest row version :p

> > The original fix he "suggests" was to that imagined behaviour and thus
> > ignored all the real problems of such change.
> 
> The original suggestion, was nothing more than a hypothetical for the
> purpose of discussion.
> 
> The problem was the steady degradation of performance on frequent updates.
> That was the point of discussion.  I brought up "one possible way" to
> start a "brain storm." The discussion then morphed into critisizing the
> example and not addressing the problem.

The problem is heatedly discussed every 3-4 months.

> Anyway, I think some decent discussion about the problem did happen, and
> that is good.

Agreed. 

Maybe this _was_ the best way to bring up the discussion again.

-- 

Hannu Krosing
Database Architect
Skype Technologies OÜ
Akadeemia tee 21 F, Tallinn, 12618, Estonia

Skype me:  callto:hkrosing
Get Skype for free:  http://www.skype.com



---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

Re: [HACKERS] vacuum, performance, and MVCC

2006-06-26 Thread Bruce Momjian

Hannu Krosing wrote:
> ?hel kenal p?eval, E, 2006-06-26 kell 14:56, kirjutas Martijn van
> Oosterhout:
> > On Mon, Jun 26, 2006 at 07:17:31AM -0400, Bruce Momjian wrote:
> > > Correct!  We use the same pointers used by normal UPDATEs, except we set
> > > a bit on the old tuple indicating it is a single-index tuple, and we
> > > don't create index entries for the new tuple.  Index scan routines will
> > > need to be taught about the new chains, but because only one tuple in
> > > the chain is visible to a single backend, the callers should not need to
> > > be modified.
> > 
> > I suppose we would also change the index_getmulti() function to return
> > a set of ctids plus flags so the caller knows to follow the chains,
> > right? 
> 
> It is probably better to always return the pointer to the head of CITC
> chain (the one an index points to) and do extra visibility checks and
> chain-following on each access. This would keep the change internal to
> tuple fetching functions.

So index_getnext() traverses the chain and returns one member per call. 
Makes sense.  Just realize you are in a single index entry returning
multiple tuples.  We will need some record keeping to track that.

> > And for bitmap index scans you would only remember the page in
> > the case of such a tuple, since you can't be sure the exact ctid you've
> > got is the one you want.
> 
> no, you should only use the pointer to CITC head outside tuple access
> funtions. And this pointer to CITC head is what is always passed to
> those access functions/macros.
> 
> The VACUUM would run its passes thus:
> 
> pass 1: run over heap, collect pointers to single dead tuples, and fully
> dead CITC chains (fully dead = no live tuples on this page). Clean up
> old tuples from CITC chains and move live tuples around so that CITC
> points to oldest possibly visible (not vacuumed) tuple. Doing this there
> frees us from need to collect a separate set of pointers for those. Or
> have you planned that old tuples from CITC chains are collected on the
> go/as needed ? Of course we could do both.

Non-visible CITC members should be freed during an UPDATE on the same
page, so vacuum doesn't have to be involved.

> pass 2: clean indexes based on ctid from pass 1
> 
> pass 3: clean heap based on ctid from pass 1
> 
> If yo do it this way, you dont need to invent new data structures to
> pass extra info about CITC internals to passes 2 and 3
> 
> On more thing - when should free space map be notified about free space
> in pages with CITC chains ?

Uh, well, I am thinking we only free CITC space when we are going to use
it for an UPDATE, rather than free things while doing an operation.  It
is good to keep the cleanup overhead out of the main path as much as
possible.

Also, seems I can't spell algorithms very well:

  Definition:  Single-Index-Tuple Chain (SITC)
 -
Thinking of vacuum, right now it does these cleanups:

o  non-visible UPDATEs on the same page with no key changes
o  non-visible UPDATEs on the same page with key changes
o  non-visible UPDATEs on different pages
o  DELETEs
o  aborted transactions

The big question is what percentage of dead space is the first one?  My
guess is 65%.

-- 
  Bruce Momjian   [EMAIL PROTECTED]
  EnterpriseDBhttp://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] "Truncated" tuples for tuple hash tables

2006-06-26 Thread Martijn van Oosterhout

On Mon, Jun 26, 2006 at 10:36:00AM -0400, Tom Lane wrote:
> While looking at the recently-noticed problem that HashAggregate nodes
> store more columns of the input than they need to, I couldn't help
> noticing how much of the hashtable space goes into HeapTuple header
> overhead.  A couple months ago we were able to get a useful improvement
> in sorting by not storing unnecessary header fields in sort files, and
> I'm strongly tempted to do the same in tuple hash tables.
> 
> Unlike the case with sort temp files, it's important to be able to
> access the stored data without moving/copying it.  So, not wishing to
> duplicate all the tuple access machinery we have already, I'm
> envisioning a compromise design that leaves a couple bytes on the table
> but looks enough like a standard tuple to be directly usable.

I considered this, but ran into the problem that heap_getattr fell back
to fastgetattr, which wouldn't know what kind of tuple it was given.
Now, if you're going to add a special heap_getattr for these tuples,
then ofcourse there's no problem.

Maybe create a version of heap_getattr that takes the fallback function
as a parameter?

Anyway, I think it's a good idea. Most places in the backend after the
SeqScan/IndexScan node really don't care about most of the header
fields and being able to drop them would be nice.

Have a nice day,
-- 
Martijn van Oosterhout  http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to 
> litigate.

signature.asc
Description: Digital signature

Re: [HACKERS] UTF8 server-side on Win32?

2006-06-26 Thread Oswaldo Hernández


>> > The installer for 8.1 will let you pick UNICODE encoding
>> during setup.
>>
>> What if he uses initdb directly?
>
>Then it should be a simple "-E UNICODE", just as on unix.
>
>//Magnus

Hello, I'm the guy that post this question on psql-es-ayuda.
Thanks to Alvaro for forward here.

I would like explain the question:

I can do initdb with '-E UTF8' or '-E UNICODE' on windows without problem.

But the --locale switch do not permit set to 'Spanish_Spain.UNICODE', 'Spanish_Spain.UTF8' or 
'Spanish_Spain.65001'.


I put the locale to Spanish_Spain.1252 (Ansi encoding) and seem to work well with database encoding 
UTF8.


I'm not an encoding expert but i think that the database encoding and the right side of locale 
specification may be the same for a correct work.


Am i wrong, and -E UTF8 may cohabitate with locale windows ansi 1252 without 
problems?


Thanks


--
*
Oswaldo Hernández
[EMAIL PROTECTED]
*

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match

[HACKERS] "Truncated" tuples for tuple hash tables

2006-06-26 Thread Tom Lane

While looking at the recently-noticed problem that HashAggregate nodes
store more columns of the input than they need to, I couldn't help
noticing how much of the hashtable space goes into HeapTuple header
overhead.  A couple months ago we were able to get a useful improvement
in sorting by not storing unnecessary header fields in sort files, and
I'm strongly tempted to do the same in tuple hash tables.

Unlike the case with sort temp files, it's important to be able to
access the stored data without moving/copying it.  So, not wishing to
duplicate all the tuple access machinery we have already, I'm
envisioning a compromise design that leaves a couple bytes on the table
but looks enough like a standard tuple to be directly usable.
Specifically, something like this:

typedef struct TruncatedTupleData
{
uint32t_len;/* length of tuple */

char  pad[...]; /* see below */

int16t_natts;   /* number of attributes */
... the rest matching HeapTupleHeaderData ...
}

The padding would be chosen such that the offset of t_natts would have
the same value modulo MAXIMUM_ALIGNOF as it has in HeapTupleHeaderData.
This ensures that if a TruncatedTuple is stored starting on a MAXALIGN
boundary, data within it is correctly aligned the same as it would be
in a normal tuple.  With the current struct definitions, 2 bytes of
padding would be needed on all supported platforms, and a
TruncatedTuple would be 16 bytes shorter than a regular tuple.
However, because we are also eliminating a HeapTupleData struct, the
total savings in tuple hash tables would be 36 to 40 bytes per tuple.

To make use of a TruncatedTuple, we'd set up a temporary HeapTupleData
struct with its t_data field pointing 16 bytes before the start of the
TruncatedTuple.  As long as the code using it never tries to access any
of the missing fields (t_xmin through t_ctid), this would work exactly
like a normal HeapTuple.

Going forward, we'd have to be careful to preserve the existing
field-ordering separation between transaction status fields and data
content fields in tuple headers, but that's just a matter of adding some
comments to htup.h.

It's tempting to think about using this same representation for tuples
stored in memory by tuplesort.c and tuplestore.c.  That'd probably
require some changes in their APIs, but I think it's doable.

Comments?  Anyone think this is too ugly for words?

regards, tom lane

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match

[HACKERS] ERROR: invalid page header in block

2006-06-26 Thread Alexandru Coseru


Hello..
During a regular query , I got an werd error:


billing=> select sum(duration)/60 from cdr__2006 where 
callstarttime<'2006-06-19 00:00:00';

ERROR:  invalid page header in block 3527 of relation "cdr__2006_04_22";

Table cdr__2006_04_22 is a child of cdr__2006_04 , which is a child to 
cdr__2006.


Can somebody explain this error ?  What should I do ?

Thanks
   Alex


---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] vacuum, performance, and MVCC

2006-06-26 Thread Hannu Krosing

Ühel kenal päeval, E, 2006-06-26 kell 14:56, kirjutas Martijn van
Oosterhout:
> On Mon, Jun 26, 2006 at 07:17:31AM -0400, Bruce Momjian wrote:
> > Correct!  We use the same pointers used by normal UPDATEs, except we set
> > a bit on the old tuple indicating it is a single-index tuple, and we
> > don't create index entries for the new tuple.  Index scan routines will
> > need to be taught about the new chains, but because only one tuple in
> > the chain is visible to a single backend, the callers should not need to
> > be modified.
> 
> I suppose we would also change the index_getmulti() function to return
> a set of ctids plus flags so the caller knows to follow the chains,
> right? 

It is probably better to always return the pointer to the head of CITC
chain (the one an index points to) and do extra visibility checks and
chain-following on each access. This would keep the change internal to
tuple fetching functions.

> And for bitmap index scans you would only remember the page in
> the case of such a tuple, since you can't be sure the exact ctid you've
> got is the one you want.

no, you should only use the pointer to CITC head outside tuple access
funtions. And this pointer to CITC head is what is always passed to
those access functions/macros.

The VACUUM would run its passes thus:

pass 1: run over heap, collect pointers to single dead tuples, and fully
dead CITC chains (fully dead = no live tuples on this page). Clean up
old tuples from CITC chains and move live tuples around so that CITC
points to oldest possibly visible (not vacuumed) tuple. Doing this there
frees us from need to collect a separate set of pointers for those. Or
have you planned that old tuples from CITC chains are collected on the
go/as needed ? Of course we could do both.

pass 2: clean indexes based on ctid from pass 1

pass 3: clean heap based on ctid from pass 1

If yo do it this way, you dont need to invent new data structures to
pass extra info about CITC internals to passes 2 and 3

On more thing - when should free space map be notified about free space
in pages with CITC chains ?

-- 

Hannu Krosing
Database Architect
Skype Technologies OÜ
Akadeemia tee 21 F, Tallinn, 12618, Estonia

Skype me:  callto:hkrosing
Get Skype for free:  http://www.skype.com

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly

Re: [HACKERS] vacuum row?

2006-06-26 Thread Jonah H. Harris


On 6/26/06, Mark Woodward <[EMAIL PROTECTED]> wrote:

Is it really nessisary make personal comments like this? Lets discuss
"ideas" not personalities or people.


Chill dude :)  All I was saying is that if you look at past threads,
this is a heavily discussed and near-religious topic.  Doing some
research before firing another shot across the bow wouldn't hurt.  I
do like the discussion though, it's good... I was just making an
observation that it could've been brought up in a little more
constructive way as there's a good amount of detail on Firebird's MVCC
(not that it's perfect either), but it's a mature implementation sort
of similar to the method you're discussing.

Please don't take it personally... it surely wasn't meant that way.

--
Jonah H. Harris, Software Architect | phone: 732.331.1300
EnterpriseDB Corporation| fax: 732.331.1301
33 Wood Ave S, 2nd Floor| [EMAIL PROTECTED]
Iselin, New Jersey 08830| http://www.enterprisedb.com/

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly

Re: Anyone still care about Cygwin? (was Re: [HACKERS] [CORE] GPL

2006-06-26 Thread Devrim GUNDUZ

Hi,

On Mon, 2006-06-26 at 08:28 +0100, Dave Page wrote:
> 
> Devrim was working with another potential contributor though, dunno
> how that's going. 

Ok, I talked with them now. They need the following information:

- What type of hardware do we need? Is an Opteron enough? Memory? Disk?
- Do we need a private box, or is a VM enough?
- What is the average traffic of our web servers?

Bandwith is not cheap in Turkey, so they are inclined to put that server
somewhere outside Turkey. 

They are ready to proceed, they just want to learn the details.

Regards,
-- 
The PostgreSQL Company - Command Prompt, Inc. 1.503.667.4564
PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Managed Services, Shared and Dedicated Hosting
Co-Authors: plPHP, plPerlNG - http://www.commandprompt.com/



---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

Re: [HACKERS] vacuum row?

2006-06-26 Thread A.M.

On Mon, June 26, 2006 9:37 am, Mark Woodward wrote:
>> On 6/24/06, Mark Woodward <[EMAIL PROTECTED]> wrote:
>>
>>> I originally suggested a methodology for preserving MVCC and everyone
>>> is confusing it as update "in place," this isnot what I intended.
>>
>> Actually, you should've presented your idea as performing MVCC the way
>> Firebird does... the idea is basically the same.  Doing some research
>> never hurts... especially with this crowd.
>
> Is it really nessisary make personal comments like this? Lets discuss
> "ideas" not personalities or people.
>
>
> The whole issue was how to address updates steadily degrading
> performance. I wanted to brainstorm the issue and find a solution. I
> tossed out a first guess at an algorithm to start the ball rolling.
>
> Was it perfect? No. Was it intended to be? no. It was intended to spark a
>  discussion, get people, first to recognize the problem, and then to
> think about possible solutions.
>
> I find that this, while chaotic, usually finds the best solutions. There
> are a lot of good and smart people here who understand this process and see
> it for what it is. Unfortunately, some don't.
>
> It isn't about "research," per se, because it is assumed that we all know
>  the various issues involved to some degree. It is about using the
> collective knowledge of the group and coming up with an answer.

Actually, it is. There are plenty of databases that don't need an
expensive separate process to clean out dead-space, so it would be wise to
see how those alternatives handle MVCC to see what this project can glean
from other's work- that is the point of open source.

Firebird, for example, has a half-dozen articles on how it handles MVCC
and dead tuples. In particular, it puts the "vacuum" burden on the
sessions and fires off a separate cleanup ("sweep") thread. After all,
what will know better than the transaction itself on what needs to be
cleaned up?

Linking goodness:
http://www.firebirdsql.org/doc/whitepapers/fb_vs_ibm_vs_oracle.htm
http://www.ibphoenix.com/main.nfs?a=ibphoenix&page=ibp_expert4

-M



---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

Re: [HACKERS] vacuum, performance, and MVCC

2006-06-26 Thread Bruce Momjian

Bruce Momjian wrote:
> Bruce Momjian wrote:
> > Martijn van Oosterhout wrote:
> > -- Start of PGP signed section.
> > > On Mon, Jun 26, 2006 at 07:17:31AM -0400, Bruce Momjian wrote:
> > > > Correct!  We use the same pointers used by normal UPDATEs, except we set
> > > > a bit on the old tuple indicating it is a single-index tuple, and we
> > > > don't create index entries for the new tuple.  Index scan routines will
> > > > need to be taught about the new chains, but because only one tuple in
> > > > the chain is visible to a single backend, the callers should not need to
> > > > be modified.
> > > 
> > > I suppose we would also change the index_getmulti() function to return
> > > a set of ctids plus flags so the caller knows to follow the chains,
> > > right? And for bitmap index scans you would only remember the page in
> > > the case of such a tuple, since you can't be sure the exact ctid you've
> > > got is the one you want.
> > > 
> > > Seems doable.
> > 
> > Yes, it just is an issue of where you want to add the complexity ---
> > scan entire page when no free space, or only an UPDATE.
> 
> Oh, and because you want to do this when doing an update via sequential
> scan as well as an index scan, I am thinking you might need to do the
> per-page method because you might not have even seen the head of the
> chain yet.  With an index scan, finding the head is easy, but for a
> sequential scan, it seems more difficult, and we don't have any free
> space in the tail of the chain to maintain a pointer to the head.

Thinking some more, there will need to be a bit to uniquely identify the
head of a CITC.  With that, you could just scan the page tuples looking
for CITC heads, and checking those to see if they are not visible, and
re-using them, rather than doing a full page reorganization where all
free spaces is collected in the middle of the page.  That should limit
the overhead of reuse.

-- 
  Bruce Momjian   [EMAIL PROTECTED]
  EnterpriseDBhttp://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] vacuum row?

2006-06-26 Thread Mark Woodward

> On 6/24/06, Mark Woodward <[EMAIL PROTECTED]> wrote:
>> I originally suggested a methodology for preserving MVCC and everyone is
>> confusing it as update "in place," this isnot what I intended.
>
> Actually, you should've presented your idea as performing MVCC the way
> Firebird does... the idea is basically the same.  Doing some research
> never hurts... especially with this crowd.

Is it really nessisary make personal comments like this? Lets discuss
"ideas" not personalities or people.

The whole issue was how to address updates steadily degrading performance.
I wanted to brainstorm the issue and find a solution. I tossed out a first
guess at an algorithm to start the ball rolling.

Was it perfect? No. Was it intended to be? no. It was intended to spark a
discussion, get people, first to recognize the problem, and then to think
about possible solutions.

I find that this, while chaotic, usually finds the best solutions. There
are a lot of good and smart people here who understand this process and
see it for what it is. Unfortunately, some don't.

It isn't about "research," per se, because it is assumed that we all know
the various issues involved to some degree. It is about using the
collective knowledge of the group and coming up with an answer.

Over email, this can sometimes come off badly, and for that I appologize,
but imagine, we were sitting at a table in "cambridge brewing company,"
and we had laptops and pitchers of beer and were discussing the problem.

I'm at a stark disadvantage as I use PostgreSQL a lot, but don't have the
luxury of being able to work on it in any real depth. I'd really love too.

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] vacuum, performance, and MVCC

2006-06-26 Thread Bruce Momjian

Bruce Momjian wrote:
> Martijn van Oosterhout wrote:
> -- Start of PGP signed section.
> > On Mon, Jun 26, 2006 at 07:17:31AM -0400, Bruce Momjian wrote:
> > > Correct!  We use the same pointers used by normal UPDATEs, except we set
> > > a bit on the old tuple indicating it is a single-index tuple, and we
> > > don't create index entries for the new tuple.  Index scan routines will
> > > need to be taught about the new chains, but because only one tuple in
> > > the chain is visible to a single backend, the callers should not need to
> > > be modified.
> > 
> > I suppose we would also change the index_getmulti() function to return
> > a set of ctids plus flags so the caller knows to follow the chains,
> > right? And for bitmap index scans you would only remember the page in
> > the case of such a tuple, since you can't be sure the exact ctid you've
> > got is the one you want.
> > 
> > Seems doable.
> 
> Yes, it just is an issue of where you want to add the complexity ---
> scan entire page when no free space, or only an UPDATE.

Oh, and because you want to do this when doing an update via sequential
scan as well as an index scan, I am thinking you might need to do the
per-page method because you might not have even seen the head of the
chain yet.  With an index scan, finding the head is easy, but for a
sequential scan, it seems more difficult, and we don't have any free
space in the tail of the chain to maintain a pointer to the head.

-- 
  Bruce Momjian   [EMAIL PROTECTED]
  EnterpriseDBhttp://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] vacuum, performance, and MVCC

2006-06-26 Thread Bruce Momjian

Martijn van Oosterhout wrote:
-- Start of PGP signed section.
> On Mon, Jun 26, 2006 at 07:17:31AM -0400, Bruce Momjian wrote:
> > Correct!  We use the same pointers used by normal UPDATEs, except we set
> > a bit on the old tuple indicating it is a single-index tuple, and we
> > don't create index entries for the new tuple.  Index scan routines will
> > need to be taught about the new chains, but because only one tuple in
> > the chain is visible to a single backend, the callers should not need to
> > be modified.
> 
> I suppose we would also change the index_getmulti() function to return
> a set of ctids plus flags so the caller knows to follow the chains,
> right? And for bitmap index scans you would only remember the page in
> the case of such a tuple, since you can't be sure the exact ctid you've
> got is the one you want.
> 
> Seems doable.

Yes, it just is an issue of where you want to add the complexity ---
scan entire page when no free space, or only an UPDATE.

-- 
  Bruce Momjian   [EMAIL PROTECTED]
  EnterpriseDBhttp://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] [CORE] GPL Source and Copyright Questions

2006-06-26 Thread Bruce Momjian


I did some research on this item.  I downloaded the source code to WN from:

http://hopf.math.northwestern.edu/index.html

I could only find the most recent version. wn-2.4.7.  I then looked at
its image.c file:

http://momjian.us/expire/image.c

I looked at the last two functions in the file and compared it to what
we have in CVS, particularly the version of the code when it was first
added to CVS:


http://developer.postgresql.org/cvsweb.cgi/pgsql/src/backend/utils/adt/geo_ops.c?rev=1.13;content-type=text%2Fplain

Again, look at the last two functions in the file.

You will see similarities and differences.  Many of the variable names
are the same, and there is an identical comment block.  The algorithm
used is very similar as well, but the style and formatting is different.

I have updated the code comment in CVS to mention the web site, GPL
license, and article describing the algorithm.

Not sure where to go from here.

---

bruce wrote:
> Tom Lane wrote:
> > [ redirecting to -hackers, as I see no need for this to be a core issue ]
> > 
> > Charles Comiskey <[EMAIL PROTECTED]> writes:
> > > Hello,
> > > I've recently looked through the PostgreSQL code and a couple of 
> > > questions 
> > > surfaced.  I was hoping someone here may be able to answer them.  Two 
> > > have 
> > > links to possible GPL sources and the third is just a contribution 
> > > question. 
> > 
> > > item #1: Does the geo_ops.c file contain GPL code?
> > > Embedded within the geo_ops.c file is a John Franks copyright statement 
> > > referring to wn/image.c file from WN Server 1.15.1.  WN Server appears to 
> > > have been under the GPL license since 0.94 and continues to be offered 
> > > under the GPL license today.  John's letter to Linux Journal seems to 
> > > only 
> > > point the user to his WN Server distribution vs granting any specific 
> > > license.
> 
> The comment is:
> 
>   /* poly_contain_pt()
>* Test to see if the point is inside the polygon.
>* Code adapted from integer-based routines in
>*  Wn: A Server for the HTTP
>*  File: wn/image.c
>*  Version 1.15.1
>*  Copyright (C) 1995  
>* (code offered for use by J. Franks in Linux Journal letter.)
>*/
> 
> That term "adapted from" isn't something Thomas would idly type, I
> think.  I bet it means he looked at John Franks' code and used it as a
> base for our code.  I am not concerned.
> 
> > > Questions:
> > > 1) Is any John Franks code really in this file?
> > > 2) Did John provide a separate license for PostgreSQL to license it under 
> > > the BSD license?
> > 
> > This code seems to have been inserted by Tom Lockhart on 1997-07-29
> > (geo_ops.c rev 1.13).  Tom, any info on the copyright status?
> > 
> > > References:
> > > - 1994 e-mail with GPL reference to WN Server v0.94: 
> > > http://1997.webhistory.org/www.lists/www-talk.1994q4/1080.html
> > > - 1995 e-mail from John with GPL license text reference: 
> > > http://1997.webhistory.org/www.lists/www-talk.1995q1/0482.html
> > > - WN Server url today: http://hopf.math.northwestern.edu/
> > > - Link to Linux Journal article: http://www.linuxjournal.com/article/2197
> > 
> > 

-- 
  Bruce Momjian   [EMAIL PROTECTED]
  EnterpriseDBhttp://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: Anyone still care about Cygwin? (was Re: [HACKERS] [CORE] GPL

2006-06-26 Thread Dave Page

 

> -Original Message-
> From: Robert Treat [mailto:[EMAIL PROTECTED] 
> Sent: 24 June 2006 20:50
> To: pgsql-hackers@postgresql.org
> Cc: Andrew Dunstan; Tom Lane; Dave Page
> Subject: Re: Anyone still care about Cygwin? (was Re: 
> [HACKERS] [CORE] GPL
> 
> 
> Dave, 
> 
> wasn't someone just trying to donate a machine to us for the 
> website but we 
> weren't sure what to do with it?  One that could do VM's?  
> Seems we could use 
> that for some buildfarm members maybe. 

As with most of these, the two I was discussing recently fell through
(usual problem, company making the offer seems to think we run the
entire project off one ancient server, and therefore think that the
celeron box they offer will entitle them to be listed as hosts of the
entire project).

Devrim was working with another potential contributor though, dunno how
that's going.

Regards, Dave

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match

Re: [HACKERS] vacuum, performance, and MVCC

2006-06-26 Thread Mark Woodward

> Heikki Linnakangas wrote:
>> On Mon, 26 Jun 2006, Jan Wieck wrote:
>>
>> > On 6/25/2006 10:12 PM, Bruce Momjian wrote:
>> >> When you are using the update chaining, you can't mark that index row
>> as
>> >> dead because it actually points to more than one row on the page,
>> some
>> >> are non-visible, some are visible.
>> >
>> > Back up the truck ... you mean in the current code base we have heap
>> tuples
>> > that are visible in index scans because of heap tuple chaining but
>> without
>> > index tuples pointing directly at them?
>>
>> In current code, no. Every heap tuple has corresponding index tuples.
>>
>> In Bruce's proposal, yes. You would have heap tuples without index
>> tuples
>> pointing directly at them. An index scan could only find them by
>> following
>> t_ctid chains.
>>
>> Correct me if I understood you incorrectly, Bruce.
>
> Correct!  We use the same pointers used by normal UPDATEs, except we set
> a bit on the old tuple indicating it is a single-index tuple, and we
> don't create index entries for the new tuple.  Index scan routines will
> need to be taught about the new chains, but because only one tuple in
> the chain is visible to a single backend, the callers should not need to
> be modified.
>
> (All tuples in the chain have page item ids.  It is just that when they
> are freed, the pointers are adjusted so the index points to the chain
> head.)
>
> One problem is that once you find the row you want to update, it is
> difficult to see if it is part of a single-index chain because there are
> only forward pointers, so I think we have to scan the entire page to
> find the chains.  To reduce that overhead, I am thinking we free the
> non-visible tuples only when the page has no more free space.  This
> allows us to free not just our own non-visible tuples, but perhaps
> others as well.

This sort of incorporates the "vacuum row" I suggested.

>
> We have never been able to free non-visible tuples before because of
> index cleanup overhead, but with single-index chains, we can, and reduce
> the requirements of vacuum for many workloads.
>

This is great!

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] vacuum, performance, and MVCC

2006-06-26 Thread Martijn van Oosterhout

On Mon, Jun 26, 2006 at 07:17:31AM -0400, Bruce Momjian wrote:
> Correct!  We use the same pointers used by normal UPDATEs, except we set
> a bit on the old tuple indicating it is a single-index tuple, and we
> don't create index entries for the new tuple.  Index scan routines will
> need to be taught about the new chains, but because only one tuple in
> the chain is visible to a single backend, the callers should not need to
> be modified.

I suppose we would also change the index_getmulti() function to return
a set of ctids plus flags so the caller knows to follow the chains,
right? And for bitmap index scans you would only remember the page in
the case of such a tuple, since you can't be sure the exact ctid you've
got is the one you want.

Seems doable.

Have a nice day,
-- 
Martijn van Oosterhout  http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to 
> litigate.

signature.asc
Description: Digital signature

Re: [HACKERS] vacuum, performance, and MVCC

2006-06-26 Thread Mark Woodward

> Ãhel kenal pÃ¤eval, R, 2006-06-23 kell 17:27, kirjutas Bruce Momjian:
>> Jonah H. Harris wrote:
>> > On 6/23/06, Tom Lane <[EMAIL PROTECTED]> wrote:
>> > > What I see in this discussion is a huge amount of "the grass must be
>> > > greener on the other side" syndrome, and hardly any recognition that
>> > > every technique has its downsides and complications.
>> >
>> > I'm being totally objective.  I don't think we should abandon
>> > PostgreSQL's overall design at all, because we do perform INSERTs and
>> > DELETEs much better than most systems.  However, I've looked at many
>> > systems and how they implement UPDATE so that it is a scalable
>> > operation.  Sure, there are costs and benefits to each implementation,
>> > but I think we have some pretty brilliant people in this community and
>> > can come up with an elegant design for scalable UPDATEs.
>>
>> I think the UPDATE case is similar to the bitmap index scan or perhaps
>> bitmap indexes on disk --- there are cases we know can not be handled
>> well by our existing code, so we have added (or might add) these
>> features to try to address those difficult cases.
>
> Not really. Bitmap index scan and bitmap index are both new additions
> working well with existing framework.
>
> While the problem of slowdown on frequent updates is real, the suggested
> fix is just plain wrong, as it is based on someones faulty assumption on
> how index lookup works, and very much simplified view of how different
> parts of the system work to implement MVCC.

Yes, the suggestion was based on MVCC concepts, not a particular
implementation.
>
> The original fix he "suggests" was to that imagined behaviour and thus
> ignored all the real problems of such change.

The original suggestion, was nothing more than a hypothetical for the
purpose of discussion.

The problem was the steady degradation of performance on frequent updates.
That was the point of discussion.  I brought up "one possible way" to
start a "brain storm." The discussion then morphed into critisizing the
example and not addressing the problem.

Anyway, I think some decent discussion about the problem did happen, and
that is good.



---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] vacuum, performance, and MVCC

2006-06-26 Thread Bruce Momjian

Heikki Linnakangas wrote:
> On Mon, 26 Jun 2006, Jan Wieck wrote:
> 
> > On 6/25/2006 10:12 PM, Bruce Momjian wrote:
> >> When you are using the update chaining, you can't mark that index row as
> >> dead because it actually points to more than one row on the page, some
> >> are non-visible, some are visible.
> >
> > Back up the truck ... you mean in the current code base we have heap tuples 
> > that are visible in index scans because of heap tuple chaining but without 
> > index tuples pointing directly at them?
> 
> In current code, no. Every heap tuple has corresponding index tuples.
> 
> In Bruce's proposal, yes. You would have heap tuples without index tuples 
> pointing directly at them. An index scan could only find them by following 
> t_ctid chains.
> 
> Correct me if I understood you incorrectly, Bruce.

Correct!  We use the same pointers used by normal UPDATEs, except we set
a bit on the old tuple indicating it is a single-index tuple, and we
don't create index entries for the new tuple.  Index scan routines will
need to be taught about the new chains, but because only one tuple in
the chain is visible to a single backend, the callers should not need to
be modified.

(All tuples in the chain have page item ids.  It is just that when they
are freed, the pointers are adjusted so the index points to the chain
head.)

One problem is that once you find the row you want to update, it is
difficult to see if it is part of a single-index chain because there are
only forward pointers, so I think we have to scan the entire page to
find the chains.  To reduce that overhead, I am thinking we free the
non-visible tuples only when the page has no more free space.  This
allows us to free not just our own non-visible tuples, but perhaps
others as well.

We have never been able to free non-visible tuples before because of
index cleanup overhead, but with single-index chains, we can, and reduce
the requirements of vacuum for many workloads.

-- 
  Bruce Momjian   [EMAIL PROTECTED]
  EnterpriseDBhttp://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] vacuum, performance, and MVCC

2006-06-26 Thread Bruce Momjian

Zeugswetter Andreas DCP SD wrote:
> 
> > > On 6/25/2006 10:12 PM, Bruce Momjian wrote:
> > > >When you are using the update chaining, you can't mark that index
> row 
> > > >as dead because it actually points to more than one row on the
> page, 
> > > >some are non-visible, some are visible.
> > > 
> > > Back up the truck ... you mean in the current code base we have heap
> 
> > > tuples that are visible in index scans because of heap tuple
> chaining 
> > > but without index tuples pointing directly at them?
> > 
> > I don't know where this idea came from, but it's not true.  
> > All heap tuples, dead or otherwise, have index entries.  
> 
> When using CITC you would be "reusing" the index tuples from the current
> heap tuple, so you can only reuse free space or a dead member of a CITC
> chain.
> You cannot reuse a dead tuple not member of a CITC chain because that
> has separate
> (invalid) index tuples pointing at it.
> 
> Part of the trick was moving slots (==ctid) around, so I still do not
> really see how
> you can represent the CITC chain as part of the update chain. 
> Unless you intend to break dead parts of the update chain ? Maybe that
> is ok ?

Yes, you have to remove dead (non-visible) parts of the update chain.

-- 
  Bruce Momjian   [EMAIL PROTECTED]
  EnterpriseDBhttp://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] libpq Describe Extension [WAS: Bytea and perl]

2006-06-26 Thread Martijn van Oosterhout

On Sat, Jun 24, 2006 at 02:45:33PM +0300, Volkan YAZICI wrote:
> I totally agree with the followed ugly style. But IMHO the recursive
> parsing (that is followed in pqParseInputN()) of received data is the main
> problem behind this. I think, it will even get harder everytime somebody
> try to to add another type of message parsing capability to that loop.
> For instance, isn't pollution of PGQueryClass with state variables (like
> PGQUERY_PREPARE or PGQUERY_DESCRIBE) one of the proofs of this.

What's the alternative? pqParseInputN() work using state machines, but
they're not recursive. You're trying to parse messages where you don't
know beforehand if you have enough data. Moreover, each message could
be quite large, you don't want to have to store all of them without
parsing what you can. You're also not allowed to wait for more data to
appear.

However, it seems to me you could simplify quite a bit of coding by
adding a pqHaveNBytes function that returns true if there are that many
bytes available. Then right after you know the number of attributes,
you can do a pqHaveNBytes(4*nattr) and skip the checking within the
loop.

Have a nice day,
-- 
Martijn van Oosterhout  http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to 
> litigate.

signature.asc
Description: Digital signature

Re: [HACKERS] vacuum, performance, and MVCC

2006-06-26 Thread Zeugswetter Andreas DCP SD

> > On 6/25/2006 10:12 PM, Bruce Momjian wrote:
> > >When you are using the update chaining, you can't mark that index
row 
> > >as dead because it actually points to more than one row on the
page, 
> > >some are non-visible, some are visible.
> > 
> > Back up the truck ... you mean in the current code base we have heap

> > tuples that are visible in index scans because of heap tuple
chaining 
> > but without index tuples pointing directly at them?
> 
> I don't know where this idea came from, but it's not true.  
> All heap tuples, dead or otherwise, have index entries.  

When using CITC you would be "reusing" the index tuples from the current
heap tuple, so you can only reuse free space or a dead member of a CITC
chain.
You cannot reuse a dead tuple not member of a CITC chain because that
has separate
(invalid) index tuples pointing at it.

Part of the trick was moving slots (==ctid) around, so I still do not
really see how
you can represent the CITC chain as part of the update chain. 
Unless you intend to break dead parts of the update chain ? Maybe that
is ok ?

Andreas

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] vacuum, performance, and MVCC

2006-06-26 Thread Bruce Momjian

Alvaro Herrera wrote:
> Jan Wieck wrote:
> > On 6/25/2006 10:12 PM, Bruce Momjian wrote:
> > >When you are using the update chaining, you can't mark that index row as
> > >dead because it actually points to more than one row on the page, some
> > >are non-visible, some are visible.
> > 
> > Back up the truck ... you mean in the current code base we have heap 
> > tuples that are visible in index scans because of heap tuple chaining 
> > but without index tuples pointing directly at them?
> 
> I don't know where this idea came from, but it's not true.  All heap
> tuples, dead or otherwise, have index entries.  Unless the idea is to
> extend update chaining to mean something different from the current
> meaning.

It does mean something different.  Single-Index-Tuple Chain (CITC) is a
special type of update chaining where the updates are all on the same
row, and a single index entry points to the entire chain.

-- 
  Bruce Momjian   [EMAIL PROTECTED]
  EnterpriseDBhttp://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org

Re: [HACKERS] vacuum, performance, and MVCC

2006-06-26 Thread Bruce Momjian

Jan Wieck wrote:
> On 6/25/2006 10:12 PM, Bruce Momjian wrote:
> > When you are using the update chaining, you can't mark that index row as
> > dead because it actually points to more than one row on the page, some
> > are non-visible, some are visible.
> 
> Back up the truck ... you mean in the current code base we have heap 
> tuples that are visible in index scans because of heap tuple chaining 
> but without index tuples pointing directly at them?

No, this would be new code added.  The basic idea is with the new
same-page update chaining, a single index points to the head of a chain,
not to a single tuple, so you can't mark a tuple as pointing to dead
rows if any of the tuples in the chain are visible.

-- 
  Bruce Momjian   [EMAIL PROTECTED]
  EnterpriseDBhttp://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org

Re: [HACKERS] vacuum row?

2006-06-26 Thread Csaba Nagy

On Sun, 2006-06-25 at 05:29, Alvaro Herrera wrote:
> Mark Woodward wrote:
> > I originally suggested a methodology for preserving MVCC and everyone is
> > confusing it as update "in place," this isnot what I intended.
> 
> It doesn't make sense, but maybe vacuuming a page would.  Naturally, it
> would need to wholly scan all the indexes to clean'em up, so it's
> probably not a good idea in general.

But a version of vacuum which does normal index scans when vacuuming
just a small percent of a huge table would make sense wouldn't it ? So
you don't need to make full scans of the vacuumed indexes but look up
the entries based on the vacuumed key.

There were discussions about this I think, and the objection was that it
might be that an index scan might miss index entries, in particular when
badly behaved user defined functions are involved.

Cheers,
Csaba.

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] vacuum, performance, and MVCC

2006-06-26 Thread Heikki Linnakangas


On Mon, 26 Jun 2006, Jan Wieck wrote:


On 6/25/2006 10:12 PM, Bruce Momjian wrote:

When you are using the update chaining, you can't mark that index row as
dead because it actually points to more than one row on the page, some
are non-visible, some are visible.


Back up the truck ... you mean in the current code base we have heap tuples 
that are visible in index scans because of heap tuple chaining but without 
index tuples pointing directly at them?


In current code, no. Every heap tuple has corresponding index tuples.

In Bruce's proposal, yes. You would have heap tuples without index tuples 
pointing directly at them. An index scan could only find them by following 
t_ctid chains.


Correct me if I understood you incorrectly, Bruce.

- Heikki

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly

84 matches

Mail list logo