Re: [PATCHES] [HACKERS] Full page writes improvement, code update

2007-04-25 Thread Zeugswetter Andreas ADI SD

> > 1) To deal with partial/inconsisitent write to the data file at
crash 
> > recovery, we need full page writes at the first modification to
pages
> > after each checkpoint.   It consumes much of WAL space.
> 
> We need to find a way around this someday.  Other DBs don't 
> do this; it may be becuase they're less durable, or because 
> they fixed the problem.

They eighter can only detect a failure later (this may be a very long
time depending on access and verify runs) or they also write page
images. Those that write page images usually write "before images" to a
different area that is cleared periodically (e.g. during checkpoint).

Writing to a different area was considered in pg, but there were more
negative issues than positive.
So imho pg_compresslog is the correct path forward. The current
discussion is only about whether we want a more complex pg_compresslog
and no change to current WAL, or an increased WAL size for a less
complex implementation.
Both would be able to compress the WAL to the same "archive log" size.

Andreas

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PATCHES] [HACKERS] Full page writes improvement, code update

2007-04-24 Thread Zeugswetter Andreas ADI SD

> 3) To maintain crash recovery chance and reduce the amount of 
> archive log, removal of  unnecessary full page writes from 
> archive logs is a good choice.

Definitely, yes. pg_compresslog could even move the full pages written
during backup out of WAL and put them in a different file that needs to
be applied before replay of the corresponding WAL after a physical
restore. This would further help reduce log shipping volume.

> To do this, we need both logical log and full page writes in WAL.

This is only true in the sense, that it allows a less complex
implementation of pg_compresslog.

Basically a WAL record consists of info about what happened and
currently eighter per tuple new data or a full page image. The info of
"what happened" together with the full page image is sufficient to
reconstruct the "per tuple new data". There might be a few WAL record
types (e.g. in btree split ?) where this is not so, but we could eighter
fix those or not compress those.

This is why I don't like Josh's suggested name of wal_compressable
eighter.
WAL is compressable eighter way, only pg_compresslog would need to be
more complex if you don't turn off the full page optimization. I think a
good name would tell that you are turning off an optimization.
(thus my wal_fullpage_optimization on/off)

Andreas


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PATCHES] [HACKERS] Full page writes improvement, code update

2007-04-23 Thread Zeugswetter Andreas ADI SD

> I don't insist the name and the default of the GUC parameter. 
>  I'm afraid wal_fullpage_optimization = on (default) makes 
> some confusion because the default behavior becomes a bit 
> different on WAL itself.

Seems my wal_fullpage_optimization is not a good name if it caused
misinterpretation already :-(

> >> Amount of WAL after 60min. run of DBT-2 benchmark 
> >> wal_add_optimization_info = off (default) 3.13GB
> > 
> > how about wal_fullpage_optimization = on (default)

The meaning of wal_fullpage_optimization = on (default)
would be the same as your wal_add_optimization_info = off (default).
(Reversed name, reversed meaning of the boolean value)

It would be there to *turn off* the (default) WAL full_page
optimization.
For your pg_compresslog it would need to be set to off. 
"add_optimization_info" sounded like added info about/for some
optimization
which it is not. We turn off an optimization with the flag for the
benefit
of an easier pg_compresslog implementation.

As already said I would decouple this setting from the part that sets
the "removeable full page" flag in WAL, and making the recovery able to
skip dummy records. This I would do unconditionally.

Andreas

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [PATCHES] [HACKERS] Full page writes improvement, code update

2007-04-20 Thread Zeugswetter Andreas ADI SD

> With DBT-2 benchmark, I've already compared the amount of WAL.   The 
> result was as follows:
> 
> Amount of WAL after 60min. run of DBT-2 benchmark 
> wal_add_optimization_info = off (default) 3.13GB

how about wal_fullpage_optimization = on (default)
 
> wal_add_optimization_info = on (new case) 3.17GB -> can be 
> optimized to 0.31GB by pg_compresslog.
> 
> So the difference will be around a couple of percents.   I think this
is 
> very good figure.
> 
> For information,
> DB Size: 12.35GB (120WH)
> Checkpoint timeout: 60min.  Checkpoint occured only once in the run.

Unfortunately I think DBT-2 is not a good benchmark to test the disabled
wal optimization.
The test should contain some larger rows (maybe some updates on large
toasted values), and maybe more frequent checkpoints. Actually the poor
ratio between full pages and normal WAL content in this benchmark is
strange to begin with.
Tom fixed a bug recently, and it would be nice to see the new ratio. 

Have you read Tom's comment on not really having to be able to
reconstruct all record types from the full page image ? I think that
sounded very promising (e.g. start out with only heap insert/update). 

Then:
- we would not need the wal optimization switch (the full page flag
would always be added depending only on backup)
- pg_compresslog would only remove such "full page" images where it
knows how to reconstruct a "normal" WAL record from
- with time and effort pg_compresslog would be able to compress [nearly]
all record types's full images (no change in backend)

> I don't think replacing LSN works fine.  For full recovery to 
> the current time, we need both archive log and WAL.  
> Replacing LSN will make archive log LSN inconsistent with 
> WAL's LSN and the recovery will not work.

WAL recovery would have had to be modified (decouple LSN from WAL
position during recovery).
An "archive log" would have been a valid WAL (with appropriate LSN
advance records). 
 
> Reconstruction to regular WAL is proposed as 
> pg_decompresslog.  We should be careful enough not to make 
> redo routines confused with the dummy full page writes, as 
> Simon suggested.  So far, it works fine.

Yes, Tom didn't like "LSN replacing" eighter. I withdraw my concern
regarding pg_decompresslog.

Your work in this area is extremely valuable and I hope my comments are
not discouraging.

Thank you
Andreas

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [PATCHES] [HACKERS] Full page writes improvement, code update

2007-04-13 Thread Zeugswetter Andreas ADI SD

> > Yup, this is a good summary.
> > 
> > You say you need to remove the optimization that avoids the logging
of 
> > a new tuple because the full page image exists.
> > I think we must already have the info in WAL which tuple inside the 
> > full page image is new (the one for which we avoided the WAL entry 
> > for).
> > 
> > How about this:
> > Leave current WAL as it is and only add the not removeable flag to 
> > full pages.
> > pg_compresslog then replaces the full page image with a record for
the 
> > one tuple that is changed.
> > I tend to think it is not worth the increased complexity only to
save 
> > bytes in the uncompressed WAL though.
> 
> It is essentially what my patch proposes.  My patch includes 
> flag to full page writes which "can be" removed.

Ok, a flag that marks full page images that can be removed is perfect.

But you also turn off the optimization that avoids writing regular
WAL records when the info is already contained in a full-page image
(increasing the
uncompressed size of WAL).
It was that part I questioned. As already stated, maybe I should not
have because
it would be too complex to reconstruct a regular WAL record from the
full-page image.  
But that code would also be needed for WAL based partial replication, so
if it where too
complicated we would eventually want a switch to turn off the
optimization anyway
(at least for heap page changes).

> > Another point about pg_decompresslog:
> > 
> > Why do you need a pg_decompresslog ? Imho pg_compresslog should 
> > already do the replacing of the full_page with the dummy entry. Then

> > pg_decompresslog could be a simple gunzip, or whatever compression
was 
> > used, but no logic.
> 
> Just removing full page writes does not work.   If we shift the rest
of 
> the WAL, then LSN becomes inconsistent in compressed archive logs
which 
> pg_compresslog produces.   For recovery, we have to restore LSN as the

> original WAL.   Pg_decompresslog restores removed full page writes as
a 
> dumm records so that recovery redo functions won't be confused.

Ah sorry, I needed some pgsql/src/backend/access/transam/README reading.

LSN is the physical position of records in WAL. Thus your dummy record
size is equal to what you cut out of the original record.
What about disconnecting WAL LSN from physical WAL record position
during replay ?
Add simple short WAL records in pg_compresslog like: advance LSN by 8192
bytes.

Andreas

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PATCHES] [HACKERS] Full page writes improvement, code update

2007-04-12 Thread Zeugswetter Andreas ADI SD

> I don't fully understand what "transaction log" means.   If it means 
> "archived WAL", the current (8.2) code handle WAL as follows:

Probably we can define "transaction log" to be the part of WAL that is
not
full pages.

> 1) If full_page_writes=off, then no full page writes will be 
> written to WAL, except for those during onlie backup (between 
> pg_start_backup and 
> pg_stop_backup).   The WAL size will be considerably small 
> but it cannot 
> recover from partial/inconsistent write to the database 
> files.  We have to go back to the online backup and apply all 
> the archive log.
> 
> 2) If full_page_writes=on, then full page writes will be 
> written at the first update of a page after each checkpoint, 
> plus full page writes at
> 1).   Because we have no means (in 8.2) to optimize the WAL 
> so far, what 
> we can do is to copy WAL or gzip it at archive time.
> 
> If we'd like to keep good chance of recovery after the crash, 
> 8.2 provides only the method 2), leaving archive log size 
> considerably large.  My proposal maintains the chance of 
> crash recovery the same as in the case of full_page_writes=on 
> and reduces the size of archived log as in the case of 
> full_page_writes=off.

Yup, this is a good summary.

You say you need to remove the optimization that avoids 
the logging of a new tuple because the full page image exists.
I think we must already have the info in WAL which tuple inside the full
page image
is new (the one for which we avoided the WAL entry for).

How about this:
Leave current WAL as it is and only add the not removeable flag to full
pages.
pg_compresslog then replaces the full page image with a record for the
one tuple that is changed.
I tend to think it is not worth the increased complexity only to save
bytes in the uncompressed WAL though.

Another point about pg_decompresslog:

Why do you need a pg_decompresslog ? Imho pg_compresslog should already
do the replacing of the
full_page with the dummy entry. Then pg_decompresslog could be a simple
gunzip, or whatever compression was used,
but no logic.

Andreas

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PATCHES] [HACKERS] Full page writes improvement, code update

2007-03-30 Thread Zeugswetter Andreas ADI SD

> Without a switch, because both full page writes and 
> corresponding logical log is included in WAL, this will 
> increase WAL size slightly 
> (maybe about five percent or so).   If everybody is happy 
> with this, we 
> don't need a switch.

Sorry, I still don't understand that. What is the "corresponding logical
log" ?
It seems to me, that a full page WAL record has enough info to produce a

dummy LSN WAL entry. So insead of just cutting the full page wal record
you 
could replace it with a LSN WAL entry when archiving the log.

Then all that is needed is the one flag, no extra space ?

Andreas


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] [PATCHES] wal_checksum = on (default) | off

2007-01-05 Thread Zeugswetter Andreas ADI SD

> > Ok, so when you need CRC's on a replicate (but not on the master)
you
> > turn it
> > off during standby replay, but turn it on when you start the
replicate
> > for normal operation.
> 
> Thought: even when it's off, the CRC had better be computed for
> shutdown-checkpoint records.  Else there's no way to turn it on even
> with a postmaster restart --- unless we accept the idea of poking a
hole
> in the normal mode.  (Which I still dislike, and even more so if the
> special value is zero.  Almost any other value would be safer than
zero.)
> 
> On the whole, though, I still don't want to put this in.  I don't
think
> Simon has thought it through sufficiently, 

Well, the part that we do not really want a special value (at least not
0)
is new, and makes things a bit more complicated.

> and we haven't even seen any demonstration of a big speedup.

Yes, iirc the demonstration was with the 64 bit crc instead of the
sufficient 
32-bit (or a bad crc compiler optimization?).
But I do think it can be shown to provide significant speedup
(at least peak burst performance).

Especially on target hardware WAL write IO is extremely fast 
(since it is write cached), so the CPU should show.

Andreas

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] [PATCHES] wal_checksum = on (default) | off

2007-01-05 Thread Zeugswetter Andreas ADI SD

> > > > What's the use-case for changing the variable on the fly anyway?
Seems a
> > better
> > > > solution is just to lock down the setting at postmaster start.
> > 
> > I guess that the use case is more for a WAL based replicate, that 
> > has/wants a different setting. Maybe we want a WAL entry for the
change,
> > or force a log switch (so you can interrupt the replicate, change
it's
> > setting
> > and proceed with the next log) ?
> > 
> > Maybe a 3rd mode for replicates that ignores 0 CRC's ?
> 
> Well, wal_checksum allows you to have this turned ON for the main
server
> and OFF on a Warm Standby.

Ok, so when you need CRC's on a replicate (but not on the master) you
turn it
off during standby replay, but turn it on when you start the replicate
for normal operation.

Andreas

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] [PATCHES] wal_checksum = on (default) | off

2007-01-05 Thread Zeugswetter Andreas ADI SD

> > >>> Recovery can occur with/without same setting of wal_checksum, to
avoid
> > >>> complications from crashes immediately after turning GUC on.
> > >> 
> > >> Surely not.  Otherwise even the "on" setting is not really a
defense.
> > 
> > > Only when the CRC is exactly zero, which happens very very rarely.
> > 
> > "It works most of the time" doesn't exactly satisfy me.  What's the

Agreed

> > use-case for changing the variable on the fly anyway?  Seems a
better
> > solution is just to lock down the setting at postmaster start.

I guess that the use case is more for a WAL based replicate, that 
has/wants a different setting. Maybe we want a WAL entry for the change,
or force a log switch (so you can interrupt the replicate, change it's
setting
and proceed with the next log) ?

Maybe a 3rd mode for replicates that ignores 0 CRC's ?

Andreas

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate