date:20070227

Re: [PATCHES] [HACKERS] Load distributed checkpoint

2007-02-27 Thread ITAGAKI Takahiro

"Inaam Rana" <[EMAIL PROTECTED]> wrote:

> One of the issues we had during testing with original patch was db stop not
> working properly. I think you coded something to do a stop checkpoint in
> immediately but if a checkpoint is already in progress at that time, it
> would take its own time to complete.
> Does this patch resolve that issue?

Yes, I fixed the problem. If a checkpoint by user SQL or shutdown is waiting
during an automatic checkpoint, the running checkpoint will be done without
any delays. At the worst case, you have to wait two checkpoints, (a running
automatic checkpoint and your explicit request) but nothing more of them.

> Also, is it based on pg82stable or HEAD?

It's based on HEAD.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match

Re: [HACKERS] Dead Space Map version 2

2007-02-27 Thread Simon Riggs

On Tue, 2007-02-27 at 12:05 +0900, ITAGAKI Takahiro wrote:

> If we combine this with the HOT patch, pages with HOT tuples are probably
> marked as UNFROZEN because we don't bother vacuuming HOT tuples. They can
> be removed incrementally and doesn't require explicit vacuums.

Perhaps avoid DSM entries for HOT updates completely?

> VACUUM commands
> ---
> 
> VACUUM now only scans the pages that possibly have dead tuples.
> VACUUM ALL, a new syntax, behaves as the same as before.
> 
> - VACUUM FULL : Not changed. scans all pages and compress them.
> - VACUUM ALL  : Scans all pages; Do the same behavior as previous VACUUM.
> - VACUUM  : Scans only HIGH pages usually, but also LOW and UNFROZEN
> pages on vacuums in the cases for preventing XID wraparound.

Sounds good.

> Performance issues
> --
> 
> * Enable/Disable DSM tracking per tables
> DSM requires more or less additional works. If we know specific tables
> where DSM does not work well, ex. heavily updated small tables, we can
> disable DSM for it. The syntax is:
>   ALTER TABLE name SET (dsm=true/false);

How about a dsm_tracking_limit GUC? (Better name please)
The number of pages in a table before we start tracking DSM entries for
it. DSM only gives worthwhile benefits for larger tables anyway, so let
the user define what large means for them.
dsm_tracking_limit = 1000 by default.

> * Dead Space State Cache
> The DSM management module is guarded using one LWLock, DeadSpaceLock.
> Almost all accesses to DSM requires only shared lock, but the frequency
> of shared lock was very high (tied with BufMappingLock) in my research.
> To avoid the lock contention, I added a cache of dead space state in
> BufferDesc flags. Backends see the flags first, and avoid locking if no
> need to 

ISTM there should be a point at which DSM is so full we don't bother to
keep track any longer, so we can drop that information. For example if
user runs UPDATE without a WHERE clause, there's no point in tracking
whole relation.

> Memory management
> -
> 
> In current implementation, DSM allocates a bunch of memory at start up and
> we cannot modify it in running. It's maybe enough because DSM consumes very
> little memory -- 32MB memory per 1TB database.

That sounds fine.

-- 
  Simon Riggs 
  EnterpriseDB   http://www.enterprisedb.com



---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match

Re: [HACKERS] Dead Space Map version 2

2007-02-27 Thread Simon Riggs

On Tue, 2007-02-27 at 00:55 -0500, Tom Lane wrote:
> "Jim C. Nasby" <[EMAIL PROTECTED]> writes:
> > Yes, DSM would make FSM recovery more important, but I thought it was
> > recoverable now? Or is that only on a clean shutdown?
> 
> Currently we throw away FSM during any non-clean restart.  This is
> probably overkill but I'm quite unclear what would be a safe
> alternative.
> 
> > I suspect we don't need perfect recoverability...
> 
> The main problem with the levels proposed by Takahiro-san is that any
> transition from FROZEN to not-FROZEN *must* be exactly recovered,
> because vacuum will never visit an allegedly frozen page at all.  This
> appears to require WAL-logging DSM state changes, which is a pretty
> serious performance hit.  I'd be happier if the DSM content could be
> treated as just a hint.  I think that means not trusting it for whether
> a page is frozen to the extent of not needing vacuum even for
> wraparound.  

Agreed.

> So I'm inclined to propose that there be only two states
> (hence only one DSM bit per page): page needs vacuum for space recovery,
> or not.  Vacuum for XID wraparound would have to hit every page
> regardless.

I'm inclined to think: this close to deadline it would be more robust to
go with the simpler option. So, agreed to the one bit per page.

We can revisit the 2 bits/page idea easily for later releases. If the
DSM is non-transactional, upgrading to a new format in the future should
be very easy.

-- 
  Simon Riggs 
  EnterpriseDB   http://www.enterprisedb.com



---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org

Re: [HACKERS] Dead Space Map version 2

2007-02-27 Thread ITAGAKI Takahiro

"Jim C. Nasby" <[EMAIL PROTECTED]> wrote:

> > If we do UPDATE a tuple, the original page containing the tuple is marked
> > as HIGH and the new page where the updated tuple is placed is marked as LOW.
> 
> Don't you mean UNFROZEN?

No, the new tuples are marked as LOW. I intend to use UNFROZEN and FROZEN
pages as "all tuples in the pages are visible to all transactions" for
index-only-scan in the future.

> What makes it more important to mark the original page as HIGH instead
> of LOW, like the page with the new tuple? The description of the states
> indicates that there would likely be a lot more dead tuples in a HIGH
> page than in a LOW page.
> 
> Perhaps it would be better to have the bgwriter take a look at how many
> dead tuples (or how much space the dead tuples account for) when it
> writes a page out and adjust the DSM at that time.

Yeah, I feel it is worth optimizable, too. One question is, how we treat
dirty pages written by backends not by bgwriter? If we want to add some
works in bgwriter, do we also need to make bgwriter to write almost of
dirty pages?

> > * Agressive freezing
> > We will freeze tuples in dirty pages using OldestXmin but FreezeLimit.
> 
> Do you mean using OldestXmin instead of FreezeLimit?

Yes, we will use OldestXmin as the threshold to freeze tuples in
dirty pages or pages that have some dead tuples. Or, many UNFROZEN
pages still remain after vacuum and they will cost us in the next
vacuum preventing XID wraparound.

> > I'm thinking to change them into 2 new paramaters. We will allocates memory
> > for DSM that can hold all of estimated_database_size, and for FSM 50% or
> > something of the size. Is this reasonable?
>  
> I don't think so, at least not until we get data from the field about
> what's typical. If the DSM is tracking every page in the cluster then
> I'd expect the FSM to be closer to 10% or 20% of that, anyway.

I'd like to add some kind of logical flavors to max_fsm_pages
and max_dsm_pages. For DSM, max_dsm_pages should represent the
whole database size. In the other hand, what meaning does
max_fsm_pages have? (estimated_updatable_size ?)

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly

Re: [HACKERS] Dead Space Map version 2

2007-02-27 Thread ITAGAKI Takahiro


Tom Lane <[EMAIL PROTECTED]> wrote:

> Vacuum for XID wraparound would have to hit every page regardless.

There is one problem at this point. If we want to guarantee that there
are no tuples that XIDs are older than pg_class.relfrozenxid, we must scan
all pages for XID wraparound for every vacuums. So I used two thresholds
for treating XIDs, that is commented as follows. Do you have better ideas
for this point?

/*
 * We use vacuum_freeze_min_age to determine whether a freeze scan is
 * needed, but half vacuum_freeze_min_age for the actual freeze limits
 * in order to prevent XID wraparound won't occur too frequently.
 */


Also, normal vacuums uses DSM and freeze-vacuum does not, so vacuums
sometimes take longer time than usual. Doesn't the surprise bother us?

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center



---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [HACKERS] Seeking Google SoC Mentors

2007-02-27 Thread Dave Page

Tom Lane wrote:
> Andrew Dunstan <[EMAIL PROTECTED]> writes:
>> Well, here's a question. Given the recent discussion re full 
>> disjunction, I'd like to know what sort of commitment we are going to 
>> give people who work on proposed projects.
> 
> Um, if you mean are we going to promise to accept a patch in advance of
> seeing it, the answer is certainly not.  Still, a SoC author can improve
> his chances in all the usual ways, primarily by getting discussion and
> rough consensus on a spec and then on an implementation sketch before
> he starts to do much code.  Lots of showstopper problems can be caught
> at that stage.

We cannot necessarily expect the students to work this way without
guidance if they are not familiar with our processes before they start.
The mentors should be there to guide not just with the technical aspects
of the project, but the procedural as well imho.

Regards, Dave

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] Dead Space Map version 2

2007-02-27 Thread ITAGAKI Takahiro

"Simon Riggs" <[EMAIL PROTECTED]> wrote:

> > If we combine this with the HOT patch, pages with HOT tuples are probably
> > marked as UNFROZEN because we don't bother vacuuming HOT tuples. They can
> > be removed incrementally and doesn't require explicit vacuums.
> 
> Perhaps avoid DSM entries for HOT updates completely?

Yes, if we employ 1bit/page (worth vacuum or not).
Or no if 2bits/page because HOT updates change page states to UNFROZEN.

> > * Enable/Disable DSM tracking per tables
> 
> How about a dsm_tracking_limit GUC? (Better name please)
> The number of pages in a table before we start tracking DSM entries for
> it. DSM only gives worthwhile benefits for larger tables anyway, so let
> the user define what large means for them.
> dsm_tracking_limit = 1000 by default.

Sound good. How about small_table_size = 8MB for the variable?
I found that we've already have the value used for truncating
threshold for vacuum. (REL_TRUNCATE_MINIMUM = 1000 in vacuumlazy.c)
I think they have the same purpose in treating of small tables
and we can use the same variable in these places.

> > * Dead Space State Cache
> 
> ISTM there should be a point at which DSM is so full we don't bother to
> keep track any longer, so we can drop that information. For example if
> user runs UPDATE without a WHERE clause, there's no point in tracking
> whole relation.

It's a bit difficult. We have to lock DSM *before* we see whether
the table is tracked or not. So we need to cache the tracked state
in the relcache entry, but it requres some works to keep coherency
between cached states and shared states.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] Seeking Google SoC Mentors

2007-02-27 Thread Magnus Hagander

On Tue, Feb 27, 2007 at 09:21:42AM +, Dave Page wrote:
> Tom Lane wrote:
> > Andrew Dunstan <[EMAIL PROTECTED]> writes:
> >> Well, here's a question. Given the recent discussion re full 
> >> disjunction, I'd like to know what sort of commitment we are going to 
> >> give people who work on proposed projects.
> > 
> > Um, if you mean are we going to promise to accept a patch in advance of
> > seeing it, the answer is certainly not.  Still, a SoC author can improve
> > his chances in all the usual ways, primarily by getting discussion and
> > rough consensus on a spec and then on an implementation sketch before
> > he starts to do much code.  Lots of showstopper problems can be caught
> > at that stage.
> 
> We cannot necessarily expect the students to work this way without
> guidance if they are not familiar with our processes before they start.
> The mentors should be there to guide not just with the technical aspects
> of the project, but the procedural as well imho.

IIRC, last time we had a pgsql-students (or similar) mailinglist for the
SoC people. That was closed. Perhaps that's a bit counterproductive - it's
better to get introduced to the "normal way of doing things" right away? 
With the help of the mentor, of course.

//Magnus

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [HACKERS] Dead Space Map version 2

2007-02-27 Thread Heikki Linnakangas


Tom Lane wrote:

The main problem with the levels proposed by Takahiro-san is that any
transition from FROZEN to not-FROZEN *must* be exactly recovered,
because vacuum will never visit an allegedly frozen page at all.  This
appears to require WAL-logging DSM state changes, which is a pretty
serious performance hit.


I doubt it would be a big performance hit. AFAICS, all the information 
needed to recover the DSM is already written to WAL, so it wouldn't need 
any new WAL records.



 I'd be happier if the DSM content could be
treated as just a hint.  I think that means not trusting it for whether
a page is frozen to the extent of not needing vacuum even for
wraparound.  So I'm inclined to propose that there be only two states
(hence only one DSM bit per page): page needs vacuum for space recovery,
or not.  Vacuum for XID wraparound would have to hit every page
regardless.


If we don't have a frozen state, we can't use the DSM to implement 
index-only scans. Index-only scans will obviously require a lot more 
work than just the DSM, but I'd like to have a solution that enables it 
in the future.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] Seeking Google SoC Mentors

2007-02-27 Thread Dave Page

Magnus Hagander wrote:
> On Tue, Feb 27, 2007 at 09:21:42AM +, Dave Page wrote:
>> Tom Lane wrote:
>>> Andrew Dunstan <[EMAIL PROTECTED]> writes:
 Well, here's a question. Given the recent discussion re full 
 disjunction, I'd like to know what sort of commitment we are going to 
 give people who work on proposed projects.
>>> Um, if you mean are we going to promise to accept a patch in advance of
>>> seeing it, the answer is certainly not.  Still, a SoC author can improve
>>> his chances in all the usual ways, primarily by getting discussion and
>>> rough consensus on a spec and then on an implementation sketch before
>>> he starts to do much code.  Lots of showstopper problems can be caught
>>> at that stage.
>> We cannot necessarily expect the students to work this way without
>> guidance if they are not familiar with our processes before they start.
>> The mentors should be there to guide not just with the technical aspects
>> of the project, but the procedural as well imho.
> 
> IIRC, last time we had a pgsql-students (or similar) mailinglist for the
> SoC people. That was closed. Perhaps that's a bit counterproductive - it's
> better to get introduced to the "normal way of doing things" right away? 
> With the help of the mentor, of course.

Yes. The other issue though is that initial project proposal scoring and
discussion is done on a private Google site by the mentors. I don't know
if we're allowed to make the proposals public before they get accepted
by Google in case the students copy or improve each others proposals.
>From their (and Google's) point of view their proposals are essentially
job applications.

Once they've been ranked, and Google have approved the top-N projects I
guess it's open season!

Regards, Dave

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] Seeking Google SoC Mentors

2007-02-27 Thread Magnus Hagander

On Tue, Feb 27, 2007 at 09:53:41AM +, Dave Page wrote:
> Magnus Hagander wrote:
> > On Tue, Feb 27, 2007 at 09:21:42AM +, Dave Page wrote:
> >> Tom Lane wrote:
> >>> Andrew Dunstan <[EMAIL PROTECTED]> writes:
>  Well, here's a question. Given the recent discussion re full 
>  disjunction, I'd like to know what sort of commitment we are going to 
>  give people who work on proposed projects.
> >>> Um, if you mean are we going to promise to accept a patch in advance of
> >>> seeing it, the answer is certainly not.  Still, a SoC author can improve
> >>> his chances in all the usual ways, primarily by getting discussion and
> >>> rough consensus on a spec and then on an implementation sketch before
> >>> he starts to do much code.  Lots of showstopper problems can be caught
> >>> at that stage.
> >> We cannot necessarily expect the students to work this way without
> >> guidance if they are not familiar with our processes before they start.
> >> The mentors should be there to guide not just with the technical aspects
> >> of the project, but the procedural as well imho.
> > 
> > IIRC, last time we had a pgsql-students (or similar) mailinglist for the
> > SoC people. That was closed. Perhaps that's a bit counterproductive - it's
> > better to get introduced to the "normal way of doing things" right away? 
> > With the help of the mentor, of course.
> 
> Yes. The other issue though is that initial project proposal scoring and
> discussion is done on a private Google site by the mentors. I don't know
> if we're allowed to make the proposals public before they get accepted
> by Google in case the students copy or improve each others proposals.
> From their (and Google's) point of view their proposals are essentially
> job applications.

Right. We'll just have to live by Googles rule for that part, I'm
talking about the discussions later. Once things are approved, they
should all be handled on the standard mailinglists, IMHO.

Being able to make "possibly controversial" suggestiosn public
beforehand would certainly help, but may not be possible. But aren't we
supposed to pick mentors who will know enough to be able to make that
call themselves?

//Magnus

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match

Re: [HACKERS] Seeking Google SoC Mentors

2007-02-27 Thread Dave Page

Magnus Hagander wrote:
> Right. We'll just have to live by Googles rule for that part, I'm
> talking about the discussions later. Once things are approved, they
> should all be handled on the standard mailinglists, IMHO.

Oh, 100% agreed.

> Being able to make "possibly controversial" suggestiosn public
> beforehand would certainly help, but may not be possible. But aren't we
> supposed to pick mentors who will know enough to be able to make that
> call themselves?

yeah, in theory, but you know how wildly opinions can vary even amongst
the oldest and most familiar of community members.

Regards, Dave


---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [HACKERS] COMMIT NOWAIT Performance Option

2007-02-27 Thread Simon Riggs

On Mon, 2007-02-26 at 23:04 -0500, Tom Lane wrote:
> "Simon Riggs" <[EMAIL PROTECTED]> writes:
> > COMMIT NOWAIT can co-exist with the normal form of COMMIT and does not
> > threaten the consistency or robustness of other COMMIT modes. Read that
> > again and think about it, before we go further, please.
> 
> I read that, and thought about it, and don't think I believe it.  

I don't mind saying its taken a while to believe it myself. And I very
much want to have the concept tested so we all believe it. Doubt=>risk.

> The
> problem is that there are more interconnections between different
> transactions than you're allowing for.  In particular you need to
> justify that the behavior is safe for non-transactional operations like
> btree page splits and pg_clog buffer page writes.  The idea that's
> particularly bothering me at the moment is that after a system crash,
> we might come back up in a state where a NOWAIT transaction appears
> committed when its updates didn't all get to disk.  "Database corrupt"
> is a situation that threatens all your transactions...

OK, I can see its possible for the following WAL history to occur:

t1  btree split part1
t2  COMMIT;
<-- risk of inconsistent WAL
t1  btree split part2

The COMMIT by t2 flushes WAL, so if a crash happens at the point shown
we have an incomplete multi-part WAL operation. At this stage, lets
assume that no data writes by either t1 or t2 have made it to disk,
which is normal and likely.

On recovery, we apply all the changes in WAL up to the end of WAL. At
the last minute t2 sneaks in and gets committed. t1 was never committed,
so we never found out whether it would be a COMMIT NOWAIT or a COMMIT.
*Whichever* it will be the btree split is still incomplete and recovery
understands this and acts accordingly.

So the situation that sounds nasty is actually a normal situation
currently, so presents no threat because we already handle this
correctly. Multi-part operations seem good to me.

> The idea that's
> particularly bothering me at the moment is that after a system crash,
> we might come back up in a state where a NOWAIT transaction appears
> committed when its updates didn't all get to disk.  "Database corrupt"
> is a situation that threatens all your transactions...

Well, I've spent 2 hours thinking this through, gone pale, thought hard
and gone down a few blind alleys along the way.

The bottom line is that COMMIT NOWAIT isn't fully safe, and thats part
of the explicit non-full guarantee, written in big letters on the tin.
If a transaction commits and then we crash before we flush WAL, then the
transaction will be incomplete. As we define COMMIT now, this is broken
and I use that word accurately: If you use COMMIT NOWAIT, you risk data
loss *but* you have the choice of which transactions this applies to.
However, WAL *is* consistent and there is no danger of database
corruption. Simply put, this is not group commit.

So if you issue COMMIT NOWAIT and then crash, the transactions that were
marked committed need to be marked aborted in clog. So during recovery
we will need to keep track of which transactions are in progress, so we
can mark them explicitly aborted, rather than the current implicit
mechanism. This would need to be done carefully, since a clog page that
was created at transaction start may never have made it to disk by the
time of the crash. We must extend clog in that case, even if the
extension WAL never made it to WAL, either, so that we can record the
aborted state of the Xids that exist somewhere on disk.

Keeping track of transactions in progress won't take long. It isn't
required at all during archive recovery, but that might form the basis
for later use as a Snapshot creation mechanism for read-only access
during startup.

It isn't possible for a normal non-readonly transaction to change data
that has been changed by a commit-nowait transaction, and then have the
normal transaction commit, yet without also implicitly committing the
commit-nowait transaction.

Hopefully, I've got that right?

> > New commit mode is available by explicit command, or as a default
> > setting that will be applied to all COMMITs, or both.
> 
> I dislike introducing new nonstandard syntax ("Oracle compatible" is not
> standard).  If we did this I'd vote for control via a GUC setting only;
> I think that is more useful anyway, as an application can be made to run
> with such a setting without invasive source code changes.

OK.

Having read through all of the above things again, ISTM that we should
make this functionality available by a new GUC commit_fsync_delay, which
must be set explicitly > 0 before this feature can be used at all. If I
confused Tom by using commit_delay, then I'll confuse others also and
group commit and deferred fsync are different techniques with different
robustness guarantees. When enabled it should have a clear message in
the log to show that some commits might be using commit_nowait. 

I'd ev

Re: [HACKERS] [PATCHES]

2007-02-27 Thread Heikki Linnakangas


John Bartlett wrote:

The community may wish to comment on the following issue:

1)At present the file that will contain the list of ctids is going into
a new directory called pg_ctids, analogous to pg_twophase, and also stored
in the pg_data directory.


I don't understand this. What's stored in the file and why? If they're 
only needed within the transaction, surely a temp file would be more 
appropriate?


The new ctidListStore.c file in the patch is not in a valid diff-format. 
I also noticed that you've moved the line beginning with "CREATE_ROLE" 
in gram.y so that it's not in alphabetical order anymore.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq

Re: [HACKERS] Developer TODO List as a PostgreSQL DB

2007-02-27 Thread Chad Wagner

On 2/26/07, Josh Berkus  wrote:

> Just wondering after reading so many mails from Hackers List.(its 2.15AM
> now!!) Is there anybody working on something to create a DB from
> a) The TODO list http://www.postgresql.org/docs/faqs.TODO.html
> b) The sourcecode of PostgreSQL
> c) The relevant Mailings from the developers lists

Of those, only the TODO makes sense as a DB.  You're welcome to go for it.

The TODO list is a bit outdated, and it would make sense to make it into a
web page and assign names to the tasks and which releases the TODO task will
make it in.  Perhaps it would even be useful for folks to post their WIP
patches, status updates, etc to such a page?

Any comments?  I am willing to chip in.  I could only imagine the current
workload for maintaining this is a bit overwhelming.

Re: [HACKERS] Compile libpq for pg8.2.3 with vc7

2007-02-27 Thread Magnus Hagander

On Mon, Feb 26, 2007 at 02:23:58PM -0500, Jeff McKenna wrote:
> Trying to compile 8.2.3 with VC 7.10.3077 (2003) on Win32, I get the 
> following error:
> 
> mypath\postgresql-8.2.3\src\include\c.h(88) : fatal
> error C1083: Cannot open include file: 'pg_config_os.h': No such file or 
> directory
> 
> 
> Is this a known issue?  (is there a patch available?)

How exactly did you try to build it? What command did you use?

//Magnus

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

[HACKERS] Packed short varlenas, what next?

2007-02-27 Thread Gregory Stark


I'm really curious to know how people feel about the varlena patch. In
particular I know these issues may elicit comment:

1) Do we really need a special case for little-endian machines? I think it
   would be trivial to add but having two code paths may be annoying to
   maintain. The flip side is it would make it easier to read varlena headers
   in gdb which I found kind of annoying with them in network byte order.

2) How do people feel about the way I inlined most of the VARATT_IS_SHORT
   cases in heaptuple.c. I tried at first to hide that all in the att_align
   and att_addlength macros but a) it would never be possible to hide most of
   it and b) it would require a few more redundant tests.

3) How do people feel about not allowing an escape hatch for new types and
   explicitly exempting int2vector and oidvector. The alternatives are either
   a) adding a new column to pg_type and pg_attribute and setting that on
   catalog attributes that are accessed through GETSTRUCT (as the first
   varlena in the table) and also setting it on oidvector and int2vector
   because they don't call pg_detoast_datum(). Or b) fixing int2vector and
   oidvector to pass through pg_detoast_datum and fix all the accesses to the
   first int2vector/oidvector in every catalog table to use fastgetattr
   instead. or c) keep things as they are now.

4) Should I start hitting the more heavily trod codepaths in text.c and
   numeric.c to avoid detoasting short varlenas? The macro api is not quite
   complete enough for this yet so it may make sense to tackle at least one
   code site before merging it to be sure we have a workable api for data
   types that want to avoid unnecessary detoasting.

The latest patch is at 

 http://community.enterprisedb.com/varlena/patch-varvarlena-12.patch.gz

I've been doing some benchmarking, I see a 9.7% space saving on the
Benchmark-SQL 5.2 schema which translates into about a 8% performance gain.
The DBT2 benchmarks show a smaller 5.3% space saving because we had already
done a lot more optimizing of the schema. 

-- 
  Gregory Stark
  EnterpriseDB  http://www.enterprisedb.com

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly

Re: [HACKERS] Developer TODO List as a PostgreSQL DB

2007-02-27 Thread Devrim GÜNDÜZ


On Tue, 2007-02-27 at 08:41 -0500, Chad Wagner wrote:

> The TODO list is a bit outdated, 

Really?

> and it would make sense to make it into a web page and assign names to
> the tasks and which releases the TODO task will make it in.  Perhaps
> it would even be useful for folks to post their WIP patches, status
> updates, etc to such a page? 

Please read the archives.
-- 
Devrim GÜNDÜZ
PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Managed Services, Shared and Dedicated Hosting
Co-Authors: plPHP, ODBCng - http://www.commandprompt.com/




signature.asc
Description: This is a digitally signed message part

Re: [HACKERS] Developer TODO List as a PostgreSQL DB

2007-02-27 Thread Andrew Dunstan

Chad Wagner wrote:
On 2/26/07, *Josh Berkus* > wrote:

> Just wondering after reading so many mails from Hackers
List.(its 2.15 AM
> now!!) Is there anybody working on something to create a DB from
> a) The TODO list http://www.postgresql.org/docs/faqs.TODO.html
> b) The sourcecode of PostgreSQL
> c) The relevant Mailings from the developers lists

Of those, only the TODO makes sense as a DB.  You're welcome to go
for it. 

The TODO list is a bit outdated, and it would make sense to make it 
into a web page and assign names to the tasks and which releases the 
TODO task will make it in.  Perhaps it would even be useful for folks 
to post their WIP patches, status updates, etc to such a page?

Any comments?  I am willing to chip in.  I could only imagine the 
current workload for maintaining this is a bit overwhelming.

Before we rehash recent debates, please, everybody, review them. Going 
over and over and over the same ground laboriously is really getting 
tiresome, and unfortunately it's also getting more frequent. It's *déjà 
vu *all over again.

And, for the record, I do not see the slightest point in putting the 
TODO list on its own into a database. None, zilch, nada. As database 
professionals we should be adept at recognising when the use of a 
database is appropriate and when it isn't. If we put it into a tracking 
system (presumably database backed) that would be another matter, but 
that has gone nowhere as usual.

There is some point in putting it in a wiki where we can gather relevant 
documents, links to discussions etc. That's why the developers' wiki was 
established, AIUI.

cheers

andrew

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly

Re: [HACKERS] Expanding DELETE/UPDATE returning

2007-02-27 Thread Florian G. Pflug


David Fetter wrote:

On Mon, Feb 26, 2007 at 11:14:01PM -0500, Tom Lane wrote:

Rusty Conover <[EMAIL PROTECTED]> writes:

Or allow delete  and update to be used in sub-queries?

That's been discussed but the implementation effort seems far from
trivial.  One big problem is that a sub-query can normally be
re-executed multiple times, eg on the inner side of a join; whereas
that's clearly not acceptable for an insert/update/delete.


What kinds of machinery would be needed in order for certain kinds of
subqueries to get executed only once and have the results cached?



INSERT/UPDATE/DELETE ... RETURNING wouldn't be the only possible uses
of such machinery.  A data-changing function in a subquery could be
another.  Maybe there could be some way to mark functions as "execute
once per subquery."


Is "execute only once" even well-defined for subqueries? Take for example

select * from t1, (delete from t2 returning t2.t1_id) where t1.id = 
t2.t1_id ;


Will this delete all record from t2, or just those records for which
a matching record in t1 exists? In case you vote for "all records" 
above, now take


select * from t1, (delete from t2 returning t2.t1_id) where t1.id = 
t2.t1_id limit 1 ;


I for my part couldn't even say what I'd expect that query to do.
Do other databases support this?

greetings, Florian Pflug

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [HACKERS] Proposal for Implenting read-only queries during wal replay (SoC 2007)

2007-02-27 Thread Florian G. Pflug


Merlin Moncure wrote:

getting back on topic (ahem), florian: are you comfortable going ahead
with this?  is there anything you need help with?


I'm currently updating my proposal, trying to incorporate the points
people brought up in this thread.

I realized that trying to use the same kind of "read only mode" for
both a standalone postgres (with the datadir on read-only storage),
and a PITR-slave was missguided - 
http://archives.postgresql.org/pgsql-hackers/2005-11/msg01043.php

was really enlightning ;-)

Tom's objects regarding performance are the hardest one to deal with -
I'm currently thinking about ways that the locking requirements could
be relaxed - though I still plan to do a basic implementation first, and 
then try to improve it.


I'll post an updated proposal during the next few days.

greetings, Florian Pflug

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match

Re: [HACKERS] Developer TODO List as a PostgreSQL DB

2007-02-27 Thread Chad Wagner


On 2/27/07, Andrew Dunstan <[EMAIL PROTECTED]> wrote:


Before we rehash recent debates, please, everybody, review them. Going
over and over and over the same ground laboriously is really getting
tiresome, and unfortunately it's also getting more frequent. It's *déjà
vu *all over again.



History does tend to repeat itself, doesn't it?  I just assumed it was never
discussed based on Josh's response.


database is appropriate and when it isn't. If we put it into a tracking

system (presumably database backed) that would be another matter, but
that has gone nowhere as usual.



A tracking system is what I was referring to, if it didn't have a front-end
then it would be quite pointless to have it in a database.  Honestly, I
think Trac even does this, but I am far from familiar with that tool -- I
have just seen other projects with "roadmap" items which categorized TODO
tasks into future releases.


There is some point in putting it in a wiki where we can gather relevant

documents, links to discussions etc. That's why the developers' wiki was
established, AIUI.



To be honest, it may be adequate to maintain this solely through the Wiki.
The only thing I see lacking is specifically who is handling tasks, and what
release it is planned for.  Perhaps, I am just looking in the wrong place?

Re: [HACKERS] Packed short varlenas, what next?

2007-02-27 Thread Peter Eisentraut

Gregory Stark wrote:
> I'm really curious to know how people feel about the varlena patch.

As I has mentioned earlier, I'm missing a plan to allow 8-byte varlena 
sizes.

-- 
Peter Eisentraut
http://developer.postgresql.org/~petere/

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] Developer TODO List as a PostgreSQL DB

2007-02-27 Thread Andrew Dunstan


Chad Wagner wrote:


There is some point in putting it in a wiki where we can gather
relevant
documents, links to discussions etc. That's why the developers'
wiki was
established, AIUI.


To be honest, it may be adequate to maintain this solely through the 
Wiki.  The only thing I see lacking is specifically who is handling 
tasks, and what release it is planned for.  Perhaps, I am just looking 
in the wrong place?




There's a good reason this information is missing: it doesn't exist. 
Please, like I said, have a look at recent discussions.


Maybe we need some extra FAQs, like:

. Why do you still use CVS instead of ?
. Why don't you use bug tracking software?
. Where do I find out when  will be in a 
release?


cheers

andrew


---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org

Re: [HACKERS] Developer TODO List as a PostgreSQL DB

2007-02-27 Thread Lukas Kahwe Smith


Andrew Dunstan wrote:

Chad Wagner wrote:


There is some point in putting it in a wiki where we can gather
relevant
documents, links to discussions etc. That's why the developers'
wiki was
established, AIUI.


To be honest, it may be adequate to maintain this solely through the 
Wiki.  The only thing I see lacking is specifically who is handling 
tasks, and what release it is planned for.  Perhaps, I am just looking 
in the wrong place?




There's a good reason this information is missing: it doesn't exist. 
Please, like I said, have a look at recent discussions.


Maybe we need some extra FAQs, like:

. Why do you still use CVS instead of ?
. Why don't you use bug tracking software?
. Where do I find out when  will be in a 
release?


yeah .. would be a good idea .. with some links to key posts in the 
archive .. would be much less work than replying all the time .. and 
would also make it easier for new comers to not burn their fingers at 
rehashing things needlessly ..


regards,
Lukas

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate

Re: [HACKERS] Developer TODO List as a PostgreSQL DB

2007-02-27 Thread Michael Glaesemann



On Feb 27, 2007, at 23:40 , Andrew Dunstan wrote:


Maybe we need some extra FAQs, like:

. Why do you still use CVS instead of here>?


I just saw a patch from Robert Treat on just this topic. Doesn't look  
like its been applied yet.


Michael Glaesemann
grzm seespotcode net



---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate

Re: [HACKERS] Packed short varlenas, what next?

2007-02-27 Thread Tom Lane

Peter Eisentraut <[EMAIL PROTECTED]> writes:
> As I has mentioned earlier, I'm missing a plan to allow 8-byte varlena 
> sizes.

I don't think it's entirely fair to expect this patch to solve that
problem.  In the first place, that is not what the patch's goal is,
but merely tangentially related to the same code.  In the second place,
I don't see any way we could possibly do that without wide-ranging code
changes; to take just one point, much of the code that works with
varlenas uses "int" or "int32" variables to compute sizes.  So it would
certainly expand the scope of the patch quite a lot to try to put that
in place, and it's mighty late in the devel cycle to be thinking about
that sort of thing.

For the moment I think it should be enough to expect that the patch
allow for more than one format of TOAST pointer, so that if we ever did
try to support 8-byte varlenas, there'd be a way to represent them
on-disk.  Some of the alternatives that we discussed last year used up
all of the "prefix space" and wouldn't have allowed expansion in this
particular direction.

regards, tom lane

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] Packed short varlenas, what next?

2007-02-27 Thread Gregory Stark

Tom Lane <[EMAIL PROTECTED]> writes:

Tom Lane <[EMAIL PROTECTED]> writes:

> Peter Eisentraut <[EMAIL PROTECTED]> writes:
> > As I has mentioned earlier, I'm missing a plan to allow 8-byte varlena 
> > sizes.

Hm, change VARHDRSZ to 8 and change all the varlena data types to have an
int64 leading field? I suppose it could be done, and it would give us more
bits to play with in the codespace since then we could limit 4-byte headers to
128M or something. But yes, there are tons of places in the code that
currently do arithmetic on sizes using integers -- and often signed integers
at that.

But that's a change to what a *detoasted* datum looks like. My patch mainly
changes what a *toasted* datum looks like. (Admittedly after making more data
fall in that category than previously.) The only change to a detoasted datum
is that the size is stored in network byte order.

> For the moment I think it should be enough to expect that the patch
> allow for more than one format of TOAST pointer, so that if we ever did
> try to support 8-byte varlenas, there'd be a way to represent them
> on-disk.  Some of the alternatives that we discussed last year used up
> all of the "prefix space" and wouldn't have allowed expansion in this
> particular direction.

Ah yes, I had intended to include the bit-pattern choice in the list as well.

There are two issues there:

1) The lack of 2-byte patterns which is quite annoying as really *any* on-disk
   datum would fit in a 2-byte header varlena. However it became quite tricky
   to convert things to 2-byte headers, especially for compressed data, it
   would have made for a much bigger patch to tuptoaster.c and pg_lzcompress.
   And I became convinced that it was best to get the most important gain
   first, saving 2 bytes on wider tuples is less important than 3-6 bytes on
   narrow tuples.

2) The choice of encoding for toast pointers. Note that currently they don't
   actually save *any* space due to the alignment requirements of the OIDs.
   which seems kind of silly but I didn't see any reasonable way around that.
   The flip side is that gives us 24 bits to play with if we want to have
   different types of external pointers or more meta-information about the
   toasted data.

   One of the details here is that I didn't store the compressed bit anywhere
   for external toast pointers. I just made the macro compare the rawsize and
   extsize. If that strikes anyone as evil we could take a byte out of those 3
   padding bytes for flags and store a compressed flag there.

-- 
  Gregory Stark
  EnterpriseDB  http://www.enterprisedb.com

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [HACKERS] Packed short varlenas, what next?

2007-02-27 Thread Tom Lane

Gregory Stark <[EMAIL PROTECTED]> writes:
> 2) The choice of encoding for toast pointers. Note that currently they don't
>actually save *any* space due to the alignment requirements of the OIDs.
>which seems kind of silly but I didn't see any reasonable way around that.

I was expecting that we'd store them as unaligned and memcpy a toast
pointer into a suitably-aligned local variable any time we wanted to
look at its contents.  Detoasting is expensive enough that that's not
going to add any noticeable percentage time-overhead, and not having to
align toast pointers should be a pretty good percentage space-saving,
seeing that they're only 20-some bytes anyway.

>One of the details here is that I didn't store the compressed bit anywhere
>for external toast pointers. I just made the macro compare the rawsize and
>extsize. If that strikes anyone as evil we could take a byte out of those 3
>padding bytes for flags and store a compressed flag there.

I believe this is OK since the toast code doesn't compress unless space
is actually saved.  You should put a note in the code that that behavior
is now necessary for correctness, not just a performance tweak.

regards, tom lane

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [HACKERS] autovacuum next steps, take 2

2007-02-27 Thread Jim C. Nasby

On Tue, Feb 27, 2007 at 01:26:00AM -0500, Matthew T. O'Connor wrote:
> Tom Lane wrote:
> >"Matthew T. O'Connor"  writes:
> >>I'm not sure what you are saying here, are you now saying that partial 
> >>vacuum won't work for autovac?  Or are you saying that saving state as 
> >>Jim is describing above won't work?
> >
> >I'm saying that I don't like the idea of trying to "stop on a dime" by
> >saving the current contents of vacuum's dead-TID array to disk with the
> >idea that we can trust those values 100% later.  Saving the array is
> >expensive both in runtime and code complexity, and I don't believe we
> >can trust it later --- at least not without even more expensive-and-
> >complex measures, such as WAL-logging every such save :-(
> >
> >I'm for stopping only after completing an index-cleaning pass, at the
> >point where we empty the dead-TID array anyway.  If you really have to
> >have "stop on a dime", just kill -INT the process, accepting that you
> >will have to redo your heap scan since the last restart point.
> 
> OK, so if I understand correct, a vacuum of a table with 10 indexes on 
> it can be interrupted 10 times, once after each index-cleaning pass? 
> That might have some value, especially breaking up the work required to 
> vacuum a large table. Or am I still not getting it?

It'd stop after scanning the heap, scanning all the indexes, and then
cleaning the heap. After that's done it no longer needs any of the
dead-TID info; anytime before that it does need that info, and Tom's
objection is that trying to store that info is a bad idea.

The problem with this is that typically it takes a long time to go
through a complete vacuum cycle; minutes at least, and preferably
longer. Decreasing that cycle time will greatly increase the amount of
IO required for vacuuming a table with any indexes, because every time
you cycle through you have to read the entire index. That's why I don't
see it as being useful at all for getting autovac to work on hot tables
- if you actually got that cycle time low enough you'd kill the system
with all the extra index scanning.
-- 
Jim Nasby[EMAIL PROTECTED]
EnterpriseDB  http://enterprisedb.com  512.569.9461 (cell)

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org

Re: [HACKERS] autovacuum next steps, take 2

2007-02-27 Thread Jim C. Nasby

On Tue, Feb 27, 2007 at 12:54:28AM -0500, Matthew T. O'Connor wrote:
> Jim C. Nasby wrote:
> >On Mon, Feb 26, 2007 at 10:18:36PM -0500, Matthew T. O'Connor wrote:
> >>Jim C. Nasby wrote:
> >>Here is a worst case example: A DB with 6 tables all of which are highly 
> >>active and will need to be vacuumed constantly.  While this is totally 
> >>hypothetical, it is how I envision things working (without the threshold).
> >
> >I fail to see how a simple 6 table case is 'worst case'. It's common to
> >see hundreds of tables, and I've run across more than one database with
> >thousands of tables (think partitioning). In cases like those it's
> >certainly possible, perhaps even likely that you would get many daemons
> >running in the database at one time just from different tables suddenly
> >needing vacuuming and appearing at a higher point in the list than other
> >tables. With 100 ~1G tables getting updates it certainly wouldn't be
> >hard to end up with 10 of those being vacuumed all at the same time.
> 
> Yes 6 tables is small, the worst-case part of the example was that all 
> the tables would need to be vacuumed constantly.  Most databases only 
> have a few hot tables.  Most tables only need to vacuumed every once in 
> a while.
 
It's not the hot tables that are the issue; it's how many large tables
(hot or not) that can come up for vacuuming in order. For example, if
A-Z are all large tables (ie: a few GB), with A being the largest and Z
the smallest, think about what happens here:

Round 1: A needs vacuuming. Daemon gets to it and starts working.
Round 2: B now needs vacuuming. It's slightly smaller than A, so daemon
2 gets to it.
Round 3: C now needs vacuuming. Daemon 3.
...
Round 26: Z now needs vacuuming. Daemon 26 picks it up.

You now have 26 daemons running in the database.

Now, we can argue about how likely that scenario is, but I don't think
it's relevant. What matters is that it *is* possible, and as long as
that's the case you'd have to have some kind of limit. (While this
simple 26 table example is definitely worst-case, if you've got hundreds
of tables that are all multiple GB in size I think it wouldn't be hard
at all for you to end up with a dozen or more daemons all hammering
away).

> >I do like the idea since it should be easier to tune, but I think we
> >still need some limit on it. Perhaps as a first-pass we could just have
> >a hard limit and log a message and/or set a flag any time we hit it.
> >That would hopefully allow us to get information about how big a problem
> >it really is. We could go one step further and say that the last daemon
> >that can start in a database will only vacuum tables that can be done
> >quickly; that's essentially what we've been talking about, except the
> >limit we've been discussing would be hard-coded at 2.
> 
> I'm confused, what limit would be set at 2?  The number of concurrent 
> workers?  I've never said that.

The point I was making is that the proposal about limiting the 2nd
daemon to only processing tables it can do in a short period of time is
akin to setting a limit of only 2 daemons in a database at a time.
-- 
Jim Nasby[EMAIL PROTECTED]
EnterpriseDB  http://enterprisedb.com  512.569.9461 (cell)

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate

Re: [HACKERS] autovacuum next steps, take 2

2007-02-27 Thread Jim C. Nasby

On Tue, Feb 27, 2007 at 12:00:41AM -0300, Alvaro Herrera wrote:
> Jim C. Nasby wrote:
> 
> > The advantage to keying this to autovac_naptime is that it means we
> > don't need another GUC, but after I suggested that before I realized
> > that's probably not the best idea. For example, I've seen clusters that
> > are running dozens-hundreds of databases; in that environment you really
> > need to turn naptime way down (to like a second). In that case you
> > wouldn't want to key to naptime.
> 
> Actually, I've been thinking that it would be a good idea to change the
> semantics of autovacuum_naptime so that it means the average time to
> start a worker in any given database.  That way, the time between
> autovac runs is not dependent on the number of databases you have.

BTW, another issue that I don't think we can ignore: we actually need to
do this on a per-tablespace level, or at least have the ability to
disable or somehow limit it. While it's not common, there are users that
run a hundred or more databases in a single cluster; it would be ugly if
we suddenly had 100 vacuums trying to run on the same set of drives
concurrently.
-- 
Jim Nasby[EMAIL PROTECTED]
EnterpriseDB  http://enterprisedb.com  512.569.9461 (cell)

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [HACKERS] Developer TODO List as a PostgreSQL DB

2007-02-27 Thread Joshua D. Drake


> And, for the record, I do not see the slightest point in putting the
> TODO list on its own into a database. None, zilch, nada. As database
> professionals we should be adept at recognising when the use of a
> database is appropriate and when it isn't. If we put it into a tracking
> system (presumably database backed) that would be another matter, but
> that has gone nowhere as usual.

As someone who has been involved in these discussions over and over, I
would suggest that everyone just drop the whole thing. It won't go
anywhere, and it just isn't worth the 3 cents of bandwidth it would cost
to send the emails.

Joshua D. Drake



-- 

  === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
Providing the most comprehensive  PostgreSQL solutions since 1997
 http://www.commandprompt.com/

Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
PostgreSQL Replication: http://www.commandprompt.com/products/


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

Re: [HACKERS] [PATCHES] HOT WIP Patch - version 3.2

2007-02-27 Thread Heikki Linnakangas


Pavan Deolasee wrote:

- What do we do with the LP_DELETEd tuples at the VACUUM time ?
In this patch, we are collecting them and vacuuming like
any other dead tuples. But is that the best thing to do ?


Since they don't need index cleanups, it's a waste of 
maintenance_work_mem to keep track of them in the dead tuples array. 
Let's remove them in the 1st phase. That means trading the shared lock 
for a vacuum-level lock on pages with LP_DELETEd tuples. Or if we want 
to get fancy, we could skip LP_DELETEd tuples in the 1st phase for pages 
that had dead tuples on them, and scan and remove them in the 2nd phase 
when we have to acquire the vacuum-level lock anyway.



- While searching for a LP_DELETEd tuple, we start from the
first offset and return the first slot which is big enough
to store the tuple. Is there a better search algorithm
(sorting/randomizing) ? Should we go for best-fit instead
of first-fit ?


Best-fit seems better to me. It's pretty cheap to scan for LP_DELETEd 
line pointers, but wasting space can lead to cold updates and get much 
more expensive.


You could also prune the chains on the page to make room for the update, 
and if you can get a vacuum lock you can also defrag the page.



- Should we have metadata on the heap page to track the
number of LP_DELETEd tuples, number of HOT-update chains in the
page and any other information that can help us optimize
search/prune operations ?


I don't think the CPU overhead is that significant; we only need to do 
the search/prune when a page gets full. We can add flags later if we 
feel like it, but let's keep it simple for now.



- There are some interesting issues in the VACUUMing area. How
do we count the dead tuples ? Should we count HOT-updated
tuples in the dead count ? If we do so, I noticed that
VACUUM gets triggered on very small tables like "tellers"
in pgbench and takes several minutes to finish because
it waits very very long for VACUUM-strength lock on the
index pages. Index is just a page or two in this case.


Yeah, that's not good. HOT updates shouldn't increase the n_dead_tuples 
pgstat counter.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate

Re: [HACKERS] Resumable vacuum proposal and design overview

2007-02-27 Thread Jim C. Nasby

On Tue, Feb 27, 2007 at 11:44:28AM +0900, Galy Lee wrote:
> For example, there is one table:
>- The table is a hundreds GBs table.
>- It takes 4-8 hours to vacuum such a large table.
>- Enabling cost-based delay may make it last for 24 hours.
>- It can be vacuumed during night time for 2-4 hours.
> 
> It is true there is no such restrict requirement that vacuum
> need to be interrupt immediately, but it should be stopped in an
> *predictable way*. In the above example, if we have to wait for the end
>  of one full cycle of cleaning, it may take up to 8 hours for vacuum to
> stop after it has received stop request. This seems quit unacceptable.

Even with very large tables, you could likely still fit things into a
specific time frame by adjusting how much time is spent scanning for
dead tuples. The idea would be to give vacuum a target run time, and it
would monitor how much time it had remaining, taking into account how
long it should take to scan the indexes based on how long it's been
taking to scan the heap. When the amount of time left becomes less than
the estimate of the amount of time required to scan the indexes (and
clean the heap), you stop the heap scan and start scanning indexes. As
long as the IO workload on the machine doesn't vary wildly between the
heap scan and the rest of the vacuum process, I would expect this to
work out fairly well.

While not as nice as the ability to 'stop on a dime' as Tom puts it,
this would be much easier and safer to implement. If there's still a
need for something better after that we could revisit it at that time.
-- 
Jim Nasby[EMAIL PROTECTED]
EnterpriseDB  http://enterprisedb.com  512.569.9461 (cell)

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

[HACKERS] 7.x horology regression test on Solaris buildfarm machines

2007-02-27 Thread Tom Lane

I see that kudu and dragonfly are now failing regression in the 7.3 and
7.4 branches, as a consequence of this patch:
http://archives.postgresql.org/pgsql-committers/2007-02/msg00491.php

Is it reasonable to assume that that machine will soon be patched to
know about the new US DST rules?  If not, I suppose we could install
variant horology-solaris-1947_1.out files to allow either result to
succeed, but it seems a bit silly ...

regards, tom lane

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly

Re: [HACKERS] Seeking Google SoC Mentors

2007-02-27 Thread Jim C. Nasby

On Tue, Feb 27, 2007 at 12:47:14AM -0500, Tom Lane wrote:
> "Jim C. Nasby" <[EMAIL PROTECTED]> writes:
> > Yes, but the list being discussed is SoC projects that the community
> > would like to see done, which means most people would assume that #1
> > isn't an issue.
> 
> > We need to make sure that every project on the list of SoC ideas is
> > supported by the community.
> 
> Agreed, except that in most cases a one-liner description of an idea
> isn't enough to get a meaningful reading on whether people think it's
> sane or not.  To take our current example: do you think a one-liner
> description of full disjunctions would have gotten any feedback, except
> for requests for more detail?
> 
> I'm not sure how we fix that --- laying out every acceptable project
> in great detail in advance won't happen for lack of manpower, and wouldn't
> be a good idea even if we could do it, because that sounds like a great
> way to stifle creativity.  At the same time we can hardly promise to
> accept every wild-west idea that someone manages to turn into some kind
> of code.  What can we tell the students other than "get as much feedback
> as you can, as early as you can"?

I agree we certainly don't want to go designing these projects in
advance, but I think we could at least ensure that the community buys
into the concept of each project. ISTM one of the big issues with FD is
that most people didn't even really understand what exactly it was or
how it might be useful, which made getting broad acceptance even harder.

For example, these TODOs seem like they have good acceptance by the
community (though they might not be good SoC projects for other
reasons):
Simplify ability to create partitioned tables
Allow auto-selection of partitioned tables for min/max() operations
Allow commenting of variables in postgresql.conf to restore them to
defaults

Examples of ideas that might not be good because it's unclear that the
community supports them:
Stop-on-a-dime partial vacuum
Adding a replication solution to the backend
Putting time travel support back in

Granted, the 'not good idea' list is pretty exaggerated simply because
it's not as easy to find examples of that on the TODO list, since stuff
on the TODO list is generally supported. Some of the 'temporal database'
items that had been suggested probably fall into the category of 'might
be a good idea, but the community hasn't decided that yet'. So maybe we
should be limiting SoC projects to stuff that's already on the TODO
list..
-- 
Jim Nasby[EMAIL PROTECTED]
EnterpriseDB  http://enterprisedb.com  512.569.9461 (cell)

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match

Re: [HACKERS] Resumable vacuum proposal and design overview

2007-02-27 Thread Simon Riggs

On Tue, 2007-02-27 at 10:37 -0600, Jim C. Nasby wrote:
> On Tue, Feb 27, 2007 at 11:44:28AM +0900, Galy Lee wrote:
> > For example, there is one table:
> >- The table is a hundreds GBs table.
> >- It takes 4-8 hours to vacuum such a large table.
> >- Enabling cost-based delay may make it last for 24 hours.
> >- It can be vacuumed during night time for 2-4 hours.
> > 
> > It is true there is no such restrict requirement that vacuum
> > need to be interrupt immediately, but it should be stopped in an
> > *predictable way*. In the above example, if we have to wait for the end
> >  of one full cycle of cleaning, it may take up to 8 hours for vacuum to
> > stop after it has received stop request. This seems quit unacceptable.
> 
> Even with very large tables, you could likely still fit things into a
> specific time frame by adjusting how much time is spent scanning for
> dead tuples. The idea would be to give vacuum a target run time, and it
> would monitor how much time it had remaining, taking into account how
> long it should take to scan the indexes based on how long it's been
> taking to scan the heap. When the amount of time left becomes less than
> the estimate of the amount of time required to scan the indexes (and
> clean the heap), you stop the heap scan and start scanning indexes. As
> long as the IO workload on the machine doesn't vary wildly between the
> heap scan and the rest of the vacuum process, I would expect this to
> work out fairly well.
> 
> While not as nice as the ability to 'stop on a dime' as Tom puts it,
> this would be much easier and safer to implement. If there's still a
> need for something better after that we could revisit it at that time.

I do like this idea, but it also seems easy to calculate that bit
yourself. Run VACUUM, after X minutes issue stop_vacuum() and see how
long it takes to finish. Adjust X until you have it right.

If we did want to automate it, vacuum_target_duration userset GUC would
make, following Jim's thought. =0 means run-to-completion. 

Getting it to work well for VACUUM FULL would be more than a little
interesting though.

-- 
  Simon Riggs 
  EnterpriseDB   http://www.enterprisedb.com



---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly

Re: [HACKERS] 7.x horology regression test on Solaris buildfarm machines

2007-02-27 Thread Andrew Dunstan


Tom Lane wrote:

I see that kudu and dragonfly are now failing regression in the 7.3 and
7.4 branches, as a consequence of this patch:
http://archives.postgresql.org/pgsql-committers/2007-02/msg00491.php

Is it reasonable to assume that that machine will soon be patched to
know about the new US DST rules?  If not, I suppose we could install
variant horology-solaris-1947_1.out files to allow either result to
succeed, but it seems a bit silly ...

  


Yeah. I have stopped building 7.4 on hyena.

cheers

andrew


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq

Re: [HACKERS] Dead Space Map version 2

2007-02-27 Thread Jim C. Nasby

On Tue, Feb 27, 2007 at 12:55:21AM -0500, Tom Lane wrote:
> "Jim C. Nasby" <[EMAIL PROTECTED]> writes:
> > Yes, DSM would make FSM recovery more important, but I thought it was
> > recoverable now? Or is that only on a clean shutdown?
> 
> Currently we throw away FSM during any non-clean restart.  This is
> probably overkill but I'm quite unclear what would be a safe
> alternative.

My thought would be to revert to a FSM that has pages marked as free
that no longer are. Could be done by writing the FSM out every time we
add pages to it. After an unclean restart backends would be getting
pages from the FSM that didn't have free space, in which case they'd
need to yank that page out of the FSM and request a new one. Granted,
this means extra IO until the FSM gets back to a realistic state, but I
suspect that's better than bloating tables out until the next vacuum.
And it's ultimately less IO than re-vacuuming every table to rebuild the
FSM.
-- 
Jim Nasby[EMAIL PROTECTED]
EnterpriseDB  http://enterprisedb.com  512.569.9461 (cell)

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org

Re: [HACKERS] 7.x horology regression test on Solaris buildfarm machines

2007-02-27 Thread Kris Jurka




On Tue, 27 Feb 2007, Tom Lane wrote:


I see that kudu and dragonfly are now failing regression in the 7.3 and
7.4 branches, as a consequence of this patch:
http://archives.postgresql.org/pgsql-committers/2007-02/msg00491.php

Is it reasonable to assume that that machine will soon be patched to
know about the new US DST rules?  If not, I suppose we could install
variant horology-solaris-1947_1.out files to allow either result to
succeed, but it seems a bit silly ...



I'll look at an upgrade.  Eel is failing as well, but surprisingly canary 
is not.  Canary hasn't had any updates applied, so why isn't it failing as 
well:


[EMAIL PROTECTED]:~$ zdump -v /etc/localtime  | grep 2007
/etc/localtime  Sun Apr  1 08:59:59 2007 UTC = Sun Apr  1 01:59:59 2007 
MST isdst=0
/etc/localtime  Sun Apr  1 09:00:00 2007 UTC = Sun Apr  1 03:00:00 2007 
MDT isdst=1
/etc/localtime  Sun Oct 28 07:59:59 2007 UTC = Sun Oct 28 01:59:59 2007 
MDT isdst=1
/etc/localtime  Sun Oct 28 08:00:00 2007 UTC = Sun Oct 28 01:00:00 2007 
MST isdst=0


Kris Jurka

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] Dead Space Map version 2

2007-02-27 Thread Jim C. Nasby

On Tue, Feb 27, 2007 at 05:38:39PM +0900, ITAGAKI Takahiro wrote:
> "Jim C. Nasby" <[EMAIL PROTECTED]> wrote:
> 
> > > If we do UPDATE a tuple, the original page containing the tuple is marked
> > > as HIGH and the new page where the updated tuple is placed is marked as 
> > > LOW.
> > 
> > Don't you mean UNFROZEN?
> 
> No, the new tuples are marked as LOW. I intend to use UNFROZEN and FROZEN
> pages as "all tuples in the pages are visible to all transactions" for
> index-only-scan in the future.
 
Ahh, ok. Makes sense, though I tend to agree with others that it's
better to leave that off for now, or at least do the initial patch
without it.
 
> > What makes it more important to mark the original page as HIGH instead
> > of LOW, like the page with the new tuple? The description of the states
> > indicates that there would likely be a lot more dead tuples in a HIGH
> > page than in a LOW page.
> > 
> > Perhaps it would be better to have the bgwriter take a look at how many
> > dead tuples (or how much space the dead tuples account for) when it
> > writes a page out and adjust the DSM at that time.
> 
> Yeah, I feel it is worth optimizable, too. One question is, how we treat
> dirty pages written by backends not by bgwriter? If we want to add some
> works in bgwriter, do we also need to make bgwriter to write almost of
> dirty pages?

IMO yes, we want the bgwriter to be the only process that's normally
writing pages out. How close we are to that, I don't know...
 
> > > * Agressive freezing
> > > We will freeze tuples in dirty pages using OldestXmin but FreezeLimit.
> > 
> > Do you mean using OldestXmin instead of FreezeLimit?
> 
> Yes, we will use OldestXmin as the threshold to freeze tuples in
> dirty pages or pages that have some dead tuples. Or, many UNFROZEN
> pages still remain after vacuum and they will cost us in the next
> vacuum preventing XID wraparound.

Another good idea. If it's not too invasive I'd love to see that as a
stand-alone patch so that we know it can get in.
 
> > > I'm thinking to change them into 2 new paramaters. We will allocates 
> > > memory
> > > for DSM that can hold all of estimated_database_size, and for FSM 50% or
> > > something of the size. Is this reasonable?
> >  
> > I don't think so, at least not until we get data from the field about
> > what's typical. If the DSM is tracking every page in the cluster then
> > I'd expect the FSM to be closer to 10% or 20% of that, anyway.
> 
> I'd like to add some kind of logical flavors to max_fsm_pages
> and max_dsm_pages. For DSM, max_dsm_pages should represent the
> whole database size. In the other hand, what meaning does
> max_fsm_pages have? (estimated_updatable_size ?)

At some point it might make sense to convert the FSM into a bitmap; that
way everything just scales with database size.

In the meantime, I'm not sure if it makes sense to tie the FSM size to
the DSM size, since each FSM page requires 48x the storage of a DSM
page. I think there's also a lot of cases where FSM size will not scale
the same was DSM size will, such as when there's historical data in the
database.

That raises another question... what happens when we run out of DSM
space?
-- 
Jim Nasby[EMAIL PROTECTED]
EnterpriseDB  http://enterprisedb.com  512.569.9461 (cell)

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] Resumable vacuum proposal and design overview

2007-02-27 Thread Tom Lane

"Simon Riggs" <[EMAIL PROTECTED]> writes:
> On Tue, 2007-02-27 at 10:37 -0600, Jim C. Nasby wrote:
>> ... The idea would be to give vacuum a target run time, and it
>> would monitor how much time it had remaining, taking into account how
>> long it should take to scan the indexes based on how long it's been
>> taking to scan the heap. When the amount of time left becomes less than
>> the estimate of the amount of time required to scan the indexes (and
>> clean the heap), you stop the heap scan and start scanning indexes.

> I do like this idea, but it also seems easy to calculate that bit
> yourself. Run VACUUM, after X minutes issue stop_vacuum() and see how
> long it takes to finish. Adjust X until you have it right.

One problem with it is that a too-small target would result in vacuum
proceeding to scan indexes after having accumulated only a few dead
tuples, resulting in increases (potentially enormous ones) in the total
work needed to vacuum the table completely.

I think it's sufficient to have two cases: abort now, and restart from
the last cycle-completion point next time (this would basically just be
SIGINT); or set a flag to stop at the next cycle-completion point.

It occurs to me that we may be thinking about this the wrong way
entirely.  Perhaps a more useful answer to the problem of using a
defined maintenance window is to allow VACUUM to respond to changes in
the vacuum cost delay settings on-the-fly.  So when your window closes,
you don't abandon your work so far, you just throttle your I/O rate back
to whatever's considered acceptable for daytime vacuuming.

regards, tom lane

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate

Re: [HACKERS] autovacuum next steps, take 2

2007-02-27 Thread Matthew T. O'Connor


Jim C. Nasby wrote:

On Tue, Feb 27, 2007 at 12:00:41AM -0300, Alvaro Herrera wrote:

Jim C. Nasby wrote:


The advantage to keying this to autovac_naptime is that it means we
don't need another GUC, but after I suggested that before I realized
that's probably not the best idea. For example, I've seen clusters that
are running dozens-hundreds of databases; in that environment you really
need to turn naptime way down (to like a second). In that case you
wouldn't want to key to naptime.

Actually, I've been thinking that it would be a good idea to change the
semantics of autovacuum_naptime so that it means the average time to
start a worker in any given database.  That way, the time between
autovac runs is not dependent on the number of databases you have.


BTW, another issue that I don't think we can ignore: we actually need to
do this on a per-tablespace level, or at least have the ability to
disable or somehow limit it. While it's not common, there are users that
run a hundred or more databases in a single cluster; it would be ugly if
we suddenly had 100 vacuums trying to run on the same set of drives
concurrently.


I think we all agree that autovacuum needs to become tablespace aware at 
some point, but I think that is further down the line, we're having 
enough trouble figuring things out without that additional complication.


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly

Re: [HACKERS] 7.x horology regression test on Solaris buildfarm machines

2007-02-27 Thread Kris Jurka




On Tue, 27 Feb 2007, Kris Jurka wrote:

I'll look at an upgrade.  Eel is failing as well, but surprisingly canary is 
not.  Canary hasn't had any updates applied, so why isn't it failing as well:




Shouldn't all of the buildfarm members be failing either before or after 
the patch?  That doesn't seem to be the case for any of them.


Kris Jurka

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [HACKERS] 7.x horology regression test on Solaris buildfarm machines

2007-02-27 Thread Tom Lane

Kris Jurka <[EMAIL PROTECTED]> writes:
> Shouldn't all of the buildfarm members be failing either before or after 
> the patch?  That doesn't seem to be the case for any of them.

No, because for the "standard" regression behavior we have variant
result files both with and without the DST law change: see horology and
horology_1.  The issue only comes up for machines that were matching to
horology-no-DST-before-1970.out (which may be only old HPUX) or
horology-solaris-1947.out (which I suppose is only Solaris).  The reason
I got around to making the patch now is I recently updated my old HPUX
machine for the law change, and so 7.x started to fail for me ...

regards, tom lane

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate

Re: [HACKERS] Resumable vacuum proposal and design overview

2007-02-27 Thread Matthew T. O'Connor


Tom Lane wrote:

"Simon Riggs" <[EMAIL PROTECTED]> writes:

On Tue, 2007-02-27 at 10:37 -0600, Jim C. Nasby wrote:

... The idea would be to give vacuum a target run time, and it
would monitor how much time it had remaining, taking into account how
long it should take to scan the indexes based on how long it's been
taking to scan the heap. When the amount of time left becomes less than
the estimate of the amount of time required to scan the indexes (and
clean the heap), you stop the heap scan and start scanning indexes.



I do like this idea, but it also seems easy to calculate that bit
yourself. Run VACUUM, after X minutes issue stop_vacuum() and see how
long it takes to finish. Adjust X until you have it right.


One problem with it is that a too-small target would result in vacuum
proceeding to scan indexes after having accumulated only a few dead
tuples, resulting in increases (potentially enormous ones) in the total
work needed to vacuum the table completely.

I think it's sufficient to have two cases: abort now, and restart from
the last cycle-completion point next time (this would basically just be
SIGINT); or set a flag to stop at the next cycle-completion point.


It occurs to me that we may be thinking about this the wrong way
entirely.  Perhaps a more useful answer to the problem of using a
defined maintenance window is to allow VACUUM to respond to changes in
the vacuum cost delay settings on-the-fly.  So when your window closes,
you don't abandon your work so far, you just throttle your I/O rate back
to whatever's considered acceptable for daytime vacuuming.


I thought we already did that?  Which BTW was part of my plan on how to 
deal with a vacuum that is still running after it's maintenance window 
has expired.


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly

Re: [HACKERS] Seeking Google SoC Mentors

2007-02-27 Thread Josh Berkus


> Well, the HOT discussion hasn't yet led to an accepted patch ... and I'd
> say its authors still did way too much work before getting the community
> involved.  But certainly it's a better model to look at than what the
> FD author did.

That's pretty much the mentor's job.  I don't remember who mentored FD, but 
obviously there wasn't enough guidance.

-- 
Josh Berkus
PostgreSQL @ Sun
San Francisco

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

[HACKERS] bug in gist hstore?

2007-02-27 Thread Gregory Stark


In the following code from hstore_io.c, is HStore a varlena? In which case is
the following code buggy because it omits to subtract VARHDRSZ from in->size
and therefore is not handling the empty hstore and also starting the loop from
the varlena header instead of the first data byte?

Or if HStore is not a varlena then PG_GETARG_HS is buggy since it calls
PG_DETOAST_DATUM() on the argument.



PG_FUNCTION_INFO_V1(hstore_out);
Datum   hstore_out(PG_FUNCTION_ARGS);
Datum
hstore_out(PG_FUNCTION_ARGS)
{
HStore *in = PG_GETARG_HS(0);
int buflen,
i;
char   *out,
   *ptr;
char   *base = STRPTR(in);
HEntry *entries = ARRPTR(in);

if (in->size == 0)
{
out = palloc(1);
*out = '\0';
PG_FREE_IF_COPY(in, 0);
PG_RETURN_CSTRING(out);
}

buflen = (4 /* " */ + 2 /* => */ + 2 /* , */ ) * in->size +
2 /* esc */ * (in->len - CALCDATASIZE(in->size, 0));

out = ptr = palloc(buflen);
for (i = 0; i < in->size; i++)
{
*ptr++ = '"';
ptr = cpw(ptr, base + entries[i].pos, entries[i].keylen);
*ptr++ = '"';
*ptr++ = '=';
*ptr++ = '>';


-- 
  Gregory Stark
  EnterpriseDB  http://www.enterprisedb.com

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org

Re: [HACKERS] Seeking Google SoC Mentors

2007-02-27 Thread Robert Treat

On Monday 26 February 2007 18:46, Josh Berkus wrote:
> Demian,
>
> > Could you also please share your thoughts on what would be a good
> > student profile- for instance, how much theoretical background and
> > practical experience, for working on a SoC project?
>
> Well, it shouldn't be the student's first year writing code.  Basically,
> they're committing to developing a feature *to completion* in 3 months.
> We're going to be evaluating  proposals based on whether we think students
> can do that.
>
> References of some kind will be important, given that we'll only be
> accepting about 12% of proposals.  So, some demonstration of the student's
> ability to code from either the open source world or previous coursework.
>
> I don't know, honestly, that we care very much about theoretical
> background, except where it relates directly to and cutting-edge concepts
> in you proposal.  PostgreSQL-space is littered with half-completed
> academic projects; we're not seeking more of those.

Additionally it doesn't hurt to bring up an idea to the community before you 
even submit the application.  If you get buy in / agreement from multiple 
community members on a project now, it can only help your chances in the SoC 
process. (Picking items from the TODO list is a good way to start if you need 
ideas)

-- 
Robert Treat
Build A Brighter LAMP :: Linux Apache {middleware} PostgreSQL

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org

Re: [HACKERS] Seeking Google SoC Mentors

2007-02-27 Thread Andrew Dunstan


Jim C. Nasby wrote:


I agree we certainly don't want to go designing these projects in
advance, but I think we could at least ensure that the community buys
into the concept of each project. 


Yes, at least for those that go on a suggestion list. And that was my 
worry about Warren's suggestion - I haven't seen much enthusiasm for it 
from anyone else, so the degree of in-principle buyin is fairly dubious.


cheers

andrew

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] Resumable vacuum proposal and design overview

2007-02-27 Thread Tom Lane

"Matthew T. O'Connor" <[EMAIL PROTECTED]> writes:
> Tom Lane wrote:
>> It occurs to me that we may be thinking about this the wrong way
>> entirely.  Perhaps a more useful answer to the problem of using a
>> defined maintenance window is to allow VACUUM to respond to changes in
>> the vacuum cost delay settings on-the-fly.  So when your window closes,
>> you don't abandon your work so far, you just throttle your I/O rate back
>> to whatever's considered acceptable for daytime vacuuming.

> I thought we already did that?

No, we only react to SIGHUP when idle.  I think that's a good policy for
standard backends, but for autovacuum it might be appropriate to check
more often.

regards, tom lane

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org

Re: [HACKERS] autovacuum next steps, take 2

2007-02-27 Thread Jim C. Nasby

On Tue, Feb 27, 2007 at 12:12:22PM -0500, Matthew T. O'Connor wrote:
> Jim C. Nasby wrote:
> >On Tue, Feb 27, 2007 at 12:00:41AM -0300, Alvaro Herrera wrote:
> >>Jim C. Nasby wrote:
> >>
> >>>The advantage to keying this to autovac_naptime is that it means we
> >>>don't need another GUC, but after I suggested that before I realized
> >>>that's probably not the best idea. For example, I've seen clusters that
> >>>are running dozens-hundreds of databases; in that environment you really
> >>>need to turn naptime way down (to like a second). In that case you
> >>>wouldn't want to key to naptime.
> >>Actually, I've been thinking that it would be a good idea to change the
> >>semantics of autovacuum_naptime so that it means the average time to
> >>start a worker in any given database.  That way, the time between
> >>autovac runs is not dependent on the number of databases you have.
> >
> >BTW, another issue that I don't think we can ignore: we actually need to
> >do this on a per-tablespace level, or at least have the ability to
> >disable or somehow limit it. While it's not common, there are users that
> >run a hundred or more databases in a single cluster; it would be ugly if
> >we suddenly had 100 vacuums trying to run on the same set of drives
> >concurrently.
> 
> I think we all agree that autovacuum needs to become tablespace aware at 
> some point, but I think that is further down the line, we're having 
> enough trouble figuring things out without that additional complication.

Sure, we just need a way to disable the multiple autovac daemon stuff
then.
-- 
Jim Nasby[EMAIL PROTECTED]
EnterpriseDB  http://enterprisedb.com  512.569.9461 (cell)

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org

Re: [HACKERS] COMMIT NOWAIT Performance Option

2007-02-27 Thread Jim C. Nasby

On Tue, Feb 27, 2007 at 10:49:32AM +, Simon Riggs wrote:
> > I dislike introducing new nonstandard syntax ("Oracle compatible" is not
> > standard).  If we did this I'd vote for control via a GUC setting only;
> > I think that is more useful anyway, as an application can be made to run
> > with such a setting without invasive source code changes.
> 
> OK.
> 
> Having read through all of the above things again, ISTM that we should
> make this functionality available by a new GUC commit_fsync_delay, which
> must be set explicitly > 0 before this feature can be used at all. If I
> confused Tom by using commit_delay, then I'll confuse others also and
> group commit and deferred fsync are different techniques with different
> robustness guarantees. When enabled it should have a clear message in
> the log to show that some commits might be using commit_nowait. 
> 
> I'd even welcome a more descriptive term that summed up the relaxed
> transaction guarantee implied by the use of the deferred fsync
> technique. Perhaps even a very explicit USERSET GUC:
> 
>   transaction_guarantee = on (default) | off

So would you set commit_fsync_delay on a per-transaction basis? That
doesn't make much sense to me... I guess I'm not seeing how you would
explicitly mark transactions that you didn't want to fsync immediately.
-- 
Jim Nasby[EMAIL PROTECTED]
EnterpriseDB  http://enterprisedb.com  512.569.9461 (cell)

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly

Re: [HACKERS] COMMIT NOWAIT Performance Option

2007-02-27 Thread Josh Berkus

Simon,

One of the things I love about doing informal online user support in the 
PostgreSQL community, and formal user support for Sun's customers, is the 
almost-ironclad guarentee that if a user has a corrupt database or data loss, 
one of three things is true:
a) they didn't apply some recommended PG update;
b) they have a bad disk controller or disk config;
c) they have bad ram.

It seriously narrows down the problem space to know that PostgreSQL does *not* 
allow data loss if it's physically possible to prevent it.

Therefore, if we're going to arm a foot-gun as big as COMMIT NOWAIT for 
PostgreSQL, I'd like to see the answers to two questions:

a) Please give some examples of performance gain on applications using COMMIT 
NOWAIT.  The performance gain needs to be substantial (like, 50% to 100%) to 
justify a compromise like this.

b) Why this and not global temporary tables or queuing?

-- 
Josh Berkus
PostgreSQL @ Sun
San Francisco

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate

Re: [HACKERS] Packed short varlenas, what next?

2007-02-27 Thread Josh Berkus

Greg,

> I'm really curious to know how people feel about the varlena patch. In
> particular I know these issues may elicit comment:

Haven't tested yet.  Will let you know when I do.

-- 
Josh Berkus
PostgreSQL @ Sun
San Francisco

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate

Re: [HACKERS] COMMIT NOWAIT Performance Option

2007-02-27 Thread Tom Lane

"Jim C. Nasby" <[EMAIL PROTECTED]> writes:
> So would you set commit_fsync_delay on a per-transaction basis? That
> doesn't make much sense to me... I guess I'm not seeing how you would
> explicitly mark transactions that you didn't want to fsync immediately.

My assumption was that most of the time you'd want this behavior
per-session and so the existing mechanisms for setting a GUC variable
would work perfectly well.  If you really want it per-transaction then
changing the variable on the fly is possible (perhaps using SET LOCAL).
You'll be issuing nonstandard commands either way...

regards, tom lane

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org

[HACKERS] Implicit casts with generic arrays

2007-02-27 Thread Peter Eisentraut

I've looked into cutting back on the implicit casts to text, which 
exposed the following little gem.

The expressions

'abc' || 34
34 || 'abc'

would no longer work, with the following error message:

ERROR:  22P02: array value must start with "{" or dimension information

That's because the best matches are now respectively

anyarray || anyelement
anyelement || anyarray

Now either this is just too bad and users of a system with reduced casts 
to text will have to live with this odd error message, or coercing any 
old unknown constant to anyarray isn't such a good idea.

Comments?

-- 
Peter Eisentraut
http://developer.postgresql.org/~petere/

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] Seeking Google SoC Mentors

2007-02-27 Thread Martijn van Oosterhout

On Tue, Feb 27, 2007 at 10:49:18AM -0600, Jim C. Nasby wrote:
> I agree we certainly don't want to go designing these projects in
> advance, but I think we could at least ensure that the community buys
> into the concept of each project. ISTM one of the big issues with FD is
> that most people didn't even really understand what exactly it was or
> how it might be useful, which made getting broad acceptance even harder.

I think it might be useful to at least encourage people wanting to an
SoC project to create page on the developer wiki selling their idea.
You know, questions like: why do we want it? Where do you expect it to
be included? etc.

That might help catch misunderstandings much earlier.

Have a nice day,
-- 
Martijn van Oosterhout  http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to 
> litigate.

signature.asc
Description: Digital signature

Re: [HACKERS] Packed short varlenas, what next?

2007-02-27 Thread Tom Lane

Gregory Stark <[EMAIL PROTECTED]> writes:
> I'm really curious to know how people feel about the varlena patch.

One thing I think we could do immediately is apply the change to replace
"VARATT_SIZEP(x) = len" with "SET_VARSIZE(x, len)" --- that would
considerably reduce the size of the patch and allow people to focus on
the important changes instead of underbrush.  Barring objection I'll go
ahead and do that today.

regards, tom lane

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly

Re: [HACKERS] Resumable vacuum proposal and design overview

2007-02-27 Thread Simon Riggs

On Tue, 2007-02-27 at 12:23 -0500, Tom Lane wrote:
> "Matthew T. O'Connor" <[EMAIL PROTECTED]> writes:
> > Tom Lane wrote:
> >> It occurs to me that we may be thinking about this the wrong way
> >> entirely.  Perhaps a more useful answer to the problem of using a
> >> defined maintenance window is to allow VACUUM to respond to changes in
> >> the vacuum cost delay settings on-the-fly.  So when your window closes,
> >> you don't abandon your work so far, you just throttle your I/O rate back
> >> to whatever's considered acceptable for daytime vacuuming.
> 
> > I thought we already did that?
> 
> No, we only react to SIGHUP when idle.  I think that's a good policy for
> standard backends, but for autovacuum it might be appropriate to check
> more often.

You mean react to changes while in the middle of a VACUUM? Sounds like a
great idea to me.

Not sure why you'd do that just for autovacuum though. Sounds like it
would be good in all cases. Vacuum and autovacuum have different vacuum
delays, after all.

-- 
  Simon Riggs 
  EnterpriseDB   http://www.enterprisedb.com



---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate

Re: [HACKERS] bug in gist hstore?

2007-02-27 Thread Gregory Stark


"Gregory Stark" <[EMAIL PROTECTED]> writes:

> In the following code from hstore_io.c, is HStore a varlena? 

Sorry, thinko, it is a varlena but "size" isn't the first struct field,
there's a "len" field first.

-- 
  Gregory Stark
  EnterpriseDB  http://www.enterprisedb.com

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] [PATCHES] HOT WIP Patch - version 3.2

2007-02-27 Thread Pavan Deolasee

On 2/27/07, Heikki Linnakangas <[EMAIL PROTECTED]> wrote:

Pavan Deolasee wrote:
> - What do we do with the LP_DELETEd tuples at the VACUUM time ?
> In this patch, we are collecting them and vacuuming like
> any other dead tuples. But is that the best thing to do ?

Since they don't need index cleanups, it's a waste of
maintenance_work_mem to keep track of them in the dead tuples array.
Let's remove them in the 1st phase. That means trading the shared lock
for a vacuum-level lock on pages with LP_DELETEd tuples. Or if we want
to get fancy, we could skip LP_DELETEd tuples in the 1st phase for pages
that had dead tuples on them, and scan and remove them in the 2nd phase
when we have to acquire the vacuum-level lock anyway.

I liked the idea of not collecting the LP_DELETEd tuples in the first
pass. We also prune the HOT-update chains in the page in the first
pass, may be that can also be moved to second pass. We need to
carefully work on the race conditions involved in the VACUUM, pruning
and tuple reuse though.

- While searching for a LP_DELETEd tuple, we start from the
> first offset and return the first slot which is big enough
> to store the tuple. Is there a better search algorithm
> (sorting/randomizing) ? Should we go for best-fit instead
> of first-fit ?

Best-fit seems better to me. It's pretty cheap to scan for LP_DELETEd
line pointers, but wasting space can lead to cold updates and get much
more expensive.

Ok. I will give it a shot once the basic things are ready.

You could also prune the chains on the page to make room for the update,

and if you can get a vacuum lock you can also defrag the page.

Yes, thats a good suggestion as well. I am already doing that in the
patch I am working on right now.

- Should we have metadata on the heap page to track the
> number of LP_DELETEd tuples, number of HOT-update chains in the
> page and any other information that can help us optimize
> search/prune operations ?

I don't think the CPU overhead is that significant; we only need to do
the search/prune when a page gets full. We can add flags later if we
feel like it, but let's keep it simple for now.

I am making good progress with the line-pointer redirection stuff.
Its showing tremendous value in keeping the table and index size
in control. But we need to check for the CPU overhead as well
and if required optimize there.

Thanks,
Pavan

--

EnterpriseDB http://www.enterprisedb.com

Re: [HACKERS] Seeking Google SoC Mentors

2007-02-27 Thread Peter Eisentraut

Martijn van Oosterhout wrote:
> I think it might be useful to at least encourage people wanting to an
> SoC project to create page on the developer wiki selling their idea.
> You know, questions like: why do we want it? Where do you expect it
> to be included? etc.

They are expected to do as much when they apply for a project at the 
Google site.

-- 
Peter Eisentraut
http://developer.postgresql.org/~petere/

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org

Re: [HACKERS] COMMIT NOWAIT Performance Option

2007-02-27 Thread Simon Riggs

On Tue, 2007-02-27 at 11:32 -0600, Jim C. Nasby wrote:
> On Tue, Feb 27, 2007 at 10:49:32AM +, Simon Riggs wrote:
> > > I dislike introducing new nonstandard syntax ("Oracle compatible" is not
> > > standard).  If we did this I'd vote for control via a GUC setting only;
> > > I think that is more useful anyway, as an application can be made to run
> > > with such a setting without invasive source code changes.
> > 
> > OK.
> > 
> > Having read through all of the above things again, ISTM that we should
> > make this functionality available by a new GUC commit_fsync_delay, which
> > must be set explicitly > 0 before this feature can be used at all. If I
> > confused Tom by using commit_delay, then I'll confuse others also and
> > group commit and deferred fsync are different techniques with different
> > robustness guarantees. When enabled it should have a clear message in
> > the log to show that some commits might be using commit_nowait. 
> > 
> > I'd even welcome a more descriptive term that summed up the relaxed
> > transaction guarantee implied by the use of the deferred fsync
> > technique. Perhaps even a very explicit USERSET GUC:
> > 
> > transaction_guarantee = on (default) | off
> 
> So would you set commit_fsync_delay on a per-transaction basis? That
> doesn't make much sense to me... I guess I'm not seeing how you would
> explicitly mark transactions that you didn't want to fsync immediately.

There are 2 GUCs that would control the behaviour here:

transaction_guarantee = on | off
Specifies whether following transaction commits will guarantee
WAL has been flushed prior to reporting commit. If no guarantee
is requested (=off), then data loss may result even after the
transaction has reported its COMMIT message. 
USERSET, but listed in postgresql.conf where default = on
Set this at role, individual session or transaction level to
improve performance of non-critical user data. Use of this
setting does not interfere with the transaction_guarantee
that other transactions may choose. i.e. if somebody else
chooses to take risks with their data it will not affect
the transaction guarantees the server offers to you.
Can only be set off by a transaction if commit_fsync_delay
has been enabled. Use this parameter with care; if you find
yourself wanting to use this parameter all of the time you
should consult a psychiatrist or change open source databases.

commit_fsync_delay = 0...1 microseconds (0 = off, default)
Controls how often the WALWriter issues an XLogFlush()
SIGHUP, so set once for each server, in postgresql.conf
This provides a maximum time window of potential data loss 
in the event of a server crash for transactions that choose
transaction_guarantee = off. This parameter has no effect
on transactions that choose transaction_guarantee = on.

-- 
  Simon Riggs 
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [HACKERS] Implicit casts with generic arrays

2007-02-27 Thread Tom Lane

Peter Eisentraut <[EMAIL PROTECTED]> writes:
> I've looked into cutting back on the implicit casts to text, which 
> exposed the following little gem.

> The expressions
> 'abc' || 34
> 34 || 'abc'
> would no longer work, with the following error message:
> ERROR:  22P02: array value must start with "{" or dimension information

Hm, that's annoying.  Not that the expressions fail --- we want them to
--- but that the error message is so unhelpful.

Since ANYARRAY is already special to the type system, I don't have a
problem with inserting some special case to prevent this, but I'm not
sure what the special case should be.  Is it too klugy to say "don't
implicitly cast unknown to anyarray unless the literal's value starts
with { or ["?  We've never made casting decisions depend on the contents
of strings before, and I'm really loath to make 'em do so now.

Seems basically we'd want to not cast unknown to anyarray unless there
is some additional bit of context suggesting that that's the right thing.
But what should that extra requirement be?  Can we go as far as not
doing this cast implicitly at all?

regards, tom lane

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [HACKERS] 7.x horology regression test on Solaris buildfarm machines

2007-02-27 Thread Kris Jurka




On Tue, 27 Feb 2007, Tom Lane wrote:


No, because for the "standard" regression behavior we have variant
result files both with and without the DST law change: see horology and
horology_1.  The issue only comes up for machines that were matching to
horology-no-DST-before-1970.out (which may be only old HPUX) or
horology-solaris-1947.out (which I suppose is only Solaris).  The reason
I got around to making the patch now is I recently updated my old HPUX
machine for the law change, and so 7.x started to fail for me ...



OK, kudu + dragonfly updated.  I probably won't get around to fixing eel 
(cygwin) for some time though.


Kris Jurka


---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [HACKERS] autovacuum next steps, take 2

2007-02-27 Thread Casey Duncan



On Feb 26, 2007, at 12:49 PM, Alvaro Herrera wrote:


Jim C. Nasby wrote:


That's why I'm thinking it would be best to keep the maximum size of
stuff for the second worker small. It probably also makes sense to  
tie
it to time and not size, since the key factor is that you want it  
to hit

the high-update tables every X number of seconds.

If we wanted to get fancy, we could factor in how far over the vacuum
threshold a table is, so even if the table is on the larger size, if
it's way over the threshold the second vacuum will hit it.


Ok, I think we may be actually getting somewhere.

I propose to have two different algorithms for choosing the tables to
work on.  The worker would behave differently, depending on whether
there is one or more workers on the database already or not.

The first algorithm is the plain threshold equation stuff we use  
today.

If a worker connects and determines that no other worker is in the
database, it uses the "plain worker" mode.  A worker in this mode  
would

examine pgstats, determine what tables to vacuum/analyze, sort them by
size (smaller to larger), and goes about its work.  This kind of  
worker
can take a long time to vacuum the whole database -- we don't  
impose any

time limit or table size limit to what it can do.

The second mode is the "hot table worker" mode, enabled when the  
worker

detects that there's already a worker in the database.  In this mode,
the worker is limited to those tables that can be vacuumed in less  
than

autovacuum_naptime, so large tables are not considered.  Because of
this, it'll generally not compete with the first mode above -- the
tables in plain worker were sorted by size, so the small tables were
among the first vacuumed by the plain worker.  The estimated time to
vacuum may be calculated according to autovacuum_vacuum_delay  
settings,

assuming that all pages constitute cache misses.


Perhaps this has already been proposed, but maybe some combination of  
the following inputs could be used to determine which table most  
needs vacuuming:


- The proportion of tuples in a table that are dead (updated rows  
since last vacuum/estimated row count). This would favor "hot" tables  
naturally regardless of size.


- The time since the last vacuum, so that larger tables are  
eventually vacuumed even if hot tables totally dominate


Of course tables that did not pass the minimum parameters specified  
in postgresql.conf would not even get considered.


I'm being intentionally vague here on the exact algorithm, since you  
all have though about this more than I have. One thing I like about  
the above is that it is independent of table size, and doesn't  
require anyone to determine which tables are hot manually.


-Casey

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq

Re: [HACKERS] COMMIT NOWAIT Performance Option

2007-02-27 Thread Zeugswetter Andreas ADI SD

> There are 2 GUCs that would control the behaviour here:
> 
> transaction_guarantee = on | off

>   has been enabled. Use this parameter with care; if you find
>   yourself wanting to use this parameter all of the time you
>   should consult a psychiatrist or change open source databases.

If you guarantee your customers, that you wont loose a transaction that
has already been committed you need to at least have synchronous
replication to a remote site. Since not many installations have that, I
do find it funny that people imply so much safety only by syncing the
wal.

Without sync replication a "buffered wal" as proposed only increases the
chances that you loose something. It certainly is no change from safe
heaven to abyssmal hell. 

So I think the part after the semicolon can safely be dropped.
Many will be able to use it always, without changing to another db :-)

> commit_fsync_delay = 0...1 microseconds (0 = off, default)
>   Controls how often the WALWriter issues an XLogFlush()
>   SIGHUP, so set once for each server, in postgresql.conf
>   This provides a maximum time window of potential data loss 
>   in the event of a server crash for transactions that choose
>   transaction_guarantee = off. This parameter has no effect
>   on transactions that choose transaction_guarantee = on.

The wal sync method probably needs to be considered ?
If the wal is opened with open_datasync, how does that affect the
performance, or do you ommit the write and leave that to the WALWriter
also ? You probably also want more wal_buffers in such a setup. It may
be better to trigger the WALWriter with wal_buffer fill-level instead of
an extra parameter ? 

It is imho great that you are working on this. I always thought it
impossible, because WAL (write ahead) implied to me, that you are not
allowed to do some data/index page changes before wal is on disk.

Andreas

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: conversion efforts (Re: [HACKERS] SCMS question)

2007-02-27 Thread Bruce Momjian

Warren Turkal wrote:
> On Monday 26 February 2007 10:13, Bruce Momjian wrote:
> > Warren Turkal wrote:
> > > On Saturday 24 February 2007 16:47, Chad Wagner wrote:
> > > > head pgsql/src/interfaces/perl5/Attic/test.pl.oldstyle,v
> > > > head ? ?1.3;
> > > > access;
> > > > symbols
> > > > ? ? ? ? Release-1-6-0:1.1.1.1
> > > > ? ? ? ? creation:1.1.1.1
> > > > ? ? ? ? creation:1.1.1; ? ?<<< What the heck happened here?
> > > > locks; strict;
> > > > comment @# @;
> > >
> > > Can a cvs maintainer please correct this file by adding a ";" after the
> > > first "creation" line and removing the second "creation" line?
> >
> > Done. Any other problems?
> 
> This has not been fixed in the rsync repository. Here's one proof for one of 
> the files.
> 
> [EMAIL 
> PROTECTED]:~/projects/postgresql/pgsql-cvs/cvsroot/pgsql/src/interfaces/perl5/Attic$
>  head -n 7 test.pl.newstyle,v
> head1.3;
> access;
> symbols
> Release-1-6-0:1.1.1.1
> creation:1.1.1.1
> creation:1.1.1;
> locks; strict;
> [EMAIL 
> PROTECTED]:~/projects/postgresql/pgsql-cvs/cvsroot/pgsql/src/interfaces/perl5/Attic$
> 
> I have attached a patch for the files that needs to be applied within
> the cvsroot/pgsql/src/interfaces/perl5/Attic directory in the
> repository.

OK, I definately had added the semicolons, so I am confused why you
don't see them.  Anyway, I have remove the duplicate 'creation:' lines,
so now there is only one line in each file.  Let me know how that works.

-- 
  Bruce Momjian  <[EMAIL PROTECTED]>  http://momjian.us
  EnterpriseDB   http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] COMMIT NOWAIT Performance Option

2007-02-27 Thread Jonah H. Harris


On 2/27/07, Josh Berkus  wrote:

It seriously narrows down the problem space to know that PostgreSQL does *not*
allow data loss if it's physically possible to prevent it.


Seems like we're trying to protect users from themselves again.  This
is not a PostgreSQL database issue; it's a feature designed for
application developers and cases where there is the possibility of
acceptable data loss for certain types of operations or transactions.
We already have the capacity for loss when fsync is disabled; this is
just a more controlled feature.

Under Oracle, NOWAIT is an asynchronous commit... anyone that uses it
should understand that it's still not on-disk and that they can lose
it in the event of a failure.  That's what Oracle's docs even say.
It's just a risk vs. reward trade off.

--
Jonah H. Harris, Software Architect | phone: 732.331.1324
EnterpriseDB Corporation| fax: 732.331.1301
33 Wood Ave S, 3rd Floor| [EMAIL PROTECTED]
Iselin, New Jersey 08830| http://www.enterprisedb.com/

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly

Re: [HACKERS] COMMIT NOWAIT Performance Option

2007-02-27 Thread Josh Berkus

Jonah,

> Under Oracle, NOWAIT is an asynchronous commit... anyone that uses it
> should understand that it's still not on-disk and that they can lose
> it in the event of a failure.  That's what Oracle's docs even say.
> It's just a risk vs. reward trade off.

You're missing my point, which is that nobody has demonstrated that there 
is any real performance gain from this.  I see no reason to implement it 
if there is no performance gain.

-- 
--Josh

Josh Berkus
PostgreSQL @ Sun
San Francisco

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [HACKERS] COMMIT NOWAIT Performance Option

2007-02-27 Thread Joshua D. Drake

Josh Berkus wrote:
> Simon,
> 
> One of the things I love about doing informal online user support in the 
> PostgreSQL community, and formal user support for Sun's customers, is the 
> almost-ironclad guarentee that if a user has a corrupt database or data loss, 
> one of three things is true:
> a) they didn't apply some recommended PG update;
> b) they have a bad disk controller or disk config;
> c) they have bad ram.

That is pretty spot on.

> 
> It seriously narrows down the problem space to know that PostgreSQL does 
> *not* 
> allow data loss if it's physically possible to prevent it.

But we do don't we? fsync = off, full_page_writes = off?

> 
> Therefore, if we're going to arm a foot-gun as big as COMMIT NOWAIT for 
> PostgreSQL, I'd like to see the answers to two questions:

I agree with this.

> 
> a) Please give some examples of performance gain on applications using COMMIT 
> NOWAIT.  The performance gain needs to be substantial (like, 50% to 100%) to 
> justify a compromise like this.

WOAH... that seems excessive. There are a couple of things going on here.

1. We have a potential increase in performance for certain workloads.
This is good, but must be proven. IS that proof 50%? Bah.. let's talk
15-25%.

2. We have to accept that not everyone wants IRON clad data integrity.
We have many, many options for dealing with that now, including PITR and
REPLICATION.

> 
> b) Why this and not global temporary tables or queuing?

/me would love global temp tables.

Much of the PostgreSQL Users out there today, will happily loose a 15
minutes of data if it means their data is served 25% faster.

Sincerely,

Joshua D. Drake

-- 

  === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
Providing the most comprehensive  PostgreSQL solutions since 1997
 http://www.commandprompt.com/

Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
PostgreSQL Replication: http://www.commandprompt.com/products/

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate

Re: conversion efforts (Re: [HACKERS] SCMS question)

2007-02-27 Thread Warren Turkal

On Tuesday 27 February 2007 12:26, Bruce Momjian wrote:
> OK, I definately had added the semicolons, so I am confused why you
> don't see them.  Anyway, I have remove the duplicate 'creation:' lines,
> so now there is only one line in each file.  Let me know how that works.

Everything looks good now. Cvs2svn makes its through where the error 
originally occurred.

Joshua, can you see if your conversion script can do a conversion without 
deleting the perl5 directory?

wt
-- 
Warren Turkal (w00t)

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] COMMIT NOWAIT Performance Option

2007-02-27 Thread Dave Page

Joshua D. Drake wrote:

> 2. We have to accept that not everyone wants IRON clad data integrity.
> We have many, many options for dealing with that now, including PITR and
> REPLICATION.

100% agreed - our own stats collector is extremely similar (in that it
may drop data under high load) to a system I designed a few years back
which collected stats from snmp devices and windows boxes via WMI on a
regular basis. We didn't care if we lost a few values here and there for
that data and would happily have taken the riskier but higher
performance option had we needed to (and had we ever got round to ever
actually building the system!).

I do agree that some benchmarks are appropriate though.

Regards, Dave

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

Re: [HACKERS] COMMIT NOWAIT Performance Option

2007-02-27 Thread Jordan Henderson

I am not sure about some of this.  The Oracle option does not change the 
engine fsync behavior I believe.  All that is changed is whether the client 
side waits for the complete of the fsync or not.  If this is true, the data 
store, logs, etc, are all protected.  The user may still experience a data 
loss if a network, or system failure occurred just after the client issued 
the commit.  This would be something like I send the message and exit.  
However prior to the engine receiving the message, a network component fails 
and the message is never delivered.  This will turn into an aborted 
transaction as far as the engine is concerned.  Of course, the exact details 
are in the protocol between the client and the server.

The commit nowait is async with respect to the response to the user, not the 
underlying engine I think.  Therefore performance gains are purely a user 
perspective, not an engine perspective.  Perhaps some network traffic could 
be pruned, not sending the response.

Jordan Henderson

On Tuesday 27 February 2007, Joshua D. Drake wrote:
> Josh Berkus wrote:
> > Simon,
> >
> > One of the things I love about doing informal online user support in the
> > PostgreSQL community, and formal user support for Sun's customers, is the
> > almost-ironclad guarentee that if a user has a corrupt database or data
> > loss, one of three things is true:
> > a) they didn't apply some recommended PG update;
> > b) they have a bad disk controller or disk config;
> > c) they have bad ram.
>
> That is pretty spot on.
>
> > It seriously narrows down the problem space to know that PostgreSQL does
> > *not* allow data loss if it's physically possible to prevent it.
>
> But we do don't we? fsync = off, full_page_writes = off?
>
> > Therefore, if we're going to arm a foot-gun as big as COMMIT NOWAIT for
> > PostgreSQL, I'd like to see the answers to two questions:
>
> I agree with this.
>
> > a) Please give some examples of performance gain on applications using
> > COMMIT NOWAIT.  The performance gain needs to be substantial (like, 50%
> > to 100%) to justify a compromise like this.
>
> WOAH... that seems excessive. There are a couple of things going on here.
>
> 1. We have a potential increase in performance for certain workloads.
> This is good, but must be proven. IS that proof 50%? Bah.. let's talk
> 15-25%.
>
> 2. We have to accept that not everyone wants IRON clad data integrity.
> We have many, many options for dealing with that now, including PITR and
> REPLICATION.
>
> > b) Why this and not global temporary tables or queuing?
>
> /me would love global temp tables.
>
> Much of the PostgreSQL Users out there today, will happily loose a 15
> minutes of data if it means their data is served 25% faster.
>
> Sincerely,
>
> Joshua D. Drake

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: conversion efforts (Re: [HACKERS] SCMS question)

2007-02-27 Thread Andrew Dunstan


Warren Turkal wrote:

On Tuesday 27 February 2007 12:26, Bruce Momjian wrote:
  

OK, I definately had added the semicolons, so I am confused why you
don't see them.  Anyway, I have remove the duplicate 'creation:' lines,
so now there is only one line in each file.  Let me know how that works.



Everything looks good now. Cvs2svn makes its through where the error 
originally occurred.


Joshua, can you see if your conversion script can do a conversion without 
deleting the perl5 directory?


  


You know, you can prune what is rsynced.

my rsync line looks like this:

 rsync -avzCH --delete --exclude-from=/home/cvsmirror/pg-exclude 
anoncvs.postgresql.org::pgsql-cvs /home/cvsmirror/pg


and the exclude file has these four lines:

 /sup/
 /CVSROOT/loginfo*
 /CVSROOT/commitinfo*
 /CVSROOT/config*

cheers

andrew




---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] COMMIT NOWAIT Performance Option

2007-02-27 Thread Tom Lane

"Joshua D. Drake" <[EMAIL PROTECTED]> writes:
> Josh Berkus wrote:
>> It seriously narrows down the problem space to know that PostgreSQL does 
>> *not* 
>> allow data loss if it's physically possible to prevent it.

> But we do don't we? fsync = off, full_page_writes = off?

One of the things that's really attractive about the proposed mode is
that it does *not* create a risk of data corruption (assuming that
Simon's analyzed it correctly --- I think the clog code in particular
needs a look).  What you risk is that when the database comes back up,
its state may reflect an instant up to X seconds before the time of the
crash, rather than exactly the crash time.  It seems to me that that's
way better than fsync = off, which allows unlimited corruption.

I agree that we ought to look at some performance numbers before
accepting the patch, but I think Josh's argument that this opens us
up to major corruption problems is probably wrong.  The question is
whether your application can tolerate loss of "very recent" transactions,
and I think there are plenty where it can.

regards, tom lane

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] COMMIT NOWAIT Performance Option

2007-02-27 Thread Jonah H. Harris


On 2/27/07, Josh Berkus  wrote:

You're missing my point, which is that nobody has demonstrated that there
is any real performance gain from this.  I see no reason to implement it
if there is no performance gain.


While I'll back your request for results, it seems nearly impossible
to expect no performance gain using this.

Results never hurt though.

--
Jonah H. Harris, Software Architect | phone: 732.331.1324
EnterpriseDB Corporation| fax: 732.331.1301
33 Wood Ave S, 3rd Floor| [EMAIL PROTECTED]
Iselin, New Jersey 08830| http://www.enterprisedb.com/

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq

[HACKERS] high CPU usage for stats collector in 8.2

2007-02-27 Thread Darcy Buskermolen

I'm observing high CPU usage (95%) of a 2.6GHz opteron by the stats collector 
on an 8.2.3 box  investigation has lead me to belive that the stats file is 
written a lot more often that once every 500ms  the following shows this 
behavior.

PostgreSQL 8.2.3 on x86_64-redhat-linux-gnu, compiled by GCC gcc (GCC) 3.4.6 
20060404 (Red Hat 3.4.6-3)

I ran a 
time for i in `psql -qtc "select generate_series(1,1)"`; do psql  -qtc 
"select 1 from test where msg_id=$i" ; done

which took
real1m23.288s
user0m24.142s
sys 0m21.536s


to execute, during which time I ran a strace on the stats collector which 
produces the following output.  From this it looks like the stats file is 
getting rewritten for each connection teardown, not just every 500ms.

Process 10061 attached - interrupt to quit
Process 10061 detached
% time seconds  usecs/call callserrors syscall
-- --- --- - - 
 68.14   28.811963  17   1663827   write
 18.227.701885 123 62808 12793 poll
 11.314.783082 365 13101   rename
  0.580.246169   5 50006   recvfrom
  0.570.241073  18 13101   open
  0.430.182816  14 13101   munmap
  0.180.076176   6 13101   mmap
  0.170.072746   6 13101   close
  0.140.060483   5 13101   setitimer
  0.100.041344   3 13101 12793 rt_sigreturn
  0.090.039240   3 13101   fstat
  0.060.024041   2 13110   getppid
-- --- --- - - 
100.00   42.281018   1894559 25586 total

As you can see rename was called more than the theroitcal 167  times for 500ms 
slices that elapsed during the test

Compared to  PostgreSQL 8.1.8 on x86_64-pc-linux-gnu, compiled by GCC 
gcc-4.0.gcc-opt (GCC) 4.0.3 (Ubuntu 4.0.3-1ubuntu5) 
[All be it this is slower hardware..]

time for i in `psql -qtc "select generate_series(1,1)"`; do psql  -qtc 
"select 1 from test where msg_id=$i" ; done

which took
real9m25.380s
user6m51.254s
sys 1m47.687s
(and therefor should be about 1130 stat write cycles)

and yielded the following strace

% time seconds  usecs/call callserrors syscall
-- --- --- - - 
 93.64   20.422006 334 61212   select
  3.490.760963   7110192   read
  1.820.396654  19 21128   write
  0.640.139679 126  1112   rename
  0.270.057970  52  1112   open
  0.060.012177  11  1112   munmap
  0.040.008901   8  1112   mmap
  0.030.006402   6  1112   close
  0.020.004282   4  1112   fstat
-- --- --- - - 
100.00   21.809034199204   total



During this run the stats collector does not even show and CPU usage according 
to top.


both 8.1 and 8.2 have the following postgresql.conf parameters

stats_command_string =  off
stats_start_collector = on
stats_block_level = on
stats_row_level = on



-- 
Darcy Buskermolen
Command Prompt, Inc.
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
PostgreSQL solutions since 1997
http://www.commandprompt.com/

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate

Re: [HACKERS] COMMIT NOWAIT Performance Option

2007-02-27 Thread Bruce Momjian

Tom Lane wrote:
> "Simon Riggs" <[EMAIL PROTECTED]> writes:
> > COMMIT NOWAIT can co-exist with the normal form of COMMIT and does not
> > threaten the consistency or robustness of other COMMIT modes. Read that
> > again and think about it, before we go further, please.
> 
> I read that, and thought about it, and don't think I believe it.  The
> problem is that there are more interconnections between different
> transactions than you're allowing for.  In particular you need to
> justify that the behavior is safe for non-transactional operations like
> btree page splits and pg_clog buffer page writes.  The idea that's
> particularly bothering me at the moment is that after a system crash,
> we might come back up in a state where a NOWAIT transaction appears
> committed when its updates didn't all get to disk.  "Database corrupt"
> is a situation that threatens all your transactions...
> 
> > New commit mode is available by explicit command, or as a default
> > setting that will be applied to all COMMITs, or both.
> 
> I dislike introducing new nonstandard syntax ("Oracle compatible" is not
> standard).  If we did this I'd vote for control via a GUC setting only;
> I think that is more useful anyway, as an application can be made to run
> with such a setting without invasive source code changes.

Agreed on the GUC-only.  I don't see many people using the per-COMMIT
setting without wanting it to be for many transactions in the session.

(And, YES, I am VERY excited we are getting this feature.)

-- 
  Bruce Momjian  <[EMAIL PROTECTED]>  http://momjian.us
  EnterpriseDB   http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] COMMIT NOWAIT Performance Option

2007-02-27 Thread Bruce Momjian

Jonah H. Harris wrote:
> On 2/27/07, Josh Berkus  wrote:
> > You're missing my point, which is that nobody has demonstrated that there
> > is any real performance gain from this.  I see no reason to implement it
> > if there is no performance gain.
> 
> While I'll back your request for results, it seems nearly impossible
> to expect no performance gain using this.
> 
> Results never hurt though.

The results are going to be very close to fsync off for sufficiently
high values delay, and we _know_ fsync off is a performance win.

-- 
  Bruce Momjian  <[EMAIL PROTECTED]>  http://momjian.us
  EnterpriseDB   http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match

Re: [HACKERS] COMMIT NOWAIT Performance Option

2007-02-27 Thread Bruce Momjian

Jim C. Nasby wrote:
> On Mon, Feb 26, 2007 at 10:56:58PM +, Simon Riggs wrote:
> > 2. remove fsync parameter
> 
> Why? Wouldn't fsync=off still speed up checkpoints? ISTM you'd still
> want this for things like database restores.

I think we will remove fsync in favor of the new delay, and allow -1 to
be the same behavior as fsync off.

-- 
  Bruce Momjian  <[EMAIL PROTECTED]>  http://momjian.us
  EnterpriseDB   http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] COMMIT NOWAIT Performance Option

2007-02-27 Thread Joshua D. Drake

Bruce Momjian wrote:
> Jonah H. Harris wrote:
>> On 2/27/07, Josh Berkus  wrote:
>>> You're missing my point, which is that nobody has demonstrated that there
>>> is any real performance gain from this.  I see no reason to implement it
>>> if there is no performance gain.
>> While I'll back your request for results, it seems nearly impossible
>> to expect no performance gain using this.
>>
>> Results never hurt though.
> 
> The results are going to be very close to fsync off for sufficiently
> high values delay, and we _know_ fsync off is a performance win.

This is an assumption. Yes we know that fsync off is a performance win.
We do not know that COMMIT NO WAIT is a performance win. Yes, we can all
sit here and stare at what we *think* will be the result and yes I
actually concur that it will be a performance win.

However, I strongly concur that we need at least some evidence. It could
easily be that a misstep in the code, causes a loop over the wrong set
and all the performance we thought we would get is invalid, not because
of theory or what should happen, but because of actual implementation.

Joshua D. Drake

> 

-- 

  === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
Providing the most comprehensive  PostgreSQL solutions since 1997
 http://www.commandprompt.com/

Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
PostgreSQL Replication: http://www.commandprompt.com/products/

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: conversion efforts (Re: [HACKERS] SCMS question)

2007-02-27 Thread Warren Turkal

On Tuesday 27 February 2007 13:50, Andrew Dunstan wrote:
> You know, you can prune what is rsynced.

I am not sure why you brought this up, but yes I did know this.

> my rsync line looks like this:
>
>   rsync -avzCH --delete --exclude-from=/home/cvsmirror/pg-exclude
> anoncvs.postgresql.org::pgsql-cvs /home/cvsmirror/pg
>
> and the exclude file has these four lines:
>
>   /sup/
>   /CVSROOT/loginfo*
>   /CVSROOT/commitinfo*
>   /CVSROOT/config*

This setup prunes 1.25MB.

Are you suggesting that I prune this from the conversion?

wt
-- 
Warren Turkal (w00t)

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: conversion efforts (Re: [HACKERS] SCMS question)

2007-02-27 Thread Joshua D. Drake

Warren Turkal wrote:
> On Tuesday 27 February 2007 12:26, Bruce Momjian wrote:
>> OK, I definately had added the semicolons, so I am confused why you
>> don't see them.  Anyway, I have remove the duplicate 'creation:' lines,
>> so now there is only one line in each file.  Let me know how that works.
> 
> Everything looks good now. Cvs2svn makes its through where the error 
> originally occurred.
> 
> Joshua, can you see if your conversion script can do a conversion without 
> deleting the perl5 directory?

Trying now.

> 
> wt


-- 

  === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
Providing the most comprehensive  PostgreSQL solutions since 1997
 http://www.commandprompt.com/

Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
PostgreSQL Replication: http://www.commandprompt.com/products/


---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate

Re: [HACKERS] [PATCHES]

2007-02-27 Thread Bruce Momjian


FYI, I am not going to be comfortable accepting a final patch that
contains this email signature:

This is an email from Fujitsu Australia Software Technology Pty Ltd, ABN
27 003 693 481. It is confidential to the ordinary user of the email
address to which it was addressed and may contain copyright and/or
  -
legally privileged information. No one else may read, print, store, copy
or forward all or any of it or its attachments. If you receive this
email in error, please return to s ender. Thank you.

unless you provide additional details on your contribution of this code
under a BSD license.

---

John Bartlett wrote:
> Hi,
> 
>  
> 
> This is the first posting to the community of the WIP patch for the
> Updatable Cursor implementation.
> 
>  
> 
> I want to confirm that the community is satisfied that the effort to date is
> in a suitable direction and to get comments on the development to date.
> 
>  
> 
> The patch is in the following state:
> 
>  
> 
> The grammar definition is complete and 'yacc'ed to produce gram.y.c.
> 
>  
> 
> The functions transformUpdateStmt and transformDeleteStmt have been updated
> to process the cursor name and obtain the related portal.
> 
>  
> 
> The change to save the current tuple id (ctid) into the portal, related to
> the Fetch command has been done.
> 
>  
> 
> The ctids relating to the Update/Delete statements' TidScan are being
> extracted to be saved in the executor.
> 
>  
> 
> The parts in progress are to complete the saving of the ctids from the
> TidScan into a list stored in a file, plus related searching the list for an
> individual ctid obtained from the Update/Delete statements.
> 
>  
> 
> Unstarted as yet:
> 
>  
> 
> 1)Correctly process, in the database, the Delete / Update of the
> tuple from the cursor.
> 
> 2)To enable the cursor name to be defined as a parameter in a
> PREPARE statement and provided as part if an EXECUTE statement.
> 
>  
> 
> The community may wish to comment on the following issue:
> 
>  
> 
> 1)At present the file that will contain the list of ctids is going into
> a new directory called pg_ctids, analogous to pg_twophase, and also stored
> in the pg_data directory.
> 
>  
> 
> Regards,
> John Bartlett
> 
> This is an email from Fujitsu Australia Software Technology Pty Ltd, ABN 27 
> 003 693 481. It is confidential to the ordinary user of the email address to 
> which it was addressed and may contain copyright and/or legally privileged 
> information. No one else may read, print, store, copy or forward all or any 
> of it or its attachments. If you receive this email in error, please return 
> to sender. Thank you.
> 
> If you do not wish to receive commercial email messages from Fujitsu 
> Australia Software Technology Pty Ltd, please email [EMAIL PROTECTED]

[ Attachment, skipping... ]

> 
> ---(end of broadcast)---
> TIP 9: In versions below 8.0, the planner will ignore your desire to
>choose an index scan if your joining column's datatypes do not
>match

-- 
  Bruce Momjian  <[EMAIL PROTECTED]>  http://momjian.us
  EnterpriseDB   http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly

Re: [HACKERS] [PATCHES]

2007-02-27 Thread Joshua D. Drake

Bruce Momjian wrote:
> FYI, I am not going to be comfortable accepting a final patch that
> contains this email signature:
> 
>   This is an email from Fujitsu Australia Software Technology Pty Ltd, ABN
>   27 003 693 481. It is confidential to the ordinary user of the email
>   address to which it was addressed and may contain copyright and/or
> -
>   legally privileged information. No one else may read, print, store, copy
>   or forward all or any of it or its attachments. If you receive this
>   email in error, please return to s ender. Thank you.
> 
> unless you provide additional details on your contribution of this code
> under a BSD license.

Gonna have to concur with that. Not that the sig is legally binding
anyway, we do need to have a disclaimer in the email stating that you
are assigning to PGDG>

Joshua D. Drake


-- 

  === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
Providing the most comprehensive  PostgreSQL solutions since 1997
 http://www.commandprompt.com/

Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
PostgreSQL Replication: http://www.commandprompt.com/products/


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly

[HACKERS] Synchronized Scan update

2007-02-27 Thread Jeff Davis


I have found some interesting results from my tests with the
Synchronized Scan patch I'm working on.

The two benefits that I hope to achieve with the patch are:
(1) Better caching behavior with multiple sequential scans running in
parallel
(2) Faster sequential reads from disk and less seeking

I have consistently seen #1 to be true. There is still more testing to
be done (hopefully soon), but I haven't found a problem yet. And the
benefits I've seen are very substantial, which isn't hard, since in the
typical case, a large sequential scan will have 0% cache hit rate. These
numbers were retrieved using log_executor_stats=on.

#2 however, is a little trickier. IIRC, Tom was the first to point out
that the I/O system might not recognize that reads coming from different
processes are indeed one sequential read.

At first I never saw the problem actually happen, and I assumed that the
OS was being smart enough. However, recently I noticed this problem on
my home machine, which experienced great caching behavior but poor I/O
throughput (as measured by iostat). My home machine was using the Linux
CFQ io scheduler, and when I swapped the CFQ io scheduler for the
anticipatory scheduler (AS), it worked great. When I sent Josh my patch
(per his request) I mentioned the problem I experienced.

Then I started investigating, and found some mixed results. My test was
basically to use iostat (or zpool iostat) to measure disk throughput,
and N processes of "dd if=bigfile of=/dev/null" (started simultaneously)
to run the test. I consider the test to be "passed" if the additional
processes did not interfere (i.e. each process finished as though it
were the only one running). Of course, all tests were I/O bound.

My home machine (core 2 duo, single SATA disk, intel controller):
Linux/ext3/AS: passed
Linux/ext3/CFQ: failed
Linux/ext3/noop: passed
Linux/ext3/deadline: passed

Machine 2 (old thinkpad, IDE disk):
Solaris/UFS: failed
Solaris/ZFS: passed

Machine 3 (dell 2950, LSI PERC/5i controller, 6 SAS disks, RAID-10,
adaptive read ahead):
FreeBSD/UFS: failed

(I suspect the last test would be fine with read ahead always on, and it
may just be a problem with the adaptive read ahead feature)

There are a lot of factors involved, because several components of the
I/O system have the ability to reorder requests or read ahead, such as
the block layer and the controller. 

The block request ordering isn't the only factor because Solaris/UFS
only orders the requests by cylinder and moves only in one direction
(i.e. looks like a simple elevator algorithm that isn't affected by
process id). At least, that's how I understand it.

Readahead can't be the only factor either because replacing the io
scheduler in Linux solved the problem, even when that replacement was
the noop scheduler.

Anyway, back to the patch, it looks like there are some complications if
you try to use it with the wrong combination of fs, io scheduler, and
controller. 

The patch is designed for certain query patterns anyway, so I don't
think that this is a show-stopper. Given the better cache behavior, it
seems like it's really the job of the I/O system to get a single,
sequential stream of blocks efficiently.

The alternative would be to have a single block-reader process, which I
don't think we want to do. However, I/O systems don't really seem to
know how to handle multiple processes reading from the same file very
well.

Comments?

Regards,
Jeff Davis


---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org

Re: [HACKERS] No ~ operator for box, point

2007-02-27 Thread Jim C. Nasby

On Tue, Feb 20, 2007 at 04:22:27PM -0500, Bruce Momjian wrote:
> Added to TODO:
> 
>   * Add missing operators for geometric data types
>   
> Some geometric types do not have the full suite of geometric 
> operators,
> e.g. box @> point
> 

I've started looking at this, and ISTM that at least part of this could
be solved by allowing some implicit casts. Given that the geometry data
types are point, line, lseg, box, path, polygon, circle, I think the
following should be safe:

box -> polygon
lseg -> open path
polygon -> closed path

I would argue that this is similar to int2 -> int4 -> int8: a box is a
type of polygon, a polygon is a closed path (that doesn't intersect,
which needs to be added to the docs, btw), and a line segment is an open
path.

Is there any reason not to make these casts implicit? If there is,
what's the best way to go about adding operators for cases where
equivalent operators exist? (IE: @>(box,point) doesn't exist, but
@>(polygon,point) does, and should suffice for @>(box,point) with
appropriate casting)

Actually, looking at one example (@(point,box) vs @(point,poly)), part
of the reason is that it's far simpler to deal with a box than a generic
polygon.
-- 
Jim Nasby[EMAIL PROTECTED]
EnterpriseDB  http://enterprisedb.com  512.569.9461 (cell)

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

Re: [HACKERS] [PATCHES]

2007-02-27 Thread Neil Conway

On Tue, 2007-02-27 at 14:52 -0800, Joshua D. Drake wrote:
> Gonna have to concur with that. Not that the sig is legally binding
> anyway, we do need to have a disclaimer in the email stating that you
> are assigning to PGDG

I think it's pretty silly to start caring about this now. Do you think
that in the absence of any signature/disclaimer attached to a patch,
then the copyright for the change is "implicitly" assigned to PGDG? (I'm
not a lawyer, but I believe that's not the case.)

-Neil

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org

Re: [HACKERS] [PATCHES]

2007-02-27 Thread Bruce Momjian

Neil Conway wrote:
> On Tue, 2007-02-27 at 14:52 -0800, Joshua D. Drake wrote:
> > Gonna have to concur with that. Not that the sig is legally binding
> > anyway, we do need to have a disclaimer in the email stating that you
> > are assigning to PGDG
> 
> I think it's pretty silly to start caring about this now. Do you think
> that in the absence of any signature/disclaimer attached to a patch,
> then the copyright for the change is "implicitly" assigned to PGDG? (I'm
> not a lawyer, but I believe that's not the case.)

I think the issue is _explicit_ vs _implicit_.  I think the email
signature was too explicit.

-- 
  Bruce Momjian  <[EMAIL PROTECTED]>  http://momjian.us
  EnterpriseDB   http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match

Re: [HACKERS] COMMIT NOWAIT Performance Option

2007-02-27 Thread Jim C. Nasby

On Tue, Feb 27, 2007 at 05:18:28PM -0500, Bruce Momjian wrote:
> Jim C. Nasby wrote:
> > On Mon, Feb 26, 2007 at 10:56:58PM +, Simon Riggs wrote:
> > > 2. remove fsync parameter
> > 
> > Why? Wouldn't fsync=off still speed up checkpoints? ISTM you'd still
> > want this for things like database restores.
> 
> I think we will remove fsync in favor of the new delay, and allow -1 to
> be the same behavior as fsync off.

Well, presumably we'd still allow fsync for some number of versions...

Actually, I don't know that combining both settings is a wise move. The
delay should still provide crash protection, whereas with fsync=off
you've got absolutely no protection from anything. That's a huge
difference, and one that IMHO warrants a separate setting (and a big,
fat WARNING in the comment for that setting).
-- 
Jim Nasby[EMAIL PROTECTED]
EnterpriseDB  http://enterprisedb.com  512.569.9461 (cell)

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match

Re: [HACKERS] No ~ operator for box, point

2007-02-27 Thread Tom Lane

"Jim C. Nasby" <[EMAIL PROTECTED]> writes:
> Is there any reason not to make these casts implicit?

To the extent that you're trying to provide operators that should be
indexable, that won't solve the problem.

I'm unconvinced that these casts should be implicit anyway, as the types
are really considerably different than, say, int2 and int4 --- there are
no operations on int4 that "don't make sense" for an int2, the way there
are for polygon vs. box.

regards, tom lane

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

Re: [HACKERS] COMMIT NOWAIT Performance Option

2007-02-27 Thread Bruce Momjian

Jim C. Nasby wrote:
> On Tue, Feb 27, 2007 at 05:18:28PM -0500, Bruce Momjian wrote:
> > Jim C. Nasby wrote:
> > > On Mon, Feb 26, 2007 at 10:56:58PM +, Simon Riggs wrote:
> > > > 2. remove fsync parameter
> > > 
> > > Why? Wouldn't fsync=off still speed up checkpoints? ISTM you'd still
> > > want this for things like database restores.
> > 
> > I think we will remove fsync in favor of the new delay, and allow -1 to
> > be the same behavior as fsync off.
> 
> Well, presumably we'd still allow fsync for some number of versions...

No.

> Actually, I don't know that combining both settings is a wise move. The
> delay should still provide crash protection, whereas with fsync=off
> you've got absolutely no protection from anything. That's a huge
> difference, and one that IMHO warrants a separate setting (and a big,
> fat WARNING in the comment for that setting).

Yes, it needs a warning, or perhaps we just tell people to set it to
something high and that is all they can do.

-- 
  Bruce Momjian  <[EMAIL PROTECTED]>  http://momjian.us
  EnterpriseDB   http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] [PATCHES]

2007-02-27 Thread Joshua D. Drake

Neil Conway wrote:
> On Tue, 2007-02-27 at 14:52 -0800, Joshua D. Drake wrote:
>> Gonna have to concur with that. Not that the sig is legally binding
>> anyway, we do need to have a disclaimer in the email stating that you
>> are assigning to PGDG
> 
> I think it's pretty silly to start caring about this now. Do you think
> that in the absence of any signature/disclaimer attached to a patch,
> then the copyright for the change is "implicitly" assigned to PGDG? (I'm
> not a lawyer, but I believe that's not the case.)

I can tell you that it depends on the individuals relationship with
their employer. The employer may have agreement (most do) that will
state that whatever the employee does is owned by the employer.

Thus we may literally not have rights to the code. Do you really want to
go down the path of in 2 years, Fujitsu (No offense Fujitsu), but you
are the topic) decides that the code they provided is owned by them and
they didn't give us permission?

Joshua D. Drake

> 
> -Neil
> 
> 

-- 

  === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
Providing the most comprehensive  PostgreSQL solutions since 1997
 http://www.commandprompt.com/

Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
PostgreSQL Replication: http://www.commandprompt.com/products/

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

Re: [HACKERS] COMMIT NOWAIT Performance Option

2007-02-27 Thread Jim C. Nasby

On Tue, Feb 27, 2007 at 07:17:37PM -0500, Bruce Momjian wrote:
> > Actually, I don't know that combining both settings is a wise move. The
> > delay should still provide crash protection, whereas with fsync=off
> > you've got absolutely no protection from anything. That's a huge
> > difference, and one that IMHO warrants a separate setting (and a big,
> > fat WARNING in the comment for that setting).
> 
> Yes, it needs a warning, or perhaps we just tell people to set it to
> something high and that is all they can do.

Before doing that I'd want to see how the performance compares to
fsync=off. My guess is that fsync=off is a big gain during checkpoints.
-- 
Jim Nasby[EMAIL PROTECTED]
EnterpriseDB  http://enterprisedb.com  512.569.9461 (cell)

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate

Re: conversion efforts (Re: [HACKERS] SCMS question)

2007-02-27 Thread Joshua D. Drake

Warren Turkal wrote:
> On Tuesday 27 February 2007 12:26, Bruce Momjian wrote:
>> OK, I definately had added the semicolons, so I am confused why you
>> don't see them.  Anyway, I have remove the duplicate 'creation:' lines,
>> so now there is only one line in each file.  Let me know how that works.
> 
> Everything looks good now. Cvs2svn makes its through where the error 
> originally occurred.
> 
> Joshua, can you see if your conversion script can do a conversion without 
> deleting the perl5 directory?

My conversion was successfull. You can see here:

http://projects.commandprompt.com/public/pgsql

> 
> wt


-- 

  === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
Providing the most comprehensive  PostgreSQL solutions since 1997
 http://www.commandprompt.com/

Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
PostgreSQL Replication: http://www.commandprompt.com/products/


---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate

Re: conversion efforts (Re: [HACKERS] SCMS question)

2007-02-27 Thread Andrew Dunstan


Warren Turkal wrote:

On Tuesday 27 February 2007 13:50, Andrew Dunstan wrote:
  

You know, you can prune what is rsynced.



I am not sure why you brought this up, but yes I did know this.
  



Well I thought it might be useful to prune that directory you were 
having trouble with. But we have apparently fixed the problem.
  

my rsync line looks like this:

  rsync -avzCH --delete --exclude-from=/home/cvsmirror/pg-exclude
anoncvs.postgresql.org::pgsql-cvs /home/cvsmirror/pg

and the exclude file has these four lines:

  /sup/
  /CVSROOT/loginfo*
  /CVSROOT/commitinfo*
  /CVSROOT/config*



This setup prunes 1.25MB.

Are you suggesting that I prune this from the conversion?

  


You  don't need /sup/ for the conversion, I believe - isn't that just 
there for cvsup support? Of course, 1.25Mb is fairly small in a 341Mb 
repo, but every bit helps I guess.


cheers

andrew


---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [HACKERS] COMMIT NOWAIT Performance Option

2007-02-27 Thread Gregory Stark

"Joshua D. Drake" <[EMAIL PROTECTED]> writes:

>>> On 2/27/07, Josh Berkus  wrote:

 I see no reason to implement it if there is no performance gain.

> However, I strongly concur that we need at least some evidence. It could
> easily be that a misstep in the code, causes a loop over the wrong set
> and all the performance we thought we would get is invalid, not because
> of theory or what should happen, but because of actual implementation.

It rather sounds like you're asking for a proof that Simon can write bug-free
code before you allow him to write any code.

-- 
  Gregory Stark
  EnterpriseDB  http://www.enterprisedb.com

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org

1 2 >

1 - 100 of 139 matches

Mail list logo