Re: [HACKERS] pg_restore stuck in a loop?

2005-04-28 Thread Tom Lane
Rod Taylor [EMAIL PROTECTED] writes:
 On Tue, 2005-04-26 at 23:22 -0400, Tom Lane wrote:
 You tell us ;-).  You've got the test case, attach to it with a debugger
 and find out what it's doing.

 I wasn't entirely sure how to catch it in action so I just used CTRL+C
 to interrupt the pg_restore process, and got these as backtraces:

Hm, I wonder if there is something broken about the SQL parsing logic
in _sendSQLLine, such that it could go into an infinite loop given the
right input?

regards, tom lane

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [HACKERS] Increased company involvement

2005-04-28 Thread Thomas Hallgren
Joshua D. Drake wrote:
However, there was a lot of coordination that happened with Fujitsu that
I don't see happening with the current companies involved.  Companies
are already duplicating work that is also done by community members or
by other companies.

That is bound to happen no matter what. Look at plJava and plJ. Some 
people just feel that their way is better. Some people just don't get 
along etc...

Actually, I think that PL/Java versus PL/J is a good example of where 
some coordination would have helped a lot.

The short story:
I was between jobs in December 2003 through February the following year. 
A lot of work on PL/Java was made during that time. I had no clue that 
there was another active project with similar objectives until after my 
first fully functional submission to gborg. Had I known, the outcome 
would have been different. Today there are ongoing and very active 
efforts to collaborate.

The longer story (if anyone is interested):
Before I started PL/Java I informed the community of my intentions (see 
hackers thread pljava revisited 
http://archives.postgresql.org/pgsql-hackers/2003-12/msg00310.php ). I 
got a lot of feedback and good advice such as using C instead of C++, 
hosting the project at gborg, etc. but nobody told me back then that 
there was an active PL/J project. I found traces of that project on 
sourceforge but it seemed to have been dead for over a year. At that 
time there was no redirect from sourceforge.

Jan Wieck started the thread PL/Java issues 
http://archives.postgresql.org/pgsql-hackers/2003-12/msg00819.php in 
which I made 2 major posts. Nothing in that thread indicated that there 
was an ongoing project and I got no reply to my posts. On January the 
7th, I made my first submission to gborg.

When I, in mid February, realized that the PL/J project was indeed alive 
and active, I wrote the PL/Java - next step? 
http://archives.postgresql.org/pgsql-hackers/2004-02/msg00713.php where 
I outlined possible futures for PL/Java and PL/J. The outcome of that 
was that PL/J  (Dave Cramer and Laszlo Hornyak) and I had an IRC meeting 
where we agreed on some limited collaboration (see Minutes from Pl/Java 
-next step IRC 
http://archives.postgresql.org/pgsql-hackers/2004-03/msg00171.php ).

From that point and until a month or so ago, the active collaboration 
between the projects could have been better. Some things did happen 
though. Laszlo asked me to publish some PL/Java interfaces in a public 
maven repository which I did and we had some discussions. I made an 
attempt to have a major sponsor step in and take the lead in a project 
aiming to provide a flexible solution where a choice of approach could 
be made but the sponsor understandably wanted to wait and see.

Today, we (the PL/Java and PL/J project members) make common efforts to 
factor out client tools that indeed can be common to a separate project. 
We are also discussing how to make the PostgreSQL user experience as 
similar as possible and thus allowing use of PL/Java or PL/J without 
changing anything but configuration.

Kind regards,
Thomas Hallgren
---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
   (send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [HACKERS] [pgsql-advocacy] Increased company involvement

2005-04-28 Thread Hannu Krosing
On K, 2005-04-27 at 22:21 -0700, Joshua D. Drake wrote:
  However, there was a lot of coordination that happened with Fujitsu that
  I don't see happening with the current companies involved.  Companies
  are already duplicating work that is also done by community members or
  by other companies.

 That is why we have 80 Linux distributions and a dozen FreeBSD 
 distributions (can I include MacOSX?).

I guess more aprropriate comparison would be to distibution specific
linux kernels, i.e. RedHat linux kernel vs. suse linux kernel v.s.
vanilla or real :) linux kernel.

The big issue is communication.  Because the
  PostgreSQL code base is common for most of the companies involved, there
  has to be coordination in what they are working on and their approaches.
 
 I can see this as an issue but sometimes that community is a hampering 
 course as well. I recognize the community goals and respect them but in
 some things the community can move really slow. From what I can tell
 some of this is caused by the no new features rules etc...
 
 In business moving slow can mean death to a project.
 
 Which is why (hate to beat a dead horse) many OSS projects have moved
 to 6 month release cycles.

Well, it is a two-sided thing. On one hand, businesses usually need new
features yesterday, but on the other hand, business would loose most
of the benefit of getting the feature fast, if it is not included in the
main branch along the road, preferrably in the next official release,
because said business would be dependent of the implementor of his
specific feature for integrating _all_ other new and desirable fetures
of next releas in their specific version of postgres.

  is happening.  I realize this is hard for companies because their
  efforts are in some ways part of their profitability.
 
 That is true, there are sometimes strategic reasons to not annouce a 
 project.

I can see no strategic advantages for sponsors in not announcing the
project. There may be some small tactical wins for the actual
implementors, but not even tactical wins the sponsors.

There may be some strategic/tactical reasons for sponsors to announce
who they (the sponsors) are, but I can't see any reason not to announce
the project or not to encourage/ask/demand the implementor to do so.

  profitability require duplication of effort and code collisions?  I am
  not sure, but if it does, we are in trouble.  I am not sure the
  community has the resources to resolve that many collisions.
 
 Which is why you are starting to see forks such as Bizgres but it is 
 also why you are seeing forks go away (Mammoth PostgreSQL).

At first glance at least, BizGres looks like a community oriented
project and I hope that BizGres will be a testbed/early implementation
of some DataWarehouse features, which will be integrated back to
postgres at first possibility.

But I too expected the discussion to take place on pgsql-hackers, not
some half-hidden mailinglist on pgfoundry. Or at least an announcement
of that mailinglist to be made on pgsql-hachers.

OTOH, there are some things (2PC, recursive queries, ...) which are
discussed on psql-hacker and still are lingering somewhere and are for
various reasons (one of them may be lack of time/money/sponsor-pressure
for the developer) not in the postgres proper yet. 

OTOOH, I see that a growing number of companies who might think of
sponsoring PG development are interested in BI/DW and OLAP features in
addition to plain OLTP enchancements.

-- 
Hannu Krosing [EMAIL PROTECTED]


---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [HACKERS] [pgsql-advocacy] Increased company involvement

2005-04-28 Thread Bruce Momjian
Hannu Krosing wrote:
  Which is why (hate to beat a dead horse) many OSS projects have moved
  to 6 month release cycles.
 
 Well, it is a two-sided thing. On one hand, businesses usually need new
 features yesterday, but on the other hand, business would loose most
 of the benefit of getting the feature fast, if it is not included in the
 main branch along the road, preferrably in the next official release,
 because said business would be dependent of the implementor of his
 specific feature for integrating _all_ other new and desirable fetures
 of next releas in their specific version of postgres.

Yes, you can bet that will happen.  You can also bet that the community
release will have 2X more feature additions than any commercial release
that is a version behind unless _huge_ sums of money are thrown at the
commercial release.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  pgman@candle.pha.pa.us   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


[HACKERS] Feature freeze date for 8.1

2005-04-28 Thread Bruce Momjian
You might remember that when we released 8.0, the plan was to have a
12-month development cycle for 8.1, unless there were Win32 problems
that required complex fixes, in which case we would have a shorter 8.1
cycle.

Well the good news is that there have been almost no Win32 problems, but
the other good news is that we are getting a lot of powerful features
for 8.1 already:

o two-phase (Heikki Linnakangas, almost done)
o multiple out function paramters (Tom, done)
o bitmappted indexes (Tom, almost done)
o shared row locks (Alvaro, almost done)
o integrated auto-vacuum (Bruce)
o buffer cache fixes for SMP (Tom, done)

It is possible all these items will be done by sometime in June.  Now,
if that happens, do we want to continue with the 12-month plan or
shorten the 8.1 release cycle, perhaps targeting a release in the
September/October timeframe?

The current core proposal is to do feature freeze on July 1, with the
understanding that we will be done most of the items above by then and
have the outstanding patches applied.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  pgman@candle.pha.pa.us   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [HACKERS] [proposal] protocol extension to support loadable stream filters

2005-04-28 Thread Tom Lane
Brent Verner [EMAIL PROTECTED] writes:
   Would it be sane to recognize a specific PG_PROTOCOL_MAJOR
 to enter the filter-negotiation process?  PG_PROTOCOL_MINOR
 would then be used to lookup and call a ptr to the filter's 
 create() in CreateStreamFilter...

Looks reasonable enough to me ...

regards, tom lane

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [HACKERS] Feature freeze date for 8.1

2005-04-28 Thread Bruno Wolff III
On Thu, Apr 28, 2005 at 09:02:40 -0400,
  Bruce Momjian pgman@candle.pha.pa.us wrote:
 Well the good news is that there have been almost no Win32 problems, but
 the other good news is that we are getting a lot of powerful features
 for 8.1 already:

You forgot to list the indexed aggregate feature for max and min. While
this isn't that important for experienced postgres users, it is a gotcha
for new users. Between this, integrated autovacuum and the cross type
index changes in 8.0 we have covered almost all of the newbie gotchas.
This should make Postgres effectively equivalent in difficulty with
getting started for new users as MySQL. That could significantly
boost usage for low end use once word gets out.

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] Feature freeze date for 8.1

2005-04-28 Thread Rob Butler
As a user, I would definetly prefer to see 8.1
released sooner with the feature set listed below,
than wait another 6+ months for a few other features. 
Additionally, the beta may go smoother/faster if you
don't have too many huge features going in at once.

Just my opinion.
Later
Rob
--- Bruce Momjian pgman@candle.pha.pa.us wrote:
 You might remember that when we released 8.0, the
 plan was to have a
 12-month development cycle for 8.1, unless there
 were Win32 problems
 that required complex fixes, in which case we would
 have a shorter 8.1
 cycle.
 
 Well the good news is that there have been almost no
 Win32 problems, but
 the other good news is that we are getting a lot of
 powerful features
 for 8.1 already:
 
   o two-phase (Heikki Linnakangas, almost done)
   o multiple out function paramters (Tom, done)
   o bitmappted indexes (Tom, almost done)
   o shared row locks (Alvaro, almost done)
   o integrated auto-vacuum (Bruce)
   o buffer cache fixes for SMP (Tom, done)
 
 It is possible all these items will be done by
 sometime in June.  Now,
 if that happens, do we want to continue with the
 12-month plan or
 shorten the 8.1 release cycle, perhaps targeting a
 release in the
 September/October timeframe?
 
 The current core proposal is to do feature freeze on
 July 1, with the
 understanding that we will be done most of the items
 above by then and
 have the outstanding patches applied.
 
 -- 
   Bruce Momjian| 
 http://candle.pha.pa.us
   pgman@candle.pha.pa.us   |  (610)
 359-1001
   +  If your life is a hard drive, |  13 Roberts
 Road
   +  Christ can be your backup.|  Newtown
 Square, Pennsylvania 19073
 
 ---(end of
 broadcast)---
 TIP 9: the planner will ignore your desire to choose
 an index scan if your
   joining column's datatypes do not match
 

__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] [pgsql-advocacy] Increased company involvement

2005-04-28 Thread Robert Treat
On Thursday 28 April 2005 01:48, Joshua D. Drake wrote:
  Do companies want to write for Blue Hat PostgreSQL and Suza PostgreSQL
  because that might be what happens if we don't stay organized?  In fact,
  it might have be happening already.

 Well that depends... If the companies are writing for Pervasive
 PostgreSQL I don't think they would have a problem with that ;).

 And I do agree with you Bruce, it is happening already -- I don't think
 there is any question in that.


ISTM the allure of differentiation and branding is going to be too strong for 
us to prevent such things.  An easy way to differentiate is to add some 
proprietary/unique extension to the main code and then package that up. If 
you have to have all your extensions be put into the community version then 
lose this advantage over your comptetitors. (Mammoth PostgreSQL/Replicator is 
an example of this)  The same holds true for branding if your Pervasive 
you want to sell Pervasive Postgres rather than PostgreSQL because you get to 
push your name out there, and get people thinking about your company whenever 
they talk about the database.   

I think our goal is to encorage companies to push these changes into the core 
as much as possible, pointing out things like the advantages community 
support brings like ongoing maintainance,  support in add-on tools like 
pgadmin or phppgadmin, and eliminating the chance that someone else will 
submit a similar solution that gets accepted to the community code there by 
deprecating the work they have already done.  

-- 
Robert Treat
Build A Brighter Lamp :: Linux Apache {middleware} PostgreSQL

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?

2005-04-28 Thread Mischa Sandberg
Quoting Josh Berkus josh@agliodbs.com:

  Perhaps I can save you some time (yes, I have a degree in Math). If I
  understand correctly, you're trying extrapolate from the correlation
  between a tiny sample and a larger sample. Introducing the tiny sample
  into any decision can only produce a less accurate result than just
  taking the larger sample on its own; GIGO. Whether they are consistent
  with one another has no relationship to whether the larger sample
  correlates with the whole population. You can think of the tiny sample
  like anecdotal evidence for wonderdrugs.

 Actually, it's more to characterize how large of a sample we need.  For
 example, if we sample 0.005 of disk pages, and get an estimate, and then
 sample another 0.005 of disk pages and get an estimate which is not even
 close to the first estimate, then we have an idea that this is a table
which
 defies analysis based on small samples.   Wheras if the two estimates
are 
 1.0 stdev apart, we can have good confidence that the table is easily
 estimated.  Note that this doesn't require progressively larger
samples; any
 two samples would work.

We're sort of wandering away from the area where words are a good way
to describe the problem. Lacking a common scratchpad to work with,
could I suggest you talk to someone you consider has a background in
stats, and have them draw for you why this doesn't work?

About all you can get out of it is, if the two samples are
disjunct by a stddev, yes, you've demonstrated that the union
of the two populations has a larger stddev than either of them;
but your two stddevs are less info than the stddev of the whole.
Breaking your sample into two (or three, or four, ...) arbitrary pieces
and looking at their stddevs just doesn't tell you any more than what
you start with.

-- 
Dreams come true, not free. -- S.Sondheim, ITW 


---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [HACKERS] [pgsql-advocacy] Increased company involvement

2005-04-28 Thread Magnus Hagander
   However, there was a lot of coordination that happened 
 with Fujitsu 
   that I don't see happening with the current companies involved.  
   Companies are already duplicating work that is also done by 
   community members or by other companies.
 
  That is why we have 80 Linux distributions and a dozen FreeBSD 
  distributions (can I include MacOSX?).
 
 I guess more aprropriate comparison would be to distibution 
 specific linux kernels, i.e. RedHat linux kernel vs. suse 
 linux kernel v.s.
 vanilla or real :) linux kernel.

As someone who has been bitten by the RH vs SuSE vs kernel.org kernel
several times over the past couple of  weeks, we should *really* try to
avoid this as much as possible. It's a *major* hassle for the end user
if postgresql is no longer the same as postgresql.

Not exactly sure what to do to avoid it, though ;-) Other than encourage
the companies that develop extensions to submit these to the community
distribution and work to get them accepted there...


  I can see this as an issue but sometimes that community is 
 a hampering 
  course as well. I recognize the community goals and respect 
 them but 
  in some things the community can move really slow. From what I can 
  tell some of this is caused by the no new features rules etc...
  
  In business moving slow can mean death to a project.
  
  Which is why (hate to beat a dead horse) many OSS projects 
 have moved 
  to 6 month release cycles.
 
 Well, it is a two-sided thing. On one hand, businesses 
 usually need new features yesterday, but on the other hand, 
 business would loose most of the benefit of getting the 
 feature fast, if it is not included in the main branch along 
 the road, preferrably in the next official release, because 
 said business would be dependent of the implementor of his 
 specific feature for integrating _all_ other new and 
 desirable fetures of next releas in their specific version of 
 postgres.

An example of this might be Powergres - not sure if it was planned to go
into community ever, but it must be a pain to maintain the threaded
version alongside the main one. I guess that's why it's still based on
7.3...


//Magnus

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] [JDBC] Statement Timeout and Locking

2005-04-28 Thread Tom Lane
Kris Jurka [EMAIL PROTECTED] writes:
 This actually is the problem.  It works as three separate statements, but 
 fails as one.   The server doesn't seem to recognize the SET when other 
 commands come in before Sync.

[ reads some code... ]  The problem is that postgres.c only inspects
StatementTimeout when start_xact_command starts a transaction command,
and the placement of finish_xact_command calls is such that that's
not going to happen until after Sync.  So the upshot is that the
SET statement_timeout isn't going to have effect until after Sync
(or after a transaction-control command, but there are none in your
example).

This suggests that the statement_timeout stuff is being done at the wrong
place.  I'm not sure exactly what the more-right places would be for
V3 protocol though.  What exactly would you expect statement_timeout to
cover in a Parse/Bind/Execute world --- especially if those aren't
issued in a purely sequential fashion?

A very simple definition would be that each Parse, Bind, or Execute
action is independently constrained by statement_timeout, but that would
act significantly differently from the simple-query case if planning
takes long enough to be a factor.  (Bear in mind that planning can
include constant-folding of user-defined functions, so at least in some
cases you can imagine people would want statement_timeout to constrain
planning.)  Also that would imply three times as many timer
enable/disable kernel calls, which might be an annoying amount of
overhead.

Anyway the short-term answer for Markus is don't do it that way.
We ought to think about making the backend's behavior more consistent,
though.

regards, tom lane

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Feature freeze date for 8.1

2005-04-28 Thread Andreas Pflug
Bruce Momjian wrote:
You might remember that when we released 8.0, the plan was to have a
12-month development cycle for 8.1, unless there were Win32 problems
that required complex fixes, in which case we would have a shorter 8.1
cycle.
Well the good news is that there have been almost no Win32 problems, but
the other good news is that we are getting a lot of powerful features
for 8.1 already:
o two-phase (Heikki Linnakangas, almost done)
o multiple out function paramters (Tom, done)
o bitmappted indexes (Tom, almost done)
o shared row locks (Alvaro, almost done)
o integrated auto-vacuum (Bruce)
o buffer cache fixes for SMP (Tom, done)
It is possible all these items will be done by sometime in June.  Now,
if that happens, do we want to continue with the 12-month plan or
shorten the 8.1 release cycle, perhaps targeting a release in the
September/October timeframe?
The current core proposal is to do feature freeze on July 1, with the
understanding that we will be done most of the items above by then and
have the outstanding patches applied.
It seems to be a good idea to take the chance now to make a release. 
Delaying the release would mean preventing the wide usage of features 
although they appear production grade. OTOH, if feature freeze is 
delayed some seemingly essential features for 8.1 which might arise in 
the meantime might delay the release further, or induce late feature 
exclusion when some last minute issues are discovered. Integrated 
autovacuum seems good enough a reason to release early.

Regards,
Andreas
---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?
  http://www.postgresql.org/docs/faq


Re: [HACKERS] [pgsql-advocacy] Increased company involvement

2005-04-28 Thread Andy Astor
  However, there was a lot of coordination that happened with Fujitsu
that
  I don't see happening with the current companies involved.
Companies
  are already duplicating work that is also done by community members
or
  by other companies.
 
 That is bound to happen no matter what. Look at plJava and plJ. Some
 people just feel that their way is better. Some people just don't get
 along etc...
 
 That is why we have 80 Linux distributions and a dozen FreeBSD
 distributions (can I include MacOSX?).

True enough. And coordination, just like other outward-facing
activities, is often inconvenient and easy to forget. But it's
important. I've just left the Board of Directors of the Web Services
Interoperability organization (WS-I). Coordinating the standards
activities of IBM, MS, Sun, Oracle, BEA, Fujitsu, SAP, and 130 others
takes enormous time and care, but it's the only way coop-etors can
function.

And having said that, aggressive businesspeople will move too quickly at
times, or will hide their activities for business reasons. It's a mostly
forgivable sin, IMO.

  Second, some developers are being hired from the community to work
on
  closed-source additions to PostgreSQL.  That is fine and great, but
one
  way to kill PostgreSQL is to hire away its developers.  If a
commercial
  company wanted to hurt us, that is certainly one way they might do
it.
  Anyway, it is a concern I have.  I am hoping community members hired
to
  do closed-source additions can at least spend some of their time on
  community work.
 
 I would think that most of the developers would stipulate that in
order
 to take the position??? I know Command Prompt would always make sure
 that the developer could work on the community stuff.

The same is true for EnterpriseDB.

  And finally, we have a few companies working on features that they
  eventually want merged back into the PostgreSQL codebase.  That is a
  very tricky process and usually goes badly unless the company seeks
  community involvement from the start, including user interface,
  implementation, and coding standards.
 
 I concur with this. We ran into this with plPerl.

The only way to successfully extend PostgreSQL commercially is to
coordinate with the community.

-- Andy




---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [HACKERS] [JDBC] Statement Timeout and Locking

2005-04-28 Thread Markus Schaber
Hi, Tom,

Tom Lane schrieb:

 Anyway the short-term answer for Markus is don't do it that way.
 We ought to think about making the backend's behavior more consistent,
 though.

I'll split the query into three. Having it in one query just was a
convenience here.

Thanks,
Markus

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?

2005-04-28 Thread Marko Ristola
First I will comment my original idea.
Second I will give another improved suggestion (an idea).
I hope, that they will be useful for you.
(I don't know, wether the first one was useful at all because it showed,
that I and some others of us are not very good with statistics :( )
I haven't looked about the PostgreSQL code, so I don't know, that what 
is possible
now, and what is not. I do know, that the full table scan and after that 
incremental
statistics changes are a very big change, without looking at the code.


I meant the following  idea:
- compare two equal sized samples. Then redo the same thing with double
sized samples. So do lots of unnecessary work.
Check out the correlation of the two samples to try to guess the 
distribution.

So I tried to give you an idea, not to give you a full answer into the 
whole problem.

I did read some parts of the attached PDFs. They did convince me,
that it seems, that the heuristics for the hard cases would actually read
almost the whole table in many cases.
I did cover the too little sample problem by stating that the
user should be able to give the minimum size of samples. This way you would
avoid the too small sampling problem. My purpose was not to achieve at
most 5% wrong estimates, but to decrease the 2000% wrong estimates, that 
are
seen now sometimes.

Conclusions:
- No heuristics or similar thing of small samples will grant excellent 
results.
- If you need excellent estimates, you need to process the whole table!
- Some special cases, like primary keys and the unique indexes and special
case column types do give easy ways to make estimates:
For example, wether a boolean column has zero, one or two distinct 
values, it does not matter
so much ??? Hashing seems the right choise for all of them.

If I have understund correctly, the full table scans are out of
questions for large tables at this time.
The percentage idea of taking 10% samples seems good.
So here is another suggestion:
1. Do a full percentage scan, starting at an arbitrary position. If the 
user's data is not
homogenous, this hurts it, but this way it is faster.
During that scan, try to figure out all those columns, that have at most 
100 distinct values.

Of course, with it you can't go into 100% accuracy, but if the full 
table scan is out of question now,
it is better, if the accuracy is for example at most ten times wrong.

You could also improve accuracy by instead of doing a 10% partial table 
scan, you could
do 20 pieces of 0,5 percent partial table scans: This would improve 
accuracy a bit, but keep
the speed almost the same as the partial table scan.

Here are questions for the statisticians for distinct values calculation:
If we want at most 1000% tolerance, how big percentage of table's one
column must be processed?
If we want at most 500% tolerance, how big percentage of table's one
column must be processed?
If we want at most 250% tolerance, how big percentage of table's one
column must be processed?
Better to assume, that there are at most 100 distinct values on a table,
if it helps calculations.
If we try to get as much with one discontinuous partial table scan
(0,1-10% sample), here is the information, we can gather:
1. We could gather a histogram for max(100) distinct values for each 
column for every column.
2. We could measure variance and average, and the number of rows for 
these 100 distinct values.
3. We could count the number of rows, that didn't match with these 100 
distinct values:
they were left out from the histogram.
4. We could get a minimum and a maximum value for each column.

= We could get exact information about the sample with one 0,1-10% pass 
for many columns.

What you statisticans can gather about these values?
My idea is programmatical combined with statistics:
+ Performance: scan for example 100 blocks each of size 100Mb, because 
disc I/O
is much faster this way.
+ Enables larger table percentage. I hope it helps with the statistics 
formula.
   Required because of more robust statistics: take those blocks at random
   (not over each other) places to decrease the effect from hitting 
into statistically
   bad parts on the table.
+ Less table scan passes: scan all columns with limited hashing in the 
first pass.
+ All easy columns are found here with one pass.
+- Harder columns need an own pass each, but we have some preliminary
   knoledge of them on the given sample after all (minimum and maximum 
values
   and the histogram of the 100 distinct values).

Marko Ristola
Greg Stark wrote:
Dave Held [EMAIL PROTECTED] writes:
 

Actually, it's more to characterize how large of a sample
we need.  For example, if we sample 0.005 of disk pages, and
get an estimate, and then sample another 0.005 of disk pages
and get an estimate which is not even close to the first
estimate, then we have an idea that this is a table which 
defies analysis based on small samples.  
 

I buy that.
   

Better yet is to use the entire sample you've gathered of .01 and 

Re: [HACKERS] [COMMITTERS] pgsql: Implement sharable row-level locks, and use them for foreign key

2005-04-28 Thread Alvaro Herrera
On Thu, Apr 28, 2005 at 06:47:18PM -0300, Tom Lane wrote:

 Implement sharable row-level locks, and use them for foreign key references
 to eliminate unnecessary deadlocks.  This commit adds SELECT ... FOR SHARE
 paralleling SELECT ... FOR UPDATE.  The implementation uses a new SLRU
 data structure (managed much like pg_subtrans) to represent multiple-
 transaction-ID sets.

One point I didn't quite understand was the business about XLogging
heap_lock_tuple.  I had to reread your mail to -hackers on this issue
several times to get it (as you can see I don't fully grok the WAL
rules).  Now, I believe that heap_mark4update was wrong on this, no?
Only it didn't matter because after a crash nobody cared about the
stored Xmax.

One nice side effect of this is that the 2PC patch now has this problem
solved.  The bad part is that locking a tuple emits an (non-XLogFlushed)
WAL record and it may have a performance impact.  (We should have better
performance overall I think, because transactions are no longer locked
on foreign key checking.)


Anyway: many thanks for updating the patch to an usable state.  I'm
sorry to have inflicted all those bugs upon you.

-- 
Alvaro Herrera ([EMAIL PROTECTED])
La soledad es compañía

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] [proposal] protocol extension to support loadable stream filters

2005-04-28 Thread Brent Verner
[2005-04-28 10:00] Tom Lane said:
| Brent Verner [EMAIL PROTECTED] writes:
|Would it be sane to recognize a specific PG_PROTOCOL_MAJOR
|  to enter the filter-negotiation process?  PG_PROTOCOL_MINOR
|  would then be used to lookup and call a ptr to the filter's 
|  create() in CreateStreamFilter...
| 
| Looks reasonable enough to me ...

  Now, the hard part...where should this code live?  I'm thinking a 
src/transport directory seems sensible.

  StreamFilter.[ch] will contain the base StreamFilter along with 
various utility functions.  StreamFilter implementations will reside
in their own subdir.

  src/include/transport/StreamFilter.h
  src/transport/StreamFilter.c
  src/transport/zlib/...
  src/transport/ssl/...

Comments/ideas appreciated.

cheers.
b


---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [HACKERS] [pgsql-advocacy] Increased company involvement

2005-04-28 Thread Josh Berkus
Hannu,

 But I too expected the discussion to take place on pgsql-hackers, not
 some half-hidden mailinglist on pgfoundry. Or at least an announcement
 of that mailinglist to be made on pgsql-hachers.

Yeah, we should announce the mailing list.  Actually, I did direct e-mail a 
bunch of people (including you) about it and invite them to the mailing list.
So: http://pgfoundry.org/mail/?group_id=1000107

For discussing potential features, though, I'm personally reluctant at this 
point to discuss major features until I and my collaborators (example, 
newsysviews) have a concrete proposal together.   It's far too easy to get 
sidetracked into a discussion of minutia and politics on -hackers; by 
generating a complete draft spec (at least) on a small mailing list, it's a 
lot easier to focus on specific goals and schedules, and discussions on 
hackers around detailed proposals tend to be a lot more focused.

-- 
Josh Berkus
Aglio Database Solutions
San Francisco

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [HACKERS] [COMMITTERS] pgsql: Implement sharable row-level locks, and use them for foreign key

2005-04-28 Thread Tom Lane
Alvaro Herrera [EMAIL PROTECTED] writes:
 One point I didn't quite understand was the business about XLogging
 heap_lock_tuple.  I had to reread your mail to -hackers on this issue
 several times to get it (as you can see I don't fully grok the WAL
 rules).  Now, I believe that heap_mark4update was wrong on this, no?
 Only it didn't matter because after a crash nobody cared about the
 stored Xmax.

Well, actually the reason I decided to put in xlogging there was that
I realized it was already broken before.  In the existing code it was
possible to have this scenario:
* transaction N selects-for-update some tuple, so N goes into
  the tuple's XMAX.
* transaction N ends without doing anything else.  Since it's
  not produced any XLOG entries, xact.c thinks it doesn't need
  to emit either a COMMIT or ABORT xlog record.
* therefore, there is no record whatsoever of XID N in XLOG.
* bgwriter pushes the dirty data page to disk.
* database crashes.
* on restart, WAL replay sets the XID counter to N or less,
  because there is no evidence in the XLOG for N.
* now there will be a new transaction N that is mistakenly
  considered to own an update lock on the tuple.

While the negative impact of this situation is probably not high,
it's clearly The Wrong Thing.

The MultiXactId patch introduces a second way to have the same
problem, ie a MultiXactId on disk for which there is no evidence
in XLOG, so the MXID might get re-used after restart.

In view of the fact that we want to do 2PC sometime soon, and that
absolutely requires xlogging every lock, I thought that continuing to
try to avoid emitting an xlog record for heap_lock_tuple was just silly.

regards, tom lane

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] [pgsql-advocacy] Increased company involvement

2005-04-28 Thread Bruce Momjian
Robert Treat wrote:
 ISTM the allure of differentiation and branding is going to be too strong for 
 us to prevent such things.  An easy way to differentiate is to add some 
 proprietary/unique extension to the main code and then package that up. If 
 you have to have all your extensions be put into the community version then 
 lose this advantage over your comptetitors. (Mammoth PostgreSQL/Replicator is 
 an example of this)  The same holds true for branding if your Pervasive 
 you want to sell Pervasive Postgres rather than PostgreSQL because you get to 
 push your name out there, and get people thinking about your company whenever 
 they talk about the database.   
 
 I think our goal is to encorage companies to push these changes into the core 
 as much as possible, pointing out things like the advantages community 
 support brings like ongoing maintainance,  support in add-on tools like 
 pgadmin or phppgadmin, and eliminating the chance that someone else will 
 submit a similar solution that gets accepted to the community code there by 
 deprecating the work they have already done.  

I remember something the president of Great Bridge told me, he said,
Great Bridge needs PostgreSQL.  If Great Bridge dies, PostgreSQL goes
on (as it did), but if PostgreSQL dies, Great Bridge is dead too.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  pgman@candle.pha.pa.us   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] [PERFORM] Distinct-Sampling (Gibbons paper) for Postgres

2005-04-28 Thread Josh Berkus

 Now, if we can come up with something better than the ARC algorithm ...

Tom already did.  His clock-sweep patch is already in the 8.1 source.

-- 
Josh Berkus
Aglio Database Solutions
San Francisco

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings