Sounds a bit like multi-dimensional clustering ...
http://www.research.ibm.com/mdc/
After the ARC experience though ...
--
Pip-pip
Sailesh
http://www.cs.berkeley.edu/~sailesh
---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster
> Jim C. Nasby wrote:
>> On Mon, Feb 14, 2005 at 09:55:38AM -0800, Ron Mayer wrote:
>>
>> > I still suspect that the correct way to do it would not be
>> > to use the single "correlation", but 2 stats - one for estimating
>> > how sequential/random accesses would be; and one for estimating
>> > the
Bruce Momjian wrote:
Jim C. Nasby wrote:
On Mon, Feb 14, 2005 at 09:55:38AM -0800, Ron Mayer wrote:
I still suspect that the correct way to do it would not be
to use the single "correlation", but 2 stats - one for estimating
how sequential/random accesses would be; and one for estimating
the numbe
Jim C. Nasby wrote:
> On Mon, Feb 14, 2005 at 09:55:38AM -0800, Ron Mayer wrote:
>
> > I still suspect that the correct way to do it would not be
> > to use the single "correlation", but 2 stats - one for estimating
> > how sequential/random accesses would be; and one for estimating
> > the numbe
On Mon, Feb 14, 2005 at 09:55:38AM -0800, Ron Mayer wrote:
> I still suspect that the correct way to do it would not be
> to use the single "correlation", but 2 stats - one for estimating
> how sequential/random accesses would be; and one for estimating
> the number of pages that would be hit. I
[EMAIL PROTECTED] wrote:
You know, I don't think a lot of people "get" the issues I was describing,
or maybe they don't believe it, I don't know, but, I think that it would
be a useful contrib project to create an 'analyze_special('table',
'column', 'method')' function that does a better job at cal
> Probably off-topic, but I think it's worth to see what astronomers are
> doing with their very big spatial databases. For example, we are working
> with more than 500,000,000 rows catalog and we use some special
> transformation
> of coordinates to integer numbers with preserving objects closenes
[EMAIL PROTECTED]
> [mailto:[EMAIL PROTECTED] On Behalf Of
> [EMAIL PROTECTED]
> Sent: Wednesday, February 09, 2005 3:46 PM
> To: Ron Mayer
> Cc: Mark Kirkwood; Tom Lane; Ron Mayer; pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] Query optimizer 8.0.1 (and 8.0)
>
>
>
What's the purpose of doing this transformation? Is it just a means to
sub-divide the dataset? It's very possible that PostGIS would do just as
good a job, without using HTM. Granted, GIS is designed more for working
in LAT/LONG, but I suspect it should work just as well in whatever
coordinate syst
Probably off-topic, but I think it's worth to see what astronomers are
doing with their very big spatial databases. For example, we are working
with more than 500,000,000 rows catalog and we use some special transformation
of coordinates to integer numbers with preserving objects closeness.
I hope
On Wed, Feb 09, 2005 at 07:30:16PM -0500, [EMAIL PROTECTED] wrote:
> > Mark,
> >
> >> Hey, I can give you a copy of RT1 which is fine, but it is 1.1G
> >> compressed. I'd have to mail you a DVD.
> >
> > Sure, cool.
> >
> [address info sniped]
>
> I would be willing to send a couple DVDs (on a regu
> On Wed, Feb 09, 2005 at 07:30:16PM -0500, [EMAIL PROTECTED] wrote:
>> I would love to keep these things current for PG development, but my
>> company's server is on a plan that gets 1G free, and is billed after
>> that. Also, I am on a broadband line at my office, and uploading the
>> data
>> wo
On Wed, Feb 09, 2005 at 07:30:16PM -0500, [EMAIL PROTECTED] wrote:
> I would love to keep these things current for PG development, but my
> company's server is on a plan that gets 1G free, and is billed after
> that. Also, I am on a broadband line at my office, and uploading the data
> would take
> Mark,
>
>> Hey, I can give you a copy of RT1 which is fine, but it is 1.1G
>> compressed. I'd have to mail you a DVD.
>
> Sure, cool.
>
[address info sniped]
I would be willing to send a couple DVDs (on a regular basis) to anyone
who is able to post this on a good mirror that anyone could get at
Josh Berkus wrote:
> Mark, Stephen, etc:
>
> > > I can see your point, however I wonder if the issue is that the default
> > > stats settings of '10' (3000 rows, 10 histogram buckets) is too low, and
> > > maybe we should consider making a higher value (say '100') the default.
> >
> > Personally,
I wrote a message caled "One Big trend vs multiple smaller trends in table
statistics" that, I think, explains what we've been seeing.
> [EMAIL PROTECTED] wrote:
>>
>> In this case, the behavior observed could be changed by altering the
>> sample size for a table. I submit that an arbitrary fixed
[EMAIL PROTECTED] wrote:
In this case, the behavior observed could be changed by altering the
sample size for a table. I submit that an arbitrary fixed sample size is
not a good base for the analyzer, but that the sample size should be based
on the size of the table or some calculation of its devia
> Mark, Stephen, etc:
>
>> > I can see your point, however I wonder if the issue is that the
>> default
>> > stats settings of '10' (3000 rows, 10 histogram buckets) is too low,
>> and
>> > maybe we should consider making a higher value (say '100') the
>> default.
>>
>> Personally, I think that'd b
Mark, Stephen, etc:
> > I can see your point, however I wonder if the issue is that the default
> > stats settings of '10' (3000 rows, 10 histogram buckets) is too low, and
> > maybe we should consider making a higher value (say '100') the default.
>
> Personally, I think that'd be reasonable.
I
On Mon, Feb 07, 2005 at 17:45:23 -0500,
Greg Stark <[EMAIL PROTECTED]> wrote:
>
> However for discrete values like the "top ten most common values" and the
> "total number of distinct values" it's not so clear at all that you can
> extrapolate from a sample at all. And it's certainly not clear t
> [EMAIL PROTECTED] writes:
>
>> The basic problem with a fixed sample is that is assumes a normal
>> distribution.
>
> That's sort of true, but not in the way you think it is.
>
[snip]
Greg, I think you have an excellent ability to articulate stats, but I
think that the view that this is like ele
* Mark Kirkwood ([EMAIL PROTECTED]) wrote:
> I can see your point, however I wonder if the issue is that the default
> stats settings of '10' (3000 rows, 10 histogram buckets) is too low, and
> maybe we should consider making a higher value (say '100') the default.
Personally, I think that'd be re
> [EMAIL PROTECTED] wrote:
>>
>> In this case, the behavior observed could be changed by altering the
>> sample size for a table. I submit that an arbitrary fixed sample size is
>> not a good base for the analyzer, but that the sample size should be
>> based
>> on the size of the table or some calc
[EMAIL PROTECTED] writes:
> The basic problem with a fixed sample is that is assumes a normal
> distribution.
That's sort of true, but not in the way you think it is.
What's under discussion here, w.r.t. the histograms and the validity of
sampling is really basic statistics. You don't have to b
[EMAIL PROTECTED] wrote:
In this case, the behavior observed could be changed by altering the
sample size for a table. I submit that an arbitrary fixed sample size is
not a good base for the analyzer, but that the sample size should be based
on the size of the table or some calculation of its devia
> On Mon, Feb 07, 2005 at 05:16:56PM -0500, [EMAIL PROTECTED] wrote:
>> > On Mon, Feb 07, 2005 at 13:28:04 -0500,
>> >
>> > What you are saying here is that if you want more accurate statistics,
>> you
>> > need to sample more rows. That is true. However, the size of the
>> sample
>> > is essential
Tom Lane <[EMAIL PROTECTED]> writes:
> The papers that I looked at say that this rule has a good solid
> statistical foundation, at least for the case of estimating histograms.
Earlier I was discussing the issue of how to measure cross-column
"correlation" with someone who works on precisely the
> Maybe I am missing something - ISTM that you can increase your
> statistics target for those larger tables to obtain a larger (i.e.
> better) sample.
No one is arguing that you can't manually do things, but I am not the
first to notice this. I saw the query planner doing something completely
stu
On Mon, Feb 07, 2005 at 05:16:56PM -0500, [EMAIL PROTECTED] wrote:
> > On Mon, Feb 07, 2005 at 13:28:04 -0500,
> >
> > What you are saying here is that if you want more accurate statistics, you
> > need to sample more rows. That is true. However, the size of the sample
> > is essentially only depen
> On Mon, Feb 07, 2005 at 13:28:04 -0500,
>
> What you are saying here is that if you want more accurate statistics, you
> need to sample more rows. That is true. However, the size of the sample
> is essentially only dependent on the accuracy you need and not the size
> of the population, for large
Maybe I am missing something - ISTM that you can increase your
statistics target for those larger tables to obtain a larger (i.e.
better) sample.
regards
Mark
[EMAIL PROTECTED] wrote:
[EMAIL PROTECTED] writes:
Any and all random sampling assumes a degree of uniform distribution. This
is the basi
On Mon, Feb 07, 2005 at 13:28:04 -0500,
> >
> > For large populations the accuracy of estimates of statistics based on
> > random
> > samples from that population are not very sensitve to population size and
> > depends primarily on the sample size. So that you would not expect to need
> > to use l
> On Mon, Feb 07, 2005 at 11:27:59 -0500,
> [EMAIL PROTECTED] wrote:
>>
>> It is inarguable that increasing the sample size increases the accuracy
>> of
>> a study, especially when diversity of the subject is unknown. It is
>> known
>> that reducing a sample size increases probability of error in
On Mon, Feb 07, 2005 at 11:27:59 -0500,
[EMAIL PROTECTED] wrote:
>
> It is inarguable that increasing the sample size increases the accuracy of
> a study, especially when diversity of the subject is unknown. It is known
> that reducing a sample size increases probability of error in any poll or
[EMAIL PROTECTED] writes:
> On a very basic level, why bother sampling the whole table at all? Why not
> check one block and infer all information from that? Because we know that
> isn't enough data. In a table of 4.6 million rows, can you say with any
> mathmatical certainty that a sample of 100 p
> [EMAIL PROTECTED] writes:
>> On a very basic level, why bother sampling the whole table at all? Why
>> not
>> check one block and infer all information from that? Because we know
>> that
>> isn't enough data. In a table of 4.6 million rows, can you say with any
>> mathmatical certainty that a sam
> [EMAIL PROTECTED] writes:
>> One of the things that is disturbing to me about the analyze settings is
>> that it wants to sample the same number of records from a table
>> regardless
>> of the size of that table.
>
> The papers that I looked at say that this rule has a good solid
> statistical fo
[EMAIL PROTECTED] writes:
> One of the things that is disturbing to me about the analyze settings is
> that it wants to sample the same number of records from a table regardless
> of the size of that table.
The papers that I looked at say that this rule has a good solid
statistical foundation, at
Ron Mayer <[EMAIL PROTECTED]> writes:
> What I think I'd like to see is for there to be
> another statistic similar to "correlation" but rather
> than looking at the total-ordering of the table, to
> look how correlated values within any single page are.
Our current approach to correlation is sure
Short summary:
I had the same problem - since the sort order of zip-codes,
counties, city names, and states don't match, the optimizer
grossly overestimated the number of pages that would be read.
I bet doing a CLUSTER by ZIP would solve that particular
query, but would break similar queries b
> [EMAIL PROTECTED] writes:
>>-> Index Scan using rt1_zipr, rt1_zipl on rt1 (cost=0.00..121893.93
>> rows=30835 width=302)
>> Index Cond: ((zipr = 2186) OR (zipl = 2186))
>
>> zipl |925 |
>> zipr |899 |
>
> Those n_distinct values for zipl and zipr seem ab
On Fri, Feb 04, 2005 at 05:23:27PM -0500, Stephen Frost wrote:
> As I recall, the default is 100 for statistics, I'd say increase that
> number to like 200 or 300 or more and see if it helps.
No, it's 10.
--
Jim C. Nasby, Database Consultant [EMAIL PROTECTED]
Give your computer som
[EMAIL PROTECTED] writes:
>-> Index Scan using rt1_zipr, rt1_zipl on rt1 (cost=0.00..121893.93
> rows=30835 width=302)
> Index Cond: ((zipr = 2186) OR (zipl = 2186))
> zipl |925 |
> zipr |899 |
Those n_distinct values for zipl and zipr seem aberrant ---
> [EMAIL PROTECTED] writes:
>> I suspect that analyze only samples a very small amount of the database
>> and gets the wrong idea about it. Is there a way to force analyze to
>> sample more rows?
>
> default_statistics_target. But let's see the pg_stats rows for these
> columns before assuming th
* [EMAIL PROTECTED] ([EMAIL PROTECTED]) wrote:
> I have the USA census TIGER database loaded, the WHOLE THING, the whole
> country. It isn't the biggest database, but it is about 40G before
> indexes. Every table is over a few million rows. I can quite honestly say,
> a sequential scan is almost ne
[EMAIL PROTECTED] writes:
> I suspect that analyze only samples a very small amount of the database
> and gets the wrong idea about it. Is there a way to force analyze to
> sample more rows?
default_statistics_target. But let's see the pg_stats rows for these
columns before assuming that analyze
Here's one:
I have the USA census TIGER database loaded, the WHOLE THING, the whole
country. It isn't the biggest database, but it is about 40G before
indexes. Every table is over a few million rows. I can quite honestly say,
a sequential scan is almost never the right thing to do.
Info below.
He
47 matches
Mail list logo