On Tue, Jan 5, 2010 at 7:49 AM, Chetan Suttraway
wrote:
> if we can treat this case as similar to that of merging of histograms in
> case of joins involving 2 tables and generating the histograms for the
> cartesian (result) node,
...which you can't, because it's totally different, so I think the
Hi,
My suggestion is to keep two sets of histograms. One which is generated by
running ANALYZE and
the other which is dynamically generated histograms using the entries from
logging (that is done
in insert/update/delete operations).
I am not sure how difficult is it to read such record details fr
Josh Berkus writes:
>> Great. When you have it set up, you might want to play with enabling
>> the mergejoinscansel change as well, and see if that is a net plus or
>> minus for you.
> How would I *disable* it?
It's #ifdef NOT_USED in CVS at the moment.
regards, tom lan
> Great. When you have it set up, you might want to play with enabling
> the mergejoinscansel change as well, and see if that is a net plus or
> minus for you.
How would I *disable* it?
--Josh
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscr
> I've applied a patch to HEAD that does the above. Can you test it to
> see how well it fixes your problem?
Sure. It'll take us a while to set up a test environment; the database
is 1TB in size so converting it to 8.5 isn't quick.
Will report back.
--Josh
--
Sent via pgsql-hackers mailing
Josh Berkus writes:
>> I've applied a patch to HEAD that does the above. Can you test it to
>> see how well it fixes your problem?
> Sure. It'll take us a while to set up a test environment; the database
> is 1TB in size so converting it to 8.5 isn't quick.
Great. When you have it set up, you
On Wed, 2009-12-30 at 17:16 +0100, Tom Lane wrote:
> I think the cleanest solution to this would be to make ANALYZE
> cheaper, perhaps by finding some way for it to work incrementally.
What if when inserting/deleting a tuple, some random sample of them
would be passed into an auto-analyze buffer ?
I wrote:
> Actually, in the problematic cases, it's interesting to consider the
> following strategy: when scalarineqsel notices that it's being asked for
> a range estimate that's outside the current histogram bounds, first try
> to obtain the actual current max() or min() of the column value ---
On Thu, 2009-12-31 at 21:29 +, Simon Riggs wrote:
> On Thu, 2009-12-31 at 15:18 -0500, Tom Lane wrote:
> > Simon Riggs writes:
> > > Why not get both max() and min(), then rebase the histogram according to
> > > those values. That way the histogram can still move significantly and
> > > the te
On Thu, 2009-12-31 at 15:18 -0500, Tom Lane wrote:
> Simon Riggs writes:
> > Why not get both max() and min(), then rebase the histogram according to
> > those values. That way the histogram can still move significantly and
> > the technique will still work.
>
> Define "rebase", keeping in mind t
Simon Riggs writes:
> Why not get both max() and min(), then rebase the histogram according to
> those values. That way the histogram can still move significantly and
> the technique will still work.
Define "rebase", keeping in mind that this has to work on datatypes that
we don't have a distance
On Wed, 2009-12-30 at 14:55 -0500, Tom Lane wrote:
> Actually, in the problematic cases, it's interesting to consider the
> following strategy: when scalarineqsel notices that it's being asked for
> a range estimate that's outside the current histogram bounds, first try
> to obtain the actual curr
Tom Lane writes:
> Actually, in the problematic cases, it's interesting to consider the
> following strategy: when scalarineqsel notices that it's being asked for
> a range estimate that's outside the current histogram bounds, first try
> to obtain the actual current max() or min() of the column v
While regular ANALYZE seems to be pretty good ... is it insane to
suggest determining the min/max bounds of problem columns by looking at
a btree index on the column in ANALYZE, instead of relying on random
data sampling? An ANALYZE that didn't even have to scan the indexes but
just look at the
On 31/12/2009 12:33 AM, Kevin Grittner wrote:
Tom Lane wrote:
Well, the problem Josh has got is exactly that a constant high
bound doesn't work.
I thought the problem was that the high bound in the statistics fell
too far below the actual high end in the data. This tends (in my
experience)
Joshua D. Drake wrote:
postgres=# analyze verbose test_ten_million;
INFO: analyzing "public.test_ten_million"
INFO: "test_ten_million": scanned 3000 of 44248 pages, containing 678000
live rows and 0 dead rows; 3000 rows in sample, 1048 estimated total
rows
ANALYZE
Time: 20145.148 ms
At
Chris Browne writes:
> I find it curious that ANALYZE *would* take a long time to run.
> After all, its sampling strategy means that, barring having SET
> STATISTICS to some ghastly high number, it shouldn't need to do
> materially more work to analyze a 1TB table than is required to analyze
> a
well that's interesting because they claim to be doing exactly the same amount
of I/O in terms of pages.
In the first case it's reading 3/4 of the table so it's effectively doing a
sequential scan. In the second case it's only scanning 7.5% so you would expect
it to be slower but not that much
j...@commandprompt.com ("Joshua D. Drake") writes:
> On the other hand ANALYZE also:
>
> 1. Uses lots of memory
> 2. Lots of processor
> 3. Can take a long time
>
> We normally don't notice because most sets won't incur a penalty. We got a
> customer who
> has a single table that is over 1TB in siz
Peter Eisentraut writes:
> On tis, 2009-12-29 at 22:08 -0500, Tom Lane wrote:
>> This seems like a fundamentally broken approach, first because "time
>> between analyzes" is not even approximately a constant, and second
>> because it assumes that we have a distance metric for all datatypes.
> May
On tis, 2009-12-29 at 22:08 -0500, Tom Lane wrote:
> This seems like a fundamentally broken approach, first because "time
> between analyzes" is not even approximately a constant, and second
> because it assumes that we have a distance metric for all datatypes.
Maybe you could compute a correlatio
On Wed, 30 Dec 2009 18:42:38 +, Greg Stark wrote:
> I'm a bit puzzled by people's repeated suggestion here that large
> tables take a long time to analyze. The sample analyze takes to
> generate statistics is not heavily influenced by the size of the
> table. Your 1TB table should take basica
On Wed, Dec 30, 2009 at 4:31 PM, Joshua D. Drake wrote:
> On the other hand ANALYZE also:
>
> 1. Uses lots of memory
> 2. Lots of processor
> 3. Can take a long time
>
> We normally don't notice because most sets won't incur a penalty. We got a
> customer who
> has a single table that is over 1TB
Alvaro Herrera writes:
> Tom Lane escribió:
>> Normally, yeah. I think Josh's problem is that he's got
>> performance-critical queries that are touching the "moving edge" of the
>> data set, and so the part of the stats that are relevant to them is
>> changing fast, even though in an overall sens
Tom Lane escribió:
> Greg Smith writes:
> > Right, and the only thing that makes this case less painful is that you
> > don't really need the stats to be updated quite as often in situations
> > with that much data. If, say, your stats say there's 2B rows in the
> > table but there's actually
Greg Smith wrote:
> If, say, your stats say there's 2B rows in the table but there's
> actually 2.5B, that's a big error, but unlikely to change the
> types of plans you get. Once there's millions of distinct values
> it's takes a big change for plans to shift, etc.
Well, the exception to tha
Greg Smith writes:
> Right, and the only thing that makes this case less painful is that you
> don't really need the stats to be updated quite as often in situations
> with that much data. If, say, your stats say there's 2B rows in the
> table but there's actually 2.5B, that's a big error, but
Joshua D. Drake wrote:
We normally don't notice because most sets won't incur a penalty. We got a
customer who
has a single table that is over 1TB in size... We notice. Granted that is the
extreme
but it would only take a quarter of that size (which is common) to start seeing
issues.
Righ
Tom Lane wrote:
> Well, the problem Josh has got is exactly that a constant high
> bound doesn't work.
I thought the problem was that the high bound in the statistics fell
too far below the actual high end in the data. This tends (in my
experience) to be much more painful than an artificially
On Wed, 30 Dec 2009 11:16:45 -0500, Tom Lane wrote:
> "Kevin Grittner" writes:
>> Tom Lane wrote:
>>> I don't have a better idea at the moment :-(
>
>> It's been a while since I've been bitten by this issue -- the last
>> time was under Sybase. The Sybase suggestion was to either add
>> "dumm
"Kevin Grittner" writes:
> Tom Lane wrote:
>> I don't have a better idea at the moment :-(
> It's been a while since I've been bitten by this issue -- the last
> time was under Sybase. The Sybase suggestion was to either add
> "dummy rows" [YUCK!] to set the extreme bounds or to "lie to the
>
Tom Lane wrote:
> Josh Berkus writes:
>> My thoughts on dealing with this intelligently without a major
>> change to statstics gathering went along these lines:
>
>> 1. add columns to pg_statistic to hold estimates of upper and
>> lower bounds growth between analyzes.
>
> This seems like a fund
Josh Berkus writes:
> My thoughts on dealing with this intelligently without a major change to
> statstics gathering went along these lines:
> 1. add columns to pg_statistic to hold estimates of upper and lower
> bounds growth between analyzes.
This seems like a fundamentally broken approach, fi
All,
One of our clients is having query plan issues with a table with a
continuously advancing timestamp column (i.e. one with default now()).
The newest rows, which are the most in demand, are always estimated to
be fewer than they are or even non-existant. As a result, the user has
to analyze t
34 matches
Mail list logo