Re: [ODBC] [PERFORM] Joel's Performance Issues WAS : Opteron vs Xeon

2005-04-24 Thread Marko Ristola
Here is, how you can receive all one billion rows with
pieces of 2048 rows. This changes PostgreSQL and ODBC behaviour:
Change ODBC data source configuration in the following way:
Fetch = 2048
UseDeclareFetch = 1
It does not create core dumps with 32 bit computers with billions of rows!
This is a bit slower than fetching all rows at once. Scalability means 
sometimes
a bit less speed :(

With UseDeclareFetch=1 you might get even 150 thousands rows per second.
With UseDeclareFetch=0 the backend might be able to send about 200 
thousands rows per
second.

So, these high numbers come, if all the results are already in memory, 
and no disc
accesses are needed. These are about the peak speeds with VARCHAR, 
without Unicode,
with Athlon64 home computer.

With sequential disc scan, more typical fetching
speed is about 50-100 thousands rows per second.
PostgreSQL ODBC row fetching speed is very good.
Perhaps with better discs, with RAID10, the current upper limit about 
200 thousands
rows per second could be achieved??

So the in memory examples show, that the hard disc is normally
the bottleneck. It is on the server side.
My experiments are done in Linux. In Windows, the speed might be a bit 
different
by a constant factor (algorithmically).

These speeds depend on very many factos even on sequential scan.
ODBC speed is affected by the number of columns fetched and the types of 
the columns.
Integers are processed faster than textual or date columns.

The network latency is decreased with UseDeclareFetc=1 by increasing the 
Fetch=2048
parameter: With Fetch=1 you get a bad performance with lots of rows, but 
if you fetch
more data from the server once per 2048 rows, the network latency 
affects only once for
the 2048 row block.

Regards,
Marko Ristola
Joel Fradkin wrote:
Hate to be dumb, but unfortunately I am.
Could you give me an idea what I should be using, or is there a good
resource for me to check out.
I have been spending so much time with config and moving data, converting
etc, I never looked at the odbc settings (didn't even think about it until
Josh brought it up). I did ask him for his advice, but would love a second
opinion.
Our data is a bit of a mixture, some records have text items most are
varchars and integers with a bit of Booleans mixed in.
I am running 8.0.2 so not sure if the protocol is ODBC or Postgres?
Thanks for responding I appreciate any help 

Joel Fradkin
-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of Mohan, Ross
Sent: Thursday, April 21, 2005 10:01 AM
To: [EMAIL PROTECTED]
Subject: Re: [ODBC] [PERFORM] Joel's Performance Issues WAS : Opteron vs
Xeon
Joel, thanks. A couple of things jump out there for
me, not a problem for a routine ODBC connection, but
perhaps in the lotsa stuff context of your current
explorations, it might be relevant?
I am completely shooting from the hip, here, but...if
it were my goose to cook, I'd be investigating
Session(StringConn) =
DRIVER={PostgreSQL};DATABASE=wazagua;SERVER=192.168.123.252;PORT=5432;UID=;
PWD=;ReadOnly=0;Protocol=6.4;
|| Protocol? Is this related to version? is the driver wy old?
FakeOidIndex=0;ShowOidColumn=0;RowVersioning=0;
ShowSystemTables=0;ConnSettings=;Fetch=100;
||  Fetch great for OLTP, lousy for batch?
Socket=4096;UnknownSizes=0;MaxVarcharSize=254;MaxLongVarcharSize=8190;
||  what ARE the datatypes and sizes in your particular case? 

Debug=0;
||  a run with debug=1 probably would spit up something interesting
CommLog=0;Optimizer=1;
||  Optimizer? that's a new one on me
Ksqo=1;UseDeclareFetch=0;TextAsLongVarchar=1;UnknownsAsLongVarchar=0;BoolsAs
Char=1;Parse=0;CancelAsFreeStmt=;ExtraSysTablePrefixes=dd_;LFConversion=1;Up
datableCursors=1;DisallowPremature=0;TrueIsMinus1=0;BI=0;ByteaAsLongVarBinar
y=0;UseServerSidePrepare=0
||  that's about all I can see, prima facie.  I'll be very curious to know
if ODBC is
  any part of your performance equation. 

HTH, 

Ross
-Original Message-
From: Joel Fradkin [mailto:[EMAIL PROTECTED] 
Sent: Thursday, April 21, 2005 10:54 AM
To: Mohan, Ross
Cc: [EMAIL PROTECTED]; PostgreSQL Perform
Subject: RE: [PERFORM] Joel's Performance Issues WAS : Opteron vs Xeon

Here is the connect string I am using.
It could be horrid as I cut it from ODBC program.
Session(StringConn) =
DRIVER={PostgreSQL};DATABASE=wazagua;SERVER=192.168.123.252;PORT=5432;UID=;
PWD=;ReadOnly=0;Protocol=6.4;FakeOidIndex=0;ShowOidColumn=0;RowVersioning=0;
ShowSystemTables=0;ConnSettings=;Fetch=100;Socket=4096;UnknownSizes=0;MaxVar
charSize=254;MaxLongVarcharSize=8190;Debug=0;CommLog=0;Optimizer=1;Ksqo=1;Us
eDeclareFetch=0;TextAsLongVarchar=1;UnknownsAsLongVarchar=0;BoolsAsChar=1;Pa
rse=0;CancelAsFreeStmt=0;ExtraSysTablePrefixes=dd_;LFConversion=1;UpdatableC
ursors=1;DisallowPremature=0;TrueIsMinus1=0;BI=0;ByteaAsLongVarBinary=0;UseS
erverSidePrepare=0
Joel Fradkin
-Original Message-
From: Mohan, Ross [mailto:[EMAIL PROTECTED] 
Sent: Thursday, April 21, 2005 9:42 AM

Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?

2005-04-24 Thread Marko Ristola
Here is my opinion.
I hope this helps.
Maybe there is no one good formula:
On boolean type, there are at most 3 distinct values.
There is an upper bound for fornames in one country.
There is an upper bound for last names in one country.
There is a fixed number of states and postal codes in one country.
On the other hand, with timestamp, every value could be distinct.
A primary key with only one column has only distinct values.
If the integer column refers with a foreign key into another table's
only primary key, we could take advantage of that knolege.
A column with a unique index has only distinct values.
First ones are for classifying and the second ones measure continuous
or discrete time or something like the time.
The upper bound for classifying might be 3 (boolean), or it might be
one million. The properties of the distribution might be hard to guess.
Here is one way:
1. Find out the number of distinct values for 500 rows.
2. Try to guess, how many distinct values are for 1000 rows.
   Find out the real number of distinct values for 1000 rows.
3. If the guess and the reality are 50% wrong, do the iteration for 
2x1000 rows.
Iterate using a power of two to increase the samples, until you trust the
estimate enough.

So, in the phase two, you could try to guess with two distinct formulas:
One for the classifying target (boolean columns hit there).
Another one for the timestamp and numerical values.
If there are one million classifications on one column, how you
can find it out, by other means than checking at least two million
rows?
This means, that the user should have a possibility to tell the lower
bound for the number of rows for sampling.
Regards,
Marko Ristola
Tom Lane wrote:
Josh Berkus josh@agliodbs.com writes:
 

Overall, our formula is inherently conservative of n_distinct.   That is, I 
believe that it is actually computing the *smallest* number of distinct 
values which would reasonably produce the given sample, rather than the 
*median* one.  This is contrary to the notes in analyze.c, which seem to 
think that we're *overestimating* n_distinct.  
   

Well, the notes are there because the early tests I ran on that formula
did show it overestimating n_distinct more often than not.  Greg is
correct that this is inherently a hard problem :-(
I have nothing against adopting a different formula, if you can find
something with a comparable amount of math behind it ... but I fear
it'd only shift the failure cases around.
regards, tom lane
---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
 joining column's datatypes do not match
 


---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
 subscribe-nomail command to [EMAIL PROTECTED] so that your
 message can get through to the mailing list cleanly


Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?

2005-04-24 Thread Josh Berkus
Andrew,

 The math in the paper does not seem to look at very low levels of q (=
 sample to pop ratio).

Yes, I think that's the failing.   Mind you, I did more testing and found out 
that for D/N ratios of 0.1 to 0.3, the formula only works within 5x accuracy 
(which I would consider acceptable) with a sample size of 25% or more (which 
is infeasable in any large table).The formula does work for populations 
where D/N is much lower, say 0.01.  So overall it seems to only work for 1/4 
of cases; those where n/N is large and D/N is low.   And, annoyingly, that's 
probably the population where accurate estimation is least crucial, as it 
consists mostly of small tables.

I've just developed (not original, probably, but original to *me*) a formula 
that works on populations where n/N is very small and D/N is moderate (i.e. 
0.1 to 0.4):

N * (d/n)^(sqrt(N/n))

However, I've tested it only on (n/N  0.005 and D/N  0.1 and D/N  0.4) 
populations, and only 3 of them to boot.   I'd appreciate other people trying 
it on their own data populations, particularly very different ones, like D/N 
 0.7 or D/N  0.01.

Further, as Andrew points out we presumably do page sampling rather than 
purely random sampling so I should probably read the paper he referenced.  
Working on it now 

-- 
Josh Berkus
Aglio Database Solutions
San Francisco

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?

2005-04-24 Thread Josh Berkus
Folks,

 I wonder if this paper has anything that might help:
 http://www.stat.washington.edu/www/research/reports/1999/tr355.ps - if I
 were more of a statistician I might be able to answer :-)

Actually, that paper looks *really* promising.   Does anyone here have enough 
math to solve for D(sub)Md on page 6?   I'd like to test it on samples of  
0.01%.

Tom, how does our heuristic sampling work?   Is it pure random sampling, or 
page sampling?

-- 
Josh Berkus
Aglio Database Solutions
San Francisco

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [PERFORM] Sort and index

2005-04-24 Thread Jim C. Nasby
On Sat, Apr 23, 2005 at 01:00:40AM -0400, Tom Lane wrote:
 Jim C. Nasby [EMAIL PROTECTED] writes:
  Feel free to propose better cost equations.
 
  Where would I look in code to see what's used now?
 
 All the gold is hidden in src/backend/optimizer/path/costsize.c.
 
   regards, tom lane

After setting up a second test that orders the table by a highly
non-correlated column, I think I've found part of the problem. The
estimated index scan cost for (project_id, id, date) is
0.00..100117429.34 while the estimate for work_units is
0.00..103168408.62; almost no difference, even though project_id
correlation is .657 while work_units correlation is .116. This is with
random_page_cost set to 1.1; if I set it much higher I can't force the
index scan (BTW, would it make more sense to set the cost of a disable
seqscan to either pages or tuples * disable_cost?), but even with only a
10% overhead on random page fetches it seems logical that the two
estimates should be much farther apart. If you look at the results of
the initial run (http://stats.distributed.net/~decibel/timing.log),
you'll see that the cost of the index scan is way overestimated. Looking
at the code, the runcost is calculated as

run_cost += max_IO_cost + csquared * (min_IO_cost - max_IO_cost);

where csquared is indexCorrelation^2. Why is indexCorrelation squared?
The comments say a linear interpolation between min_IO and max_IO is
used, but ISTM that if it was linear then instead of csquared,
indexCorrelation would just be used.

By the way, I'm running a test for ordering by work_units right now, and
I included code to allocate and zero 3.3G of memory (out of 4G) between
steps to clear the kernel buffers. This brought the seqscan times up to
~6800 seconds, so it seems there was in fact buffering going on in the
first test. The second test has been running an index scan for over 14
hours now, so clearly a seqscan+sort is the way to go for a highly
uncorrelated index (at least one that won't fit in
effective_cache_size).
-- 
Jim C. Nasby, Database Consultant   [EMAIL PROTECTED] 
Give your computer some brain candy! www.distributed.net Team #1828

Windows: Where do you want to go today?
Linux: Where do you want to go tomorrow?
FreeBSD: Are you guys coming, or what?

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?

2005-04-24 Thread Tom Lane
Josh Berkus josh@agliodbs.com writes:
 Tom, how does our heuristic sampling work?   Is it pure random sampling, or 
 page sampling?

Manfred probably remembers better than I do, but I think the idea is
to approximate pure random sampling as best we can without actually
examining every page of the table.

regards, tom lane

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org