Re: [PERFORM] TPC-R benchmarks

2003-09-29 Thread Mary Edie Meredith
On Mon, 2003-09-29 at 07:35, Oleg Lebedev wrote:
 I left my TPC-R query #17 working over the weekend and it took 3988 mins
 ~ 10 hours to complete. And this is considering that I am using a TPC-R
 database created with a scale factor of 1, which corresponds to ~1 GB of
 data. I am running RedHat 8.0 on a dual 1 GHz processor, 512 MB RAM.

Was this run with or without the l_partkey index that Jenny suggested? 

 
 Here is an excerpt from my postgresql.conf file (the rest of the
 settings are commented out):
 
 #
 # Shared Memory Size
 #
 shared_buffers = 16384# 2*max_connections, min 16, typically
 8KB each
 
 #
 # Non-shared Memory Sizes
 #
 sort_mem = 32768
 
 #
 # Optimizer Parameters
 #
 effective_cache_size = 32000  # typically 8KB each
 
 Any suggestions on how to optimize these settings?
 
 I agree with Jenny that declaring additional indexes on the TPC-R tables
 may alter the validity of the benchmarks. Are there any official TPC
 benchmarks submitted by PostgreSQL? 

Actually, for the TPC-R you _are allowed to declare additional indexes. 
With TPC-H you are restricted to a specific set listed in the spec (an
index on l_partkey is allowed for both).

What you cannot do for either TPC-R or TPC-H is rewrite the SQL of the
query for the purposes of making the query run faster.

Sorry if I was unclear.

Valid TPC-R benchmark results are on the TPC web site:
http://www.tpc.org/tpcr/default.asp  

I do not see one for PostgreSQL.


Regards,

Mary 

-- 
Mary Edie Meredith [EMAIL PROTECTED]
Open Source Development Lab

 
 Thanks.
 
 Oleg
 
 -Original Message-
 From: Mary Edie Meredith [mailto:[EMAIL PROTECTED] 
 Sent: Friday, September 26, 2003 10:12 AM
 To: Tom Lane
 Cc: Oleg Lebedev; Jenny Zhang; pgsql-performance
 Subject: Re: [PERFORM] TPC-R benchmarks
 
 
 The TPC-H/R rules allow only minor changes to the SQL that are necessary
 due to SQL implementation differences. They do not allow changes made to
 improve performance.  It is their way to test optimizer's ability to
 recognize an inefficient SQL statement and do the rewrite.
 
 The rule makes sense for the TPC-H, which is supposed to represent
 ad-Hoc query.  One might argue that for TPC-R, which is suppose to
 represent Reporting with pre-knowledge of the query, that re-write
 should be allowed. However, that is currently not the case. Since the
 RDBMS's represented on the TPC council are competing with TPC-H, their
 optimizers already do the re-write, so (IMHO) there is no motivation to
 relax the rules for the TPC-R.
 
 
 On Thu, 2003-09-25 at 21:28, Tom Lane wrote:
  Oleg Lebedev [EMAIL PROTECTED] writes:
   Seems like in your case postgres uses an i_l_partkey index on 
   lineitem table. I have a foreign key constraint defined between the 
   lineitem and part table, but didn't create an special indexes. Here 
   is my query plan:
  
  The planner is obviously unhappy with this plan (note the large cost 
  numbers), but it can't find a way to do better.  An index on 
  lineitem.l_partkey would help, I think.
  
  The whole query seems like it's written in a very inefficient fashion;
 
  couldn't the estimation of '0.2 * avg(l_quantity)' be amortized across
 
  multiple join rows?  But I dunno whether the TPC rules allow for 
  significant manual rewriting of the given query.
  
  regards, tom lane
  
  ---(end of 
  broadcast)---
  TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that
 your
message can get through to the mailing list cleanly


---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [PERFORM] TPC-R benchmarks

2003-09-26 Thread Mary Edie Meredith
The TPC-H/R rules allow only minor changes to the SQL that are necessary
due to SQL implementation differences. They do not allow changes made to
improve performance.  It is their way to test optimizer's ability to
recognize an inefficient SQL statement and do the rewrite.

The rule makes sense for the TPC-H, which is supposed to represent
ad-Hoc query.  One might argue that for TPC-R, which is suppose to
represent Reporting with pre-knowledge of the query, that re-write
should be allowed. However, that is currently not the case. Since the
RDBMS's represented on the TPC council are competing with TPC-H, their
optimizers already do the re-write, so (IMHO) there is no motivation to
relax the rules for the TPC-R.


On Thu, 2003-09-25 at 21:28, Tom Lane wrote:
 Oleg Lebedev [EMAIL PROTECTED] writes:
  Seems like in your case postgres uses an i_l_partkey index on lineitem
  table. I have a foreign key constraint defined between the lineitem and
  part table, but didn't create an special indexes. Here is my query plan:
 
 The planner is obviously unhappy with this plan (note the large cost
 numbers), but it can't find a way to do better.  An index on
 lineitem.l_partkey would help, I think.
 
 The whole query seems like it's written in a very inefficient fashion;
 couldn't the estimation of '0.2 * avg(l_quantity)' be amortized across
 multiple join rows?  But I dunno whether the TPC rules allow for
 significant manual rewriting of the given query.
 
   regards, tom lane
 
 ---(end of broadcast)---
 TIP 3: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly
-- 
Mary Edie Meredith [EMAIL PROTECTED]
Open Source Development Lab


---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [osdldbt-general] Re: [PERFORM] [GENERAL] how to get accurate

2003-09-10 Thread Mary Edie Meredith
We tried 1000 as the default and found that the plans were good plans
and were consistent, but the pg_statistics was not exactly the same.

We took Tom's' advice and tried SET SEED=0 (actually select setseed (0)
).

We did runs last night on our project machine which produced consistent
pg_statistics data and (of course) the same plans.

We will next try runs where we vary the default buckets.  Other than 10
and 1000, what numbers would you like us to try besides.  Previously the
number 100 was mentioned.  Are there others?

On Wed, 2003-09-10 at 12:44, Bruce Momjian wrote:
 Tom Lane wrote:
  Mary Edie Meredith [EMAIL PROTECTED] writes:
   Stephan Szabo kindly responded to our earlier queries suggesting we look
   at default_statistics_target and ALTER TABLE ALTER COLUMN SET
   STATISTICS.  
  
   These determine the number of bins in the histogram for a given column. 
   But for a large number of rows (for example 6 million) the maximum value
   (1000) does not guarantee that ANALYZE will do a full scan of the table.
   We do not see a way to guarantee the same statistics run to run without
   forcing ANALYZE to examine every row of every table.  
  
  Do you actually still have a problem with the plans changing when the
  stats target is above 100 or so?  I think the notion of force ANALYZE
  to do a full scan is inherently wrongheaded ... it certainly would not
  produce numbers that have anything to do with ordinary practice.
  
  If you have data statistics that are so bizarre that the planner still
  gets things wrong with a target of 1000, then I'd like to know more
  about why.
 
 Has there been any progress in determining if the number of default
 buckets (10) is the best value?
-- 
Mary Edie Meredith [EMAIL PROTECTED]
Open Source Development Lab


---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


[PERFORM] [GENERAL] how to get accurate values in pg_statistic (continued)

2003-09-04 Thread Mary Edie Meredith
Our port of OSDL DBT3 test suite to PostgreSQL (see Background
information below) is nearing completion.  We would also like to confirm
our understanding of an outstanding consistency issue.

We have not been able to do meaningful kernel testing since the runs
(all parameters/kernels being equal) arewildly varying - sometimes
20-25% differences in the metrics run to run. 

We found plans were changing from test run to test run. In one case a
plan ran 20 minutes in the throughput test of one run, and 2 seconds in
another run!  By forcing the contents of pg_statistics to be the same
before the queries run, we have consistent results now. So we know for
sure the problem is due to the random nature of the stats sampling: the
optimizer always saw different stats data resulting in different plans. 

Stephan Szabo kindly responded to our earlier queries suggesting we look
at default_statistics_target and ALTER TABLE ALTER COLUMN SET
STATISTICS.  

These determine the number of bins in the histogram for a given column. 
But for a large number of rows (for example 6 million) the maximum value
(1000) does not guarantee that ANALYZE will do a full scan of the table.
We do not see a way to guarantee the same statistics run to run without
forcing ANALYZE to examine every row of every table.  

Are we wrong in our analysis?

Are there main-stream alternatives we have missed? 
 
How do you do testing on large tables and make the execution plans
consistent?

Is there a change to ANALYZE in 7.4 that solves our problem?

TIA.



Background information:

Database Test 3 (DBT-3) is a decision support workload.

The test kit itself has been executing on PostgreSQL for some time, is
available on sourceforge, and is implemented on our Scalable Test
Platform (STP). 


A bit of background:  The test 
(1) builds a database from load files, gathers statistics, 
(2) runs a single stream of 22 queries plus a set of inserts and deletes
(the power test), then 
(3) runs a multiple stream of the queries with one added stream of
inserts/deletes (the throughput test). 


-- 
Mary Edie Meredith [EMAIL PROTECTED]
Open Source Development Lab


---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html