Re: [PERFORM] TPC-R benchmarks
On Mon, 2003-09-29 at 07:35, Oleg Lebedev wrote: I left my TPC-R query #17 working over the weekend and it took 3988 mins ~ 10 hours to complete. And this is considering that I am using a TPC-R database created with a scale factor of 1, which corresponds to ~1 GB of data. I am running RedHat 8.0 on a dual 1 GHz processor, 512 MB RAM. Was this run with or without the l_partkey index that Jenny suggested? Here is an excerpt from my postgresql.conf file (the rest of the settings are commented out): # # Shared Memory Size # shared_buffers = 16384# 2*max_connections, min 16, typically 8KB each # # Non-shared Memory Sizes # sort_mem = 32768 # # Optimizer Parameters # effective_cache_size = 32000 # typically 8KB each Any suggestions on how to optimize these settings? I agree with Jenny that declaring additional indexes on the TPC-R tables may alter the validity of the benchmarks. Are there any official TPC benchmarks submitted by PostgreSQL? Actually, for the TPC-R you _are allowed to declare additional indexes. With TPC-H you are restricted to a specific set listed in the spec (an index on l_partkey is allowed for both). What you cannot do for either TPC-R or TPC-H is rewrite the SQL of the query for the purposes of making the query run faster. Sorry if I was unclear. Valid TPC-R benchmark results are on the TPC web site: http://www.tpc.org/tpcr/default.asp I do not see one for PostgreSQL. Regards, Mary -- Mary Edie Meredith [EMAIL PROTECTED] Open Source Development Lab Thanks. Oleg -Original Message- From: Mary Edie Meredith [mailto:[EMAIL PROTECTED] Sent: Friday, September 26, 2003 10:12 AM To: Tom Lane Cc: Oleg Lebedev; Jenny Zhang; pgsql-performance Subject: Re: [PERFORM] TPC-R benchmarks The TPC-H/R rules allow only minor changes to the SQL that are necessary due to SQL implementation differences. They do not allow changes made to improve performance. It is their way to test optimizer's ability to recognize an inefficient SQL statement and do the rewrite. The rule makes sense for the TPC-H, which is supposed to represent ad-Hoc query. One might argue that for TPC-R, which is suppose to represent Reporting with pre-knowledge of the query, that re-write should be allowed. However, that is currently not the case. Since the RDBMS's represented on the TPC council are competing with TPC-H, their optimizers already do the re-write, so (IMHO) there is no motivation to relax the rules for the TPC-R. On Thu, 2003-09-25 at 21:28, Tom Lane wrote: Oleg Lebedev [EMAIL PROTECTED] writes: Seems like in your case postgres uses an i_l_partkey index on lineitem table. I have a foreign key constraint defined between the lineitem and part table, but didn't create an special indexes. Here is my query plan: The planner is obviously unhappy with this plan (note the large cost numbers), but it can't find a way to do better. An index on lineitem.l_partkey would help, I think. The whole query seems like it's written in a very inefficient fashion; couldn't the estimation of '0.2 * avg(l_quantity)' be amortized across multiple join rows? But I dunno whether the TPC rules allow for significant manual rewriting of the given query. regards, tom lane ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [PERFORM] TPC-R benchmarks
The TPC-H/R rules allow only minor changes to the SQL that are necessary due to SQL implementation differences. They do not allow changes made to improve performance. It is their way to test optimizer's ability to recognize an inefficient SQL statement and do the rewrite. The rule makes sense for the TPC-H, which is supposed to represent ad-Hoc query. One might argue that for TPC-R, which is suppose to represent Reporting with pre-knowledge of the query, that re-write should be allowed. However, that is currently not the case. Since the RDBMS's represented on the TPC council are competing with TPC-H, their optimizers already do the re-write, so (IMHO) there is no motivation to relax the rules for the TPC-R. On Thu, 2003-09-25 at 21:28, Tom Lane wrote: Oleg Lebedev [EMAIL PROTECTED] writes: Seems like in your case postgres uses an i_l_partkey index on lineitem table. I have a foreign key constraint defined between the lineitem and part table, but didn't create an special indexes. Here is my query plan: The planner is obviously unhappy with this plan (note the large cost numbers), but it can't find a way to do better. An index on lineitem.l_partkey would help, I think. The whole query seems like it's written in a very inefficient fashion; couldn't the estimation of '0.2 * avg(l_quantity)' be amortized across multiple join rows? But I dunno whether the TPC rules allow for significant manual rewriting of the given query. regards, tom lane ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly -- Mary Edie Meredith [EMAIL PROTECTED] Open Source Development Lab ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [osdldbt-general] Re: [PERFORM] [GENERAL] how to get accurate
We tried 1000 as the default and found that the plans were good plans and were consistent, but the pg_statistics was not exactly the same. We took Tom's' advice and tried SET SEED=0 (actually select setseed (0) ). We did runs last night on our project machine which produced consistent pg_statistics data and (of course) the same plans. We will next try runs where we vary the default buckets. Other than 10 and 1000, what numbers would you like us to try besides. Previously the number 100 was mentioned. Are there others? On Wed, 2003-09-10 at 12:44, Bruce Momjian wrote: Tom Lane wrote: Mary Edie Meredith [EMAIL PROTECTED] writes: Stephan Szabo kindly responded to our earlier queries suggesting we look at default_statistics_target and ALTER TABLE ALTER COLUMN SET STATISTICS. These determine the number of bins in the histogram for a given column. But for a large number of rows (for example 6 million) the maximum value (1000) does not guarantee that ANALYZE will do a full scan of the table. We do not see a way to guarantee the same statistics run to run without forcing ANALYZE to examine every row of every table. Do you actually still have a problem with the plans changing when the stats target is above 100 or so? I think the notion of force ANALYZE to do a full scan is inherently wrongheaded ... it certainly would not produce numbers that have anything to do with ordinary practice. If you have data statistics that are so bizarre that the planner still gets things wrong with a target of 1000, then I'd like to know more about why. Has there been any progress in determining if the number of default buckets (10) is the best value? -- Mary Edie Meredith [EMAIL PROTECTED] Open Source Development Lab ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
[PERFORM] [GENERAL] how to get accurate values in pg_statistic (continued)
Our port of OSDL DBT3 test suite to PostgreSQL (see Background information below) is nearing completion. We would also like to confirm our understanding of an outstanding consistency issue. We have not been able to do meaningful kernel testing since the runs (all parameters/kernels being equal) arewildly varying - sometimes 20-25% differences in the metrics run to run. We found plans were changing from test run to test run. In one case a plan ran 20 minutes in the throughput test of one run, and 2 seconds in another run! By forcing the contents of pg_statistics to be the same before the queries run, we have consistent results now. So we know for sure the problem is due to the random nature of the stats sampling: the optimizer always saw different stats data resulting in different plans. Stephan Szabo kindly responded to our earlier queries suggesting we look at default_statistics_target and ALTER TABLE ALTER COLUMN SET STATISTICS. These determine the number of bins in the histogram for a given column. But for a large number of rows (for example 6 million) the maximum value (1000) does not guarantee that ANALYZE will do a full scan of the table. We do not see a way to guarantee the same statistics run to run without forcing ANALYZE to examine every row of every table. Are we wrong in our analysis? Are there main-stream alternatives we have missed? How do you do testing on large tables and make the execution plans consistent? Is there a change to ANALYZE in 7.4 that solves our problem? TIA. Background information: Database Test 3 (DBT-3) is a decision support workload. The test kit itself has been executing on PostgreSQL for some time, is available on sourceforge, and is implemented on our Scalable Test Platform (STP). A bit of background: The test (1) builds a database from load files, gathers statistics, (2) runs a single stream of 22 queries plus a set of inserts and deletes (the power test), then (3) runs a multiple stream of the queries with one added stream of inserts/deletes (the throughput test). -- Mary Edie Meredith [EMAIL PROTECTED] Open Source Development Lab ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faqs/FAQ.html