Marc, You should expect that for the kind of OLAP workload you describe in steps 2 and 3 you will have exactly one CPU working for you in Postgres.
If you want to accelerate the speed of this processing by a factor of 100 or more on this machine, you should try Greenplum DB which is Postgres 8.2 compatible. Based on the overall setup you describe, you may have a hybrid installation with GPDB doing the reporting / OLAP workload and the other Postgres databases handling the customer workloads. - Luke On 7/24/07 7:38 AM, "Marc Mamin" <[EMAIL PROTECTED]> wrote: > > Hello, > > thank you for all your comments and recommendations. > > I'm aware that the conditions for this benchmark are not ideal, mostly > due to the lack of time to prepare it. We will also need an additional > benchmark on a less powerful - more realistic - server to better > understand the scability of our application. > > > Our application is based on java and is generating dynamic reports from > log files content. Dynamic means here that a repor will be calculated > from the postgres data the first time it is requested (it will then be > cached). Java is used to drive the data preparation and to > handle/generate the reports requests. > > This is much more an OLAP system then an OLTP, at least for our > performance concern. > > > > > Data preparation: > > 1) parsing the log files with a heavy use of perl (regular expressions) > to generate csv files. Prepared statements also maintain reference > tables in the DB. Postgres performance is not an issue for this first > step. > > 2) loading the csv files with COPY. As around 70% of the data to load > come in a single daily table, we don't allow concurrent jobs for this > step. We have between a few and a few hundreds files to load into a > single table; they are processed one after the other. A primary key is > always defined; for the case when the required indexes are alreay built > and when the new data are above a given size, we are using a "shadow" > table instead (without the indexes) , build the index after the import > and then replace the live table with the shadow one. > For example, we a have a table of 13 GB + 11 GB indexes (5 pieces). > > Performances : > > a) is there an "ideal" size to consider for our csv files (100 x 10 > MB or better 1 x 1GB ?) > b) maintenance_work_mem: I'll use around 1 GB as recommended by > Stefan > > 3) Data agggregation. This is the heaviest part for Postgres. On our > current system some queries need above one hour, with phases of around > 100% cpu use, alterning with times of heavy i/o load when temporary > results are written/read to the plate (pgsql_tmp). During the > aggregation, other postgres activities are low (at least should be) as > this should take place at night. Currently we have a locking mechanism > to avoid having more than one of such queries running concurently. This > may be to strict for the benchmark server but better reflect our current > hardware capabilities. > > Performances : Here we should favorise a single huge transaction and > consider a low probability to have another transaction requiring large > sort space. Considering this, is it reasonable to define work_mem being > 3GB (I guess I should raise this parameter dynamically before running > the aggregation queries) > > 4) Queries (report generation) > > We have only few requests which are not satisfying while requiring large > sort operations. The data are structured in different aggregation levels > (minutes, hours, days) with logical time based partitions in oder to > limit the data size to compute for a given report. Moreover we can scale > our infrastrucure while using different or dedicated Postgres servers > for different customers. Smaller customers may share a same instance, > each of them having its own schema (The lock mechanism for large > aggregations apply to a whole Postgres instance, not to a single > customer) . The benchmark will help us to plan such distribution. > > During the benchmark, we will probably not have more than 50 not idle > connections simultaneously. It is a bit too early for us to fine tune > this part. The benchmark will mainly focus on the steps 1 to 3 > > During the benchmark, the Db will reach a size of about 400 GB, > simulating 3 different customers, also with data quite equally splitted > in 3 scheemas. > > > > I will post our configuration(s) later on. > > > > Thanks again for all your valuable input. > > Marc Mamin > > ---------------------------(end of broadcast)--------------------------- > TIP 5: don't forget to increase your free space map settings ---------------------------(end of broadcast)--------------------------- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match