Re: [PERFORM] need to speed up query

2008-05-06 Thread PFC


i've had to write queries to get trail balance values out of the GL  
transaction table and i'm not happy with its performance The table has  
76K rows growing about 1000 rows per working day so the performance is  
not that great it takes about 20 to 30 seconds to get all the records  
for the table and when we limit it to single accounting period it drops  
down to 2 seconds


	What is a period ? Is it a month, or something more custom ? Can  
periods overlap ?



COALESCE(( SELECT sum(gltrans.gltrans_amount) AS sum
FROM gltrans
WHERE gltrans.gltrans_date  period.period_start
AND gltrans.gltrans_accnt_id = accnt.accnt_id
AND gltrans.gltrans_posted = true), 0.00)::text::money AS  
beginbalance,


	Note that here you are scanning the entire table multiple times, the  
complexity of this is basically (rows in gltrans)^2 which is something  
you'd like to avoid.


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] need to speed up query

2008-05-06 Thread Justin



PFC wrote:


i've had to write queries to get trail balance values out of the GL 
transaction table and i'm not happy with its performance The table 
has 76K rows growing about 1000 rows per working day so the 
performance is not that great it takes about 20 to 30 seconds to get 
all the records for the table and when we limit it to single 
accounting period it drops down to 2 seconds


What is a period ? Is it a month, or something more custom ? 
Can periods overlap ?

No periods can never overlap.  If the periods did you would be in
violation of many tax laws around the world. Plus it you would not know
how much money you are making or losing.
Generally  yes a accounting period is a normal calendar month.  but you
can have 13 periods in a normal calendar year.  52 weeks in a year / 4
weeks in month = 13 periods or 13 months in a Fiscal Calendar year.
This means if someone is using a 13 period fiscal accounting year the
start and end dates are offset from a normal calendar.
To make this really funky you can have a Fiscal  Calendar year start
June 15 2008 and end on June 14 2009

http://en.wikipedia.org/wiki/Fiscal_year



COALESCE(( SELECT sum(gltrans.gltrans_amount) AS sum
FROM gltrans
WHERE gltrans.gltrans_date  period.period_start
AND gltrans.gltrans_accnt_id = accnt.accnt_id
AND gltrans.gltrans_posted = true), 0.00)::text::money AS 
beginbalance,


Note that here you are scanning the entire table multiple times, 
the complexity of this is basically (rows in gltrans)^2 which is 
something you'd like to avoid.



For accounting purposes you need to know the Beginning Balances,
Debits,  Credits,  Difference between Debits to Credits and the Ending
Balance  for each account.  We have 133 accounts with presently 12
periods defined so we end up 1596 rows returned for this query.

So period 1 should have for the most part have Zero for Beginning
Balances for most types of Accounts.  Period 2 is Beginning Balance is
Period 1 Ending Balance, Period 3 is Period 2 ending balance so and so
on forever.






--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] need to speed up query

2008-05-06 Thread PFC


What is a period ? Is it a month, or something more custom ?  
Can periods overlap ?


No periods can never overlap.  If the periods did you would be in   
violation of many tax laws around the world. Plus it you would not know  
how much money you are making or losing.


	I was wondering if you'd be using the same query to compute how much was  
gained every month and every week, which would have complicated things.

But now it's clear.

To make this really funky you can have a Fiscal  Calendar year start  
June 15 2008 and end on June 14 2009


	Don't you just love those guys ? Always trying new tricks to make your  
life more interesting.


Note that here you are scanning the entire table multiple times,  
the complexity of this is basically (rows in gltrans)^2 which is  
something you'd like to avoid.


For accounting purposes you need to know the Beginning Balances,  
Debits,  Credits,  Difference between Debits to Credits and the Ending  
Balance  for each account.  We have 133 accounts with presently 12  
periods defined so we end up 1596 rows returned for this query.


Alright, I propose a solution which only works when periods don't 
overlap.
	It will scan the entire table, but only once, not many times as your  
current query does.


So period 1 should have for the most part have Zero for Beginning  
Balances for most types of Accounts.  Period 2 is Beginning Balance is  
Period 1 Ending Balance, Period 3 is Period 2 ending balance so and so  
on forever.


	Precisely. So, it is not necessary to recompute everything for each  
period.
	Use the previous period's ending balance as the current period's starting  
balance...


There are several ways to do this.
	First, you could use your current query, but only compute the sum of what  
happened during a period, for each period, and store that in a temporary  
table.
	Then, you use a plpgsql function, or you do that in your client, you take  
the rows in chronological order, you sum them as they come, and you get  
your balances. Use a NUMERIC type, not a FLOAT, to avoid rounding errors.


	The other solution does the same thing but optimizes the first step like  
this :

INSERT INTO temp_table SELECT period, sum(...) GROUP BY period

	To do this you must be able to compute the period from the date and not  
the other way around. You could store a period_id in your table, or use a  
function.


	Another much more efficient solution would be to have a summary table  
which keeps the summary data for each period, with beginning balance and  
end balance. This table will only need to be updated when someone finds an  
old receipt in their pocket or something.


This falls under the stupid question and i'm just curious what other  
people think what makes a query complex?


	I have some rather complex queries which postgres burns in a few  
milliseconds.
	You could define complexity as the amount of brain sweat that went into  
writing that query.
	You could also define complexity as O(n) or O(n^2) etc, for instance your  
query (as written) is O(n^2) which is something you don't want, I've seen  
stuff that was O(2^n) or worse, O(n!) in software written by drunk  
students, in this case getting rid of it is an emergency...


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] need to speed up query

2008-05-06 Thread Shaun Thomas
On Tue, 2008-05-06 at 03:01 +0100, Justin wrote:

 i've had to write queries to get trail balance values out of the GL
 transaction table and i'm not happy with its performance

Go ahead and give this a try:

SELECT p.period_id, p.period_start, p.period_end, a.accnt_id,
   a.accnt_number, a.accnt_descrip, p.period_yearperiod_id,
   a.accnt_type,
   SUM(CASE WHEN g.gltrans_date  p.period_start
THEN g.gltrans_amount ELSE 0.0
   END)::text::money AS beginbalance,
   SUM(CASE WHEN g.gltrans_date  p.period_end
 AND g.gltrans_date = p.period_start
 AND g.gltrans_amount = 0::numeric
THEN g.gltrans_amount ELSE 0.0
   END)::text::money AS negative,
   SUM(CASE WHEN g.gltrans_date = p.period_end
 AND g.gltrans_date = p.period_start
 AND g.gltrans_amount = 0::numeric
THEN g.gltrans_amount ELSE 0.0
   END)::text::money AS positive,
   SUM(CASE WHEN g.gltrans_date = p.period_end
 AND g.gltrans_date = p.period_start
THEN g.gltrans_amount ELSE 0.0
   END)::text::money AS difference,
   SUM(CASE WHEN g.gltrans_date = p.period_end
THEN g.gltrans_amount ELSE 0.0
   END)::text::money AS endbalance,
  FROM period p
 CROSS JOIN accnt a
  LEFT JOIN gltrans g ON (g.gltrans_accnt_id = a.accnt_id
  AND g.gltrans_posted = true)
 ORDER BY period.period_id, accnt.accnt_number;

Depending on how the planner saw your old query, it may have forced
several different sequence or index scans to get the information from
gltrans.  One thing all of your subqueries had in common was a join on
the account id and listing only posted transactions.  It's still a big
gulp, but it's only one gulp.

The other thing I did was that I guessed you added the coalesce clause
because the subqueries individually could return null rowsets for
various groupings, and you wouldn't want that.  This left-join solution
only lets it add to your various sums if it matches all the conditions,
otherwise it falls through the list of cases until nothing matches.  If
some of your transactions can have null amounts, you might consider
turning g.gltrans into COALESCE(g.gltrans, 0.0) instead.

Otherwise, this *might* work; without knowing more about your schema,
it's only a guess.  I'm a little skeptical about the conditionless
cross-join, but whatever.

Either way, by looking at this query, it looks like some year-end
summary piece, or an at-a-glance idea of your account standings.  The
problem you're going to have with this is that there's no way to truly
optimize this.  One way or another, you're going to incur some
combination of three sequence scans or three index scans; if those
tables get huge, you're in trouble.  You might want to consider a
denormalized summary table that contains this information (and maybe
more) maintained by a trigger or regularly invoked stored-procedure and
then you can select from *that* with much less agony.

Then there's fact-tables, but that's beyond the scope of this email. ;)

Good luck!

-- 

Shaun Thomas
Database Administrator

Leapfrog Online 
807 Greenwood Street 
Evanston, IL 60201 
Tel. 847-440-8253
Fax. 847-570-5750
www.leapfrogonline.com



-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] need to speed up query

2008-05-06 Thread Justin
it worked it had couple missing parts but it worked and ran in 3.3 
seconds.  *Thanks for this *
i need to review the result and balance it to my results as the 
Accountant already went through and balanced some accounts by hand to 
verify my results


begin quote

You might want to consider a
denormalized summary table that contains this information (and maybe
more) maintained by a trigger or regularly invoked stored-procedure and
then you can select from *that* with much less agony.

end quote

I just dumped the summary table because it kept getting out of balance 
all the time and was missing accounts that did not have transaction in 
them for given period.  Again i did not lay out the table nor the old 
code which was terrible and did not work  correctly.   I tried several 
times to fix the summary table  but to many  things allowed it to get 
out of sync.  Keeping the Ending and Beginning Balance correct was to 
much trouble and i needed to get numbers we can trust to the accountant. 

The developers of the code got credits and debits backwards so instead 
of fixing the code they just added code to flip the values on the front 
end.  Its really annoying.  At this point if i could go back 7 months 
ago i would not purchased this software if i had known what i know now.


I've had to make all kinds of changes i never intended to make in order 
to get the stuff to balance and agree. I've spent the last 3 months in 
code review fixing things that allow accounts to get out of balance and 
stop stupid things from happening, like posting GL Transactions into 
non-existing accounting periods.  the list of things i have to fix is 
getting dam long.





Re: [PERFORM] need to speed up query

2008-05-06 Thread Justin



PFC wrote:


What is a period ? Is it a month, or something more custom ? 
Can periods overlap ?


No periods can never overlap.  If the periods did you would be in  
violation of many tax laws around the world. Plus it you would not 
know how much money you are making or losing.


I was wondering if you'd be using the same query to compute how 
much was gained every month and every week, which would have 
complicated things.

But now it's clear.

To make this really funky you can have a Fiscal  Calendar year start 
June 15 2008 and end on June 14 2009


Don't you just love those guys ? Always trying new tricks to make 
your life more interesting.


Thats been around been around a long time.  You can go back a few 
hundreds years



Note that here you are scanning the entire table multiple times, 
the complexity of this is basically (rows in gltrans)^2 which is 
something you'd like to avoid.


For accounting purposes you need to know the Beginning Balances, 
Debits,  Credits,  Difference between Debits to Credits and the 
Ending Balance  for each account.  We have 133 accounts with 
presently 12 periods defined so we end up 1596 rows returned for this 
query.


Alright, I propose a solution which only works when periods don't 
overlap.
It will scan the entire table, but only once, not many times as 
your current query does.


So period 1 should have for the most part have Zero for Beginning 
Balances for most types of Accounts.  Period 2 is Beginning Balance 
is Period 1 Ending Balance, Period 3 is Period 2 ending balance so 
and so on forever.


Precisely. So, it is not necessary to recompute everything for 
each period.
Use the previous period's ending balance as the current period's 
starting balance...


There are several ways to do this.
First, you could use your current query, but only compute the sum 
of what happened during a period, for each period, and store that in a 
temporary table.
Then, you use a plpgsql function, or you do that in your client, 
you take the rows in chronological order, you sum them as they come, 
and you get your balances. Use a NUMERIC type, not a FLOAT, to avoid 
rounding errors.


The other solution does the same thing but optimizes the first 
step like this :

INSERT INTO temp_table SELECT period, sum(...) GROUP BY period

To do this you must be able to compute the period from the date 
and not the other way around. You could store a period_id in your 
table, or use a function.


Another much more efficient solution would be to have a summary 
table which keeps the summary data for each period, with beginning 
balance and end balance. This table will only need to be updated when 
someone finds an old receipt in their pocket or something.




As i posted earlier the software did do this but it has so many bugs 
else where in the code it allows it get out of balance to what really is 
happening.   I spent a several weeks trying to get this working and find 
all the places it  went wrong.  I gave up and did this query which took 
a day write and balance to a point that i turned it over to the 
accountant.   I redid the front end and i'm off to the races and Fixing 
other critical problems.


All i need to do is take Shanun Thomas code and replace the View this 
select statement creates



This falls under the stupid question and i'm just curious what other 
people think what makes a query complex?


I have some rather complex queries which postgres burns in a few 
milliseconds.
You could define complexity as the amount of brain sweat that went 
into writing that query.
You could also define complexity as O(n) or O(n^2) etc, for 
instance your query (as written) is O(n^2) which is something you 
don't want, I've seen stuff that was O(2^n) or worse, O(n!) in 
software written by drunk students, in this case getting rid of it is 
an emergency...




Thanks for your help and ideas i really appreciate it.

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] need to speed up query

2008-05-05 Thread Gregory Williamson
Justin --

You wrote:
  
 i've had to write queries to get trail balance values out of the GL 
 transaction table and i'm not happy with its performance 
 
 
 The table has 76K rows growing about 1000 rows per working day so the 
 performance is not that great it takes about 20 to 30 seconds to get all 
 the records for the table and when we limit it to single accounting 
 period it drops down to 2 seconds

So 30 seconds for 76 days (roughly) worth of numbers ? Not terrible but not 
great.

 Here is the query and explain .  PostgreSql  is 8.3.1 on new server with 
 raid 10 Serial SCSI.
... snipped 'cause I have a lame reader ...

   Sort Method:  quicksort  Memory: 292kB
...snip...
 Total runtime: 24682.580 ms


I don't have any immediate thoughts but maybe you could post the table schemas 
and indexes. It looks to my untutored eye as if most of the estimates are fair 
so I am guessing that you have run analyze recently.

What is your sort memory set to ? If work_mem is too low then you'll go to disk 
(if you see tmp files under the postgres $PGDATA/base directory you might be 
seeing the result of this) ...

HTH

Greg Williamson
Senior DBA
DigitalGlobe

Confidentiality Notice: This e-mail message, including any attachments, is for 
the sole use of the intended recipient(s) and may contain confidential and 
privileged information and must be protected in accordance with those 
provisions. Any unauthorized review, use, disclosure or distribution is 
prohibited. If you are not the intended recipient, please contact the sender by 
reply e-mail and destroy all copies of the original message.

(My corporate masters made me say this.)



Re: [PERFORM] need to speed up query

2008-05-05 Thread Justin

yes the cross join is intentional.

Thanks creating the two column index drop processing time to 15 to 17 
seconds

put per period down to 1 second



Scott Marlowe wrote:

You're joining these two tables: period, accnt, but I'm not seeing an
on () clause or a where clause joining them.  Is the cross product
intentional?

But what I'm seeing that seems like the lowest hanging fruit would be
two column indexes on the bits that are showing up in those bit map
scans.  Like this part:

  Recheck Cond: ((gltrans_date = $3) AND
(gltrans_date = $0) AND gltrans_accnt_id = $1))
  Filter: gltrans_posted
  -  BitmapAnd  (cost=38.90..38.90 rows=10
width=0) (actual time=0.839..0.839 rows=0 loops=1729)
-  Bitmap Index Scan on
gltrans_gltrans_date_idx  (cost=0.00..8.08 rows=382 width=0) (actual
time=0.782..0.782 rows=5872 loops=1729)
  Index Cond: ((gltrans_date = $3)
AND (gltrans_date = $0))
-  Bitmap Index Scan on
gltrans_gltrans_accnt_id_idx  (cost=0.00..30.57 rows=1908 width=0)
(actual time=0.076..0.076 rows=574 loops=798)
  Index Cond: (gltrans_accnt_id = $1)

You are looking through 574 rows in one column and 5872 in another.
But when they're anded together, you get 0 rows.  A two column index
there should really help.

  


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] need to speed up query

2008-05-05 Thread Justin

Gregory Williamson wrote:


Justin --

You wrote:
 
 i've had to write queries to get trail balance values out of the GL

 transaction table and i'm not happy with its performance


 The table has 76K rows growing about 1000 rows per working day so the
 performance is not that great it takes about 20 to 30 seconds to get all
 the records for the table and when we limit it to single accounting
 period it drops down to 2 seconds

So 30 seconds for 76 days (roughly) worth of numbers ? Not terrible 
but not great.


 Here is the query and explain .  PostgreSql  is 8.3.1 on new server with
 raid 10 Serial SCSI.
... snipped 'cause I have a lame reader ...

not according to the bench marks i have done,  which were posted a 
couple of months ago.



   Sort Method:  quicksort  Memory: 292kB
...snip...
 Total runtime: 24682.580 ms


I don't have any immediate thoughts but maybe you could post the table 
schemas and indexes. It looks to my untutored eye as if most of the 
estimates are fair so I am guessing that you have run analyze recently.


What is your sort memory set to ? If work_mem is too low then you'll 
go to disk (if you see tmp files under the postgres $PGDATA/base 
directory you might be seeing the result of this) ...


i need to look into work mem its set at 25 megs which is fine for most 
work unless we get into the accounting queries which have to be more 
complicated than they need to be because how some of the tables are laid 
out which i did not lay out.



HTH

Greg Williamson
Senior DBA
DigitalGlobe

Confidentiality Notice: This e-mail message, including any 
attachments, is for the sole use of the intended recipient(s) and may 
contain confidential and privileged information and must be protected 
in accordance with those provisions. Any unauthorized review, use, 
disclosure or distribution is prohibited. If you are not the intended 
recipient, please contact the sender by reply e-mail and destroy all 
copies of the original message.


(My corporate masters made me say this.)



Re: [PERFORM] Need for speed 2

2005-09-20 Thread Alex Turner
I have found that while the OS may flush to the controller fast with
fsync=true, the controller does as it pleases (it has BBU, so I'm not
too worried), so you get great performance because your controller is
determine read/write sequence outside of what is being demanded by an
fsync.

Alex Turner
NetEconomistOn 8/25/05, Kelly Burkhart [EMAIL PROTECTED] wrote:
On Thu, 2005-08-25 at 11:16 -0400, Ron wrote: # - Settings - 
fsync =
false
# turns forced synchronization on or off #wal_sync_method = fsync# the default varies across platforms:

# fsync, fdatasync, open_sync, or I hope you have a battery backed write buffer!Battery backed write buffer will do nothing here, because the OS istaking it's sweet time flushing to the controller's battery backed write
buffer!Isn't the reason for batter backed controller cache to make fsync()sfast?-K---(end of broadcast)---TIP 9: In versions below 8.0, the planner will ignore your desire to
 choose an index scan if your joining column's datatypes do not match


Re: [PERFORM] Need for speed 3

2005-09-05 Thread Nicholas E. Wakefield
Ulrich,

Luke cc'd me on his reply and you definitely should have a look at
Bizgres Clickstream. Even if the whole stack doesn't match you needs,
though it sounds like it would. The clickstream focused TELL and BizGres
enhancements could make your life a little easier.

Basically the stack components that you might want to look at first are:

BizGres flavor of PostGreSQL - Enhanced for business intelligence and
data warehousing - The www.bizgres.com website can speak to this in more
detail.
Clickstream Data Model - Pageview fact table surrounded by various
dimensions and 2 core staging tables for the cleansed weblog data.
ETL Platform - Contains a weblog sessionizer, cleanser and ETL
transformations, which can handle 2-3 million hits without any trouble.
With native support for the COPY command, for even greater performance.
JasperReports - For pixel perfect reporting.

Sorry for sounding like I'm in marketing or sales, however I'm not.

Couple of key features that might interest you, considering your email.
The weblog parsing component allows for relatively complex cleansing,
allowing for less data to be written to the DB and therefore increasing
throughput. In addition, if you run every 5 minutes there would be no
need to truncate the days data and reload, the ETL knows how to connect
the data from before. The copy enhancement to postgresql found in
bizgres, makes a noticeable improvement when loading data.
The schema is basically

Dimension tables Session, Known Party (If cookies are logged), Page, IP
Address, Date, Time, Referrer, Referrer Page.
Fact tables: Pageview, Hit Subset (Not everyone wants all hits).

Staging Tables: Hits (Cleansed hits or just pageviews without surrogate
keys), Session (Session data gathered while parsing the log).

Regards

Nick


-Original Message-
From: Luke Lonergan [mailto:[EMAIL PROTECTED] 
Sent: Thursday, September 01, 2005 9:38 AM
To: Ulrich Wisser; pgsql-performance@postgresql.org
Cc: Nicholas E. Wakefield; Barry Klawans; Daria Hutchinson
Subject: Re: [PERFORM] Need for speed 3

Ulrich,

On 9/1/05 6:25 AM, Ulrich Wisser [EMAIL PROTECTED]
wrote:

 My application basically imports Apache log files into a Postgres 
 database. Every row in the log file gets imported in one of three (raw
 data) tables. My columns are exactly as in the log file. The import is

 run approx. every five minutes. We import about two million rows a
month.

Bizgres Clickstream does this job using an ETL (extract transform and
load) process to transform the weblogs into an optimized schema for
reporting.
 
 After every import the data from the current day is deleted from the 
 reporting table and recalculated from the raw data table.

This is something the optimized ETL in Bizgres Clickstream also does
well.
 
 What do you think of this approach? Are there better ways to do it? Is

 there some literature you recommend reading?

I recommend the Bizgres Clickstream docs, you can get it from Bizgres
CVS, and there will shortly be a live html link on the website.

Bizgres is free - it also improves COPY performance by almost 2x, among
other enhancements.

- Luke 




---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PERFORM] Need for speed 3

2005-09-01 Thread Merlin Moncure

Ulrich wrote:
 Hi again,
 
 first I want to say ***THANK YOU*** for everyone who kindly shared
their
 thoughts on my hardware problems. I really appreciate it. I started to
 look for a new server and I am quite sure we'll get a serious hardware
 update. As suggested by some people I would like now to look closer
at
 possible algorithmic improvements.
 
 My application basically imports Apache log files into a Postgres
 database. Every row in the log file gets imported in one of three (raw
 data) tables. My columns are exactly as in the log file. The import is
 run approx. every five minutes. We import about two million rows a
month.
 
 Between 30 and 50 users are using the reporting at the same time.
 
 Because reporting became so slow, I did create a reporting table. In
 that table data is aggregated by dropping time (date is preserved),
ip,
 referer, user-agent. And although it breaks normalization some data
from
 a master table is copied, so no joins are needed anymore.
 
 After every import the data from the current day is deleted from the
 reporting table and recalculated from the raw data table.
 

schemas would be helpful.  You may be able to tweak the import table a
bit and how it moves over to the data tables.

Just a thought: have you considered having apache logs write to a
process that immediately makes insert query(s) to postgresql? 

You could write small C program which executes advanced query interface
call to the server.

Merlin

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PERFORM] Need for speed 3

2005-09-01 Thread Ulrich Wisser

Hi Merlin,

schemas would be helpful.  


right now I would like to know if my approach to the problem makes 
sense. Or if I should rework the whole procedure of import and aggregate.



Just a thought: have you considered having apache logs write to a
process that immediately makes insert query(s) to postgresql? 


Yes we have considered that, but dismissed the idea very soon. We need 
Apache to be as responsive as possible. It's a two server setup with 
load balancer and failover. Serving about ones thousand domains and 
counting. It needs to be as failsafe as possible and under no 
circumstances can any request be lost. (The click counting is core 
business and relates directly to our income.)
That said it seemed quite save to let Apache write logfiles. And import 
them later. By that a database downtime wouldn't be mission critical.




You could write small C program which executes advanced query interface
call to the server.


How would that improve performance?

Ulrich

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [PERFORM] Need for speed 3

2005-09-01 Thread Merlin Moncure
 Hi Merlin,
  Just a thought: have you considered having apache logs write to a
  process that immediately makes insert query(s) to postgresql?
 
 Yes we have considered that, but dismissed the idea very soon. We need
 Apache to be as responsive as possible. It's a two server setup with
 load balancer and failover. Serving about ones thousand domains and
 counting. It needs to be as failsafe as possible and under no
 circumstances can any request be lost. (The click counting is core
 business and relates directly to our income.)
 That said it seemed quite save to let Apache write logfiles. And
import
 them later. By that a database downtime wouldn't be mission critical.

hm.  well, it may be possible to do this in a fast and safe way but I
understand your reservations here, but I'm going to spout off my opinion
anyways :).

If you are not doing this the following point is moot.  But take into
consideration you could set a very low transaction time out (like .25
seconds) and siphon log entries off to a text file if your database
server gets in trouble.  2 million hits a month is not very high even if
your traffic is bursty (there are approx 2.5 million seconds in a
month).

With a direct linked log file you get up to date stats always and spare
yourself the dump/load song and dance which is always a headache :(.
Also, however you are doing your billing, it will be easier to manage it
if everything is extracted from pg and not some conglomeration of log
files, *if* you can put 100% faith in your database.  When it comes to
pg now, I'm a believer.

  You could write small C program which executes advanced query
interface
  call to the server.
 
 How would that improve performance?

The functions I'm talking about are PQexecParams and PQexecPrepared.
The query string does not need to be encoded or decoded and is very
light on server resources and is very low latency.  Using them you could
get prob. 5000 inserts/sec on a cheap server if you have some type of
write caching in place with low cpu load.  

Merlin



---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PERFORM] Need for speed 3

2005-09-01 Thread Luke Lonergan
Ulrich,

On 9/1/05 6:25 AM, Ulrich Wisser [EMAIL PROTECTED] wrote:

 My application basically imports Apache log files into a Postgres
 database. Every row in the log file gets imported in one of three (raw
 data) tables. My columns are exactly as in the log file. The import is
 run approx. every five minutes. We import about two million rows a month.

Bizgres Clickstream does this job using an ETL (extract transform and load)
process to transform the weblogs into an optimized schema for reporting.
 
 After every import the data from the current day is deleted from the
 reporting table and recalculated from the raw data table.

This is something the optimized ETL in Bizgres Clickstream also does well.
 
 What do you think of this approach? Are there better ways to do it? Is
 there some literature you recommend reading?

I recommend the Bizgres Clickstream docs, you can get it from Bizgres CVS,
and there will shortly be a live html link on the website.

Bizgres is free - it also improves COPY performance by almost 2x, among
other enhancements.

- Luke 



---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [PERFORM] Need for speed 2

2005-08-25 Thread Frank Wiles
On Thu, 25 Aug 2005 09:10:37 +0200
Ulrich Wisser [EMAIL PROTECTED] wrote:

 Pentium 4 2.4GHz
 Memory 4x DIMM DDR 1GB PC3200 400MHZ CAS3, KVR
 Motherboard chipset 'I865G', two IDE channels on board
 2x SEAGATE BARRACUDA 7200.7 80GB 7200RPM ATA/100
 (software raid 1, system, swap, pg_xlog)
 ADAPTEC SCSI RAID 2100S ULTRA160 32MB 1-CHANNEL
 2x SEAGATE CHEETAH 15K.3 73GB ULTRA320 68-PIN WIDE
 (raid 1, /var/lib/pgsql)
 
 Database size on disc is 22GB. (without pg_xlog)
 
 Please find my postgresql.conf below.
 
 Putting pg_xlog on the IDE drives gave about 10% performance
 improvement. Would faster disks give more performance?

  Faster as in RPM on your pg_xlog partition probably won't make
  much of a difference.  However, if you can get a drive with better
  overall write performance then it would be a benefit. 

  Another thing to consider on this setup is whether or not you're
  hitting swap often and/or logging to that same IDE RAID set.  For
  optimal insertion benefit you want the heads of your disks to 
  essentially be only used for pg_xlog.  If you're having to jump
  around the disk in the following manner: 

write to pg_xlog
read from swap
write syslog data
write to pg_xlog 
...
...

  You probably aren't getting anywhere near the benefit you could.  One
  thing you could easily try is to break your IDE RAID set and put 
  OS/swap on one disk and pg_xlog on the other. 

 If one query contains so much data, that a full table scan is needed,
 I  do not care if it takes two minutes to answer. But all other
 queries  with less data (at the same time) still have to be fast.
 
 I can not stop users doing that kind of reporting. :(
 
 I need more speed in orders of magnitude. Will more disks / more
 memory do that trick?

  More disk and more memory always helps out.  Since you say these
  queries are mostly on not-often-used data I would lean toward more
  disks in your SCSI RAID-1 setup than maxing out available RAM based
  on the size of your database. 

 -
   Frank Wiles [EMAIL PROTECTED]
   http://www.wiles.org
 -


---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PERFORM] Need for speed 2

2005-08-25 Thread Ron

At 03:10 AM 8/25/2005, Ulrich Wisser wrote:


I realize I need to be much more specific. Here is a more detailed
description of my hardware and system design.


Pentium 4 2.4GHz
Memory 4x DIMM DDR 1GB PC3200 400MHZ CAS3, KVR
Motherboard chipset 'I865G', two IDE channels on board


First suggestion: Get better server HW.  AMD Opteron based dual 
processor board is the current best in terms of price/performance 
ratio, _particularly_ for DB applications like the one you have 
described.  Such mainboards cost ~$400-$500.  RAM will cost about 
$75-$150/GB.  Opteron 2xx are ~$200-$700 apiece.   So a 2P AMD system 
can be had for as little as ~$850 + the cost of the RAM you need.  In 
the worst case where you need 24GB of RAM (~$3600), the total comes 
in at ~$4450.  As you can see from the numbers, buying only what RAM 
you actually need can save you a great deal on money.


Given what little you said about how much of your DB is frequently 
accessed, I'd suggest buying a server based around the 2P 16 DIMM 
slot IWill DK88 mainboard (Tyan has announced a 16 DIMM slot 
mainboard, but I do not think it is actually being sold yet.).  Then 
fill it with the minimum amount of RAM that will allow the working 
set of the DB to be cached in RAM.  In the worst case where DB 
access is essentially uniform and essentially random, you will need 
24GB of RAM to hold the 22GB DB + OS + etc.  That worst case is 
_rare_.  Usually DB's have a working set that is smaller than the 
entire DB.  You want to keep that working set in RAM.  If you can't 
identify the working set, buy enough RAM to hold the entire DB.


In particular, you want to make sure that any frequently accessed 
read only tables or indexes are kept in RAM.  The read only part is 
very important.  Tables (and their indexes) that are frequently 
written to _have_ to access HD.  Therefore you get much less out of 
having them in RAM.  Read only tables and their indexes can be loaded 
into tmpfs at boot time thereby keeping out of the way of the file 
system buffer cache.  tmpfs does not save data if the host goes down 
so it is very important that you ONLY use this trick with read only 
tables.  The other half of the trick is to make sure that the file 
system buffer cache does _not_ cache whatever you have loaded into tmpfs.




2x SEAGATE BARRACUDA 7200.7 80GB 7200RPM ATA/100
(software raid 1, system, swap, pg_xlog)
ADAPTEC SCSI RAID 2100S ULTRA160 32MB 1-CHANNEL
2x SEAGATE CHEETAH 15K.3 73GB ULTRA320 68-PIN WIDE
(raid 1, /var/lib/pgsql)


Second suggestion: you need a MUCH better IO subsystem.  In fact, 
given that you have described this system as being primarily OLTP 
like, this is more important that the above server HW.  Best would be 
to upgrade everything, but if you are strapped for cash, upgrade the 
IO subsystem first.


You need many more spindles and a decent RAID card or cards.  You 
want 15Krpm (best) or 10Krpm HDs.  As long as all of the HD's are at 
least 10Krpm, more spindles is more important than faster 
spindles.  If it's a choice between more 10Krpm discs or fewer 15Krpm 
discs, buy the 10Krpm discs.  Get the spindle count as high as you 
RAID cards can handle.


Whatever RAID cards you get should have as much battery backed write 
buffer as possible.  In the commodity market, presently the highest 
performance RAID cards I know of, and the ones that support the 
largest battery backed write buffer, are made by Areca.




Database size on disc is 22GB. (without pg_xlog)


Find out what the working set, ie the most frequently accessed 
portion, of this 22GB is and you will know how much RAM is worth 
having.  4GB is definitely too little!




Please find my postgresql.conf below.


Third suggestion:  make sure you are running a 2.6 based kernel and 
at least PG 8.0.3.  Helping beta test PG 8.1 might be an option for 
you as well.



Putting pg_xlog on the IDE drives gave about 10% performance 
improvement. Would faster disks give more performance?


What my application does:

Every five minutes a new logfile will be imported. Depending on the 
source of the request it will be imported in one of three raw click
tables. (data from two months back, to be able to verify customer 
complains)  For reporting I have a set of tables. These contain data 
from the last two years. My app deletes all entries from today and 
reinserts updated data calculated from the raw data tables.


The raw data tables seem to be read only?  If so, you should buy 
enough RAM to load them into tmpfs at boot time and have them be 
completely RAM resident in addition to having enough RAM for the OS 
to cache an appropriate amount of the rest of the DB.



The queries contain no joins only aggregates. I have several indexes 
to speed different kinds of queries.


My problems occur when one users does a report that contains too 
much old data. In that case all cache mechanisms will fail and disc 
io is the limiting factor.


If one query contains so much data, that a full 

Re: [PERFORM] Need for speed 2

2005-08-25 Thread Merlin Moncure
 Putting pg_xlog on the IDE drives gave about 10% performance
 improvement. Would faster disks give more performance?
 
 What my application does:
 
 Every five minutes a new logfile will be imported. Depending on the
 source of the request it will be imported in one of three raw click
 tables. (data from two months back, to be able to verify customer
 complains)
 For reporting I have a set of tables. These contain data from the last
 two years. My app deletes all entries from today and reinserts updated
 data calculated from the raw data tables.
 
 The queries contain no joins only aggregates. I have several indexes
to
 speed different kinds of queries.
 
 My problems occur when one users does a report that contains to much
old
 data. In that case all cache mechanisms will fail and disc io is the
 limiting factor.

It seems like you are pushing limit of what server can handle.  This
means: 1. expensive server upgrade. or 
2. make software more efficient.

Since you sound I/O bound, you can tackle 1. by a. adding more memory or
b. increasing i/o throughput.  

Unfortunately, you already have a pretty decent server (for x86) so 1.
means 64 bit platform and 2. means more expensive hard drives.  The
archives is full of information about this...

Is your data well normalized?  You can do tricks like:
if table has fields a,b,c,d,e,f with a is primary key, and d,e,f not
frequently queried or missing, move d,e,f to seprate table.

well normalized structures are always more cache efficient.  Do you have
lots of repeating and/or empty data values in your tables?

Make your indexes and data as small as possible to reduce pressure on
the cache, here are just a few tricks:
1. use int2/int4 instead of numeric
2. know when to use char and varchar 
3. use functional indexes to reduce index expression complexity.  This
can give extreme benefits if you can, for example, reduce double field
index to Boolean.

Merlin

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PERFORM] Need for speed 2

2005-08-25 Thread Kelly Burkhart
On Thu, 2005-08-25 at 11:16 -0400, Ron wrote:
 # - Settings -
 
 fsync = false   # turns forced synchronization on or off
 #wal_sync_method = fsync# the default varies across platforms:
  # fsync, fdatasync, open_sync, or
 
 I hope you have a battery backed write buffer!

Battery backed write buffer will do nothing here, because the OS is
taking it's sweet time flushing to the controller's battery backed write
buffer!

Isn't the reason for batter backed controller cache to make fsync()s
fast?

-K

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [PERFORM] Need for speed

2005-08-22 Thread Jim C. Nasby
RRS (http://rrs.decibel.org) might be of use in this case.

On Tue, Aug 16, 2005 at 01:59:53PM -0400, Alex Turner wrote:
 Are you calculating aggregates, and if so, how are you doing it (I ask
 the question from experience of a similar application where I found
 that my aggregating PGPLSQL triggers were bogging the system down, and
 changed them so scheduled jobs instead).
 
 Alex Turner
 NetEconomist
 
 On 8/16/05, Ulrich Wisser [EMAIL PROTECTED] wrote:
  Hello,
  
  one of our services is click counting for on line advertising. We do
  this by importing Apache log files every five minutes. This results in a
  lot of insert and delete statements. At the same time our customers
  shall be able to do on line reporting.
  
  We have a box with
  Linux Fedora Core 3, Postgres 7.4.2
  Intel(R) Pentium(R) 4 CPU 2.40GHz
  2 scsi 76GB disks (15.000RPM, 2ms)
  
  I did put pg_xlog on another file system on other discs.
  
  Still when several users are on line the reporting gets very slow.
  Queries can take more then 2 min.
  
  I need some ideas how to improve performance in some orders of
  magnitude. I already thought of a box with the whole database on a ram
  disc. So really any idea is welcome.
  
  Ulrich
  
  
  
  --
  Ulrich Wisser  / System Developer
  
  RELEVANT TRAFFIC SWEDEN AB, Riddarg 17A, SE-114 57 Sthlm, Sweden
  Direct (+46)86789755 || Cell (+46)704467893 || Fax (+46)86789769
  
  http://www.relevanttraffic.com
  
  ---(end of broadcast)---
  TIP 1: if posting/reading through Usenet, please send an appropriate
 subscribe-nomail command to [EMAIL PROTECTED] so that your
 message can get through to the mailing list cleanly
 
 
 ---(end of broadcast)---
 TIP 5: don't forget to increase your free space map settings
 

-- 
Jim C. Nasby, Sr. Engineering Consultant  [EMAIL PROTECTED]
Pervasive Softwarehttp://pervasive.com512-569-9461

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [PERFORM] Need for speed

2005-08-18 Thread Roger Hand
 Ulrich Wisser wrote:
 
  one of our services is click counting for on line advertising. We do
  this by importing Apache log files every five minutes. This results in a
  lot of insert and delete statements. 
...
 If you are doing mostly inserting, make sure you are in a transaction,

Well, yes, but you may need to make sure that a single transaction doesn't have 
too many inserts in it.
I was having a performance problem when doing transactions with a huge number 
of inserts
(tens of thousands), and I solved the problem by putting a simple counter in 
the loop (in the Java import code, 
that is) and doing a commit every 100 or so inserts.

-Roger

 John

  Ulrich

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PERFORM] Need for speed

2005-08-17 Thread Ulrich Wisser

Hello,

thanks for all your suggestions.

I can see that the Linux system is 90% waiting for disc io. At that time 
all my queries are *very* slow. My scsi raid controller and disc are 
already the fastest available. The query plan uses indexes and vacuum 
analyze is run once a day.


To avoid aggregating to many rows, I already made some aggregation 
tables which will be updated after the import from the Apache logfiles.

That did help, but only to a certain level.

I believe the biggest problem is disc io. Reports for very recent data 
are quite fast, these are used very often and therefor already in the 
cache. But reports can contain (and regulary do) very old data. In that 
case the whole system slows down. To me this sounds like the recent data 
is flushed out of the cache and now all data for all queries has to be 
fetched from disc.


My machine has 2GB memory, please find postgresql.conf below.

Ulrich


#---
# RESOURCE USAGE (except WAL)
#---

# - Memory -

shared_buffers = 2  # min 16, at least max_connections*2, 
sort_mem = 4096 # min 64, size in KB

vacuum_mem = 8192   # min 1024, size in KB

# - Free Space Map -

max_fsm_pages = 5   # min max_fsm_relations*16, 6 bytes each
max_fsm_relations = 3000# min 100, ~50 bytes each

# - Kernel Resource Usage -

#max_files_per_process = 1000   # min 25
#preload_libraries = ''


#---
# WRITE AHEAD LOG
#---

# - Settings -

fsync = false   # turns forced synchronization on or off
#wal_sync_method = fsync# the default varies across platforms:
wal_buffers = 128   # min 4, 8KB each

# - Checkpoints -

checkpoint_segments = 16# in logfile segments, min 1, 16MB each
#checkpoint_timeout = 300   # range 30-3600, in seconds
#checkpoint_warning = 30# 0 is off, in seconds
#commit_delay = 0   # range 0-10, in microseconds
#commit_siblings = 5# range 1-1000


---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [PERFORM] Need for speed

2005-08-17 Thread Tom Lane
Ulrich Wisser [EMAIL PROTECTED] writes:
 My machine has 2GB memory, please find postgresql.conf below.

 max_fsm_pages = 5   # min max_fsm_relations*16, 6 bytes each

FWIW, that index I've been groveling through in connection with your
other problem contains an astonishingly large amount of dead space ---
almost 50%.  I suspect that you need a much larger max_fsm_pages
setting, and possibly more-frequent vacuuming, in order to keep a lid
on the amount of wasted space.

regards, tom lane

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PERFORM] Need for speed

2005-08-17 Thread Jeffrey W. Baker
On Wed, 2005-08-17 at 11:15 +0200, Ulrich Wisser wrote:
 Hello,
 
 thanks for all your suggestions.
 
 I can see that the Linux system is 90% waiting for disc io. At that time 
 all my queries are *very* slow. My scsi raid controller and disc are 
 already the fastest available.

What RAID controller?  Initially you said you have only 2 disks, and
since you have your xlog on a separate spindle, I assume you have 1 disk
for the xlog and 1 for the data.  Even so, if you have a RAID, I'm going
to further assume you are using RAID 1, since no sane person would use
RAID 0.  In those cases you are getting the performance of a single
disk, which is never going to be very impressive.  You need a RAID.

Please be more precise when describing your system to this list.

-jwb


---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PERFORM] Need for speed

2005-08-17 Thread Josh Berkus
Ulrich,

 I believe the biggest problem is disc io. Reports for very recent data
 are quite fast, these are used very often and therefor already in the
 cache. But reports can contain (and regulary do) very old data. In that
 case the whole system slows down. To me this sounds like the recent data
 is flushed out of the cache and now all data for all queries has to be
 fetched from disc.

How large is the database on disk?

 My machine has 2GB memory, please find postgresql.conf below.

h ...
effective_cache_size?
random_page_cost?
cpu_tuple_cost?
etc.

-- 
Josh Berkus
Aglio Database Solutions
San Francisco

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PERFORM] Need for speed

2005-08-17 Thread Ron

At 05:15 AM 8/17/2005, Ulrich Wisser wrote:

Hello,

thanks for all your suggestions.

I can see that the Linux system is 90% waiting for disc io.


A clear indication that you need to improve your HD IO subsystem.


At that time all my queries are *very* slow.


To be more precise, your server performance at that point is 
essentially equal to your HD IO subsystem performance.




 My scsi raid controller and disc are already the fastest available.


Oh, REALLY?  This is the description of the system you gave us:

We have a box with
Linux Fedora Core 3, Postgres 7.4.2
Intel(R) Pentium(R) 4 CPU 2.40GHz
2 scsi 76GB disks (15.000RPM, 2ms)

The is far, Far, FAR from the the fastest available in terms of SW, 
OS, CPU host, _or_ HD subsystem.


The fastest available means
1= you should be running 8.0.3
2= you should be running the latest stable 2.6 based kernel
3= you should be running an Opteron based server
4= Fibre Channel HDs are higher performance than SCSI ones.
5= (and this is the big one) YOU NEED MORE SPINDLES AND A HIGHER END 
RAID CONTROLLER.


The absolute top of the line for RAID controllers is something 
based on Fibre Channel from Xyratex (who make the RAID engines for 
EMC and NetApps), Engino (the enterprise division of LSI Logic who 
sell mostly to IBM.  Apple has a server based on an Engino card), 
dot-hill (who bought Chaparral among others).  I suspect you can't 
afford them even if they would do business with you.  The ante for a 
FC-based RAID subsystem in this class is in the ~$32K to ~$128K 
range, even if you buy direct from the actual RAID HW manufacturer 
rather than an OEM like


In the retail commodity market, the current best RAID controllers are 
probably the 16 and 24 port versions of the Areca cards ( 
www.areca.us ).  They come darn close to saturating the the Real 
World Peak Bandwidth of a 64b 133MHz PCI-X bus.


I did put pg_xlog on another file system on other discs.


 The query plan uses indexes and vacuum analyze is run once a day.


That


To avoid aggregating to many rows, I already made some aggregation 
tables which will be updated after the import from the Apache 
logfiles.  That did help, but only to a certain level.


I believe the biggest problem is disc io. Reports for very recent 
data are quite fast, these are used very often and therefor already 
in the cache. But reports can contain (and regulary do) very old 
data. In that case the whole system slows down. To me this sounds 
like the recent data is flushed out of the cache and now all data 
for all queries has to be fetched from disc.


My machine has 2GB memory,





---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PERFORM] Need for speed

2005-08-17 Thread Ron

At 05:15 AM 8/17/2005, Ulrich Wisser wrote:

Hello,

thanks for all your suggestions.

I can see that the Linux system is 90% waiting for disc io.


A clear indication that you need to improve your HD IO subsystem if possible.



At that time all my queries are *very* slow.


To be more precise, your server performance at that point is 
essentially equal to your HD IO subsystem performance.




 My scsi raid controller and disc are already the fastest available.


Oh, REALLY?  This is the description of the system you gave us:

We have a box with
Linux Fedora Core 3, Postgres 7.4.2
Intel(R) Pentium(R) 4 CPU 2.40GHz
2 scsi 76GB disks (15.000RPM, 2ms)


The is far, Far, FAR from the the fastest available in terms of SW, 
OS, CPU host, _or_ HD subsystem.


The fastest available means
1= you should be running PostgreSQL 8.0.3
2= you should be running the latest stable 2.6 based kernel
3= you should be running an Opteron based server
4= Fibre Channel HDs are slightly higher performance than SCSI ones.
5= (and this is the big one) YOU NEED MORE SPINDLES AND A HIGHER END 
RAID CONTROLLER.


Your description of you workload was:
one of our services is click counting for on line advertising. We do 
this by importing Apache log files every five minutes. This results 
in a lot of insert and delete statements. At the same time our 
customers shall be able to do on line reporting.


There are two issues here:
1= your primary usage is OLTP-like, but you are also expecting to do 
reports against the same schema that is supporting your OLTP-like 
usage.  Bad Idea.  Schemas that are optimized for reporting and other 
data mining like operation are pessimal for OLTP-like applications 
and vice versa.  You need two schemas: one optimized for lots of 
inserts and deletes (OLTP-like), and one optimized for reporting 
(data-mining like).


2= 2 spindles, even 15K rpm spindles, is minuscule.  Real enterprise 
class RAID subsystems have at least 10-20x that many spindles, 
usually split into 6-12 sets dedicated to different groups of tables 
in the DB.  Putting xlog on its own dedicated spindles is just the 
first step.


The absolute top of the line for RAID controllers is something 
based on Fibre Channel from Xyratex (who make the RAID engines for 
EMC and NetApps), Engino (the enterprise division of LSI Logic who 
sell mostly to IBM.  Apple has a server based on an Engino card), or 
dot-hill (who bought Chaparral among others).  I suspect you can't 
afford them even if they would do business with you.  The ante for a 
FC-based RAID subsystem in this class is in the ~$32K to ~$128K 
range, even if you buy direct from the actual RAID HW manufacturer 
rather than an OEM like EMC, IBM, or NetApp who will 2x or 4x the 
price.  OTOH, these subsystems will provide OLTP or OLTP-like DB apps 
with performance that is head-and-shoulders better than anything else 
to be found.  Numbers like 50K-200K IOPS.  You get what you pay for.


In the retail commodity market where you are more realistically going 
to be buying, the current best RAID controllers are probably the 
Areca cards ( www.areca.us ).  They come darn close to saturating the 
Real World Peak Bandwidth of a 64b 133MHz PCI-X bus and have better 
IOPS numbers than their commodity brethren.  However, _none_ of the 
commodity RAID cards have IOPS numbers anywhere near as high as those 
mentioned above.



To avoid aggregating to many rows, I already made some aggregation 
tables which will be updated after the import from the Apache 
logfiles.  That did help, but only to a certain level.


I believe the biggest problem is disc io. Reports for very recent 
data are quite fast, these are used very often and therefor already 
in the cache. But reports can contain (and regulary do) very old 
data. In that case the whole system slows down. To me this sounds 
like the recent data is flushed out of the cache and now all data 
for all queries has to be fetched from disc.


I completely agree.  Hopefully my above suggestions make sense and 
are of use to you.




My machine has 2GB memory,


...and while we are at it, OLTP like apps benefit less from RAM than 
data mining ones, but still 2GB of RAM is just not that much for a 
real DB server...



Ron Peacetree



---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [PERFORM] Need for speed

2005-08-16 Thread Richard Huxton

Ulrich Wisser wrote:

Hello,

one of our services is click counting for on line advertising. We do 
this by importing Apache log files every five minutes. This results in a 
lot of insert and delete statements. At the same time our customers 
shall be able to do on line reporting.


I need some ideas how to improve performance in some orders of 
magnitude. I already thought of a box with the whole database on a ram 
disc. So really any idea is welcome.


So what's the problem - poor query plans? CPU saturated? I/O saturated? 
Too much context-switching?


What makes it worse - adding another reporting user, or importing 
another logfile?


--
  Richard Huxton
  Archonet Ltd

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re. : [PERFORM] Need for speed

2005-08-16 Thread bsimon
Hi,

How much Ram do you have ?
Could you give us your postgresql.conf  ? (shared buffer parameter)

If you do lots of deletes/inserts operations you HAVE to vacuum analyze 
your table (especially if you have indexes). 

I'm not sure if vacuuming locks your table with pg 7.4.2 (it doesn't with 
8.0), you might consider upgrading your pg version. 
Anyway, your SELECT  performance while vacuuming is going to be altered. 


I don't know your application but I would certainly try to split your 
table. it would result in one table for inserts/vaccum and one for 
selects. You would have to switch from one to the other every five 
minutes.

Benjamin.





Ulrich Wisser [EMAIL PROTECTED]
Envoyé par : [EMAIL PROTECTED]
16/08/2005 17:39

 
Pour :  pgsql-performance@postgresql.org
cc : 
Objet : [PERFORM] Need for speed


Hello,

one of our services is click counting for on line advertising. We do 
this by importing Apache log files every five minutes. This results in a 
lot of insert and delete statements. At the same time our customers 
shall be able to do on line reporting.

We have a box with
Linux Fedora Core 3, Postgres 7.4.2
Intel(R) Pentium(R) 4 CPU 2.40GHz
2 scsi 76GB disks (15.000RPM, 2ms)

I did put pg_xlog on another file system on other discs.

Still when several users are on line the reporting gets very slow. 
Queries can take more then 2 min.

I need some ideas how to improve performance in some orders of 
magnitude. I already thought of a box with the whole database on a ram 
disc. So really any idea is welcome.

Ulrich



-- 
Ulrich Wisser  / System Developer

RELEVANT TRAFFIC SWEDEN AB, Riddarg 17A, SE-114 57 Sthlm, Sweden
Direct (+46)86789755 || Cell (+46)704467893 || Fax (+46)86789769

http://www.relevanttraffic.com

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly




---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PERFORM] Need for speed

2005-08-16 Thread John A Meinel
Ulrich Wisser wrote:
 Hello,

 one of our services is click counting for on line advertising. We do
 this by importing Apache log files every five minutes. This results in a
 lot of insert and delete statements. At the same time our customers
 shall be able to do on line reporting.

What are you deleting? I can see having a lot of updates and inserts,
but I'm trying to figure out what the deletes would be.

Is it just that you completely refill the table based on the apache log,
rather than doing only appending?
Or are you deleting old rows?


 We have a box with
 Linux Fedora Core 3, Postgres 7.4.2
 Intel(R) Pentium(R) 4 CPU 2.40GHz
 2 scsi 76GB disks (15.000RPM, 2ms)

 I did put pg_xlog on another file system on other discs.

 Still when several users are on line the reporting gets very slow.
 Queries can take more then 2 min.

If it only gets slow when you have multiple clients it sounds like your
select speed is the issue, more than conflicting with your insert/deletes.


 I need some ideas how to improve performance in some orders of
 magnitude. I already thought of a box with the whole database on a ram
 disc. So really any idea is welcome.

How much ram do you have in the system? It sounds like you only have 1
CPU, so there is a lot you can do to make the box scale.

A dual Opteron (possibly a dual motherboard with dual core (but only
fill one for now)), with 16GB of ram, and an 8-drive RAID10 system would
perform quite a bit faster.

How big is your database on disk? Obviously it isn't very large if you
are thinking to hold everything in RAM (and only have 76GB of disk
storage to put it in anyway).

If your machine only has 512M, an easy solution would be to put in a
bunch more memory.

In general, your hardware is pretty low in overall specs. So if you are
willing to throw money at the problem, there is a lot you can do.

Alternatively, turn on statement logging, and then post the queries that
are slow. This mailing list is pretty good at fixing poor queries.

One thing you are probably hitting is a lot of sequential scans on the
main table.

If you are doing mostly inserting, make sure you are in a transaction,
and think about doing a COPY.

There is a lot more that can be said, we just need to have more
information about what you want.

John
=:-


 Ulrich






signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] Need for speed

2005-08-16 Thread Jeffrey W. Baker
On Tue, 2005-08-16 at 17:39 +0200, Ulrich Wisser wrote:
 Hello,
 
 one of our services is click counting for on line advertising. We do 
 this by importing Apache log files every five minutes. This results in a 
 lot of insert and delete statements. At the same time our customers 
 shall be able to do on line reporting.
 
 We have a box with
 Linux Fedora Core 3, Postgres 7.4.2
 Intel(R) Pentium(R) 4 CPU 2.40GHz

This is not a good CPU for this workload.  Try an Opteron or Xeon.  Also
of major importance is the amount of memory.  If possible, you would
like to have memory larger than the size of your database.

 2 scsi 76GB disks (15.000RPM, 2ms)

If you decide your application is I/O bound, here's an obvious place for
improvement.  More disks == faster.

 I did put pg_xlog on another file system on other discs.

Did that have a beneficial effect?

 Still when several users are on line the reporting gets very slow. 
 Queries can take more then 2 min.

Is this all the time or only during the insert?

 I need some ideas how to improve performance in some orders of 
 magnitude. I already thought of a box with the whole database on a ram 
 disc. So really any idea is welcome.

You don't need a RAM disk, just a lot of RAM.  Your operating system
will cache disk contents in memory if possible.  You have a very small
configuration, so more CPU, more memory, and especially more disks will
probably all yield improvements.

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PERFORM] Need for speed

2005-08-16 Thread Alex Turner
Are you calculating aggregates, and if so, how are you doing it (I ask
the question from experience of a similar application where I found
that my aggregating PGPLSQL triggers were bogging the system down, and
changed them so scheduled jobs instead).

Alex Turner
NetEconomist

On 8/16/05, Ulrich Wisser [EMAIL PROTECTED] wrote:
 Hello,
 
 one of our services is click counting for on line advertising. We do
 this by importing Apache log files every five minutes. This results in a
 lot of insert and delete statements. At the same time our customers
 shall be able to do on line reporting.
 
 We have a box with
 Linux Fedora Core 3, Postgres 7.4.2
 Intel(R) Pentium(R) 4 CPU 2.40GHz
 2 scsi 76GB disks (15.000RPM, 2ms)
 
 I did put pg_xlog on another file system on other discs.
 
 Still when several users are on line the reporting gets very slow.
 Queries can take more then 2 min.
 
 I need some ideas how to improve performance in some orders of
 magnitude. I already thought of a box with the whole database on a ram
 disc. So really any idea is welcome.
 
 Ulrich
 
 
 
 --
 Ulrich Wisser  / System Developer
 
 RELEVANT TRAFFIC SWEDEN AB, Riddarg 17A, SE-114 57 Sthlm, Sweden
 Direct (+46)86789755 || Cell (+46)704467893 || Fax (+46)86789769
 
 http://www.relevanttraffic.com
 
 ---(end of broadcast)---
 TIP 1: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly


---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PERFORM] Need for speed

2005-08-16 Thread Dennis Bjorklund
On Tue, 16 Aug 2005, Ulrich Wisser wrote:

 Still when several users are on line the reporting gets very slow. 
 Queries can take more then 2 min.

Could you show an exampleof such a query and the output of EXPLAIN ANALYZE
on that query (preferably done when the database is slow).

It's hard to say what is wrong without more information.

-- 
/Dennis Björklund


---(end of broadcast)---
TIP 6: explain analyze is your friend