Re: [PERFORM] how to handle a big table for data log

2010-07-19 Thread Jorge Montero
Large tables, by themselves, are not necessarily a problem. The problem is what 
you might be trying to do with them. Depending on the operations you are trying 
to do, partitioning the table might help performance or make it worse.
 
What kind of queries are you running? How many days of history are you keeping? 
Could you post an explain analyze output of a query that is being problematic?
Given the amount of data you hint about, your server configuration, and custom 
statistic targets for the big tables in question would be useful.

 kuopo sp...@cs.nctu.edu.tw 7/19/2010 1:27 AM 
Hi,

I have a situation to handle a log table which would accumulate a
large amount of logs. This table only involves insert and query
operations. To limit the table size, I tried to split this table by
date. However, the number of the logs is still large (46 million
records per day). To further limit its size, I tried to split this log
table by log type. However, this action does not improve the
performance. It is much slower than the big table solution. I guess
this is because I need to pay more cost on the auto-vacuum/analyze for
all split tables.

Can anyone comment on this situation? Thanks in advance.


kuopo.

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Need help in performance tuning.

2010-07-09 Thread Jorge Montero
If your app is running under Tomcat, connection pooling is extremely easy to 
set up from there: It has connection pooling mechanisms built in. Request your 
db connections using said mechanisms, instead of doing it manually, make a 
couple of changes to server.xml, and the problem goes away. Hundreds, if not 
thousands of concurrent users might end up running with less than 10 
connections.
 


 Harpreet singh Wadhwa harpr...@openbravo.com 7/9/2010 3:55 AM 
Thanx you all for the replies.
I got a gist on where should I head towards
like I should rely a bit on postgres for performance and rest on my
tomcat and application.
And will try connection pooling on postgres part.

And if I come back for any query (related to this topic) then this
time it will be more precise (with real time data of my testing). ;-)

Regards
haps

On Fri, Jul 9, 2010 at 1:22 PM, Craig Ringer
cr...@postnewspapers.com.au wrote:
 On 09/07/10 12:42, Tom Lane wrote:
 Samuel Gendler sgend...@ideasculptor.com writes:
 On Thu, Jul 8, 2010 at 8:11 PM, Craig Ringer
 cr...@postnewspapers.com.au wrote:
 If you're not using a connection pool, start using one.

 I see this issue and subsequent advice cross this list awfully
 frequently.  Is there in architectural reason why postgres itself
 cannot pool incoming connections in order to eliminate the requirement
 for an external pool?

 Perhaps not, but there's no obvious benefit either.  Since there's
 More Than One Way To Do It, it seems more practical to keep that as a
 separate problem that can be solved by a choice of add-on packages.

 Admittedly I'm relatively ignorant of the details, but I increasingly
 think PostgreSQL will need a different big architectural change in the
 coming years, as the typical performance characteristics of machines
 change:

 It'll need to separate running queries from running processes, or
 start threading backends, so that one way or the other a single query
 can benefit from the capabilities of multiple CPUs. The same separation,
 or a move to async I/O, might be needed to get one query to concurrently
 read multiple partitions of a table, or otherwise get maximum benefit
 from high-capacity I/O subsystems when running just a few big, expensive
 queries.

 Otherwise I'm wondering if PostgreSQL will begin really suffering in
 performance on workloads where queries are big and expensive but there
 are relatively few of them running at a time.

 My point? *if* I'm not full of hot air and there's some truth to my
 blather above, any change like that might be accompanied by a move to
 separate query execution state from connection state, so that idle
 connections have a much lower resource cost.

 OK, that's my hand-waving for the day done.

 --
 Craig Ringer

 --
 Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-performance 


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Need help in performance tuning.

2010-07-09 Thread Jorge Montero
If anything was built in the database to handle such connections, I'd recommend 
a big, bold warning, recommending the use of client-side pooling if available. 
For something like, say, a web-server, pooling connections to the database 
provides a massive performance advantage regardless of how good the database is 
at handling way more active queries than the hardware can handle: The 
assignment of a connection to a thread tends to be at least an order of 
magnitude cheaper than establishing a new connection for each new thread, and 
destroying it when it dies. This is especially true if the client architecture 
relies in relatively short lived threads.
 
While there are a few cases where pooling is counter productive, this only 
happens in relatively few scenarios. This is why every java application server 
out there wil strongly recommend using its own facilities to connect to a 
database: The performance is almost always better, and it provides less 
headaches to the DBAs.
 
Now, if remote clients are accessing your database directly, setting up a pool 
inbetween might not be as straightforward or give you the same gains across the 
board, and that might be the only case where letting the db do its own pooling 
makes sense.

 Kevin Grittner kevin.gritt...@wicourts.gov 7/9/2010 12:52 PM 
Matthew Wakeling matt...@flymine.org wrote:
 On Fri, 9 Jul 2010, Kevin Grittner wrote:
 Interesting idea. As far as I can see, you are suggesting
 solving the too many connections problem by allowing lots of
 connections, but only allowing a certain number to do anything
 at a time?

 Right.
 
 I think in some situations, this arrangement would be an
 advantage.  However, I do not think it will suit the majority of
 situations, and could reduce the performance when the user doesn't
 need the functionality, either because they have a pool already,
 or they don't have many connections.

Oh, totally agreed, except that I think we can have essentially nil
impact if they don't exceed a configured limit.  In my experience,
pooling is more effective the closer you put it to the client.  I
suppose the strongest argument that could be made against building
in some sort of pooling is that it doesn't encourage people to look
for client-side solutions.  However, we seem to get a lot of posts
from people who don't do this, are not able to easily manage it, and
who would benefit from even a simple solution like this.

-Kevin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Query timing increased from 3s to 55s when used as a function instead of select

2010-05-25 Thread Jorge Montero
Have you read this?
 
http://blog.endpoint.com/2008/12/why-is-my-function-slow.html 
 
99% of the 'function is slow' problems are caused by this.

Have you checked the difference between explain and prepare + explain execute?

 Tyler Hildebrandt ty...@campbell-lange.net 05/25/10 4:59 AM 
We're using a function that when run as a select statement outside of the 
function takes roughly 1.5s to complete whereas running an identical
query within a function is taking around 55s to complete.

We are lost as to why placing this query within a function as opposed to
substituting the variables in a select statement is so drastically different.

The timings posted here are from a 512MB memory virtual machine and are not of
major concern on their own but we are finding the same issue in our production
environment with far superior hardware.

The function can be found here:
http://campbell-lange.net/media/files/fn_medirota_get_staff_leave_summary.sql 

---

Timings for the individual components on their own is as follows:

select * from fn_medirota_validate_rota_master(6);
Time: 0.670 ms

select to_date(EXTRACT (YEAR FROM current_date)::text, '');
Time: 0.749 ms

select * from fn_medirota_people_template_generator(2, 6, date'2009-01-01',
date'2009-12-31', TRUE) AS templates;
Time: 68.004 ms

select * from fn_medirota_people_template_generator(2, 6, date'2010-01-01',
date'2010-12-31', TRUE) AS templates;
Time: 1797.323


Copying the exact same for loop select statement from the query above into
the psql query buffer and running them with variable substitution yields the
following:

Running FOR loop SElECT with variable substitution:
Time: 3150.585 ms


Whereas invoking the function yields:

select * from fn_medirota_get_staff_leave_summary(6);
Time: 57375.477 ms


We have tried using explain analyse to update the query optimiser, dropped and
recreated the function and have restarted both the machine and the postgres
server multiple times.

Any help or advice would be greatly appreciated.


Kindest regards,
Tyler Hildebrandt

---

EXPLAIN ANALYSE VERBOSE SELECT * FROM fn_medirota_get_staff_leave_summary(6);

QUERY PLAN
-
{FUNCTIONSCAN
:startup_cost 0.00
:total_cost 260.00
:plan_rows 1000
:plan_width 85
:targetlist (
   {TARGETENTRY
   :expr
  {VAR
  :varno 1
  :varattno 1
  :vartype 23
  :vartypmod -1
  :varlevelsup 0
  :varnoold 1
  :varoattno 1
  }
   :resno 1
   :resname id
   :ressortgroupref 0
   :resorigtbl 0
   :resorigcol 0
   :resjunk false
   }
   {TARGETENTRY
   :expr
  {VAR
  :varno 1
  :varattno 2
  :vartype 1043
  :vartypmod -1
  :varlevelsup 0
  :varnoold 1
  :varoattno 2
  }
   :resno 2
   :resname t_full_name
   :ressortgroupref 0
   :resorigtbl 0
   :resorigcol 0
   :resjunk false
   }
   {TARGETENTRY
   :expr
  {VAR
  :varno 1
  :varattno 3
  :vartype 16
  :vartypmod -1
  :varlevelsup 0
  :varnoold 1
  :varoattno 3
  }
   :resno 3
   :resname b_enabled
   :ressortgroupref 0
   :resorigtbl 0
   :resorigcol 0
   :resjunk false
   }
   {TARGETENTRY
   :expr
  {VAR
  :varno 1
  :varattno 4
  :vartype 1043
  :vartypmod -1
  :varlevelsup 0
  :varnoold 1
  :varoattno 4
  }
   :resno 4
   :resname t_anniversary
   :ressortgroupref 0
   :resorigtbl 0
   :resorigcol 0
   :resjunk false
   }
   {TARGETENTRY
   :expr
  {VAR
  :varno 1
  :varattno 5
  :vartype 23
  :vartypmod -1
  :varlevelsup 0
  :varnoold 1
  :varoattno 5
  }
   :resno 5
   :resname n_last_year_annual
   :ressortgroupref 0
   :resorigtbl 0
   :resorigcol 0
   :resjunk false
   }
   {TARGETENTRY
   :expr
  {VAR
  :varno 1
  :varattno 6
  :vartype 23
  :vartypmod -1
  :varlevelsup 0
  :varnoold 1
  :varoattno 6
  }
   :resno 6
   :resname n_last_year_other
   :ressortgroupref 0
   :resorigtbl 0
   :resorigcol 0
   :resjunk false
   }
   {TARGETENTRY
   :expr
  {VAR
  :varno 1
  :varattno 7
  :vartype 23
  :vartypmod -1
  :varlevelsup 0
  :varnoold 1
  :varoattno 7
  }
   :resno 7
   :resname n_this_year_annual
   :ressortgroupref 0
   :resorigtbl 0
   :resorigcol 0
   :resjunk false
   }
   {TARGETENTRY
   :expr
  {VAR
  

Re: [PERFORM] bgwriter, checkpoints, curious (seeing delays)

2010-02-26 Thread Jorge Montero


 Tory M Blue tmb...@gmail.com 02/26/10 12:52 PM 

 This is too much. Since you have 300 connections, you will probably swap
 because of this setting, since each connection may use this much
 work_mem. The rule of the thumb is to set this to a lower general value
 (say, 1-2 MB), and set it per-query when needed.

 I'm slightly confused. Most things I've read, including running
 pg_tune for grins puts this around 100MB, 98MB for pgtune.  1-2MB just
 seems really low to me. And Ignore the 300 connections, thats an upper
 limit, I usually run a max of 40-45 but usually around 20 connections
per sec.

It has been said in the list before that pg_tune is extremely aggressive when 
it comes to work_mem.

100MB is just a whole lot of memory for something that is dedicated mostly to 
sorting. Some of my relatively heavy duty queries, which end up manipulating 
hundreds of thousands of rows in subqueries, do just fine with quite a bit less.

1-2MB is good enough for many families of queries, but it's hard to say what 
the right default should be for you. The right number can be estimated by 
running explain analyze on your most common queries, with parameters that are 
representative to regular use, and see how much memory they actually claim to 
use. In my case, for example, most of my queries do just fine with 10 MB, while 
the reporting queries that accumulate quite a bit of deta request up to 60MB.

If your average query needs 100 MB, it'd still mean that 40 connections take 4 
gigs worth of work memory, which might be better spent caching the database.

Now, if your system is so over-specced that wasting a few gigs of RAM doesn't 
impact your performance one bit, then you might not have to worry about this at 
all.



-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Re: 512,600ms query becomes 7500ms... but why? Postgres 8.3 query planner quirk?

2010-02-10 Thread Jorge Montero
That sure looks like the source of the problem to me too. I've seen similar 
behavior in queries not very different from that. It's hard to guess  what the 
problem is exactly without having more knowledge of the data distribution in 
article_words though.

Given the results of analyze, I'd try to run the deepest subquery and try to 
see if I could get the estimate to match reality, either by altering statistics 
targets, or tweaking the query to give more information to the planner. 

For example, i'd check if the number of expected rows from 

SELECT context_key FROM article_words JOIN words using (word_key) WHERE word = 
'insider'

is much less accurate than the estimate for

SELECT context_key FROM article_words WHERE word_key = (whatever the actual 
word_key for insider is)


 Robert Haas robertmh...@gmail.com 02/10/10 2:31 PM 
On Wed, Feb 10, 2010 at 3:29 AM, Bryce Nesbitt bry...@obviously.com wrote:
 Or, if you want to actually read that query plan, try:
 http://explain.depesz.com/s/qYq 

Much better, though I prefer a text attachment...  anyhow, I think the
root of the problem may be that both of the subquery scans under the
append node are seeing hundreds of times more rows than they're
expecting, which is causing the planner to choose nested loops higher
up that it otherwise might have preferred to implement in some other
way.  I'm not quite sure why, though.

...Robert

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] Hashaggregate estimates

2010-01-13 Thread Jorge Montero
I'm having some performance problems in a few sales reports running on postgres 
8.3, running on Redhat 4.1.2. The hardware is a bit old, but it performs well 
enough. The reports are the typical sales reporting fare: Gather the sales of a 
time period based some criteria, aggregate them by product, and then join with 
a bunch of other tables to display different kinds of product information.
 
I'll spare you all the pain of looking at the entire queries: The ultimate 
issue appears to be the same: The innermost table of the queries is an inline 
view, which aggregates the data by product. It runs rather quickly, but 
postgres underestimates the number of rows that come out of it, making the rest 
of the query plan rather suboptimal. The inline view look like this
 
select sku_id, sum(rs.price) as dollarsSold, sum(rs.quantity) as units  
  from reporting.sales rs 
  where rs.sale_date between ? AND ?  group by sku_id
 
In some cases, we see extra conditions aside of the dates, but they have the 
same shape. Barring a massive date range, the rest of the filters are less 
selective than the date, so postgres uses an index on sale_date,sku_id. I have 
increased the statistics calculations on sale_date quite a bit to make sure 
Postgres makes decent row estimates.The problem is in the aggregation:
 
HashAggregate  (cost=54545.20..54554.83 rows=642 width=24) (actual 
time=87.945..98.219 rows=11462 loops=1)
  -  Index Scan using reporting_sales_sale_date_idx on sales rs  
(cost=0.00..54288.63 rows=34209 width=24) (actual time=0.042..34.194 rows=23744 
loops=1)
Index Cond: ((sale_date = '2009-07-01 00:00:00'::timestamp without 
time zone) AND (sale_date = '2009-07-06 00:00:00'::timestamp without time 
zone))
Total runtime: 10.110 ms
 
As you an seem the Index scan's estimate is pretty close when I use a single 
condition, but the aggregate estimate is off by a factor of 20. When I add 
further conditions, the estimate just gets worse and worse.
 
HashAggregate  (cost=8894.83..8894.85 rows=1 width=24) (actual 
time=6.444..6.501 rows=92 loops=1)
  -  Index Scan using reporting_sales_sale_date_sku_id_idx on sales rs  
(cost=0.00..8894.76 rows=9 width=24) (actual time=0.103..6.278 rows=94 loops=1)
Index Cond: ((sale_date = '2009-07-01 00:00:00'::timestamp without 
time zone) AND (sale_date = '2009-07-06 00:00:00'::timestamp without time 
zone) AND ((sale_channel)::text = 'RETAIL'::text))
Filter: ((activity_type)::text = 'RETURN'::text)
Total runtime: 6.583 ms
I think I've done what I could when it comes to altering statistics: For 
example, activity_type and sale_channel have full statistics, and they are 
rather independent as filtering mechanisms: If all Postgres did when trying to 
estimate their total filtering capacity was just multiply the frequency of each 
value, the estimates would not be far off.
 
The killer seems to be the row aggregation. There are about  95K different 
values of sku_id in the sales table, and even the best seller items are a very 
small percentage of all rows, so expecting the aggregation to consolidate the 
rows 50:1 like it does in one of the explains above is a pipe dream. I've 
increased statistics in sku_id into the three digits, but results are not any 
better
 
schemaname;tablename;attname;null_frac;avg_width;n_distinct;most_common_freqs
reporting;sales;sku_id;0;11;58337;{0.00364167,0.0027125,0.00230417,0.00217083,0.00178333,0.001675,0.00136667,0.00135,0.0012875,0.0011875,
 
Is there any way I can coax Postgres into making a more realistic aggregation 
estimate? I could just delay aggregation until the rest of the data is joined, 
making the estimate's failure moot, but the price would be quite hefty in some 
of the reports, which could return 20K products and widths of over 150, so it's 
not optimal, especially when right now the same query that can request 100 rows 
could end up requesting 80K.