from:"Jorge Montero"

Re: [PERFORM] how to handle a big table for data log

2010-07-19 Thread Jorge Montero

Large tables, by themselves, are not necessarily a problem. The problem is what 
you might be trying to do with them. Depending on the operations you are trying 
to do, partitioning the table might help performance or make it worse.
 
What kind of queries are you running? How many days of history are you keeping? 
Could you post an explain analyze output of a query that is being problematic?
Given the amount of data you hint about, your server configuration, and custom 
statistic targets for the big tables in question would be useful.

>>> kuopo  7/19/2010 1:27 AM >>>
Hi,

I have a situation to handle a log table which would accumulate a
large amount of logs. This table only involves insert and query
operations. To limit the table size, I tried to split this table by
date. However, the number of the logs is still large (46 million
records per day). To further limit its size, I tried to split this log
table by log type. However, this action does not improve the
performance. It is much slower than the big table solution. I guess
this is because I need to pay more cost on the auto-vacuum/analyze for
all split tables.

Can anyone comment on this situation? Thanks in advance.


kuopo.

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Need help in performance tuning.

2010-07-09 Thread Jorge Montero

If anything was built in the database to handle such connections, I'd recommend 
a big, bold warning, recommending the use of client-side pooling if available. 
For something like, say, a web-server, pooling connections to the database 
provides a massive performance advantage regardless of how good the database is 
at handling way more active queries than the hardware can handle: The 
assignment of a connection to a thread tends to be at least an order of 
magnitude cheaper than establishing a new connection for each new thread, and 
destroying it when it dies. This is especially true if the client architecture 
relies in relatively short lived threads.

While there are a few cases where pooling is counter productive, this only 
happens in relatively few scenarios. This is why every java application server 
out there wil strongly recommend using its own facilities to connect to a 
database: The performance is almost always better, and it provides less 
headaches to the DBAs.

Now, if remote clients are accessing your database directly, setting up a pool 
inbetween might not be as straightforward or give you the same gains across the 
board, and that might be the only case where letting the db do its own pooling 
makes sense.

>>> "Kevin Grittner"  7/9/2010 12:52 PM >>>
Matthew Wakeling  wrote:
> On Fri, 9 Jul 2010, Kevin Grittner wrote:
>>> Interesting idea. As far as I can see, you are suggesting
>>> solving the too many connections problem by allowing lots of
>>> connections, but only allowing a certain number to do anything
>>> at a time?
>>
>> Right.
> 
> I think in some situations, this arrangement would be an
> advantage.  However, I do not think it will suit the majority of
> situations, and could reduce the performance when the user doesn't
> need the functionality, either because they have a pool already,
> or they don't have many connections.

Oh, totally agreed, except that I think we can have essentially nil
impact if they don't exceed a configured limit.  In my experience,
pooling is more effective the closer you put it to the client.  I
suppose the strongest argument that could be made against building
in some sort of pooling is that it doesn't encourage people to look
for client-side solutions.  However, we seem to get a lot of posts
from people who don't do this, are not able to easily manage it, and
who would benefit from even a simple solution like this.

-Kevin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Need help in performance tuning.

2010-07-09 Thread Jorge Montero

If your app is running under Tomcat, connection pooling is extremely easy to 
set up from there: It has connection pooling mechanisms built in. Request your 
db connections using said mechanisms, instead of doing it manually, make a 
couple of changes to server.xml, and the problem goes away. Hundreds, if not 
thousands of concurrent users might end up running with less than 10 
connections.
 


>>> Harpreet singh Wadhwa  7/9/2010 3:55 AM >>>
Thanx you all for the replies.
I got a gist on where should I head towards
like I should rely a bit on postgres for performance and rest on my
tomcat and application.
And will try connection pooling on postgres part.

And if I come back for any query (related to this topic) then this
time it will be more precise (with real time data of my testing). ;-)

Regards
haps

On Fri, Jul 9, 2010 at 1:22 PM, Craig Ringer
 wrote:
> On 09/07/10 12:42, Tom Lane wrote:
>> Samuel Gendler  writes:
>>> On Thu, Jul 8, 2010 at 8:11 PM, Craig Ringer
>>>  wrote:
 If you're not using a connection pool, start using one.
>>
>>> I see this issue and subsequent advice cross this list awfully
>>> frequently.  Is there in architectural reason why postgres itself
>>> cannot pool incoming connections in order to eliminate the requirement
>>> for an external pool?
>>
>> Perhaps not, but there's no obvious benefit either.  Since there's
>> More Than One Way To Do It, it seems more practical to keep that as a
>> separate problem that can be solved by a choice of add-on packages.
>
> Admittedly I'm relatively ignorant of the details, but I increasingly
> think PostgreSQL will need a different big architectural change in the
> coming years, as the typical performance characteristics of machines
> change:
>
> It'll need to separate "running queries" from "running processes", or
> start threading backends, so that one way or the other a single query
> can benefit from the capabilities of multiple CPUs. The same separation,
> or a move to async I/O, might be needed to get one query to concurrently
> read multiple partitions of a table, or otherwise get maximum benefit
> from high-capacity I/O subsystems when running just a few big, expensive
> queries.
>
> Otherwise I'm wondering if PostgreSQL will begin really suffering in
> performance on workloads where queries are big and expensive but there
> are relatively few of them running at a time.
>
> My point? *if* I'm not full of hot air and there's some truth to my
> blather above, any change like that might be accompanied by a move to
> separate query execution state from connection state, so that idle
> connections have a much lower resource cost.
>
> OK, that's my hand-waving for the day done.
>
> --
> Craig Ringer
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance 
>

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Query timing increased from 3s to 55s when used as a function instead of select

2010-05-25 Thread Jorge Montero

Have you read this?
 
http://blog.endpoint.com/2008/12/why-is-my-function-slow.html 
 
99% of the 'function is slow' problems are caused by this.

Have you checked the difference between explain and prepare + explain execute?

>>> Tyler Hildebrandt  05/25/10 4:59 AM >>>
We're using a function that when run as a select statement outside of the 
function takes roughly 1.5s to complete whereas running an identical
query within a function is taking around 55s to complete.

We are lost as to why placing this query within a function as opposed to
substituting the variables in a select statement is so drastically different.

The timings posted here are from a 512MB memory virtual machine and are not of
major concern on their own but we are finding the same issue in our production
environment with far superior hardware.

The function can be found here:
http://campbell-lange.net/media/files/fn_medirota_get_staff_leave_summary.sql 

---

Timings for the individual components on their own is as follows:

select * from fn_medirota_validate_rota_master(6);
Time: 0.670 ms

select to_date(EXTRACT (YEAR FROM current_date)::text, '');
Time: 0.749 ms

select * from fn_medirota_people_template_generator(2, 6, date'2009-01-01',
date'2009-12-31', TRUE) AS templates;
Time: 68.004 ms

select * from fn_medirota_people_template_generator(2, 6, date'2010-01-01',
date'2010-12-31', TRUE) AS templates;
Time: 1797.323


Copying the exact same for loop select statement from the query above into
the psql query buffer and running them with variable substitution yields the
following:

Running FOR loop SElECT with variable substitution:
Time: 3150.585 ms


Whereas invoking the function yields:

select * from fn_medirota_get_staff_leave_summary(6);
Time: 57375.477 ms


We have tried using explain analyse to update the query optimiser, dropped and
recreated the function and have restarted both the machine and the postgres
server multiple times.

Any help or advice would be greatly appreciated.


Kindest regards,
Tyler Hildebrandt

---

EXPLAIN ANALYSE VERBOSE SELECT * FROM fn_medirota_get_staff_leave_summary(6);

QUERY PLAN
-
{FUNCTIONSCAN
:startup_cost 0.00
:total_cost 260.00
:plan_rows 1000
:plan_width 85
:targetlist (
   {TARGETENTRY
   :expr
  {VAR
  :varno 1
  :varattno 1
  :vartype 23
  :vartypmod -1
  :varlevelsup 0
  :varnoold 1
  :varoattno 1
  }
   :resno 1
   :resname id
   :ressortgroupref 0
   :resorigtbl 0
   :resorigcol 0
   :resjunk false
   }
   {TARGETENTRY
   :expr
  {VAR
  :varno 1
  :varattno 2
  :vartype 1043
  :vartypmod -1
  :varlevelsup 0
  :varnoold 1
  :varoattno 2
  }
   :resno 2
   :resname t_full_name
   :ressortgroupref 0
   :resorigtbl 0
   :resorigcol 0
   :resjunk false
   }
   {TARGETENTRY
   :expr
  {VAR
  :varno 1
  :varattno 3
  :vartype 16
  :vartypmod -1
  :varlevelsup 0
  :varnoold 1
  :varoattno 3
  }
   :resno 3
   :resname b_enabled
   :ressortgroupref 0
   :resorigtbl 0
   :resorigcol 0
   :resjunk false
   }
   {TARGETENTRY
   :expr
  {VAR
  :varno 1
  :varattno 4
  :vartype 1043
  :vartypmod -1
  :varlevelsup 0
  :varnoold 1
  :varoattno 4
  }
   :resno 4
   :resname t_anniversary
   :ressortgroupref 0
   :resorigtbl 0
   :resorigcol 0
   :resjunk false
   }
   {TARGETENTRY
   :expr
  {VAR
  :varno 1
  :varattno 5
  :vartype 23
  :vartypmod -1
  :varlevelsup 0
  :varnoold 1
  :varoattno 5
  }
   :resno 5
   :resname n_last_year_annual
   :ressortgroupref 0
   :resorigtbl 0
   :resorigcol 0
   :resjunk false
   }
   {TARGETENTRY
   :expr
  {VAR
  :varno 1
  :varattno 6
  :vartype 23
  :vartypmod -1
  :varlevelsup 0
  :varnoold 1
  :varoattno 6
  }
   :resno 6
   :resname n_last_year_other
   :ressortgroupref 0
   :resorigtbl 0
   :resorigcol 0
   :resjunk false
   }
   {TARGETENTRY
   :expr
  {VAR
  :varno 1
  :varattno 7
  :vartype 23
  :vartypmod -1
  :varlevelsup 0
  :varnoold 1
  :varoattno 7
  }
   :resno 7
   :resname n_this_year_annual
   :ressortgroupref 0
   :resorigtbl 0
   :resorigcol 0
   :resjunk false
   }
   {TARGETENTRY
   :expr
  {VAR
  :varno 1

Re: [PERFORM] Performance issues when the number of records are around 10 Million

2010-05-11 Thread Jorge Montero

First, are you sure you are getting autovacuum to run hourly? Autovacuum will 
only vacuum when certain configuration thresholds are reached. You can set it 
to only check for those thresholds every so often, but no vacuuming or 
analyzing will be done unless they are hit, regardless of how often autovacuum 
checks the tables. Whenever you are dealing with time series, the default 
thresholds are often insufficient, especially when you are especially 
interested in the last few records on a large table. 
 
What are your autovacuum configuration parameters?
When were the two tables last autovacuum and analyzed, according to 
pg_stat_user_tables?
Could you post the output of explain analyze of your query?
Which default statistic collection parameters do you use? Have you changed them 
specifically for the tables you are using?
Which version of Postgres are you running? Which OS? 
 
 

>>> venu madhav  05/11/10 3:47 AM >>>
Hi all,
In my database application, I've a table whose records can reach 10M and 
insertions can happen at a faster rate like 100 insertions per second in the 
peak times. I configured postgres to do auto vacuum on hourly basis. I have 
frontend GUI application in CGI which displays the data from the database. When 
I try to get the last twenty records from the database, it takes around 10-15 
mins to complete the operation.This is the query which is used:

select e.cid, timestamp, s.sig_class, s.sig_priority, s.sig_name, e.sniff_ip, 
e.sniff_channel, s.sig_config, e.wifi_addr_1,
e.wifi_addr_2, e.view_status, bssid FROM event e, signature s WHERE s.sig_id = 
e.signature AND e.timestamp >= '1270449180' AND e.timestamp < '1273473180' 
ORDER BY e.cid DESC, e.cid DESC limit 21 offset 10539780;

Can any one suggest me a better solution to improve the performance.

Please let me know if you've any further queries.


Thank you,
Venu

Re: [PERFORM] bgwriter, checkpoints, curious (seeing delays)

2010-02-26 Thread Jorge Montero



>>> Tory M Blue  02/26/10 12:52 PM >>>
>>
>> This is too much. Since you have 300 connections, you will probably swap
>> because of this setting, since each connection may use this much
>> work_mem. The rule of the thumb is to set this to a lower general value
>> (say, 1-2 MB), and set it per-query when needed.
>
> I'm slightly confused. Most things I've read, including running
> pg_tune for grins puts this around 100MB, 98MB for pgtune.  1-2MB just
> seems really low to me. And Ignore the 300 connections, thats an upper
> limit, I usually run a max of 40-45 but usually around 20 connections
>per sec.

It has been said in the list before that pg_tune is extremely aggressive when 
it comes to work_mem.

100MB is just a whole lot of memory for something that is dedicated mostly to 
sorting. Some of my relatively heavy duty queries, which end up manipulating 
hundreds of thousands of rows in subqueries, do just fine with quite a bit less.

1-2MB is good enough for many families of queries, but it's hard to say what 
the right default should be for you. The right number can be estimated by 
running explain analyze on your most common queries, with parameters that are 
representative to regular use, and see how much memory they actually claim to 
use. In my case, for example, most of my queries do just fine with 10 MB, while 
the reporting queries that accumulate quite a bit of deta request up to 60MB.

If your average query needs 100 MB, it'd still mean that 40 connections take 4 
gigs worth of work memory, which might be better spent caching the database.

Now, if your system is so over-specced that wasting a few gigs of RAM doesn't 
impact your performance one bit, then you might not have to worry about this at 
all.



-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Re: 512,600ms query becomes 7500ms... but why? Postgres 8.3 query planner quirk?

2010-02-10 Thread Jorge Montero

That sure looks like the source of the problem to me too. I've seen similar 
behavior in queries not very different from that. It's hard to guess  what the 
problem is exactly without having more knowledge of the data distribution in 
article_words though.

Given the results of analyze, I'd try to run the deepest subquery and try to 
see if I could get the estimate to match reality, either by altering statistics 
targets, or tweaking the query to give more information to the planner. 

For example, i'd check if the number of expected rows from 

SELECT context_key FROM article_words JOIN words using (word_key) WHERE word = 
'insider'

is much less accurate than the estimate for

SELECT context_key FROM article_words WHERE word_key = (whatever the actual 
word_key for insider is)

>>> Robert Haas  02/10/10 2:31 PM >>>
On Wed, Feb 10, 2010 at 3:29 AM, Bryce Nesbitt  wrote:
> Or, if you want to actually read that query plan, try:
> http://explain.depesz.com/s/qYq 

Much better, though I prefer a text attachment...  anyhow, I think the
root of the problem may be that both of the subquery scans under the
append node are seeing hundreds of times more rows than they're
expecting, which is causing the planner to choose nested loops higher
up that it otherwise might have preferred to implement in some other
way.  I'm not quite sure why, though.

...Robert

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

[PERFORM] Hashaggregate estimates

2010-01-13 Thread Jorge Montero

I'm having some performance problems in a few sales reports running on postgres 
8.3, running on Redhat 4.1.2. The hardware is a bit old, but it performs well 
enough. The reports are the typical sales reporting fare: Gather the sales of a 
time period based some criteria, aggregate them by product, and then join with 
a bunch of other tables to display different kinds of product information.
 
I'll spare you all the pain of looking at the entire queries: The ultimate 
issue appears to be the same: The innermost table of the queries is an inline 
view, which aggregates the data by product. It runs rather quickly, but 
postgres underestimates the number of rows that come out of it, making the rest 
of the query plan rather suboptimal. The inline view look like this
 
select sku_id, sum(rs.price) as dollarsSold, sum(rs.quantity) as units  
  from reporting.sales rs 
  where rs.sale_date between ? AND ?  group by sku_id
 
In some cases, we see extra conditions aside of the dates, but they have the 
same shape. Barring a massive date range, the rest of the filters are less 
selective than the date, so postgres uses an index on sale_date,sku_id. I have 
increased the statistics calculations on sale_date quite a bit to make sure 
Postgres makes decent row estimates.The problem is in the aggregation:
 
"HashAggregate  (cost=54545.20..54554.83 rows=642 width=24) (actual 
time=87.945..98.219 rows=11462 loops=1)"
"  ->  Index Scan using reporting_sales_sale_date_idx on sales rs  
(cost=0.00..54288.63 rows=34209 width=24) (actual time=0.042..34.194 rows=23744 
loops=1)"
"Index Cond: ((sale_date >= '2009-07-01 00:00:00'::timestamp without 
time zone) AND (sale_date <= '2009-07-06 00:00:00'::timestamp without time 
zone))"
"Total runtime: 10.110 ms"
 
As you an seem the Index scan's estimate is pretty close when I use a single 
condition, but the aggregate estimate is off by a factor of 20. When I add 
further conditions, the estimate just gets worse and worse.
 
"HashAggregate  (cost=8894.83..8894.85 rows=1 width=24) (actual 
time=6.444..6.501 rows=92 loops=1)"
"  ->  Index Scan using reporting_sales_sale_date_sku_id_idx on sales rs  
(cost=0.00..8894.76 rows=9 width=24) (actual time=0.103..6.278 rows=94 loops=1)"
"Index Cond: ((sale_date >= '2009-07-01 00:00:00'::timestamp without 
time zone) AND (sale_date <= '2009-07-06 00:00:00'::timestamp without time 
zone) AND ((sale_channel)::text = 'RETAIL'::text))"
"Filter: ((activity_type)::text = 'RETURN'::text)"
"Total runtime: 6.583 ms"
I think I've done what I could when it comes to altering statistics: For 
example, activity_type and sale_channel have full statistics, and they are 
rather independent as filtering mechanisms: If all Postgres did when trying to 
estimate their total filtering capacity was just multiply the frequency of each 
value, the estimates would not be far off.
 
The killer seems to be the row aggregation. There are about  95K different 
values of sku_id in the sales table, and even the best seller items are a very 
small percentage of all rows, so expecting the aggregation to consolidate the 
rows 50:1 like it does in one of the explains above is a pipe dream. I've 
increased statistics in sku_id into the three digits, but results are not any 
better
 
schemaname;tablename;attname;null_frac;avg_width;n_distinct;most_common_freqs
"reporting";"sales";"sku_id";0;11;58337;"{0.00364167,0.0027125,0.00230417,0.00217083,0.00178333,0.001675,0.00136667,0.00135,0.0012875,0.0011875,"
 
Is there any way I can coax Postgres into making a more realistic aggregation 
estimate? I could just delay aggregation until the rest of the data is joined, 
making the estimate's failure moot, but the price would be quite hefty in some 
of the reports, which could return 20K products and widths of over 150, so it's 
not optimal, especially when right now the same query that can request 100 rows 
could end up requesting 80K.

Re: [PERFORM] how to handle a big table for data log

Re: [PERFORM] Need help in performance tuning.

Re: [PERFORM] Need help in performance tuning.

Re: [PERFORM] Query timing increased from 3s to 55s when used as a function instead of select

Re: [PERFORM] Performance issues when the number of records are around 10 Million

Re: [PERFORM] bgwriter, checkpoints, curious (seeing delays)

Re: [PERFORM] Re: 512,600ms query becomes 7500ms... but why? Postgres 8.3 query planner quirk?

[PERFORM] Hashaggregate estimates

8 matches

Site Navigation

Mail list logo

Footer information