Re: FW: [PERFORM] Performance 8.4.0

2009-08-02 Thread Robert Haas
On Sun, Aug 2, 2009 at 10:04 PM, Chris Dunn wrote:
> The database is 8gb currently. Use to be a lot bigger but we removed all 
> large objects out and developed a file server storage for it, and using 
> default page costs for 8.4, I did have it changed in 8.1.4

You might want to play with lowering them.  The default page costs
make page accesses expensive relative to per-tuple operations, which
is appropriate if you are I/O-bound but not so much if you are CPU
bound, and especially if the whole database is memory resident.  I'd
try something like random_page_cost = seq_page_cost = 0.1 for
starters, or whatever values were working for you in 8.1, but the
sweet spot may be higher or lower.

...Robert

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Greenplum MapReduce

2009-08-02 Thread Chris

Suvankar Roy wrote:


Hi all,

Has anybody worked on Greenplum MapReduce programming ?


It's a commercial product, you need to contact greenplum.

--
Postgresql & php tutorials
http://www.designmagick.com/


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


FW: [PERFORM] Performance 8.4.0

2009-08-02 Thread Chris Dunn
The database is 8gb currently. Use to be a lot bigger but we removed all large 
objects out and developed a file server storage for it, and using default page 
costs for 8.4, I did have it changed in 8.1.4

-Original Message-
From: Robert Haas [mailto:robertmh...@gmail.com]
Sent: Sunday, 2 August 2009 11:26 PM
To: Chris Dunn
Cc: pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Performance 8.4.0

On Fri, Jul 31, 2009 at 12:22 AM, Chris Dunn wrote:
> constraint_exclusion = on

This is critical if you need it, but a waste of CPU time if you don't.
 Other than that your paramaters look good.  Are you using the default
page cost settings?  I see you have 12 GB RAM; how big is your
database?

...Robert

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Why is PostgreSQL so slow on Windows ( Postgres 8.3.7) version

2009-08-02 Thread Euler Taveira de Oliveira
parimala escreveu:

[Don't repeat your answer. It's a PITA to receive multiple identical copies]

> We are using Postgres 8.3.7 in our java application. We are doing
> performances tuning and load testing in our setup. we have noticed that
> ,some of our queries to the database taking long time to return the
> results.Please find our setup details belows.
> 
> We observed that postgres is running in windows is slower than the linux .
> 
That is true and it will be for quite some time. Windows port is very recent
if we compare it with the long road Unix support.

> Postgresql configuration:
> 
> shared_buffers: 1 GB
I don't use Windows but I read some Windows users saying that it isn't
appropriate to set the shared_buffers too high. Take a look at the archives.

> Effective_cache_size: 2GB
> fsync: off  (even we tested this parameter is on ,we observed the same
> slowness )
> 
What about the other parameters that are different from default (uncommented
parameters)? Also, don't turn off the fsync unless you're pretty sure about
the consequences.

> We have 300k row's in PolledData Table.In each STATSDATA table ,we have
> almost 12 to 13 million rows. Every one minute interval ,we insert data
> into to STATSDATA table. In our application ,we use insert and select
> query to STATSDATA table at regular interval. Please let us know why the
> below query takes more time to return the results. is there any thing we
> need to do to tune the postgres database ?
> 
It seems very strange that your queries are not using the indexes. Do you have
autovacuum turn on? Do you recently analyze your tables?

>  Merge Cond: (statsdata8_21_2009.pollid =
> ((polleddata.id)::numeric))
Out of curiosity, why does foreign key have different datatype of its primary 
key?


-- 
  Euler Taveira de Oliveira
  http://www.timbira.com/

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Why is PostgreSQL so slow on Windows ( Postgres 8.3.7) version

2009-08-02 Thread PFC


,some of our queries to the database taking long time to return the  
results.


 fsync: off	 (even we tested this parameter is on ,we observed the same  
slowness )


	If your queries take long time to return results, I suppose you are  
talking about SELECTs.


	fsync = off will not make SELECTs faster (only inserts, updates, deletes)  
but it is not worth it as you risk data loss.


	synchronous_commit = on has about the same advantages (faster...) as  
fsync=off, but with no risk of data loss, so it is much better !



We have 300k row's in PolledData Table.In each STATSDATA table ,we have  
almost 12 to 13 million rows.


OK. So you insert 13 million rows per day ?
That is about 150 rows per second.


Every one minute interval ,we insert data into to STATSDATA table.


	I assume you are making an INSERT INTO statsdata VALUES (.. 150  
values .)

and not 150 inserts, yes ?


First Query :
SELECT COUNT(*) FROM (

SELECT ID, PolledData.AGENT, NAME, INSTANCE, TTIME, VAL
FROM PolledData, STATSDATA8_21_2009 WHERE
( ( PolledData.ID=STATSDATA8_21_2009.POLLID)
AND ( ( TTIME >= 1250838027454)
AND ( TTIME <=1250838079654) ) ) ) t1;

* You could rewrite as :

SELECT ID, PolledData.AGENT, NAME, INSTANCE, TTIME, VAL
FROM PolledData
JOIN STATSDATA8_21_2009 ON ( PolledData.ID = STATSDATA8_21_2009.POLLID)
WHERE TTIME BETWEEN ... AND ...

- It is exactly the same query, but much easier to read.

* some ANALYZE-ing of your tables would be useful, since the estimates  
from the planner look suspiciously different from reality

- ANALYZE is fast, you can run it often if you INSERT rows all the time

* You are joining on POLLID which is a NUMERIC in one table and a BIGINT  
in the other table.

- Is there any reason for this type difference ?
- Could you use BIGINT in both tables ?
- BIGINT is faster than NUMERIC and uses less space.
- Type conversions use CPU cycles too.

* Should StatsData.ID have a foreign key REFERENCES PolledData.ID ?
- This won't make the query faster, but if you know all rows in StatsData  
reference rows in PolledData (because of the FK constraint) and you want a  
count(*) like above, you don't need to JOIN.


* TTIME >= 1250838027454 AND TTIME <=1250838079654
- TTIME should be TIMESTAMP (with or without TIMEZONE) or BIGINT but  
certainly not NUMERIC
- An index on StatsData.TTIME would be useful, it would avoid Seq Scan,  
replacing it with a Bitmap Scan, much faster


* work_mem
- since you have few connections you could increase work_mem


Second Query :


Same as first query


Third Query


SELECT COUNT(*) FROM ( SELECT ID, PolledData.AGENT, NAME, INSTANCE, TTIME,  
VAL

FROM PolledData, STATSDATA8_21_2009
WHERE ( ( PolledData.ID=STATSDATA8_21_2009.POLLID)
AND ( ( TTIME >= 1250838027454) AND ( TTIME <=1250838027454) ) )

union all  SELECT ID, PolledData.AGENT, NAME, INSTANCE, TTIME, VAL FROM  
PolledData, STATSDATA8_20_2009
WHERE ( ( PolledData.ID=STATSDATA8_20_2009.POLLID) AND ( ( TTIME >=  
1250767134601) AND ( TTIME <= 1250767134601) ) ) )t1 ;


Basically this is, again, exactly the same query as above, but two times,  
and UNION ALL'ed


* You could rewrite it like this :

SELECT ID, PolledData.AGENT, NAME, INSTANCE, TTIME, VAL
FROM
( SELECT ... FROM STATSDATA8_21_2009 WHERE TTIME BETWEEN ... AND ... )
UNION ALL SELECT ... FROM STATSDATA8_20_2009 WHERE TTIME BETWEEN ... AND  
... )

)
JOIN STATSDATA8_21_2009 ON ( PolledData.ID = STATSDATA8_21_2009.POLLID)

* If TTIME is the current time, and you insert data as it comes, data in  
StatsData tables is probably already ordered on TTIME.
- If it is not the case, once a table is filled and becomes read-only,  
consider CLUSTER on the index you created on TTIME

- It will make range queries on TTIME much faster

* Query plan
Seq Scan on statsdata8_21_2009  (cost=0.00..70574.88 rows=1 width=32)  
(actual time=0.047..29066.227 rows=227 loops=1)
Seq Scan on statsdata8_20_2009  (cost=0.00..382519.60 rows=1 width=32)  
(actual time=3136.008..93985.540 rows=1 loops=1)


Postgres thinks there is 1 row in those tables... that's probably not the  
case !

The first one returns 227 rows, so the plan chosen in a catastrophe.

I was a bit intrigued by your query, so I made a little test...

BEGIN;
CREATE TABLE test( x INT, y INT );
INSERT INTO test (SELECT n,n FROM generate_series( 1,100 ) AS n );
CREATE INDEX test_x ON test( x );
CREATE INDEX test_y ON test( y );
COMMIT;

ANALYZE test;

test=> EXPLAIN ANALYZE SELECT * FROM test a JOIN test b ON (b.x=a.x) WHERE  
a.x BETWEEN 0 AND 1;

 QUERY PLAN
-
 Hash Join  (cost=480.53..23759.14 rows=10406 width=16) (actual  
time=15.614..1085.085 rows=1 loops=1)

   Hash Cond: (b.x = a.x)
   ->  Seq Scan on test b  (cost=0.00..14424.76 ro

Re: [PERFORM] Performance Monitoring tool

2009-08-02 Thread Frits Hoogland
depending on what you mean with 'monitor'. for up/down monitoring use nagios
(http://www.nagios.org)for performance monitoring (and I guess the reason
why you ask this on the postgresql performance list), use pgstatspack: (
http://pgfoundry.org/projects/pgstatspack/)

frits

On Fri, Jul 31, 2009 at 7:49 AM, mukeshp  wrote:

>
> Hi
>
> Can anyone tell me what are the tools to monitor postgres server. ? I am
> running my Postgres server on RHEL 5 machine.
> --
> View this message in context:
> http://www.nabble.com/Performance-Monitoring-tool-tp24751382p24751382.html
> Sent from the PostgreSQL - performance mailing list archive at Nabble.com.
>
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance
>


Re: [PERFORM] Performance 8.4.0

2009-08-02 Thread Stephen Frost
* Robert Haas (robertmh...@gmail.com) wrote:
> On Fri, Jul 31, 2009 at 12:22 AM, Chris Dunn wrote:
> > constraint_exclusion = on
> 
> This is critical if you need it, but a waste of CPU time if you don't.
>  Other than that your paramaters look good.  Are you using the default
> page cost settings?  I see you have 12 GB RAM; how big is your
> database?

With 8.4, you can set 'constraint_exclusion = partition', where it'll
handle inheirited tables and UNION ALL queries but not other possible
cases.  It's set that way by default, and is pretty inexpensive to leave
in place (since it only gets tried when it's likely you want it).

I'd recommend setting it to partition under 8.4 rather than disabling it
entirely.  Under older versions, set it to 'off' if you don't need it.

Thanks,

Stephen


signature.asc
Description: Digital signature


[PERFORM] Why is PostgreSQL so slow on Windows ( Postgres 8.3.7) version

2009-08-02 Thread pari krishnan
Dear All,


We are
using Postgres 8.3.7 in our java application. We are doing performances
tuning and load testing in our setup. we have noticed that ,some of our
queries to the database taking long time to return the results.Please
find our setup details belows.

We observed that postgres is running in windows is slower than the linux .

Machine & Database Details :

Windows configuration:
4 GB RAM
4*1.6 GHZ
windows 2008 server standard edition

Postgresql configuration:

shared_buffers: 1 GB
Effective_cache_size: 2GB
fsync: off  (even we tested this parameter is on ,we observed the same slowness 
)


Database Details : 

Postgres  Database   : PostgreSQL 8.3.7.1
Driver Version: PostgreSQL 8.3 JDBC4 with SSL (build 604)
We are using 40 database connections.


We have few tables which will be having more amount data.While running
our application STATSDATA table will be created daily with table name
with date.
like as STATSDATA8_21_2009

Schema for STATSDATA table

create table STATSDATA8_21_2009(
POLLID Numeric(19),
INSTANCE varchar(100),
TTIME Numeric(19),
VAL Numeric(13)) ;CREATE INDEX POLLID%_ndx on STATSDATA%(POLLID)

Schema for PolledData

create table PolledData(
"NAME" varchar(50) NOT NULL ,
"ID" BIGINT NOT NULL ,
"AGENT" varchar(50) NOT NULL ,
"COMMUNITY" varchar(100) NOT NULL ,
"PERIOD" INTEGER NOT NULL,
"ACTIVE" varchar(10),
"OID" varchar(200) NOT NULL,
"LOGDIRECTLY" varchar(10),
"LOGFILE" varchar(100),
"SSAVE" varchar(10),
"THRESHOLD" varchar(10),
"ISMULTIPLEPOLLEDDATA" varchar(10),
"PREVIOUSSEVERITY" INTEGER,
"NUMERICTYPE" INTEGER,
"SAVEABSOLUTES" varchar(10),
"TIMEAVG" varchar(10),
"PORT" INTEGER,
"WEBNMS" varchar(100),
"GROUPNAME" varchar(100),
"LASTCOUNTERVALUE" BIGINT ,
"LASTTIMEVALUE" BIGINT ,
"TIMEVAL" BIGINT NOT NULL ,
"POLICYNAME" varchar(100),
"THRESHOLDLIST" varchar(200),
"DNSNAME" varchar(100),
"SUFFIX" varchar(20),
"STATSDATATABLENAME" varchar(100),
"POLLERNAME" varchar(200),
"FAILURECOUNT" INTEGER,
"FAILURETHRESHOLD" INTEGER,
"PARENTOBJ" varchar(100),
"PROTOCOL" varchar(50),
"SAVEPOLLCOUNT" INTEGER,
"CURRENTSAVECOUNT" INTEGER,
"SAVEONTHRESHOLD" varchar(10),
"SNMPVERSION" varchar(10),
"USERNAME" varchar(30),
"CONTEXTNAME" varchar(30),
PRIMARY KEY ("ID","NAME","AGENT","OID"),
index PolledData0_ndx ( "NAME"),
index PolledData1_ndx ( "AGENT"),
index PolledData2_ndx ( "OID"),
index PolledData3_ndx ( "ID"),
index PolledData4_ndx ( "PARENTOBJ"),
)


We
have 300k row's in PolledData Table.In each STATSDATA table ,we have
almost 12 to 13 million rows. Every one minute interval ,we insert data
into to STATSDATA table. In our application ,we use insert and select
query to STATSDATA table at regular interval. Please let us know why
the below query takes more time to return the results. is there any
thing we need to do to tune the postgres database ? 




Please find explain analyze output.


First Query :

postgres=# explain analyze SELECT COUNT(*) FROM ( SELECT ID, PolledData.AGENT, N
AME, INSTANCE, TTIME, VAL FROM PolledData, STATSDATA8_21_2009 WHERE ( ( PolledDa
ta.ID=STATSDATA8_21_2009.POLLID) AND ( ( TTIME >= 1250838027454) AND ( TTIME <=
1250838079654) ) ) ) t1;
QUERY PLAN


--
 Aggregate  (cost=773897.12..773897.13 rows=1 width=0) (actual time=17818.410..1
7818.412 rows=1 loops=1)
   ->  Merge Join  (cost=717526.23..767505.06 rows=2556821 width=0) (actual time
=17560.469..17801.790 rows=13721 loops=1)
 Merge Cond: (statsdata8_21_2009.pollid = ((polleddata.id)::numeric))
 ->  Sort  (cost=69708.44..69742.49 rows=13619 width=8) (actual time=239
2.659..2416.093 rows=13721 loops=1)
   Sort Key: statsdata8_21_2009.pollid
   Sort Method:  quicksort  Memory: 792kB
   ->  Seq Scan on statsdata8_21_2009  (cost=0.00..68773.27 rows=136
19 width=8) (actual time=0.077..2333.132 rows=13721 loops=1)
 Filter: ((ttime >= 1250838027454::numeric) AND (ttime <= 12
50838079654::numeric))
 ->  Materialize  (cost=647817.78..688331.92 rows=3241131 width=8) (actu
al time=15167.767..15282.232 rows=21582 loops=1)
   ->  Sort  (cost=647817.78..655920.61 rows=3241131 width=8) (actua
l time=15167.756..15218.645 rows=21574 loops=1)
 Sort Key: ((polleddata.id)::numeric)
 Sort Method:  external merge  Disk: 736kB
 ->  Seq Scan on polleddata  (cost=0.00..164380.31 rows=3241
131 width=8) (actual time=1197.278..14985.665 rows=23474 loops=1)
 Total runtime: 17826.511 ms
(14 rows)

Second Query :

postgres=# explain analyze SELECT COUNT(*) FROM ( SELECT ID, PolledData.AGENT, N
AME, INSTANCE, TTIME, VAL FROM PolledData, STATSDATA8_20_2009 WHERE ( ( PolledDa
ta.ID=STATSDATA8_20_2009.POLLID) AND ( ( TTIME >=  1250767134601) AND ( TTIME <=
  1250767384601) ) ) ) t1;
 

[PERFORM] Performance Monitoring tool

2009-08-02 Thread mukeshp

Hi

Can anyone tell me what are the tools to monitor postgres server. ? I am
running my Postgres server on RHEL 5 machine.
-- 
View this message in context: 
http://www.nabble.com/Performance-Monitoring-tool-tp24751382p24751382.html
Sent from the PostgreSQL - performance mailing list archive at Nabble.com.


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] SQL select query becomes slow when using limit (with no offset)

2009-08-02 Thread Kees van Dieren
Hi folks,

We have problems with performance of a simple SQL statement.

If we add a LIMIT 50, the query is about 6 times slower than without a limit
(query returns 2 rows).

I have read this discussion:
http://archives.postgresql.org/pgsql-performance/2008-09/msg5.php but
there seems to be no solution in it.

I tried this things:
http://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server but changing
settings doesn't have significant effect.

The DDL statements (create tables, indices) are attached.

The events_events table contains 375K rows, the events_event_types contains
71 rows.

The query:
select events_events.id FROM events_events
left join events_event_types on events_events.eventType_id=
events_event_types.id
where events_event_types.severity=70
and events_events.cleared='f'
order by events_events.dateTime DESC

It takes 155ms to run this query (returning 2 rows)

After adding LIMIT 10, it takes 950 ms to run.

Query plan: without limit:
"Sort  (cost=20169.62..20409.50 rows=95952 width=16)"
"  Sort Key: events_events.datetime"
"  ->  Hash Join  (cost=2.09..12229.58 rows=95952 width=16)"
"Hash Cond: (events_events.eventtype_id = events_event_types.id)"
"->  Seq Scan on events_events  (cost=0.00..9918.65 rows=359820
width=24)"
"  Filter: (NOT cleared)"
"->  Hash  (cost=1.89..1.89 rows=16 width=8)"
"  ->  Seq Scan on events_event_types  (cost=0.00..1.89 rows=16
width=8)"
"Filter: (severity = 70)"

Query plan: with limit:
"Limit  (cost=0.00..12.50 rows=10 width=16)"
"  ->  Nested Loop  (cost=0.00..119932.21 rows=95952 width=16)"
"->  Index Scan Backward using events_events_datetime_ind on
events_events  (cost=0.00..18242.28 rows=359820 width=24)"
"  Filter: (NOT cleared)"
"->  Index Scan using events_event_types_pkey on events_event_types
(cost=0.00..0.27 rows=1 width=8)"
"  Index Cond: (events_event_types.id =
events_events.eventtype_id)"
"  Filter: (events_event_types.severity = 70)"

So postgres seems to handle a query with limit different internally. Tried
to set default_statistics_target to 10, 100, 200, but no significant
differences.

This problem appears on both Postgres 8.3 and 8.4.

Any suggestions?

Thanks in advance!

Best regards,

Kees van Dieren
-- 
Squins | IT, Honestly
Oranjestraat 23
2983 HL Ridderkerk
The Netherlands
Phone: +31 (0)180 414520
Mobile: +31 (0)6 30413841
www.squins.com
Chamber of commerce Rotterdam: 22048547


events_schema.sql
Description: Binary data

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] Why is PostgreSQL so slow on Windows ( Postgres 8.3.7) version

2009-08-02 Thread parimala
Dear All,
 
 
 We are using Postgres 8.3.7 in our java application. We are doing performances 
tuning and load testing in our setup. we have noticed that ,some of our queries 
to the database taking long time to return the results.Please find our setup 
details belows.
 
We observed that postgres is running in windows is slower than the linux .
 
 Machine & Database Details :
 
 Windows configuration:
 4 GB RAM
 4*1.6 GHZ
 windows 2008 server standard edition
 
 Postgresql configuration:
 
 shared_buffers: 1 GB
 Effective_cache_size: 2GB
 fsync: off  (even we tested this parameter is on ,we observed the same 
slowness )
 
 
  Database Details : 
 
 Postgres Database : PostgreSQL 8.3.7.1
 Driver Version : PostgreSQL 8.3 JDBC4 with SSL (build 604)
  We are using 40 database connections.
 
 
 We have few tables which will be having more amount data.While running our 
application STATSDATA table will be created daily with table name with date.
 like as STATSDATA8_21_2009
 
 Schema for STATSDATA table
 
 create table STATSDATA8_21_2009(
 POLLID Numeric(19),
 INSTANCE varchar(100),
 TTIME Numeric(19),
 VAL Numeric(13)) ;CREATE INDEX POLLID%_ndx on STATSDATA%(POLLID)

Schema for PolledData

create table PolledData(
"NAME" varchar(50) NOT NULL ,
"ID" BIGINT NOT NULL ,
"AGENT" varchar(50) NOT NULL ,
"COMMUNITY" varchar(100) NOT NULL ,
"PERIOD" INTEGER NOT NULL,
"ACTIVE" varchar(10),
"OID" varchar(200) NOT NULL,
"LOGDIRECTLY" varchar(10),
"LOGFILE" varchar(100),
"SSAVE" varchar(10),
"THRESHOLD" varchar(10),
"ISMULTIPLEPOLLEDDATA" varchar(10),
"PREVIOUSSEVERITY" INTEGER,
"NUMERICTYPE" INTEGER,
"SAVEABSOLUTES" varchar(10),
"TIMEAVG" varchar(10),
"PORT" INTEGER,
"WEBNMS" varchar(100),
"GROUPNAME" varchar(100),
"LASTCOUNTERVALUE" BIGINT ,
"LASTTIMEVALUE" BIGINT ,
"TIMEVAL" BIGINT NOT NULL ,
"POLICYNAME" varchar(100),
"THRESHOLDLIST" varchar(200),
"DNSNAME" varchar(100),
"SUFFIX" varchar(20),
"STATSDATATABLENAME" varchar(100),
"POLLERNAME" varchar(200),
"FAILURECOUNT" INTEGER,
"FAILURETHRESHOLD" INTEGER,
"PARENTOBJ" varchar(100),
"PROTOCOL" varchar(50),
"SAVEPOLLCOUNT" INTEGER,
"CURRENTSAVECOUNT" INTEGER,
"SAVEONTHRESHOLD" varchar(10),
"SNMPVERSION" varchar(10),
"USERNAME" varchar(30),
"CONTEXTNAME" varchar(30),
PRIMARY KEY ("ID","NAME","AGENT","OID"),
index PolledData0_ndx ( "NAME"),
index PolledData1_ndx ( "AGENT"),
index PolledData2_ndx ( "OID"),
index PolledData3_ndx ( "ID"),
index PolledData4_ndx ( "PARENTOBJ"),
 )

 
We have 300k row's in PolledData Table.In each STATSDATA table ,we have almost 
12 to 13 million rows. Every one minute interval ,we insert data into to 
STATSDATA table. In our application ,we use insert and select query to 
STATSDATA table at regular interval. Please let us know why the below query 
takes more time to return the results. is there any thing we need to do to tune 
the postgres database ? 




Please find explain analyze output.

 
 First Query :

postgres=# explain analyze SELECT COUNT(*) FROM ( SELECT ID, PolledData.AGENT, N
 AME, INSTANCE, TTIME, VAL FROM PolledData, STATSDATA8_21_2009 WHERE ( ( 
PolledDa
 ta.ID=STATSDATA8_21_2009.POLLID) AND ( ( TTIME >= 1250838027454) AND ( 
TTIME <=
 1250838079654) ) ) ) t1;
  QUERY PLAN
 
 

 --
 Aggregate (cost=773897.12..773897.13 rows=1 width=0) (actual time=17818.410..1
 7818.412 rows=1 loops=1)
  -> Merge Join (cost=717526.23..767505.06 rows=2556821 width=0) (actual time
 =17560.469..17801.790 rows=13721 loops=1)
  Merge Cond: (statsdata8_21_2009.pollid = ((polleddata.id)::numeric))
  -> Sort (cost=69708.44..69742.49 rows=13619 width=8) (actual time=239
 2.659..2416.093 rows=13721 loops=1)
  Sort Key: statsdata8_21_2009.pollid
  Sort Method: quicksort Memory: 792kB
  -> Seq Scan on statsdata8_21_2009 (cost=0.00..68773.27 rows=136
 19 width=8) (actual time=0.077..2333.132 rows=13721 loops=1)
  Filter: ((ttime >= 1250838027454::numeric) AND (ttime <= 12
 50838079654::numeric))
  -> Materialize (cost=647817.78..688331.92 rows=3241131 width=8) (actu
 al time=15167.767..15282.232 rows=21582 loops=1)
  -> Sort (cost=647817.78..655920.61 rows=3241131 width=8) (actua
 l time=15167.756..15218.645 rows=21574 loops=1)
  Sort Key: ((polleddata.id)::numeric)
  Sort Method: external merge Disk: 736kB
  -> Seq Scan on polleddata (cost=0.00..164380.31 rows=3241
 131 width=8) (actual time=1197.278..14985.665 rows=23474 loops=1)
 Total runtime: 17826.511 ms
 (14 rows)
 
 Second Query :
 
 postgres=# explain analyze SELECT COUNT(*) FROM ( SELECT ID, PolledData.AGENT, 
N
 AME, INSTANCE, TTIME, VAL FROM PolledData, STATSDATA8_20_2009 WHERE ( ( 
PolledDa
 ta.ID=STATSDATA8_20_2009.POLLID) AND ( ( TTIME >= 1250767134601) AND ( 
TTIME <=
  1250767384601) ) ) ) t1;
  QUERY PLAN
 
 

 

[PERFORM] Greenplum MapReduce

2009-08-02 Thread Suvankar Roy
Hi all,

Has anybody worked on Greenplum MapReduce programming ?

I am facing a problem while trying to execute the below Greenplum 
Mapreduce program written in YAML (in blue). 

The error is thrown in the 7th line as:
Error: YAML syntax error - found character that cannot start any token 
while scanning for the next token, at line 7 (in red)

If somebody can explain this and the potential solution

%YAML 1.1
---
VERSION: 1.0.0.1 
DATABASE: test_db1
USER: gpadmin
DEFINE: 
- INPUT:
NAME: doc
TABLE: documents 
- INPUT:
NAME: kw
TABLE: keywords
- MAP: 
NAME:   doc_map 
LANGUAGE:   python 
FUNCTION:|
i = 0 
terms = {}
for term in data.lower().split(): 
i = i + 1
if term in terms: 
terms[term] += ','+str(i) 
else: 
terms[term] = str(i) 
for term in terms: 
yield([doc_id, term, terms[term]]) 
OPTIMIZE: STRICT IMMUTABLE 
PARAMETERS: 
- doc_id integer 
- data text 
RETURNS: 
- doc_id integer 
- term text 
- positions text 
- MAP: 
NAME:   kw_map 
LANGUAGE:   python 
FUNCTION:   | 
i = 0 
terms = {} 
for term in keyword.lower().split(): 
i = i + 1 
if term in terms: 
terms[term] += ','+str(i) 
else: 
terms[term] = str(i) 
yield([keyword_id, i, term, terms[term]]) 
OPTIMIZE: STRICT IMMUTABLE 
PARAMETERS: 
- keyword_id integer 
- keyword text 
RETURNS: 
- keyword_id integer 
- nterms integer 
- term text 
- positions text 
- TASK: 
NAME: doc_prep 
SOURCE: doc 
MAP: doc_map
- TASK: 
NAME: kw_prep 
SOURCE: kw 
MAP: kw_map 
- INPUT: 
NAME: term_join 
QUERY: | 
SELECT doc.doc_id, kw.keyword_id, kw.term, 
kw.nterms, 
doc.positions as doc_positions, 
kw.positions as kw_positions 
 FROM doc_prep doc INNER JOIN kw_prep kw ON 
(doc.term = kw.term)
- REDUCE: 
NAME: term_reducer 
TRANSITION: term_transition 
FINALIZE: term_finalizer 
- TRANSITION: 
NAME: term_transition 
LANGUAGE: python 
PARAMETERS: 
- state text 
- term text 
- nterms integer 
- doc_positions text 
- kw_positions text 
FUNCTION: | 
if state: 
kw_split = state.split(':') 
else: 
kw_split = [] 
for i in range(0,nterms): 
kw_split.append('') 
for kw_p in kw_positions.split(','): 
kw_split[int(kw_p)-1] = doc_positions 
outstate = kw_split[0] 
for s in kw_split[1:]: 
outstate = outstate + ':' + s 
return outstate 
- FINALIZE: 
NAME: term_finalizer 
LANGUAGE: python 
RETURNS: 
- count integer 
MODE: MULTI 
FUNCTION: | 
if not state: 
return 0 
kw_split = state.split(':') 
previous = None 
for i in range(0,len(kw_split)): 
isplit = kw_split[i].split(',') 
if any(map(lambda(x): x == '', isplit)): 
return 0 
adjusted = set(map(lambda(x): int(x)-i, 
isplit)) 
if (previous): 

[PERFORM] select count(idn) is slow (Seq Scan) instead of Bitmap Heap.. why?

2009-08-02 Thread Adam PAPAI

Hello,


I have a problem with an inner join + count().

my query is:

explain analyze select 
k.idn,k.kerdes_subject,k.kerdes_text,u.vezeteknev,u.keresztnev,u.idn as 
user_id, kg.kategoria_neve, count(v.idn)


FROM kategoriak as kg

INNER JOIN kerdesek as k on kg.idn = k.kategoria_id
INNER JOIN users as u ON k.user_id = u.idn
INNER JOIN valaszok as v ON k.idn = v.kerdes_id

where kg.idn=15 group by k.idn, k.kerdes_subject,k.kerdes_text, 
u.idn,u.vezeteknev,u.keresztnev,kg.kategoria_neve


The problem is with the count(v.idn).

This column has a relation with: v.kerdes_id = k.idn => k.kategoria_id = 
 kg.idn


and the WHERE says: kg.idn = 15.

Why does it run through all lines in v?

the explain sais:

 GroupAggregate  (cost=103238.59..103602.66 rows=10402 width=1382) 
(actual time=8531.405..8536.633 rows=73 loops=1)
   ->  Sort  (cost=103238.59..103264.59 rows=10402 width=1382) (actual 
time=8531.339..8533.199 rows=1203 loops=1)
 Sort Key: k.idn, k.kerdes_subject, k.kerdes_text, u.idn, 
u.vezeteknev, u.keresztnev, kg.kategoria_neve
 ->  Hash Join  (cost=3827.79..89951.54 rows=10402 width=1382) 
(actual time=1778.590..8523.015 rows=1203 loops=1)

   Hash Cond: (v.kerdes_id = k.idn)
   ->  Seq Scan on valaszok v  (cost=0.00..78215.98 
rows=2080998 width=8) (actual time=59.714..5009.171 rows=2080998 loops=1)
   ->  Hash  (cost=3823.42..3823.42 rows=350 width=1378) 
(actual time=12.553..12.553 rows=74 loops=1)
 ->  Nested Loop  (cost=14.98..3823.42 rows=350 
width=1378) (actual time=0.714..12.253 rows=74 loops=1)
   ->  Nested Loop  (cost=14.98..1056.38 
rows=350 width=830) (actual time=0.498..5.952 rows=117 loops=1)
 ->  Seq Scan on kategoriak kg 
(cost=0.00..1.30 rows=1 width=278) (actual time=0.066..0.076 rows=1 loops=1)

   Filter: (idn = 15)
 ->  Bitmap Heap Scan on kerdesek k 
(cost=14.98..1051.58 rows=350 width=560) (actual time=0.374..5.430 
rows=117 loops=1)

   Recheck Cond: (15 = kategoria_id)
   ->  Bitmap Index Scan on 
kategoria_id_id_idx  (cost=0.00..14.89 rows=350 width=0) (actual 
time=0.212..0.212 rows=117 loops=1)
 Index Cond: (15 = 
kategoria_id)
   ->  Index Scan using users_pkey on users u 
(cost=0.00..7.89 rows=1 width=552) (actual time=0.047..0.048 rows=1 
loops=117)

 Index Cond: (k.user_id = u.idn)
 Total runtime: 8536.936 ms



So it run through more than 2 mill lines... but why? It should only 
count those lines which has the category_id = 15...


What am I doing wrong?



--
Adam PAPAI

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Performance 8.4.0

2009-08-02 Thread Robert Haas
On Fri, Jul 31, 2009 at 12:22 AM, Chris Dunn wrote:
> constraint_exclusion = on

This is critical if you need it, but a waste of CPU time if you don't.
 Other than that your paramaters look good.  Are you using the default
page cost settings?  I see you have 12 GB RAM; how big is your
database?

...Robert

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance