date:20090803

Re: [PERFORM] Greenplum MapReduce

2009-08-03 Thread Suvankar Roy

Hi Robert,

Thanks much for your valuable inputs

This spaces and tabs problem is killing me in a way, it is pretty 
cumbersome to say the least

Regards,

Suvankar Roy



"Robert Mah"  
Sent by: Robert Mah 
08/02/2009 10:52 PM

To
"'Suvankar Roy'" , 

cc

Subject
RE: [PERFORM] Greenplum MapReduce






Suvankar:
 
Check your file for spaces vs tabs (one of them is bad and yes, it 
matters).
 
And as an personal aside, this is yet another reason I hate YAML.
 
Cheers,
Rob
 
From: pgsql-performance-ow...@postgresql.org [
mailto:pgsql-performance-ow...@postgresql.org] On Behalf Of Suvankar Roy
Sent: Thursday, July 30, 2009 8:25 AM
To: pgsql-performance@postgresql.org
Subject: [PERFORM] Greenplum MapReduce
 

Hi all, 

Has anybody worked on Greenplum MapReduce programming ? 

I am facing a problem while trying to execute the below Greenplum 
Mapreduce program written in YAML (in blue). 

The error is thrown in the 7th line as: 
Error: YAML syntax error - found character that cannot start any token 
while scanning for the next token, at line 7 (in red) 

If somebody can explain this and the potential solution 

%YAML 1.1 
--- 
VERSION: 1.0.0.1 
DATABASE: test_db1 
USER: gpadmin 
DEFINE: 
- INPUT: 
NAME: doc 
TABLE: documents 
- INPUT: 
NAME: kw 
TABLE: keywords 
- MAP: 
NAME: doc_map 
LANGUAGE: python 
FUNCTION:  | 
i = 0 
terms = {} 
for term in data.lower().split(): 
i = i + 1 
if term in terms: 
terms[term] += ','+str(i) 
else: 
terms[term] = str(i) 
for term in terms: 
yield([doc_id, term, terms[term]])   
OPTIMIZE: STRICT IMMUTABLE 
PARAMETERS: 
- doc_id integer 
- data text 
RETURNS: 
- doc_id integer 
- term text 
- positions text 
- MAP: 
NAME: kw_map 
LANGUAGE: python 
FUNCTION: | 
i = 0 
terms = {} 
for term in keyword.lower().split(): 
i = i + 1 
if term in terms: 
terms[term] += ','+str(i) 
else: 
terms[term] = str(i) 
yield([keyword_id, i, term, terms[term]]) 
OPTIMIZE: STRICT IMMUTABLE 
PARAMETERS: 
- keyword_id integer 
- keyword text 
RETURNS: 
- keyword_id integer 
- nterms integer 
- term text 
- positions text   
- TASK: 
NAME: doc_prep 
SOURCE: doc 
MAP: doc_map 
- TASK: 
NAME: kw_prep 
SOURCE: kw 
MAP: kw_map   
- INPUT: 
NAME: term_join 
QUERY: | 
SELECT doc.doc_id, kw.keyword_id, kw.term, 
kw.nterms, 
 doc.positions as doc_positions, 
kw.positions as kw_positions 
 FROM doc_prep doc INNER JOIN kw_prep kw ON 
(doc.term = kw.term) 
- REDUCE: 
NAME: term_reducer 
TRANSITION: term_transition 
FINALIZE: term_finalizer 
- TRANSITION: 
NAME: term_transition 
LANGUAGE: python 
PARAMETERS: 
- state text 
- term text 
- nterms integer 
- doc_positions text 
- kw_positions text 
FUNCTION: | 
if state: 
kw_split = state.split(':') 
else: 
kw_split = [] 
for i in range(0,nterms): 
kw_split.append('') 
for kw_p in kw_positions.split(','): 
kw_split[int(kw_p)-1] = doc_positions  

outstate = kw_split[0] 
for s in kw_split[1:]: 
outstate = outstate + ':' + s 
return o

Re: [PERFORM] Greenplum MapReduce

2009-08-03 Thread Suvankar Roy

Hi Richard,

I sincerely regret the inconvenience caused.

%YAML 1.1
---
VERSION: 1.0.0.1 
DATABASE: test_db1
USER: gpadmin
DEFINE: 
- INPUT: #** This the line which is causing the error **#
NAME: doc
TABLE: documents 
- INPUT:
NAME: kw
TABLE: keywords
- MAP: 
NAME:   doc_map 
LANGUAGE:   python 
FUNCTION:|
i = 0 
terms = {}
for term in data.lower().split(): 
i = i + 1
if term in terms: 
terms[term] += ','+str(i) 
else: 
terms[term] = str(i) 
for term in terms: 
yield([doc_id, term, terms[term]]) 
OPTIMIZE: STRICT IMMUTABLE 
PARAMETERS: 
- doc_id integer 
- data text 
RETURNS: 
- doc_id integer 
- term text 
- positions text 
- MAP: 
NAME:   kw_map 
LANGUAGE:   python 
FUNCTION:   | 
i = 0 
terms = {} 
for term in keyword.lower().split(): 
i = i + 1 
if term in terms: 
terms[term] += ','+str(i) 
else: 
terms[term] = str(i) 
yield([keyword_id, i, term, terms[term]]) 
OPTIMIZE: STRICT IMMUTABLE 
PARAMETERS: 
- keyword_id integer 
- keyword text 
RETURNS: 
- keyword_id integer 
- nterms integer 
- term text 
- positions text 
- TASK: 
NAME: doc_prep 
SOURCE: doc 
MAP: doc_map
- TASK: 
NAME: kw_prep 
SOURCE: kw 
MAP: kw_map 
- INPUT: 
NAME: term_join 
QUERY: | 
SELECT doc.doc_id, kw.keyword_id, kw.term, 
kw.nterms, 
doc.positions as doc_positions, 
kw.positions as kw_positions 
 FROM doc_prep doc INNER JOIN kw_prep kw ON 
(doc.term = kw.term)
- REDUCE: 
NAME: term_reducer 
TRANSITION: term_transition 
FINALIZE: term_finalizer 
- TRANSITION: 
NAME: term_transition 
LANGUAGE: python 
PARAMETERS: 
- state text 
- term text 
- nterms integer 
- doc_positions text 
- kw_positions text 
FUNCTION: | 
if state: 
kw_split = state.split(':') 
else: 
kw_split = [] 
for i in range(0,nterms): 
kw_split.append('') 
for kw_p in kw_positions.split(','): 
kw_split[int(kw_p)-1] = doc_positions 
outstate = kw_split[0] 
for s in kw_split[1:]: 
outstate = outstate + ':' + s 
return outstate 
- FINALIZE: 
NAME: term_finalizer 
LANGUAGE: python 
RETURNS: 
- count integer 
MODE: MULTI 
FUNCTION: | 
if not state: 
return 0 
kw_split = state.split(':') 
previous = None 
for i in range(0,len(kw_split)): 
isplit = kw_split[i].split(',') 
if any(map(lambda(x): x == '', isplit)): 
return 0 
adjusted = set(map(lambda(x): int(x)-i, 
isplit)) 
if (previous): 
previous = 
adjusted.intersection(previous) 
else: 
previous = adjusted 
if previous: 
return len(previous)

Re: [PERFORM] PostgreSQL 8.4 performance tuning questions

2009-08-03 Thread PFC




lzo is much, much, (much) faster than zlib.  Note, I've tried several


decompression speed is even more awesome...


times to contact the author to get clarification on licensing terms
and have been unable to get a response.


lzop and the LZO library are distributed under the terms of the GNU  
General Public License (GPL).

source : http://www.lzop.org/lzop_man.php

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Query help

2009-08-03 Thread Subbiah Stalin-XCGF84

Server has 32G memory and it's a dedicated to run PG and no other
application is sharing this database. I have checked checkpoints and
they don't occur during those slow query runtimes. Checkpoint_segments
is set 128. here is quick snap from vmstat.

 # vmstat 5 5
 kthr  memorypagedisk  faults
cpu
 r b w   swap  free  re  mf pi po fr de sr 1m 1m 1m m1   in   sy   cs us
sy id
 0 0 0 56466032 25908072 59 94 516 13 13 0 0 10 3 59 1  480  443  500  1
1 98
 0 0 0 51377520 20294328 6 8 0 32 32  0  0  0  4  1  0  368  185  361  0
1 99
 0 0 0 56466032 25908064 59 94 516 13 13 0 0 1 10 3 59  480  443  500  1
1 98
 0 0 0 51376984 20294168 57 427 0 16 16 0 0 0  0  1  0  380  781  396  1
1 98
 0 0 0 51376792 20294208 112 1131 2 50 50 0 0 0 0 5  2  398 2210  541  4
3 92 

\d output --

 Table "public.objects"
Column|Type | Modifiers
--+-+---
 id   | character varying(28)   | not null
 name | character varying(50)   | not null
 altname  | character varying(50)   |
 type | character varying(3)|
 domainid | character varying(28)   | not null
 status   | smallint|
 dbver| integer |
 created  | timestamp without time zone |
 lastmodified | timestamp without time zone |
 assignedto   | character varying(28)   |
 status2  | smallint|
 key1 | character varying(25)   |
 key2 | character varying(25)   |
 key3 | character varying(64)   |
 oui  | character varying(6)|
 prodclass| character varying(64)   |
 user1| character varying(50)   |
 user2| character varying(50)   |
 data0| character varying(2000) |
 data1| character varying(2000) |
 longdata | character varying(1)|
Indexes:
"ct_objects_id_u1" PRIMARY KEY, btree (id), tablespace
"nbbs_index_data"
"ix_objects_altname" btree (altname), tablespace "nbbs_index_data"
"ix_objects_domainid_name" btree (domainid, upper(name::text)),
tablespace "nbbs_index_data"
"ix_objects_key3" btree (upper(key3::text)), tablespace
"nbbs_index_data"
"ix_objects_name" btree (upper(name::text) varchar_pattern_ops),
tablespace "nbbs_index_data"
"ix_objects_type_lastmodified" btree ("type", lastmodified),
tablespace "nbbs_index_data"
"ix_objects_user1" btree (upper(user1::text)), tablespace
"nbbs_index_data"
"ix_objects_user2" btree (upper(user2::text)), tablespace
"nbbs_index_data"

Work_mem=64mb, r_p_c = 2 on the session gave similar execution plan
except the cost different due to change r_p_c.

   QUERY
PLAN

-
 Limit  (cost=0.00..5456.11 rows=501 width=912)
   ->  Index Scan Backward using ix_objects_type_lastmodified on objects
(cost=0.00..253083.03 rows=23239 width=912)
 Index Cond: (("type")::text = 'cpe'::text)
 Filter: ((domainid)::text = ANY
(('{HY3XGEzC0E9JxRwoXLOLbjNsghEA,3330}'::charact
er varying[])::text[]))
(4 rows)

Given the nature of the ix_objects_type_lastmodified index, wondering if
the index requires rebuilt. I tested rebuilding it in another db, and it
came to 2500 pages as opposed to 38640 pages.

The puzzle being why the same query with same filters, runs most of
times faster but at times runs 5+ mintues and it switches back to fast
mode. If it had used a different execution plan than the above, how do I
list all execution plans executed for a given SQL.

Thanks,
Stalin

-Original Message-
From: Kevin Grittner [mailto:kevin.gritt...@wicourts.gov] 
Sent: Monday, August 03, 2009 1:45 PM
To: Subbiah Stalin-XCGF84; pgsql-performance@postgresql.org
Subject: RE: [PERFORM] Query help

"Subbiah Stalin-XCGF84"  wrote:

> Shared buffer=8G, effective cache size=4G.

That is odd; if your shared buffers are at 8G, you must have more than
4G of cache.  How much RAM is used for cache at the OS level? 
Normally you would add that to the shared buffers to get your effective
cache size, or at least take the larger of the two.

How much RAM is on this machine in total?  Do you have any other
processes which use a lot of RAM or might access a lot of disk from time
to time?

> Let me know if you need any other information.

The \d output for the object table, or the CREATE for it and its
indexes, would be good.  Since it's getting through the random reads by
the current plan at the rate of about one every 5ms, I'd say your drive
array is OK.  If you want to make this query faster you've either got to
have the data in cache or it has to have reason to believe that a
different plan is faster.

One thing which might help is to boost your

Re: [PERFORM] PostgreSQL 8.4 performance tuning questions

2009-08-03 Thread Merlin Moncure

On Mon, Aug 3, 2009 at 2:56 PM, Tom Lane wrote:
> I don't see anything very contradictory here.  What you're demonstrating
> is that it's nice to be able to throw a third CPU at the compression
> part of the problem.  That's likely to remain true if we shift to a
> different compression algorithm.  I suspect if you substituted lzo for
> gzip in the third case, the picture wouldn't change very much.

lzo is much, much, (much) faster than zlib.  Note, I've tried several
times to contact the author to get clarification on licensing terms
and have been unable to get a response.

[r...@devdb merlin]# time lzop -c dump.sql > /dev/null

real0m16.683s
user0m15.573s
sys 0m0.939s
[r...@devdb merlin]# time gzip -c dump.sql > /dev/null

real3m43.090s
user3m41.471s
sys 0m1.036s

merlin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Query help

2009-08-03 Thread Kevin Grittner

"Subbiah Stalin-XCGF84"  wrote:

> Shared buffer=8G, effective cache size=4G.

That is odd; if your shared buffers are at 8G, you must have more than
4G of cache.  How much RAM is used for cache at the OS level? 
Normally you would add that to the shared buffers to get your
effective cache size, or at least take the larger of the two.

How much RAM is on this machine in total?  Do you have any other
processes which use a lot of RAM or might access a lot of disk from
time to time?

> Let me know if you need any other information.

The \d output for the object table, or the CREATE for it and its
indexes, would be good.  Since it's getting through the random reads
by the current plan at the rate of about one every 5ms, I'd say your
drive array is OK.  If you want to make this query faster you've
either got to have the data in cache or it has to have reason to
believe that a different plan is faster.

One thing which might help is to boost your work_mem setting to
somewhere in the 32MB to 64MB range, provided that won't drive you
into swapping.  You could also try dropping the random_page_cost to
maybe 2 to see if that gets you a different plan.  You can do a quick
check on what plans these generate by changing them on a given
connection and then requesting just an EXPLAIN of the plan, to see if
it's different.  (This doesn't actually run the query, so it's fast.)

-Kevin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Query help

2009-08-03 Thread Subbiah Stalin-XCGF84

Sure I can provide those details. I have seen this query running 5+
minutes for different values for doaminID too. Its just that it happens
at random and gets fixed within few mins.

Shared buffer=8G, effective cache size=4G. Optimizer/autovaccum settings
are defaults

   relname| relpages | reltuples
--+--+---
 ct_objects_id_u1 |11906 |671919
 ix_objects_altname   |13327 |671919
 ix_objects_domainid_name |24714 |671919
 ix_objects_key3  | 9891 |671919
 ix_objects_name  |11807 |671919
 ix_objects_type_lastmodified |38640 |671919
 ix_objects_user1 |20796 |671919
 ix_objects_user2 |20842 |671919
 objects  |   111873 |671919

This database resides on a RAID 1+0 storage with 10 disks (5+5).

Let me know if you need any other information.

Thanks Kevin.

Stalin

-Original Message-
From: Kevin Grittner [mailto:kevin.gritt...@wicourts.gov] 
Sent: Monday, August 03, 2009 12:48 PM
To: Subbiah Stalin-XCGF84; pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Query help

"Subbiah Stalin-XCGF84"  wrote: 

> Not sure what's wrong in below execution plan but at times the query 
> runs for 5 minutes to complete and after a while it runs within a 
> second or two.

The plan doesn't look entirely unreasonable for the given query,
although it's hard to be sure of that without seeing the table
definitions.  Given the plan, the times look to be about what I'd expect
for uncached and cached timings.  (That is, on subsequent runs, the data
is sitting in RAM, so you don't need to access the hard
drives.)

If the initial run time is unacceptable for your environment, and
there's no way to have the cached "primed" when it matters, please give
more details on your table layouts, and perhaps someone can make a
useful suggestion.

> Pg 8.2.7, Sol10.

One quick suggestion -- upgrade your PostgreSQL version if at all
possible.  The latest bug-fix version of 8.2 is currently 8.2.13, and
there are significant performance improvements in 8.3 and the newly
released 8.4.

-Kevin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] PostgreSQL 8.4 performance tuning questions

2009-08-03 Thread PFC



I get very different (contradictory) behavior.  Server with fast RAID,  
32GB

RAM, 2 x 4 core 3.16Ghz Xeon 54xx CPUs.  CentOS 5.2
8.3.6


	That's a very different serup from my (much less powerful) box, so that  
would explain it...



No disk wait time during any test.  One test beforehand was used to prime
the disk cache.
100% CPU in the below means one core fully used.  800% means the system  
is

fully loaded.

pg_dump > file  (on a subset of the DB with lots of tables with small
tuples)
6m 27s, 4.9GB;  12.9MB/sec
50% CPU in postgres, 50% CPU in pg_dump


	If there is no disk wait time, then why do you get 50/50 and not 100/100  
or at least 1 core maxed out ? That's interesting...


COPY annonces TO '/dev/null';
COPY 413526
Temps : 13871,093 ms

\copy annonces to '/dev/null'
Temps : 14037,946 ms

time pg_dump -Fc -t annonces -U annonces --compress=0 annonces >/dev/null
real0m14.596s
user0m0.700s
sys 0m0.372s

	In all 3 cases postgres maxes out one core (I've repeated the test until  
all data was cached, so there is no disk access at all in vmstat).

Size of dump is 312MB.




--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Query help

2009-08-03 Thread Kevin Grittner

"Subbiah Stalin-XCGF84"  wrote: 

> Not sure what's wrong in below execution plan but at times the query
> runs for 5 minutes to complete and after a while it runs within a
> second or two.

The plan doesn't look entirely unreasonable for the given query,
although it's hard to be sure of that without seeing the table
definitions.  Given the plan, the times look to be about what I'd
expect for uncached and cached timings.  (That is, on subsequent runs,
the data is sitting in RAM, so you don't need to access the hard
drives.)

If the initial run time is unacceptable for your environment, and
there's no way to have the cached "primed" when it matters, please
give more details on your table layouts, and perhaps someone can make
a useful suggestion.

> Pg 8.2.7, Sol10.

One quick suggestion -- upgrade your PostgreSQL version if at all
possible.  The latest bug-fix version of 8.2 is currently 8.2.13, and
there are significant performance improvements in 8.3 and the newly
released 8.4.

-Kevin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] PostgreSQL 8.4 performance tuning questions

2009-08-03 Thread Scott Carey

On 8/3/09 11:56 AM, "Tom Lane"  wrote:

> Scott Carey  writes:
>> I get very different (contradictory) behavior.  Server with fast RAID, 32GB
>> RAM, 2 x 4 core 3.16Ghz Xeon 54xx CPUs.  CentOS 5.2
>> 8.3.6
>> No disk wait time during any test.  One test beforehand was used to prime
>> the disk cache.
>> 100% CPU in the below means one core fully used.  800% means the system is
>> fully loaded.
> 
>> pg_dump > file  (on a subset of the DB with lots of tables with small
>> tuples)
>> 6m 27s, 4.9GB;  12.9MB/sec
>> 50% CPU in postgres, 50% CPU in pg_dump
> 
>> pg_dump -Fc > file.gz
>> 9m6s, output is 768M  (6.53x compression); 9.18MB/sec
>> 30% CPU in postgres, 70% CPU in pg_dump
> 
>> pg_dump | gzip > file.2.gz
>> 6m22s, 13MB/sec.
>> 50% CPU in postgres, 50% Cpu in pg_dump, 50% cpu in gzip
> 
> I don't see anything very contradictory here.

The other poster got nearly 2 CPUs of work from just pg_dump + postgres.
That contradicts my results (but could be due to data differences or
postgres version differences).
In the other use case, compression was not slower, but just used more CPU
(also contradicting my results).

> What you're demonstrating
> is that it's nice to be able to throw a third CPU at the compression
> part of the problem.

No, 1.5 CPU. A full use of a second would even be great.

I'm also demonstrating that there is some artificial bottleneck somewhere
preventing postgres and pg_dump to operate concurrently.  Instead, one waits
while the other does work.

Your claim earlier in this thread was that there was already pipelined work
being done due to pg_dump + postgresql -- which seems to be true for the
other test case but not mine.

As a consequence, adding compression throttles the postgres process even
though the compression hasn't caused 100% CPU (or close) on any task
involved.

>  That's likely to remain true if we shift to a
> different compression algorithm.  I suspect if you substituted lzo for
> gzip in the third case, the picture wouldn't change very much.
> 

That is exactly the point.  LZO would be nice (and help mitigate this
problem), but it doesn't solve the real problem here.  Pg_dump is slow and
artificially throttles without even getting 100% CPU from itself or
postgres.

The problem still remains:  dumping with -Fc can be significantly slower
than raw piped to a compression utility, even if no task is CPU or I/O
bound. Dumping and piping to gzip is faster.  But parallel restore won't
work without custom or raw format.

> regards, tom lane
> 

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] PostgreSQL 8.4 performance tuning questions

2009-08-03 Thread Tom Lane

Scott Carey  writes:
> I get very different (contradictory) behavior.  Server with fast RAID, 32GB
> RAM, 2 x 4 core 3.16Ghz Xeon 54xx CPUs.  CentOS 5.2
> 8.3.6
> No disk wait time during any test.  One test beforehand was used to prime
> the disk cache.
> 100% CPU in the below means one core fully used.  800% means the system is
> fully loaded.

> pg_dump > file  (on a subset of the DB with lots of tables with small
> tuples)
> 6m 27s, 4.9GB;  12.9MB/sec
> 50% CPU in postgres, 50% CPU in pg_dump

> pg_dump -Fc > file.gz
> 9m6s, output is 768M  (6.53x compression); 9.18MB/sec
> 30% CPU in postgres, 70% CPU in pg_dump

> pg_dump | gzip > file.2.gz
> 6m22s, 13MB/sec. 
> 50% CPU in postgres, 50% Cpu in pg_dump, 50% cpu in gzip

I don't see anything very contradictory here.  What you're demonstrating
is that it's nice to be able to throw a third CPU at the compression
part of the problem.  That's likely to remain true if we shift to a
different compression algorithm.  I suspect if you substituted lzo for
gzip in the third case, the picture wouldn't change very much.

regards, tom lane

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

[PERFORM] Query help

2009-08-03 Thread Subbiah Stalin-XCGF84

All,
 
Not sure what's wrong in below execution plan but at times the query
runs for 5 minutes to complete and after a while it runs within a second
or two.
 
Here is explain analyze out of the query.
 
SELECT
OBJECTS.ID,OBJECTS.NAME,OBJECTS.TYPE,OBJECTS.STATUS,OBJECTS.ALTNAME,OBJE
CTS.DOMAINID,OBJECTS.ASSIGNEDTO,OBJECTS.USER1,OBJECTS.USER2,
OBJECTS.KEY1,OBJECTS.KEY2,OBJECTS.KEY3,OBJECTS.OUI,OBJECTS.PRODCLASS,OBJ
ECTS.STATUS2,OBJECTS.LASTMODIFIED,OBJECTS.LONGDATA,OBJECTS.DATA0,
OBJECTS.DATA1 
FROM OBJECTS 
WHERE OBJECTS.DOMAINID IN
('HY3XGEzC0E9JxRwoXLOLbjNsghEA','3330') 
AND OBJECTS.TYPE  IN ('cpe') 
ORDER BY OBJECTS.LASTMODIFIED DESC LIMIT 501
 
 
QUERY PLAN



-
 Limit  (cost=0.00..9235.11 rows=501 width=912) (actual
time=0.396..2741.803 rows=501 loops=1)
   ->  Index Scan Backward using ix_objects_type_lastmodified on objects
(cost=0.00..428372.71 rows=23239 width=912) (actual time=0.394..2741.608
rows=501 loops=1)
 Index Cond: (("type")::text = 'cpe'::text)
 Filter: ((domainid)::text = ANY
(('{HY3XGEzC0E9JxRwoXLOLbjNsghEA,3330}'::charact
er varying[])::text[]))
 Total runtime: 2742.126 ms

 
The table is auto vaccumed regularly. I have enabled log_min_messages to
debug2 but nothing stands out during the times when the query took 5+
minutes. Is rebuild of the index necessary here.
 
Thanks in Advance,
 
Stalin
 
Pg 8.2.7, Sol10.

Re: [PERFORM] PostgreSQL 8.4 performance tuning questions

2009-08-03 Thread Scott Carey

On 7/31/09 4:01 PM, "PFC"  wrote:

> On Fri, 31 Jul 2009 19:04:52 +0200, Tom Lane  wrote:
> 
>> Greg Stark  writes:
>>> On Thu, Jul 30, 2009 at 11:30 PM, Tom Lane wrote:
 I did some tracing and verified that pg_dump passes data to deflate()
 one table row at a time.  I'm not sure about the performance
 implications of that, but it does seem like it might be something to
 look into.
>> 
>>> I suspect if this was a problem the zlib people would have added
>>> internal buffering ages ago. I find it hard to believe we're not the
>>> first application to use it this way.
>> 
>> I dug into this a bit more.  zlib *does* have internal buffering --- it
>> has to, because it needs a minimum lookahead of several hundred bytes
>> to ensure that compression works properly.  The per-call overhead of
>> deflate() looks a bit higher than one could wish when submitting short
>> chunks, but oprofile shows that "pg_dump -Fc" breaks down about like
>> this:
> 
> During dump (size of dump is 2.6 GB),
> 
> No Compression :
> - postgres at 70-100% CPU and pg_dump at something like 10-20%
> - dual core is useful (a bit...)
> - dump size 2.6G
> - dump time 2m25.288s
> 
> Compression Level 1 :
> - postgres at 70-100% CPU and pg_dump at 20%-100%
> - dual core is definitely useful
> - dump size 544MB
> - dump time 2m33.337s
> 
> Since this box is mostly idle right now, eating CPU for compression is no
> problem...
> 

I get very different (contradictory) behavior.  Server with fast RAID, 32GB
RAM, 2 x 4 core 3.16Ghz Xeon 54xx CPUs.  CentOS 5.2
8.3.6
No disk wait time during any test.  One test beforehand was used to prime
the disk cache.
100% CPU in the below means one core fully used.  800% means the system is
fully loaded.

pg_dump > file  (on a subset of the DB with lots of tables with small
tuples)
6m 27s, 4.9GB;  12.9MB/sec
50% CPU in postgres, 50% CPU in pg_dump

pg_dump -Fc > file.gz
9m6s, output is 768M  (6.53x compression); 9.18MB/sec
30% CPU in postgres, 70% CPU in pg_dump

pg_dump | gzip > file.2.gz
6m22s, 13MB/sec. 
50% CPU in postgres, 50% Cpu in pg_dump, 50% cpu in gzip

The default (5) compression level was used.

So, when using pg_dump alone, I could not get significantly more than one
core of CPU (all on the same box).  No matter how I tried, pg_dump plus the
postgres process dumping data always totaled about 102% -- it would
flulctuate in top, give or take 15% at times, but the two always were very
close (within 3%) of this total.

Piping the whole thing to gzip gets some speedup.  This indicates that
perhaps the implementation or use of gzip is inappropriate on pg_dump's side
or the library version is older or slower.  Alternatively, the use of gzip
inside pg_dump fails to pipeline CPU useage as well as piping it does, as
the above shows 50% more CPU utilization when piping.

I can do the same test with a single table that is 10GB later (which does
dump much faster than 13MB/sec and has rows that average about 500 bytes in
size).  But overall I have found pg_dump's performace sorely lacking, and
this is a data risk in the big picture.  Postgres is very good about not
losing data, but that only goes up to the limits of the hardware and OS,
which is not good enough.  Because of long disaster recovery times and poor
replication/contingency features, it is a fairly unsafe place for data once
it gets beyond a certain size and a BC plan requires minimal downtime.

> Adding an option to use LZO instead of gzip could be useful...
> 
> Compressing the uncompressed 2.6GB dump :
> 
> - gzip -1 :
> 
> - compressed size : 565 MB
> - compression throughput : 28.5 MB/s
> - decompression throughput : 74 MB/s
> 
> - LZO -1 :
> - compressed size : 696M
> - compression throughput : 86 MB/s
> - decompression throughput : 247 MB/s
> 
> Conclusion : LZO could help for fast disks (RAID) or slow disks on a
> CPU-starved server...
> 

LZO would be a great option, it is very fast, especially decompression.
With gzip, one rarely gains by going below gzip -3 or above gzip -6.

> --
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance
> 

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Why is PostgreSQL so slow on Windows ( Postgres 8.3.7) version

2009-08-03 Thread Marc Cousin

The few 'obvious' things I see :

ID and POLLID aren't of the same type (numeric vs bigint)

TTIME isn't indexed.

And as a general matter, you should stick to native datatypes if you don't 
need numeric.

But as said in the other answer, maybe you should redo this schema and use 
more consistent datatypes

Anyway, from what I remenber, it's not advised to set up shared buffers that 
high for windows (I don't do so much windows myself, so maybe someone will be 
better informed).

Anyway you can start by correcting the schema…

On Friday 31 July 2009 07:45:55 pari krishnan wrote:
> Dear All,
>
>
> We are
> using Postgres 8.3.7 in our java application. We are doing performances
> tuning and load testing in our setup. we have noticed that ,some of our
> queries to the database taking long time to return the results.Please
> find our setup details belows.
>
> We observed that postgres is running in windows is slower than the linux .
>
> Machine & Database Details :
>
> Windows configuration:
> 4 GB RAM
> 4*1.6 GHZ
> windows 2008 server standard edition
>
> Postgresql configuration:
>
> shared_buffers: 1 GB
> Effective_cache_size: 2GB
> fsync: off  (even we tested this parameter is on ,we observed the same
> slowness )
>
>
> Database Details :
>
> Postgres  Database   : PostgreSQL 8.3.7.1
> Driver Version: PostgreSQL 8.3 JDBC4 with SSL (build 604)
> We are using 40 database connections.
>
>
> We have few tables which will be having more amount data.While running
> our application STATSDATA table will be created daily with table name
> with date.
> like as STATSDATA8_21_2009
>
> Schema for STATSDATA table
>
> create table STATSDATA8_21_2009(
> POLLID Numeric(19),
> INSTANCE varchar(100),
> TTIME Numeric(19),
> VAL Numeric(13)) ;CREATE INDEX POLLID%_ndx on STATSDATA%(POLLID)
>
> Schema for PolledData
>
> create table PolledData(
> "NAME" varchar(50) NOT NULL ,
> "ID" BIGINT NOT NULL ,
> "AGENT" varchar(50) NOT NULL ,
> "COMMUNITY" varchar(100) NOT NULL ,
> "PERIOD" INTEGER NOT NULL,
> "ACTIVE" varchar(10),
> "OID" varchar(200) NOT NULL,
> "LOGDIRECTLY" varchar(10),
> "LOGFILE" varchar(100),
> "SSAVE" varchar(10),
> "THRESHOLD" varchar(10),
> "ISMULTIPLEPOLLEDDATA" varchar(10),
> "PREVIOUSSEVERITY" INTEGER,
> "NUMERICTYPE" INTEGER,
> "SAVEABSOLUTES" varchar(10),
> "TIMEAVG" varchar(10),
> "PORT" INTEGER,
> "WEBNMS" varchar(100),
> "GROUPNAME" varchar(100),
> "LASTCOUNTERVALUE" BIGINT ,
> "LASTTIMEVALUE" BIGINT ,
> "TIMEVAL" BIGINT NOT NULL ,
> "POLICYNAME" varchar(100),
> "THRESHOLDLIST" varchar(200),
> "DNSNAME" varchar(100),
> "SUFFIX" varchar(20),
> "STATSDATATABLENAME" varchar(100),
> "POLLERNAME" varchar(200),
> "FAILURECOUNT" INTEGER,
> "FAILURETHRESHOLD" INTEGER,
> "PARENTOBJ" varchar(100),
> "PROTOCOL" varchar(50),
> "SAVEPOLLCOUNT" INTEGER,
> "CURRENTSAVECOUNT" INTEGER,
> "SAVEONTHRESHOLD" varchar(10),
> "SNMPVERSION" varchar(10),
> "USERNAME" varchar(30),
> "CONTEXTNAME" varchar(30),
> PRIMARY KEY ("ID","NAME","AGENT","OID"),
> index PolledData0_ndx ( "NAME"),
> index PolledData1_ndx ( "AGENT"),
> index PolledData2_ndx ( "OID"),
> index PolledData3_ndx ( "ID"),
> index PolledData4_ndx ( "PARENTOBJ"),
> )
>
>
> We
> have 300k row's in PolledData Table.In each STATSDATA table ,we have
> almost 12 to 13 million rows. Every one minute interval ,we insert data
> into to STATSDATA table. In our application ,we use insert and select
> query to STATSDATA table at regular interval. Please let us know why
> the below query takes more time to return the results. is there any
> thing we need to do to tune the postgres database ?
>
>
>
>
> Please find explain analyze output.
>
>
> First Query :
>
> postgres=# explain analyze SELECT COUNT(*) FROM ( SELECT ID,
> PolledData.AGENT, N AME, INSTANCE, TTIME, VAL FROM PolledData,
> STATSDATA8_21_2009 WHERE ( ( PolledDa ta.ID=STATSDATA8_21_2009.POLLID) AND
> ( ( TTIME >= 1250838027454) AND ( TTIME <= 1250838079654) ) ) ) t1;
> QUERY
> PLAN
>
> ---
>- --
> Aggregate  (cost=773897.12..773897.13 rows=1 width=0) (actual
> time=17818.410..1 7818.412 rows=1 loops=1)
>->  Merge Join  (cost=717526.23..767505.06 rows=2556821 width=0) (actual
> time =17560.469..17801.790 rows=13721 loops=1)
>  Merge Cond: (statsdata8_21_2009.pollid =
> ((polleddata.id)::numeric)) ->  Sort  (cost=69708.44..69742.49 rows=13619
> width=8) (actual time=239 2.659..2416.093 rows=13721 loops=1)
>Sort Key: statsdata8_21_2009.pollid
>Sort Method:  quicksort  Memory: 792kB
>->  Seq Scan on statsdata8_21_2009  (cost=0.00..68773.27
> rows=136 19 width=8) (actual time=0.077..2333.132 rows=13721 loops=1)
>  Filter: ((ttime >= 1250838027454::numeric) AND (ttime
> <= 12 50838079654::numeric))
>  ->  Materialize  (cost=64781

Re: [PERFORM] Why is PostgreSQL so slow on Windows ( Postgres 8.3.7) version

2009-08-03 Thread Grzegorz Jaśkiewicz

how about normalizing the schema for start ?
by the looks of it, you have huge table,with plenty of varchars, that
smells like bad design of db.

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Greenplum MapReduce

2009-08-03 Thread Richard Huxton

Suvankar Roy wrote:

Hi Richard,

I sincerely regret the inconvenience caused.

No big inconvenience, but the lists can be very busy sometimes and the 
easier you make it for people to answer your questions the better the 
answers you will get.

%YAML 1.1
---
VERSION: 1.0.0.1 
DATABASE: test_db1

USER: gpadmin
DEFINE: 
- INPUT: #** This the line which is causing the error **#

> NAME: doc
> TABLE: documents

If it looks fine, always check for tabs. Oh, and you could have cut out 
all the rest of the file, really.

I have learnt that unnecessary TABs can the cause of this, so trying to 
overcome that, hopefully the problem will subside then

I'm always getting this. It's easy to accidentally introduce a tab 
character when reformatting YAML. It might be worth checking if your 
text editor has an option to always replace tabs with spaces.

--
  Richard Huxton
  Archonet Ltd

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Greenplum MapReduce

2009-08-03 Thread Richard Huxton


Suvankar Roy wrote:

Hi all,

Has anybody worked on Greenplum MapReduce programming ?

I am facing a problem while trying to execute the below Greenplum 
Mapreduce program written in YAML (in blue). 


The other poster suggested contacting Greenplum and I can only agree.


The error is thrown in the 7th line as:
Error: YAML syntax error - found character that cannot start any token 
while scanning for the next token, at line 7 (in red)


There is no red, particularly if viewing messages as plain text (which 
most people do on mailing lists). Consider indicating a line some other 
way next time (commonly below the line you put something like "this is 
line 7 ^")


The most common problem I get with YAML files though is when a tab is 
accidentally inserted instead of spaces at the start of a line.


--
  Richard Huxton
  Archonet Ltd

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Greenplum MapReduce

Re: [PERFORM] Greenplum MapReduce

Re: [PERFORM] PostgreSQL 8.4 performance tuning questions

Re: [PERFORM] Query help

Re: [PERFORM] PostgreSQL 8.4 performance tuning questions

Re: [PERFORM] Query help

Re: [PERFORM] Query help

Re: [PERFORM] PostgreSQL 8.4 performance tuning questions

Re: [PERFORM] Query help

Re: [PERFORM] PostgreSQL 8.4 performance tuning questions

Re: [PERFORM] PostgreSQL 8.4 performance tuning questions

[PERFORM] Query help

Re: [PERFORM] PostgreSQL 8.4 performance tuning questions

Re: [PERFORM] Why is PostgreSQL so slow on Windows ( Postgres 8.3.7) version

Re: [PERFORM] Why is PostgreSQL so slow on Windows ( Postgres 8.3.7) version

Re: [PERFORM] Greenplum MapReduce

Re: [PERFORM] Greenplum MapReduce

17 matches

Site Navigation

Mail list logo

Footer information