Re: [PERFORM] Greenplum MapReduce
Hi Robert, Thanks much for your valuable inputs This spaces and tabs problem is killing me in a way, it is pretty cumbersome to say the least Regards, Suvankar Roy "Robert Mah" Sent by: Robert Mah 08/02/2009 10:52 PM To "'Suvankar Roy'" , cc Subject RE: [PERFORM] Greenplum MapReduce Suvankar: Check your file for spaces vs tabs (one of them is bad and yes, it matters). And as an personal aside, this is yet another reason I hate YAML. Cheers, Rob From: pgsql-performance-ow...@postgresql.org [ mailto:pgsql-performance-ow...@postgresql.org] On Behalf Of Suvankar Roy Sent: Thursday, July 30, 2009 8:25 AM To: pgsql-performance@postgresql.org Subject: [PERFORM] Greenplum MapReduce Hi all, Has anybody worked on Greenplum MapReduce programming ? I am facing a problem while trying to execute the below Greenplum Mapreduce program written in YAML (in blue). The error is thrown in the 7th line as: Error: YAML syntax error - found character that cannot start any token while scanning for the next token, at line 7 (in red) If somebody can explain this and the potential solution %YAML 1.1 --- VERSION: 1.0.0.1 DATABASE: test_db1 USER: gpadmin DEFINE: - INPUT: NAME: doc TABLE: documents - INPUT: NAME: kw TABLE: keywords - MAP: NAME: doc_map LANGUAGE: python FUNCTION: | i = 0 terms = {} for term in data.lower().split(): i = i + 1 if term in terms: terms[term] += ','+str(i) else: terms[term] = str(i) for term in terms: yield([doc_id, term, terms[term]]) OPTIMIZE: STRICT IMMUTABLE PARAMETERS: - doc_id integer - data text RETURNS: - doc_id integer - term text - positions text - MAP: NAME: kw_map LANGUAGE: python FUNCTION: | i = 0 terms = {} for term in keyword.lower().split(): i = i + 1 if term in terms: terms[term] += ','+str(i) else: terms[term] = str(i) yield([keyword_id, i, term, terms[term]]) OPTIMIZE: STRICT IMMUTABLE PARAMETERS: - keyword_id integer - keyword text RETURNS: - keyword_id integer - nterms integer - term text - positions text - TASK: NAME: doc_prep SOURCE: doc MAP: doc_map - TASK: NAME: kw_prep SOURCE: kw MAP: kw_map - INPUT: NAME: term_join QUERY: | SELECT doc.doc_id, kw.keyword_id, kw.term, kw.nterms, doc.positions as doc_positions, kw.positions as kw_positions FROM doc_prep doc INNER JOIN kw_prep kw ON (doc.term = kw.term) - REDUCE: NAME: term_reducer TRANSITION: term_transition FINALIZE: term_finalizer - TRANSITION: NAME: term_transition LANGUAGE: python PARAMETERS: - state text - term text - nterms integer - doc_positions text - kw_positions text FUNCTION: | if state: kw_split = state.split(':') else: kw_split = [] for i in range(0,nterms): kw_split.append('') for kw_p in kw_positions.split(','): kw_split[int(kw_p)-1] = doc_positions outstate = kw_split[0] for s in kw_split[1:]: outstate = outstate + ':' + s return o
Re: [PERFORM] Greenplum MapReduce
Hi Richard, I sincerely regret the inconvenience caused. %YAML 1.1 --- VERSION: 1.0.0.1 DATABASE: test_db1 USER: gpadmin DEFINE: - INPUT: #** This the line which is causing the error **# NAME: doc TABLE: documents - INPUT: NAME: kw TABLE: keywords - MAP: NAME: doc_map LANGUAGE: python FUNCTION:| i = 0 terms = {} for term in data.lower().split(): i = i + 1 if term in terms: terms[term] += ','+str(i) else: terms[term] = str(i) for term in terms: yield([doc_id, term, terms[term]]) OPTIMIZE: STRICT IMMUTABLE PARAMETERS: - doc_id integer - data text RETURNS: - doc_id integer - term text - positions text - MAP: NAME: kw_map LANGUAGE: python FUNCTION: | i = 0 terms = {} for term in keyword.lower().split(): i = i + 1 if term in terms: terms[term] += ','+str(i) else: terms[term] = str(i) yield([keyword_id, i, term, terms[term]]) OPTIMIZE: STRICT IMMUTABLE PARAMETERS: - keyword_id integer - keyword text RETURNS: - keyword_id integer - nterms integer - term text - positions text - TASK: NAME: doc_prep SOURCE: doc MAP: doc_map - TASK: NAME: kw_prep SOURCE: kw MAP: kw_map - INPUT: NAME: term_join QUERY: | SELECT doc.doc_id, kw.keyword_id, kw.term, kw.nterms, doc.positions as doc_positions, kw.positions as kw_positions FROM doc_prep doc INNER JOIN kw_prep kw ON (doc.term = kw.term) - REDUCE: NAME: term_reducer TRANSITION: term_transition FINALIZE: term_finalizer - TRANSITION: NAME: term_transition LANGUAGE: python PARAMETERS: - state text - term text - nterms integer - doc_positions text - kw_positions text FUNCTION: | if state: kw_split = state.split(':') else: kw_split = [] for i in range(0,nterms): kw_split.append('') for kw_p in kw_positions.split(','): kw_split[int(kw_p)-1] = doc_positions outstate = kw_split[0] for s in kw_split[1:]: outstate = outstate + ':' + s return outstate - FINALIZE: NAME: term_finalizer LANGUAGE: python RETURNS: - count integer MODE: MULTI FUNCTION: | if not state: return 0 kw_split = state.split(':') previous = None for i in range(0,len(kw_split)): isplit = kw_split[i].split(',') if any(map(lambda(x): x == '', isplit)): return 0 adjusted = set(map(lambda(x): int(x)-i, isplit)) if (previous): previous = adjusted.intersection(previous) else: previous = adjusted if previous: return len(previous)
Re: [PERFORM] PostgreSQL 8.4 performance tuning questions
lzo is much, much, (much) faster than zlib. Note, I've tried several decompression speed is even more awesome... times to contact the author to get clarification on licensing terms and have been unable to get a response. lzop and the LZO library are distributed under the terms of the GNU General Public License (GPL). source : http://www.lzop.org/lzop_man.php -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Query help
Server has 32G memory and it's a dedicated to run PG and no other application is sharing this database. I have checked checkpoints and they don't occur during those slow query runtimes. Checkpoint_segments is set 128. here is quick snap from vmstat. # vmstat 5 5 kthr memorypagedisk faults cpu r b w swap free re mf pi po fr de sr 1m 1m 1m m1 in sy cs us sy id 0 0 0 56466032 25908072 59 94 516 13 13 0 0 10 3 59 1 480 443 500 1 1 98 0 0 0 51377520 20294328 6 8 0 32 32 0 0 0 4 1 0 368 185 361 0 1 99 0 0 0 56466032 25908064 59 94 516 13 13 0 0 1 10 3 59 480 443 500 1 1 98 0 0 0 51376984 20294168 57 427 0 16 16 0 0 0 0 1 0 380 781 396 1 1 98 0 0 0 51376792 20294208 112 1131 2 50 50 0 0 0 0 5 2 398 2210 541 4 3 92 \d output -- Table "public.objects" Column|Type | Modifiers --+-+--- id | character varying(28) | not null name | character varying(50) | not null altname | character varying(50) | type | character varying(3)| domainid | character varying(28) | not null status | smallint| dbver| integer | created | timestamp without time zone | lastmodified | timestamp without time zone | assignedto | character varying(28) | status2 | smallint| key1 | character varying(25) | key2 | character varying(25) | key3 | character varying(64) | oui | character varying(6)| prodclass| character varying(64) | user1| character varying(50) | user2| character varying(50) | data0| character varying(2000) | data1| character varying(2000) | longdata | character varying(1)| Indexes: "ct_objects_id_u1" PRIMARY KEY, btree (id), tablespace "nbbs_index_data" "ix_objects_altname" btree (altname), tablespace "nbbs_index_data" "ix_objects_domainid_name" btree (domainid, upper(name::text)), tablespace "nbbs_index_data" "ix_objects_key3" btree (upper(key3::text)), tablespace "nbbs_index_data" "ix_objects_name" btree (upper(name::text) varchar_pattern_ops), tablespace "nbbs_index_data" "ix_objects_type_lastmodified" btree ("type", lastmodified), tablespace "nbbs_index_data" "ix_objects_user1" btree (upper(user1::text)), tablespace "nbbs_index_data" "ix_objects_user2" btree (upper(user2::text)), tablespace "nbbs_index_data" Work_mem=64mb, r_p_c = 2 on the session gave similar execution plan except the cost different due to change r_p_c. QUERY PLAN - Limit (cost=0.00..5456.11 rows=501 width=912) -> Index Scan Backward using ix_objects_type_lastmodified on objects (cost=0.00..253083.03 rows=23239 width=912) Index Cond: (("type")::text = 'cpe'::text) Filter: ((domainid)::text = ANY (('{HY3XGEzC0E9JxRwoXLOLbjNsghEA,3330}'::charact er varying[])::text[])) (4 rows) Given the nature of the ix_objects_type_lastmodified index, wondering if the index requires rebuilt. I tested rebuilding it in another db, and it came to 2500 pages as opposed to 38640 pages. The puzzle being why the same query with same filters, runs most of times faster but at times runs 5+ mintues and it switches back to fast mode. If it had used a different execution plan than the above, how do I list all execution plans executed for a given SQL. Thanks, Stalin -Original Message- From: Kevin Grittner [mailto:kevin.gritt...@wicourts.gov] Sent: Monday, August 03, 2009 1:45 PM To: Subbiah Stalin-XCGF84; pgsql-performance@postgresql.org Subject: RE: [PERFORM] Query help "Subbiah Stalin-XCGF84" wrote: > Shared buffer=8G, effective cache size=4G. That is odd; if your shared buffers are at 8G, you must have more than 4G of cache. How much RAM is used for cache at the OS level? Normally you would add that to the shared buffers to get your effective cache size, or at least take the larger of the two. How much RAM is on this machine in total? Do you have any other processes which use a lot of RAM or might access a lot of disk from time to time? > Let me know if you need any other information. The \d output for the object table, or the CREATE for it and its indexes, would be good. Since it's getting through the random reads by the current plan at the rate of about one every 5ms, I'd say your drive array is OK. If you want to make this query faster you've either got to have the data in cache or it has to have reason to believe that a different plan is faster. One thing which might help is to boost your
Re: [PERFORM] PostgreSQL 8.4 performance tuning questions
On Mon, Aug 3, 2009 at 2:56 PM, Tom Lane wrote: > I don't see anything very contradictory here. What you're demonstrating > is that it's nice to be able to throw a third CPU at the compression > part of the problem. That's likely to remain true if we shift to a > different compression algorithm. I suspect if you substituted lzo for > gzip in the third case, the picture wouldn't change very much. lzo is much, much, (much) faster than zlib. Note, I've tried several times to contact the author to get clarification on licensing terms and have been unable to get a response. [r...@devdb merlin]# time lzop -c dump.sql > /dev/null real0m16.683s user0m15.573s sys 0m0.939s [r...@devdb merlin]# time gzip -c dump.sql > /dev/null real3m43.090s user3m41.471s sys 0m1.036s merlin -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Query help
"Subbiah Stalin-XCGF84" wrote: > Shared buffer=8G, effective cache size=4G. That is odd; if your shared buffers are at 8G, you must have more than 4G of cache. How much RAM is used for cache at the OS level? Normally you would add that to the shared buffers to get your effective cache size, or at least take the larger of the two. How much RAM is on this machine in total? Do you have any other processes which use a lot of RAM or might access a lot of disk from time to time? > Let me know if you need any other information. The \d output for the object table, or the CREATE for it and its indexes, would be good. Since it's getting through the random reads by the current plan at the rate of about one every 5ms, I'd say your drive array is OK. If you want to make this query faster you've either got to have the data in cache or it has to have reason to believe that a different plan is faster. One thing which might help is to boost your work_mem setting to somewhere in the 32MB to 64MB range, provided that won't drive you into swapping. You could also try dropping the random_page_cost to maybe 2 to see if that gets you a different plan. You can do a quick check on what plans these generate by changing them on a given connection and then requesting just an EXPLAIN of the plan, to see if it's different. (This doesn't actually run the query, so it's fast.) -Kevin -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Query help
Sure I can provide those details. I have seen this query running 5+ minutes for different values for doaminID too. Its just that it happens at random and gets fixed within few mins. Shared buffer=8G, effective cache size=4G. Optimizer/autovaccum settings are defaults relname| relpages | reltuples --+--+--- ct_objects_id_u1 |11906 |671919 ix_objects_altname |13327 |671919 ix_objects_domainid_name |24714 |671919 ix_objects_key3 | 9891 |671919 ix_objects_name |11807 |671919 ix_objects_type_lastmodified |38640 |671919 ix_objects_user1 |20796 |671919 ix_objects_user2 |20842 |671919 objects | 111873 |671919 This database resides on a RAID 1+0 storage with 10 disks (5+5). Let me know if you need any other information. Thanks Kevin. Stalin -Original Message- From: Kevin Grittner [mailto:kevin.gritt...@wicourts.gov] Sent: Monday, August 03, 2009 12:48 PM To: Subbiah Stalin-XCGF84; pgsql-performance@postgresql.org Subject: Re: [PERFORM] Query help "Subbiah Stalin-XCGF84" wrote: > Not sure what's wrong in below execution plan but at times the query > runs for 5 minutes to complete and after a while it runs within a > second or two. The plan doesn't look entirely unreasonable for the given query, although it's hard to be sure of that without seeing the table definitions. Given the plan, the times look to be about what I'd expect for uncached and cached timings. (That is, on subsequent runs, the data is sitting in RAM, so you don't need to access the hard drives.) If the initial run time is unacceptable for your environment, and there's no way to have the cached "primed" when it matters, please give more details on your table layouts, and perhaps someone can make a useful suggestion. > Pg 8.2.7, Sol10. One quick suggestion -- upgrade your PostgreSQL version if at all possible. The latest bug-fix version of 8.2 is currently 8.2.13, and there are significant performance improvements in 8.3 and the newly released 8.4. -Kevin -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] PostgreSQL 8.4 performance tuning questions
I get very different (contradictory) behavior. Server with fast RAID, 32GB RAM, 2 x 4 core 3.16Ghz Xeon 54xx CPUs. CentOS 5.2 8.3.6 That's a very different serup from my (much less powerful) box, so that would explain it... No disk wait time during any test. One test beforehand was used to prime the disk cache. 100% CPU in the below means one core fully used. 800% means the system is fully loaded. pg_dump > file (on a subset of the DB with lots of tables with small tuples) 6m 27s, 4.9GB; 12.9MB/sec 50% CPU in postgres, 50% CPU in pg_dump If there is no disk wait time, then why do you get 50/50 and not 100/100 or at least 1 core maxed out ? That's interesting... COPY annonces TO '/dev/null'; COPY 413526 Temps : 13871,093 ms \copy annonces to '/dev/null' Temps : 14037,946 ms time pg_dump -Fc -t annonces -U annonces --compress=0 annonces >/dev/null real0m14.596s user0m0.700s sys 0m0.372s In all 3 cases postgres maxes out one core (I've repeated the test until all data was cached, so there is no disk access at all in vmstat). Size of dump is 312MB. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Query help
"Subbiah Stalin-XCGF84" wrote: > Not sure what's wrong in below execution plan but at times the query > runs for 5 minutes to complete and after a while it runs within a > second or two. The plan doesn't look entirely unreasonable for the given query, although it's hard to be sure of that without seeing the table definitions. Given the plan, the times look to be about what I'd expect for uncached and cached timings. (That is, on subsequent runs, the data is sitting in RAM, so you don't need to access the hard drives.) If the initial run time is unacceptable for your environment, and there's no way to have the cached "primed" when it matters, please give more details on your table layouts, and perhaps someone can make a useful suggestion. > Pg 8.2.7, Sol10. One quick suggestion -- upgrade your PostgreSQL version if at all possible. The latest bug-fix version of 8.2 is currently 8.2.13, and there are significant performance improvements in 8.3 and the newly released 8.4. -Kevin -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] PostgreSQL 8.4 performance tuning questions
On 8/3/09 11:56 AM, "Tom Lane" wrote: > Scott Carey writes: >> I get very different (contradictory) behavior. Server with fast RAID, 32GB >> RAM, 2 x 4 core 3.16Ghz Xeon 54xx CPUs. CentOS 5.2 >> 8.3.6 >> No disk wait time during any test. One test beforehand was used to prime >> the disk cache. >> 100% CPU in the below means one core fully used. 800% means the system is >> fully loaded. > >> pg_dump > file (on a subset of the DB with lots of tables with small >> tuples) >> 6m 27s, 4.9GB; 12.9MB/sec >> 50% CPU in postgres, 50% CPU in pg_dump > >> pg_dump -Fc > file.gz >> 9m6s, output is 768M (6.53x compression); 9.18MB/sec >> 30% CPU in postgres, 70% CPU in pg_dump > >> pg_dump | gzip > file.2.gz >> 6m22s, 13MB/sec. >> 50% CPU in postgres, 50% Cpu in pg_dump, 50% cpu in gzip > > I don't see anything very contradictory here. The other poster got nearly 2 CPUs of work from just pg_dump + postgres. That contradicts my results (but could be due to data differences or postgres version differences). In the other use case, compression was not slower, but just used more CPU (also contradicting my results). > What you're demonstrating > is that it's nice to be able to throw a third CPU at the compression > part of the problem. No, 1.5 CPU. A full use of a second would even be great. I'm also demonstrating that there is some artificial bottleneck somewhere preventing postgres and pg_dump to operate concurrently. Instead, one waits while the other does work. Your claim earlier in this thread was that there was already pipelined work being done due to pg_dump + postgresql -- which seems to be true for the other test case but not mine. As a consequence, adding compression throttles the postgres process even though the compression hasn't caused 100% CPU (or close) on any task involved. > That's likely to remain true if we shift to a > different compression algorithm. I suspect if you substituted lzo for > gzip in the third case, the picture wouldn't change very much. > That is exactly the point. LZO would be nice (and help mitigate this problem), but it doesn't solve the real problem here. Pg_dump is slow and artificially throttles without even getting 100% CPU from itself or postgres. The problem still remains: dumping with -Fc can be significantly slower than raw piped to a compression utility, even if no task is CPU or I/O bound. Dumping and piping to gzip is faster. But parallel restore won't work without custom or raw format. > regards, tom lane > -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] PostgreSQL 8.4 performance tuning questions
Scott Carey writes: > I get very different (contradictory) behavior. Server with fast RAID, 32GB > RAM, 2 x 4 core 3.16Ghz Xeon 54xx CPUs. CentOS 5.2 > 8.3.6 > No disk wait time during any test. One test beforehand was used to prime > the disk cache. > 100% CPU in the below means one core fully used. 800% means the system is > fully loaded. > pg_dump > file (on a subset of the DB with lots of tables with small > tuples) > 6m 27s, 4.9GB; 12.9MB/sec > 50% CPU in postgres, 50% CPU in pg_dump > pg_dump -Fc > file.gz > 9m6s, output is 768M (6.53x compression); 9.18MB/sec > 30% CPU in postgres, 70% CPU in pg_dump > pg_dump | gzip > file.2.gz > 6m22s, 13MB/sec. > 50% CPU in postgres, 50% Cpu in pg_dump, 50% cpu in gzip I don't see anything very contradictory here. What you're demonstrating is that it's nice to be able to throw a third CPU at the compression part of the problem. That's likely to remain true if we shift to a different compression algorithm. I suspect if you substituted lzo for gzip in the third case, the picture wouldn't change very much. regards, tom lane -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
[PERFORM] Query help
All, Not sure what's wrong in below execution plan but at times the query runs for 5 minutes to complete and after a while it runs within a second or two. Here is explain analyze out of the query. SELECT OBJECTS.ID,OBJECTS.NAME,OBJECTS.TYPE,OBJECTS.STATUS,OBJECTS.ALTNAME,OBJE CTS.DOMAINID,OBJECTS.ASSIGNEDTO,OBJECTS.USER1,OBJECTS.USER2, OBJECTS.KEY1,OBJECTS.KEY2,OBJECTS.KEY3,OBJECTS.OUI,OBJECTS.PRODCLASS,OBJ ECTS.STATUS2,OBJECTS.LASTMODIFIED,OBJECTS.LONGDATA,OBJECTS.DATA0, OBJECTS.DATA1 FROM OBJECTS WHERE OBJECTS.DOMAINID IN ('HY3XGEzC0E9JxRwoXLOLbjNsghEA','3330') AND OBJECTS.TYPE IN ('cpe') ORDER BY OBJECTS.LASTMODIFIED DESC LIMIT 501 QUERY PLAN - Limit (cost=0.00..9235.11 rows=501 width=912) (actual time=0.396..2741.803 rows=501 loops=1) -> Index Scan Backward using ix_objects_type_lastmodified on objects (cost=0.00..428372.71 rows=23239 width=912) (actual time=0.394..2741.608 rows=501 loops=1) Index Cond: (("type")::text = 'cpe'::text) Filter: ((domainid)::text = ANY (('{HY3XGEzC0E9JxRwoXLOLbjNsghEA,3330}'::charact er varying[])::text[])) Total runtime: 2742.126 ms The table is auto vaccumed regularly. I have enabled log_min_messages to debug2 but nothing stands out during the times when the query took 5+ minutes. Is rebuild of the index necessary here. Thanks in Advance, Stalin Pg 8.2.7, Sol10.
Re: [PERFORM] PostgreSQL 8.4 performance tuning questions
On 7/31/09 4:01 PM, "PFC" wrote: > On Fri, 31 Jul 2009 19:04:52 +0200, Tom Lane wrote: > >> Greg Stark writes: >>> On Thu, Jul 30, 2009 at 11:30 PM, Tom Lane wrote: I did some tracing and verified that pg_dump passes data to deflate() one table row at a time. I'm not sure about the performance implications of that, but it does seem like it might be something to look into. >> >>> I suspect if this was a problem the zlib people would have added >>> internal buffering ages ago. I find it hard to believe we're not the >>> first application to use it this way. >> >> I dug into this a bit more. zlib *does* have internal buffering --- it >> has to, because it needs a minimum lookahead of several hundred bytes >> to ensure that compression works properly. The per-call overhead of >> deflate() looks a bit higher than one could wish when submitting short >> chunks, but oprofile shows that "pg_dump -Fc" breaks down about like >> this: > > During dump (size of dump is 2.6 GB), > > No Compression : > - postgres at 70-100% CPU and pg_dump at something like 10-20% > - dual core is useful (a bit...) > - dump size 2.6G > - dump time 2m25.288s > > Compression Level 1 : > - postgres at 70-100% CPU and pg_dump at 20%-100% > - dual core is definitely useful > - dump size 544MB > - dump time 2m33.337s > > Since this box is mostly idle right now, eating CPU for compression is no > problem... > I get very different (contradictory) behavior. Server with fast RAID, 32GB RAM, 2 x 4 core 3.16Ghz Xeon 54xx CPUs. CentOS 5.2 8.3.6 No disk wait time during any test. One test beforehand was used to prime the disk cache. 100% CPU in the below means one core fully used. 800% means the system is fully loaded. pg_dump > file (on a subset of the DB with lots of tables with small tuples) 6m 27s, 4.9GB; 12.9MB/sec 50% CPU in postgres, 50% CPU in pg_dump pg_dump -Fc > file.gz 9m6s, output is 768M (6.53x compression); 9.18MB/sec 30% CPU in postgres, 70% CPU in pg_dump pg_dump | gzip > file.2.gz 6m22s, 13MB/sec. 50% CPU in postgres, 50% Cpu in pg_dump, 50% cpu in gzip The default (5) compression level was used. So, when using pg_dump alone, I could not get significantly more than one core of CPU (all on the same box). No matter how I tried, pg_dump plus the postgres process dumping data always totaled about 102% -- it would flulctuate in top, give or take 15% at times, but the two always were very close (within 3%) of this total. Piping the whole thing to gzip gets some speedup. This indicates that perhaps the implementation or use of gzip is inappropriate on pg_dump's side or the library version is older or slower. Alternatively, the use of gzip inside pg_dump fails to pipeline CPU useage as well as piping it does, as the above shows 50% more CPU utilization when piping. I can do the same test with a single table that is 10GB later (which does dump much faster than 13MB/sec and has rows that average about 500 bytes in size). But overall I have found pg_dump's performace sorely lacking, and this is a data risk in the big picture. Postgres is very good about not losing data, but that only goes up to the limits of the hardware and OS, which is not good enough. Because of long disaster recovery times and poor replication/contingency features, it is a fairly unsafe place for data once it gets beyond a certain size and a BC plan requires minimal downtime. > Adding an option to use LZO instead of gzip could be useful... > > Compressing the uncompressed 2.6GB dump : > > - gzip -1 : > > - compressed size : 565 MB > - compression throughput : 28.5 MB/s > - decompression throughput : 74 MB/s > > - LZO -1 : > - compressed size : 696M > - compression throughput : 86 MB/s > - decompression throughput : 247 MB/s > > Conclusion : LZO could help for fast disks (RAID) or slow disks on a > CPU-starved server... > LZO would be a great option, it is very fast, especially decompression. With gzip, one rarely gains by going below gzip -3 or above gzip -6. > -- > Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-performance > -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Why is PostgreSQL so slow on Windows ( Postgres 8.3.7) version
The few 'obvious' things I see : ID and POLLID aren't of the same type (numeric vs bigint) TTIME isn't indexed. And as a general matter, you should stick to native datatypes if you don't need numeric. But as said in the other answer, maybe you should redo this schema and use more consistent datatypes Anyway, from what I remenber, it's not advised to set up shared buffers that high for windows (I don't do so much windows myself, so maybe someone will be better informed). Anyway you can start by correcting the schema… On Friday 31 July 2009 07:45:55 pari krishnan wrote: > Dear All, > > > We are > using Postgres 8.3.7 in our java application. We are doing performances > tuning and load testing in our setup. we have noticed that ,some of our > queries to the database taking long time to return the results.Please > find our setup details belows. > > We observed that postgres is running in windows is slower than the linux . > > Machine & Database Details : > > Windows configuration: > 4 GB RAM > 4*1.6 GHZ > windows 2008 server standard edition > > Postgresql configuration: > > shared_buffers: 1 GB > Effective_cache_size: 2GB > fsync: off (even we tested this parameter is on ,we observed the same > slowness ) > > > Database Details : > > Postgres Database : PostgreSQL 8.3.7.1 > Driver Version: PostgreSQL 8.3 JDBC4 with SSL (build 604) > We are using 40 database connections. > > > We have few tables which will be having more amount data.While running > our application STATSDATA table will be created daily with table name > with date. > like as STATSDATA8_21_2009 > > Schema for STATSDATA table > > create table STATSDATA8_21_2009( > POLLID Numeric(19), > INSTANCE varchar(100), > TTIME Numeric(19), > VAL Numeric(13)) ;CREATE INDEX POLLID%_ndx on STATSDATA%(POLLID) > > Schema for PolledData > > create table PolledData( > "NAME" varchar(50) NOT NULL , > "ID" BIGINT NOT NULL , > "AGENT" varchar(50) NOT NULL , > "COMMUNITY" varchar(100) NOT NULL , > "PERIOD" INTEGER NOT NULL, > "ACTIVE" varchar(10), > "OID" varchar(200) NOT NULL, > "LOGDIRECTLY" varchar(10), > "LOGFILE" varchar(100), > "SSAVE" varchar(10), > "THRESHOLD" varchar(10), > "ISMULTIPLEPOLLEDDATA" varchar(10), > "PREVIOUSSEVERITY" INTEGER, > "NUMERICTYPE" INTEGER, > "SAVEABSOLUTES" varchar(10), > "TIMEAVG" varchar(10), > "PORT" INTEGER, > "WEBNMS" varchar(100), > "GROUPNAME" varchar(100), > "LASTCOUNTERVALUE" BIGINT , > "LASTTIMEVALUE" BIGINT , > "TIMEVAL" BIGINT NOT NULL , > "POLICYNAME" varchar(100), > "THRESHOLDLIST" varchar(200), > "DNSNAME" varchar(100), > "SUFFIX" varchar(20), > "STATSDATATABLENAME" varchar(100), > "POLLERNAME" varchar(200), > "FAILURECOUNT" INTEGER, > "FAILURETHRESHOLD" INTEGER, > "PARENTOBJ" varchar(100), > "PROTOCOL" varchar(50), > "SAVEPOLLCOUNT" INTEGER, > "CURRENTSAVECOUNT" INTEGER, > "SAVEONTHRESHOLD" varchar(10), > "SNMPVERSION" varchar(10), > "USERNAME" varchar(30), > "CONTEXTNAME" varchar(30), > PRIMARY KEY ("ID","NAME","AGENT","OID"), > index PolledData0_ndx ( "NAME"), > index PolledData1_ndx ( "AGENT"), > index PolledData2_ndx ( "OID"), > index PolledData3_ndx ( "ID"), > index PolledData4_ndx ( "PARENTOBJ"), > ) > > > We > have 300k row's in PolledData Table.In each STATSDATA table ,we have > almost 12 to 13 million rows. Every one minute interval ,we insert data > into to STATSDATA table. In our application ,we use insert and select > query to STATSDATA table at regular interval. Please let us know why > the below query takes more time to return the results. is there any > thing we need to do to tune the postgres database ? > > > > > Please find explain analyze output. > > > First Query : > > postgres=# explain analyze SELECT COUNT(*) FROM ( SELECT ID, > PolledData.AGENT, N AME, INSTANCE, TTIME, VAL FROM PolledData, > STATSDATA8_21_2009 WHERE ( ( PolledDa ta.ID=STATSDATA8_21_2009.POLLID) AND > ( ( TTIME >= 1250838027454) AND ( TTIME <= 1250838079654) ) ) ) t1; > QUERY > PLAN > > --- >- -- > Aggregate (cost=773897.12..773897.13 rows=1 width=0) (actual > time=17818.410..1 7818.412 rows=1 loops=1) >-> Merge Join (cost=717526.23..767505.06 rows=2556821 width=0) (actual > time =17560.469..17801.790 rows=13721 loops=1) > Merge Cond: (statsdata8_21_2009.pollid = > ((polleddata.id)::numeric)) -> Sort (cost=69708.44..69742.49 rows=13619 > width=8) (actual time=239 2.659..2416.093 rows=13721 loops=1) >Sort Key: statsdata8_21_2009.pollid >Sort Method: quicksort Memory: 792kB >-> Seq Scan on statsdata8_21_2009 (cost=0.00..68773.27 > rows=136 19 width=8) (actual time=0.077..2333.132 rows=13721 loops=1) > Filter: ((ttime >= 1250838027454::numeric) AND (ttime > <= 12 50838079654::numeric)) > -> Materialize (cost=64781
Re: [PERFORM] Why is PostgreSQL so slow on Windows ( Postgres 8.3.7) version
how about normalizing the schema for start ? by the looks of it, you have huge table,with plenty of varchars, that smells like bad design of db. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Greenplum MapReduce
Suvankar Roy wrote: Hi Richard, I sincerely regret the inconvenience caused. No big inconvenience, but the lists can be very busy sometimes and the easier you make it for people to answer your questions the better the answers you will get. %YAML 1.1 --- VERSION: 1.0.0.1 DATABASE: test_db1 USER: gpadmin DEFINE: - INPUT: #** This the line which is causing the error **# > NAME: doc > TABLE: documents If it looks fine, always check for tabs. Oh, and you could have cut out all the rest of the file, really. I have learnt that unnecessary TABs can the cause of this, so trying to overcome that, hopefully the problem will subside then I'm always getting this. It's easy to accidentally introduce a tab character when reformatting YAML. It might be worth checking if your text editor has an option to always replace tabs with spaces. -- Richard Huxton Archonet Ltd -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Greenplum MapReduce
Suvankar Roy wrote: Hi all, Has anybody worked on Greenplum MapReduce programming ? I am facing a problem while trying to execute the below Greenplum Mapreduce program written in YAML (in blue). The other poster suggested contacting Greenplum and I can only agree. The error is thrown in the 7th line as: Error: YAML syntax error - found character that cannot start any token while scanning for the next token, at line 7 (in red) There is no red, particularly if viewing messages as plain text (which most people do on mailing lists). Consider indicating a line some other way next time (commonly below the line you put something like "this is line 7 ^") The most common problem I get with YAML files though is when a tab is accidentally inserted instead of spaces at the start of a line. -- Richard Huxton Archonet Ltd -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance