Re: [GENERAL] When adding millions of rows at once, getting out of disk space errors
On Wed, 2009-02-18 at 12:55 -0800, Mike Christensen wrote: I have well over 50 gigs free on that drive.. I doubt it. Are you sure the pg data directory is on the drive you think it is? Are you doing alot of deletes or are you merely inserting? Are you doing any sorting and therefore running out of temp space in your tmp partition [supposition, I've run into something like this before, but not specifically with Postgres]. -Mark -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Good Delimiter for copy command
On Tue, 2009-02-17 at 17:17 +0100, Marco Colombo wrote: Which makes me wonder, does copy accept UTF-8 input? Is it possibile to use some unicode character which is unlikely to appear in the data set as delimiter? Something like U+FFFC. I'm also not able to get unicode characters to copy in like that, but I also haven't had any problems with data containing the *tab* character. It seems to be properly escaped (see my other email in this thread) so it doesn't seem to matter if it appears in the data stream. The *only* reason I see to switch from tab is if the receiving application requires it in a different format. -Mark -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Good Delimiter for copy command
On Thu, 2009-02-12 at 12:51 -0800, SHARMILA JOTHIRAJAH wrote: My data fields contains commas, tabs,'|' etc. So I cant use t hem as delimiters..so I need a unique may be non-character to use as a delimiter... -Sharmila Is this a theoretical problem or an actual one? I haven't had any problems with the default (tab). In fact, copying from one database to another is an exceedingly common task that I do, all done with tab. dev= create table foo (a text); CREATE TABLE Time: 385.967 ms dev= insert into foo values (' '); -- literal tab INSERT 0 1 Time: 0.536 ms dev= insert into foo values ('\t'); -- special character, parsed. INSERT 0 1 Time: 0.224 ms dev= insert into foo values ('\\t'); -- backslash, t INSERT 0 1 Time: 0.183 ms dev= copy foo to stdout; \t \t \\t Time: 0.188 ms dev= select * from foo; a -- \x09 \x09 \t (3 rows) Time: 0.239 ms dev= -Mark -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Pet Peeves?
On Thu, 2009-01-29 at 13:16 +, Gregory Stark wrote: I'm putting together a talk on PostgreSQL Pet Peeves for discussion at FOSDEM 2009 this year. I have a pretty good idea what some them are of course, but I would be interested to hear if people have any complaints from personal experience. What would be most interesting is if you can explain an example of when the problem caused real inconvenience to you, since sometimes it's hard to see from a theoretical description where the real harm lies. So, what do people say? Is Postgres perfect in your world or does it do some things which rub you the wrong way? Feel free to respond on-list or if you prefer in personal emails. I do intend to use the ideas you give in my presentation so mark anything you wouldn't be happy to see in a slide at a conference some day. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's Slony Replication support! A couple of the things weighing in on my mind right now (I could be utterly wrong in all of them): - In place upgrade. Seriously, 3.5TB DBs make me cry... - Lack of good documentation for the binary copy input format for 'copy foo from stdin with binary'. Also, I don't *seem* to be able to use the same file that I copied out for copy in, which was a total surprise. - The fetch time with lots of byteas is really bad - selecting them out effectively *forces* scattered IO, even in what would normally be seq IO. It would be really nice if you did all the grabbing of rows that was required and then *at the end* fetched the appropriate bytea fields and re-aggregated them appropriately. This is a *HUGE* performance killer. - Bytea copy input format is *exceedingly* large - and the dual parser thing requiring two backslashes doesn't help!. W T F, I have got to be missing something. Consider the case where I want to write an int16_t. What should be sprintf(s, \\%o, i); becomes sprintf(s, %03o%03o, (i 0x00FF), (i 0xFF00)); - Query planning with heavy partitioning takes a huge hit, and this isn't helped by multiple cores on the same box. It would be very nice of subpartitions could simply be ignored if their parent partition wasn't required, but the planner still takes locks on them. - The ability to add a table to the inheritance structure without obtaining an acc ex lock would be really nice. - The ability to rebuild a table or index concurrently would be nice, especially if it automatically picked up interim changes and applied them before switching out and dropping the table. - Slony is really too slow to use for large quantities of data shipping. IIRC we had to move off of it when the DB was still sub 1 TB. - Lots of temp table creation/dropping plays havoc with the catalog tables and eventually requires a full maintenance window to resolve. - Creating an empty table with foreign keys requires an acc ex lock on all tables. Blargh. - It'd be nice if the query planner was more stable - sometimes the queries run fast, and then sometimes they randomly take 2 hours for a delete that normally runs in a couple of minutes. There's (alot) more, but I can't recall it all because I'm overall pretty happy with Postgres. -Mark -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Favorite Tom Lane quotes
On Tue, 2008-12-02 at 10:40 +, Grzegorz JaĆkiewicz wrote: but than you have to add cost of backing up and restoring labour time, and that's going to be a massive latency - if you ask me. Of course it is, but really latency probably isn't the key issue - more that the data itself isn't lost. -Mark -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] general table stats, ideas ?
My problem is, I need to benchmark set of tables, where - we can assume - schema of each table is unknown, and we have no assumption on any fields being present there. (altho, if there is no other way to do it, we could assume id bigint not null default nextval('someseq')); basically, I need to know when certain row was selected (read), and when it was updated/created (insert). For that I need two fields. And although former could be done, and I more or less know how to do it (I think it can be done with rule, and default = now()) - I have yet to find a way on how to do the same thing for select. So basically you've got a slony replicated database that you want to check (on a row by row level) when something gets read/inserted/deleted? It seems like you would want to add three fields to each table: last_read_time, last_update_time, and original_insert_time Then you restrict all access to the table and use security definer functions to allow access. These functions would also update said metadata to the table. If you need an ongoing log of access to the tables, you could always add an accessor log table that looked like: User (postgres/MYUSER) / Action (Select/Update/Insert) / Column (some_column_name) / New Value (blah-value) Erm, that's if I understand your question right. :-/ -Mark -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] SERIAL datatype
On Thu, 2008-08-21 at 13:53 -0600, Scott Marlowe wrote: Regular SERIAL type is limited to a 32 bit int. BIGSERIAL uses a 64 bit int. I think one of the things that would be offsetting is the size difference between the two types (32 vs 64 bits, 5 foreign keys, and a billion rows or so makes for alot of pain). -Mark -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] SERIAL datatype
On Mon, 2008-08-25 at 11:02 -0600, Scott Marlowe wrote: Well, of course a 64 bit int is gonna be bigger than a 32 bit, but with alignment issues and on 64 bit hardware, I'm guessing the difference isn't exactly twice as slow / twice as much storage. And it's way faster than a GUID which was what I think started this thread. I took a slice of data from our dev box and generated a table using integers and bigints. For reference, the schema is: bigint table: Type | Modifiers +--- bigint | date| bigint | bigint | bigint | bigint | bigint | date| date| bytea | integer | integer | numeric | numeric | numeric | integer | integer | integer | integer | integer | integer | integer | bytea | int table: Type | Modifiers +--- bigint | date| integer | integer | integer | integer | integer | date| date| bytea | integer | integer | numeric | numeric | numeric | integer | integer | integer | integer | integer | integer | integer | bytea | The integer version is 599752704 bytes, and the bigint version is 673120256 bytes (a ~12% size increase). When joining the table to itself (keys = 1 date, 5 (big)ints, no indexes), the bigint version performs a join to itself with an average of 44.1 sec, and the integer version in 29.6 sec (a 48% performance hit). While granted that it's not twice as big and twice as slow, I think it's a fairly valid reason to want to stay within (small)int ranges. Sometimes the initial performance hit on insert would really be worth the continuing space/performance savings down the road. Of course, this wasn't very scientific and the benchmarks aren't very thorough (for instance I assumed that bigserial is implemented as a bigint), but it should remain a valid point. Of course, it probably has no bearing on the OP's problem. So my advice to the OP: have you considered not keying such a volatile table on a serial value? -Mark -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] What's size of your PostgreSQL Database?
On Thu, 2008-08-21 at 22:17 +0800, Amber wrote: Another question, how many people are there maintaining this huge database. We have about 2T of compressed SAS datasets, and now considering load them into a RDBMS database, according to your experience, it seems a single PostgreSQL instance can't manage such size databases well, it that right? Yahoo has a 2PB Postgres single instance Postgres database (modified engine), but the biggest pure Pg single instance I've heard of is 4TB. The 4TB database has the additional interesting property in that they've done none of the standard scalable architecture changes (such as partitioning, etc). To me, this is really a shining example that even naive Postgres databases can scale to as much hardware as you're willing to throw at them. Of course, clever solutions will get you much more bang for your hardware buck. As for my personal experience, I'd say that the only reason that we're currently running a dual Pg instance (Master/Replica/Hot Standby) configuration is for report times. It's really important to us to have snappy access to our data warehouse. During maintenance our site and processes can easily be powered by the master database with some noticeable performance degradation for the users. The grid that we (I) am looking to build is coming out of changing (yet ever static!) business needs: we're looking to immediately get 2x the data volume and soon need to scale to 10x. Couple this with increased user load and the desire to make reports run even faster than they currently do and we're really going to run up against a hardware boundary. Besides, writing grid/distributed databases is *fun*! Uh, for a one sentence answer: A single Pg instance can absolutely handle 2+ TB without flinching. How many CPU cores and memory does your server have :) My boss asked me not to answer the questions I missed... sorry. I will say that the hardware is pretty modest, but has good RAM and disk space. -Mark -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] What's size of your PostgreSQL Database?
Just out of curiosity, how do you replicate that amount of data? When I started working here, we used Slony-I to replicate our aggregate fact tables. A little over a year ago our data volume had grown to the point that the Slony was regularly unable to keep up with the data volume and around this time Slony hit us with some crazy rollback based data loss bug. We elected to move our aggregate fact tables off of slony, but left metadata tables on Slony. So I wrote a custom in house replication engine into our aggregation process. Replication is accomplished in parallel via piped copy statements, and so far it's been fast enough to keep up with the data volume. Without getting into too much detail, an import process might look like this: - Obtain and transform data into a usable state - Import Data, and earmark it to be reaggregated - Reaggregate (summarize) the changed data - Copy the new data to all databases in the replication set - Integrate new data into the aggregate fact tables I'd be very interested in a developers view of running and maintaining a database this size. Mostly what choices is made during development that might have been different on a smaller database. I'm also curious about the maintenance needed to keep a database this size healthy over time. Ok, so all of this comes from what might be termed as an ETL Developer point of view. I pretty much only work on the data warehouse's import/summarization process and look into performance issues. Design: - Natural keys are pretty well strictly forbidden. Don't make a VARCHAR your primary key, please. ;-) - The Data determines partition criteria and the application layer is partition aware. - It's important to have aggregate tables to support common queries. Joins are ok, but repeatedly aggregating thousands of rows together on the fly really takes too much time. - Aggregation processes have to be designed with care. - Parallel processing is important, especially if you ever have to reprocess large amounts of data (for example due to incorrect initial data) Maintenance: - Autovacuum might be more trouble than it's worth. We frequently have mysteriously hung queries that are eventually traced back to being blocked by Autovacuum. The application layer knows exactly what and when is changing... it could *probably* take over this duty. - Pg upgrades are a major PITA and require absurd amounts of downtime for the data processing part of the warehouse. - Queries that have been working for long periods of time and suddenly stop working or hang are usually broken by statistics issues. - Partitioning is important because it allows the aggregate tables to be clustered, backed up, and archived individually If anyone wants to chat with me or direct me to resources about running Postgres on distributed file systems (or other grid computing solutions) please let me know. Yes, I'm aware of the proprietary solutions, but they've been ruled out on account of performance problems with aggregates and absurd licensing costs. Also, any grid computing solution that I write can easily be applied across the company and thus save us (quite literally) millions per year. Again, if you'd like more information about any particular topic, just ask. :) -Mark -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] SELECT query experts, anyone?
If you put this in the application, you could do something such as: my @manufacturers = fetch(select manufacturer_no, name from manufacturers); my @select_fields = ('product_no'); foreach my $manufacturer (@manufacturers) { my $manuf_no = $manufacturer-{manufacturer_no}; my $name = $manufacturer-{name}; push(@select_fields, case when x.manufacturer_no = $manuf_no then '$name' else null end as manuf_${manuf_no}_products); } my @outer_select_fields = ('product_no', map { my $manuf = manuf_ . $_-{manufacturer_no} . _products; sum($manuf) as $manuf } @manufacturers); my @dataset = fetch( select @{[ join(,\n, @outer_select_fields) ]} from ( select @{[ join(,\n, @select_fields) ]} from products_by_manufacturer x ) x group by product_no Uh, or something like that. Perl in Evolution is really.. painful. -Mark On Wed, 2008-08-20 at 15:50 +0300, Teemu Juntunen wrote: Hi Experts, is threre any way to SELECT values in columns instead of rows? For example select products and their manufacters in the followin way: product1; manufacturer1; manufacturer2;,,, manufacturerN product2; manufacturer3; product3; manufacturer1;.. manufacturerN-1 With a function you could put the product manufacturers in one string, but I would like to have them in columns. How about arrays. Is there a way to SELECT values in an array to columns? Best regards and thanks, Teemu Juntunen -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] What's size of your PostgreSQL Database?
On Tue, 2008-08-19 at 07:34 -0400, Bill Moran wrote: In theory, you can have so many disks that the bottleneck moves to some other location, such as the IO bus or memory or the CPU, but I've never heard of that happening to anyone. Also, you want to get fast, high- quality disks, as 10 15,000 RPM disks are going to perform better than 10 7,200 RPM disks. I've personally experienced this happening. -Mark -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] What's size of your PostgreSQL Database?
On Sat, 2008-08-16 at 11:42 +0800, Amber wrote: Dear all: We are currently considering using PostgreSQL to host a read only warehouse, we would like to get some experiences, best practices and performance metrics from the user community, following is the question list: 1. What's size of your database? 2. What Operating System are you using? 3. What level is your RAID array? 4. How many cores and memory does your server have? 5. What about your performance of join operations? 6. What about your performance of load operations? 7. How many concurrent readers of your database, and what's the average transfer rate, suppose all readers are doing one table scaning. 8. Single instance or a cluster, what cluster software are you using if you have a cluster? Thank you in advance! 1. 2.5-3TB, several others that are of fractional sisize. ... 5. They do pretty well, actually. Our aggregate fact tables regularly join to metadata tables and we have an average query return time of 10-30s. We do make some usage of denormalized mviews for chained/hierarchical metadata tables. 6. Load/copy operations are extremely performant. We pretty well constantly have 10+ concurrent load operations going with 2-3 aggregation processes. 7. About 50, but I'm not sure what the transfer rate is. 8. We have a master and a replica. We have plans to move to a cluster/grid Soon(TM). It's not an emergency and Postgres can easily handle and scale to a 3TB database on reasonable hardware ($30k). A few notes: our database really can be broken into a very typical ETL database: medium/high input (write) volume with low latency access required. I can provide a developer's view of what is necessary to keep a database of this size running, but I'm under no illusion that it's actually a large database. I'd go into more details, but I'd hate to be rambling. If anyone's actually interested about any specific parts, feel free to ask. :) Also, if you feel that we're doing something wrong, feel free to comment there too. :) -Mark -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Cause of error message?
On Mon, 2008-07-14 at 19:44 -0500, Bayless Kirtley wrote: An old app is giving a new error message. I query a table and make one update. Later I come back to query the table again and get this error: ERROR: current transaction is aborted, commands ignored until end of transaction block The error only occurs after I have made an update. As long as no updates, I can query multiple times. Can anyone give me an idea of what sort of actions or errors usually cause this error? Thanks, Bayless I believe that this might be caused by a syntax error. Do you have any error handling code that might be hiding the error from you? -Mark -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] limits?
On Mon, 2008-06-23 at 11:21 -0700, Steve Atkins wrote: http://www.postgresql.org/about/ has some of the theoretical limits. In real use you're unlikely to hit any limits, theoretical or practical, but if you start to use a silly number of tables and so on you're likely to hit performance issues eventually. I'm not sure where that threshold would be, but it's higher than thousands. Cheers, Steve I'm just a developer (my DBA would have more details on the settings that needed adjusting), but I will attest to it being reasonably easy to hit database limits. For us, it was related to having a reasonably small number of (large) aggregate tables that are partitioned (inherited with check constraints) by week. The real problem wasn't the absolute limit of tables (IIRC) as much as accessing the data in a parallel manner from the parent tables in bulk update transactions (the parallel was probably what pushed it over the top). The limit is absurdly high, and by the time you hit it, you'll probably have a really good idea of how to overcome it. Really, there's lots of challenges you'll overcome before that time (IMO). Of course all this is anecdotal, and you should take it with a grain of salt. :) -Mark -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Update Join Query
update foo set foo_id = g.goo_id from goo g where foo.foo_id = g.goo_id and foo.foo_id is not null I think. :) -Mark On Mon, 2008-06-23 at 21:43 +0200, Daniel Futerman wrote: Hi, Looking for the correct syntax for an UPDATE LEFT JOIN query in PostgreSQL. The equivalent MySQL query is : UPDATE Foo f LEFT JOIN Goo g on f.Foo_ID = g.Goo_ID SET f.Foo_ID = g.Goo_ID WHERE f.Foo_ID IS NOT NULL; When I try to run this in Postgres, i get the following error: ERROR: syntax error at or near LEFT Is it possible to have UPDATE JOIN queries in PostgreSQL? Thanks. -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general