Re: [GENERAL] When adding millions of rows at once, getting out of disk space errors

2009-02-18 Thread Mark Roberts

On Wed, 2009-02-18 at 12:55 -0800, Mike Christensen wrote:
 I have well over 50 gigs free on that drive..  I doubt it.

Are you sure the pg data directory is on the drive you think it is?  Are
you doing alot of deletes or are you merely inserting?  Are you doing
any sorting and therefore running out of temp space in your tmp
partition [supposition, I've run into something like this before, but
not specifically with Postgres].

-Mark


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Good Delimiter for copy command

2009-02-17 Thread Mark Roberts

On Tue, 2009-02-17 at 17:17 +0100, Marco Colombo wrote:
 
 Which makes me wonder, does copy accept UTF-8 input? Is it possibile
 to use some unicode character which is unlikely to appear in the data
 set as delimiter? Something like U+FFFC.

I'm also not able to get unicode characters to copy in like that, but I
also haven't had any problems with data containing the *tab* character.
It seems to be properly escaped (see my other email in this thread) so
it doesn't seem to matter if it appears in the data stream.  The *only*
reason I see to switch from tab is if the receiving application requires
it in a different format.

-Mark


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Good Delimiter for copy command

2009-02-12 Thread Mark Roberts

On Thu, 2009-02-12 at 12:51 -0800, SHARMILA JOTHIRAJAH wrote:
 
 My data fields contains commas, tabs,'|'  etc. So I cant use t hem as
 delimiters..so I need a unique may be non-character to use as a
 delimiter...
 -Sharmila

Is this a theoretical problem or an actual one?  I haven't had any
problems with the default (tab).  In fact, copying from one database to
another is an exceedingly common task that I do, all done with tab.

dev= create table foo (a text);
CREATE TABLE
Time: 385.967 ms
dev= insert into foo values (' '); -- literal tab
INSERT 0 1
Time: 0.536 ms
dev= insert into foo values ('\t'); -- special character, parsed.
INSERT 0 1
Time: 0.224 ms
dev= insert into foo values ('\\t'); -- backslash, t
INSERT 0 1
Time: 0.183 ms
dev= copy foo to stdout;
\t
\t
\\t
Time: 0.188 ms
dev= select * from foo;
  a   
--
 \x09
 \x09
 \t
(3 rows)

Time: 0.239 ms
dev= 


-Mark


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Pet Peeves?

2009-02-04 Thread Mark Roberts
On Thu, 2009-01-29 at 13:16 +, Gregory Stark wrote:
 I'm putting together a talk on PostgreSQL Pet Peeves for discussion at
 FOSDEM 2009 this year. I have a pretty good idea what some them are of course,
 but I would be interested to hear if people have any complaints from personal
 experience. What would be most interesting is if you can explain an example of
 when the problem caused real inconvenience to you, since sometimes it's hard
 to see from a theoretical description where the real harm lies.
 
 So, what do people say? Is Postgres perfect in your world or does it do some
 things which rub you the wrong way?
 
 Feel free to respond on-list or if you prefer in personal emails. I do intend
 to use the ideas you give in my presentation so mark anything you wouldn't be
 happy to see in a slide at a conference some day.
 
 -- 
   Gregory Stark
   EnterpriseDB  http://www.enterprisedb.com
   Ask me about EnterpriseDB's Slony Replication support!
 

A couple of the things weighing in on my mind right now (I could be
utterly wrong in all of them):

- In place upgrade.  Seriously, 3.5TB DBs make me cry...
- Lack of good documentation for the binary copy input format for 'copy
foo from stdin with binary'.  Also, I don't *seem* to be able to use the
same file that I copied out for copy in, which was a total surprise.
- The fetch time with lots of byteas is really bad - selecting them out
effectively *forces* scattered IO, even in what would normally be seq
IO.  It would be really nice if you did all the grabbing of rows that
was required and then *at the end* fetched the appropriate bytea fields
and re-aggregated them appropriately.  This is a *HUGE* performance
killer.
- Bytea copy input format is *exceedingly* large - and the dual parser
thing requiring two backslashes doesn't help!.  W T F, I have got to be
missing something.
Consider the case where I want to write an int16_t.  What should be
sprintf(s, \\%o, i);

becomes

sprintf(s, %03o%03o, (i  0x00FF), (i  0xFF00));

- Query planning with heavy partitioning takes a huge hit, and this
isn't helped by multiple cores on the same box.  It would be very nice
of subpartitions could simply be ignored if their parent partition
wasn't required, but the planner still takes locks on them.
- The ability to add a table to the inheritance structure without
obtaining an acc ex lock would be really nice.
- The ability to rebuild a table or index concurrently would be nice,
especially if it automatically picked up interim changes and applied
them before switching out and dropping the table.
- Slony is really too slow to use for large quantities of data shipping.
IIRC we had to move off of it when the DB was still sub 1 TB.
- Lots of temp table creation/dropping plays havoc with the catalog
tables and eventually requires a full maintenance window to resolve.
- Creating an empty table with foreign keys requires an acc ex lock on
all tables.  Blargh.
- It'd be nice if the query planner was more stable - sometimes the
queries run fast, and then sometimes they randomly take 2 hours for a
delete that normally runs in a couple of minutes.

There's (alot) more, but I can't recall it all because I'm overall
pretty happy with Postgres.

-Mark


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Favorite Tom Lane quotes

2008-12-02 Thread Mark Roberts

On Tue, 2008-12-02 at 10:40 +, Grzegorz Jaƛkiewicz wrote:
 but than you have to add cost of backing up and restoring labour time,
 and that's going to be a massive latency - if you ask me.

Of course it is, but really latency probably isn't the key issue - more
that the data itself isn't lost.

-Mark


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] general table stats, ideas ?

2008-10-07 Thread Mark Roberts


 My problem is, I need to benchmark set of tables, where - we can
 assume - schema of each table is unknown, and we have no assumption on
 any fields being present there. (altho, if there is no other way to do
 it, we could assume id bigint not null default nextval('someseq'));
 basically, I need to know when certain row was selected (read), and
 when it was updated/created (insert). For that I need two fields. And
 although former could be done, and I more or less know how to do it (I
 think it can be done with rule, and default = now()) - I have yet to
 find a way on how to do the same thing for select. 

So basically you've got a slony replicated database that you want to
check (on a row by row level) when something gets read/inserted/deleted?

It seems like you would want to add three fields to each table:
last_read_time, last_update_time, and original_insert_time

Then you restrict all access to the table and use security definer
functions to allow access.  These functions would also update said
metadata to the table.  If you need an ongoing log of access to the
tables, you could always add an accessor log table that looked like:

User (postgres/MYUSER) / Action (Select/Update/Insert) / Column
(some_column_name) / New Value (blah-value)

Erm, that's if I understand your question right. :-/

-Mark



-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] SERIAL datatype

2008-08-25 Thread Mark Roberts

On Thu, 2008-08-21 at 13:53 -0600, Scott Marlowe wrote:
 Regular SERIAL type is limited to a 32 bit int.  BIGSERIAL uses a 64
 bit int.

I think one of the things that would be offsetting is the size
difference between the two types (32 vs 64 bits, 5 foreign keys, and a
billion rows or so makes for alot of pain).

-Mark


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] SERIAL datatype

2008-08-25 Thread Mark Roberts

On Mon, 2008-08-25 at 11:02 -0600, Scott Marlowe wrote:
 Well, of course a 64 bit int is gonna be bigger than a 32 bit, but
 with alignment issues and on 64 bit hardware, I'm guessing the
 difference isn't exactly twice as slow / twice as much storage.  And
 it's way faster than a GUID which was what I think started this
 thread.

I took a slice of data from our dev box and generated a table using
integers and bigints.  For reference, the schema is:
bigint table:
 Type   | Modifiers 
+---
bigint  | 
date| 
bigint  | 
bigint  | 
bigint  | 
bigint  | 
bigint  | 
date| 
date| 
bytea   | 
integer | 
integer | 
numeric | 
numeric | 
numeric | 
integer | 
integer | 
integer | 
integer | 
integer | 
integer | 
integer | 
bytea   | 

int table:
 Type   | Modifiers 
+---
bigint  | 
date| 
integer | 
integer | 
integer | 
integer | 
integer | 
date| 
date| 
bytea   | 
integer | 
integer | 
numeric | 
numeric | 
numeric | 
integer | 
integer | 
integer | 
integer | 
integer | 
integer | 
integer | 
bytea   | 

The integer version is 599752704 bytes, and the bigint version is
673120256 bytes (a ~12% size increase).  When joining the table to
itself (keys = 1 date, 5 (big)ints, no indexes), the bigint version
performs a join to itself with an average of 44.1 sec, and the integer
version in 29.6 sec (a 48% performance hit).

While granted that it's not twice as big and twice as slow, I think it's
a fairly valid reason to want to stay within (small)int ranges.
Sometimes the initial performance hit on insert would really be worth
the continuing space/performance savings down the road.

Of course, this wasn't very scientific and the benchmarks aren't very
thorough (for instance I assumed that bigserial is implemented as a
bigint), but it should remain a valid point.

Of course, it probably has no bearing on the OP's problem.  So my advice
to the OP: have you considered not keying such a volatile table on a
serial value?

-Mark


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] What's size of your PostgreSQL Database?

2008-08-21 Thread Mark Roberts

On Thu, 2008-08-21 at 22:17 +0800, Amber wrote:
 Another question, how many people are there maintaining this huge database.
 We have about 2T of compressed SAS datasets, and now considering load them 
 into a RDBMS database,
 according to your experience, it seems a single PostgreSQL  instance can't 
 manage such size databases well, it that right?

Yahoo has a 2PB Postgres single instance Postgres database (modified
engine), but the biggest pure Pg single instance I've heard of is 4TB.
The 4TB database has the additional interesting property in that they've
done none of the standard scalable architecture changes (such as
partitioning, etc).  To me, this is really a shining example that even
naive Postgres databases can scale to as much hardware as you're willing
to throw at them.  Of course, clever solutions will get you much more
bang for your hardware buck.

As for my personal experience, I'd say that the only reason that we're
currently running a dual Pg instance (Master/Replica/Hot Standby)
configuration is for report times.  It's really important to us to have
snappy access to our data warehouse.  During maintenance our site and
processes can easily be powered by the master database with some
noticeable performance degradation for the users.

The grid that we (I) am looking to build is coming out of changing
(yet ever static!) business needs: we're looking to immediately get 2x
the data volume and soon need to scale to 10x.  Couple this with
increased user load and the desire to make reports run even faster than
they currently do and we're really going to run up against a hardware
boundary.  Besides, writing grid/distributed databases is *fun*!

Uh, for a one sentence answer: A single Pg instance can absolutely
handle 2+ TB without flinching.

 How many CPU cores and memory does your server have :) 

My boss asked me not to answer the questions I missed... sorry.  I will
say that the hardware is pretty modest, but has good RAM and disk space.

-Mark


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] What's size of your PostgreSQL Database?

2008-08-20 Thread Mark Roberts

 Just out of curiosity, how do you replicate that amount of data?

When I started working here, we used Slony-I to replicate our aggregate
fact tables.  A little over a year ago our data volume had grown to the
point that the Slony was regularly unable to keep up with the data
volume and around this time Slony hit us with some crazy rollback based
data loss bug.  We elected to move our aggregate fact tables off of
slony, but left metadata tables on Slony.

So I wrote a custom in house replication engine into our aggregation
process.  Replication is accomplished in parallel via piped copy
statements, and so far it's been fast enough to keep up with the data
volume.  Without getting into too much detail, an import process might
look like this:

- Obtain and transform data into a usable state
- Import Data, and earmark it to be reaggregated
- Reaggregate (summarize) the changed data
- Copy the new data to all databases in the replication set
- Integrate new data into the aggregate fact tables

 I'd be very interested in a developers view of running and maintaining a 
 database this size.
 Mostly what choices is made during development that might have been 
 different on a smaller database.
 I'm also curious about the maintenance needed to keep a database this 
 size healthy over time.

Ok, so all of this comes from what might be termed as an ETL Developer
point of view.  I pretty much only work on the data warehouse's
import/summarization process and look into performance issues.

Design:
- Natural keys are pretty well strictly forbidden.  Don't make a VARCHAR
your primary key, please. ;-)
- The Data determines partition criteria and the application layer is
partition aware.
- It's important to have aggregate tables to support common queries.
Joins are ok, but repeatedly aggregating thousands of rows together on
the fly really takes too much time.
- Aggregation processes have to be designed with care.
- Parallel processing is important, especially if you ever have to
reprocess large amounts of data (for example due to incorrect initial
data)



Maintenance:
- Autovacuum might be more trouble than it's worth.  We frequently have
mysteriously hung queries that are eventually traced back to being
blocked by Autovacuum.  The application layer knows exactly what and
when is changing... it could *probably* take over this duty.
- Pg upgrades are a major PITA and require absurd amounts of downtime
for the data processing part of the warehouse.
- Queries that have been working for long periods of time and suddenly
stop working or hang are usually broken by statistics issues.
- Partitioning is important because it allows the aggregate tables to be
clustered, backed up, and archived individually



If anyone wants to chat with me or direct me to resources about running
Postgres on distributed file systems (or other grid computing solutions)
please let me know.  Yes, I'm aware of the proprietary solutions, but
they've been ruled out on account of performance problems with
aggregates and absurd licensing costs.  Also, any grid computing
solution that I write can easily be applied across the company and thus
save us (quite literally) millions per year.

Again, if you'd like more information about any particular topic, just
ask. :)


-Mark


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] SELECT query experts, anyone?

2008-08-20 Thread Mark Roberts
If you put this in the application, you could do something such as:

my @manufacturers = fetch(select manufacturer_no, name from
manufacturers);
my @select_fields = ('product_no');
foreach my $manufacturer (@manufacturers) {
my $manuf_no = $manufacturer-{manufacturer_no};
my $name = $manufacturer-{name};
push(@select_fields, case when x.manufacturer_no = $manuf_no then
'$name' else null end as manuf_${manuf_no}_products);
}

my @outer_select_fields = ('product_no', map { my $manuf = manuf_ .
$_-{manufacturer_no} . _products;  sum($manuf) as $manuf }
@manufacturers);

my @dataset = fetch(
select @{[ join(,\n, @outer_select_fields) ]}
from (
select @{[ join(,\n, @select_fields) ]}
from products_by_manufacturer x
) x
group by product_no

Uh, or something like that.  Perl in Evolution is really.. painful.

-Mark

On Wed, 2008-08-20 at 15:50 +0300, Teemu Juntunen wrote:
 Hi Experts,
  
 is threre any way to SELECT values in columns instead of rows? For
 example select products and their manufacters in the followin way:
  
 product1; manufacturer1; manufacturer2;,,, manufacturerN
 product2; manufacturer3;
 product3; manufacturer1;.. manufacturerN-1
  
 With a function you could put the product manufacturers in one string,
 but I would like to have them in columns.  
  
 How about arrays. Is there a way to SELECT values in an array to
 columns?
  
 Best regards and thanks,
 Teemu Juntunen


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] What's size of your PostgreSQL Database?

2008-08-19 Thread Mark Roberts

On Tue, 2008-08-19 at 07:34 -0400, Bill Moran wrote:
 
 In theory, you can have so many disks that the bottleneck moves to
 some
 other location, such as the IO bus or memory or the CPU, but I've
 never
 heard of that happening to anyone.  Also, you want to get fast, high-
 quality disks, as 10 15,000 RPM disks are going to perform better than
 10 7,200 RPM disks.

I've personally experienced this happening.

-Mark


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] What's size of your PostgreSQL Database?

2008-08-19 Thread Mark Roberts

On Sat, 2008-08-16 at 11:42 +0800, Amber wrote: 
 Dear all:
 We are currently considering using PostgreSQL to host a read only 
 warehouse, we would like to get some experiences, best practices and 
 performance metrics from the user community, following is the question list:
 1. What's size of your database?
 2. What Operating System are you using?
 3. What level is your RAID array?
 4. How many cores and memory does your server have?
 5. What about your performance of join operations?
 6. What about your performance of load operations?
 7. How many concurrent readers of your database, and what's the average 
 transfer rate, suppose all readers are doing one table scaning.
 8. Single instance or a cluster, what cluster software are you using if you 
 have a cluster?
 
 Thank you in advance!

1. 2.5-3TB, several others that are of fractional sisize.


...


5. They do pretty well, actually.  Our aggregate fact tables regularly
join to metadata tables and we have an average query return time of
10-30s.  We do make some usage of denormalized mviews for
chained/hierarchical metadata tables.

6. Load/copy operations are extremely performant.  We pretty well
constantly have 10+ concurrent load operations going with 2-3
aggregation processes.

7. About 50, but I'm not sure what the transfer rate is.

8. We have a master and a replica.  We have plans to move to a
cluster/grid Soon(TM).  It's not an emergency and Postgres can easily
handle and scale to a 3TB database on reasonable hardware ($30k).

A few notes: our database really can be broken into a very typical ETL
database: medium/high input (write) volume with low latency access
required.  I can provide a developer's view of what is necessary to keep
a database of this size running, but I'm under no illusion that it's
actually a large database.

I'd go into more details, but I'd hate to be rambling.  If anyone's
actually interested about any specific parts, feel free to ask. :)

Also, if you feel that we're doing something wrong, feel free to
comment there too. :)

-Mark


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Cause of error message?

2008-07-14 Thread Mark Roberts

On Mon, 2008-07-14 at 19:44 -0500, Bayless Kirtley wrote:
 An old app is giving a new error message. I query a table and make one
 update. Later I come back to query the table again and get this error:
  
 ERROR: current transaction is aborted, commands ignored until end of
 transaction block
  
 The error only occurs after I have made an update. As long as no
 updates, I can query multiple times. Can anyone give me an idea of
 what sort of actions or errors usually cause this error?
  
 Thanks,
 Bayless
  

I believe that this might be caused by a syntax error.  Do you have any
error handling code that might be hiding the error from you?

-Mark


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] limits?

2008-06-23 Thread Mark Roberts

On Mon, 2008-06-23 at 11:21 -0700, Steve Atkins wrote:
 
 
 http://www.postgresql.org/about/ has some of the theoretical limits.
 
 In real use you're unlikely to hit any limits, theoretical or  
 practical, but if you start to use a silly number of tables and so
 on  
 you're likely to hit performance issues eventually. I'm not sure
 where  
 that threshold would be, but it's higher than thousands.
 
 Cheers,
Steve

I'm just a developer (my DBA would have more details on the settings
that needed adjusting), but I will attest to it being reasonably easy to
hit database limits.

For us, it was related to having a reasonably small number of (large)
aggregate tables that are partitioned (inherited with check constraints)
by week.

The real problem wasn't the absolute limit of tables (IIRC) as much as
accessing the data in a parallel manner from the parent tables in bulk
update transactions (the parallel was probably what pushed it over the
top).

The limit is absurdly high, and by the time you hit it, you'll probably
have a really good idea of how to overcome it.  Really, there's lots of
challenges you'll overcome before that time (IMO).

Of course all this is anecdotal, and you should take it with a grain of
salt. :)

-Mark


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Update Join Query

2008-06-23 Thread Mark Roberts
update foo
set foo_id = g.goo_id
from goo g
where foo.foo_id = g.goo_id and foo.foo_id is not null

I think. :)

-Mark

On Mon, 2008-06-23 at 21:43 +0200, Daniel Futerman wrote:
 Hi,
 
 Looking for the correct syntax for an UPDATE LEFT JOIN query in
 PostgreSQL.
 
 The equivalent MySQL query is :
 
 UPDATE
 Foo f LEFT JOIN Goo g on f.Foo_ID = g.Goo_ID
 SET
 f.Foo_ID = g.Goo_ID
 WHERE
 f.Foo_ID IS NOT NULL;
 
 
  When I try to run this in Postgres, i get the following error:
 
 ERROR:  syntax error at or near LEFT
 
 Is it possible to have UPDATE JOIN queries in PostgreSQL?
 
 Thanks.


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general