[GENERAL] PG_Bulkloader (or other tools) and partitions

2009-05-31 Thread Wojtek

Hi,

I'm looking for solution to speed up my data load process. Currently, I 
have up to 100 insert/second loaded to partitioned table and application 
is inserting data directly to partitions (not to master table).


As a solution, I wanted to explore pg_bulkloader but as I understand it 
needs to have destination table specified in config file, so:
- I can't make it to write directly to partitions (as I need to write 
data to several partitions every day)
- I doubt pg_bulkload will consider rules/constraints to put data to 
right partitions automatically


Anyone can share good practices on loading data to partitioned table, 
please? not necessarily using pg_bulkload...


Regards,
foo



--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Pl/java in 8.4 bet1 sources compilation failed

2009-05-31 Thread Craig Ringer
Kris Jurka wrote:
> Craig Ringer wrote:
> 
>> Perhaps a stupid question, but isn't the `-source' parameter to javac
>> intended to mask new features and such for just the purpose of compiling
>> older sources on a new JDK?
>>
> 
> The -source argument only controls language features, not
> interface/class definitions. [snip]

Ah. Thanks for that very enlightening and helpful explanation - I really
appreciate your taking the time.

It's interesting that the JDK presents these issues to driver
developers, but I see how there's no easy way around it given that Java
offers no way to declare default implementations of methods in
interfaces (iow: no multiple inheritance) so the JDBC interfaces can't
provide stubs.

Argh, if only Java had scope-based object lifetime with prompt
finalization (think C++ "stack-based" objects) and multiple inheritance...

--
Craig Ringer

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


[GENERAL] newbie table design question

2009-05-31 Thread Andrew Smith
Hi all,

I'm a beginner when it comes to Postgresql, and have a table design question
about a project I'm currently working on.  I have 1500 data items that need
to be copied every minute from an external system into my database.  The
items have a timestamp, an identifier and a value.  For example:

12/10/2008 05:00   ID_ABC  14
12/10/2008 05:01   ID_ABC  17
12/10/2008 05:02   ID_ABC  13

Pretty simple stuff.  The current solution (implemented using SQL Server a
few years ago) looks like this (an approximation using Postgresql syntax):

CREATE TABLE "DataImport"
(
  "DataImportID" serial NOT NULL PRIMARY KEY,
  "Time" timestamp without time zone NOT NULL,
  "ID_ABC" integer NOT NULL,
  "ID_DEF" integer NOT NULL,
  "ID_HIJ" integer NOT NULL,
  etc
);

While this design results in only 14400 records being present in the table
per day, I don't like it.  One problem is that every time a data item is
added or removed from the import process, the structure of the table needs
to be altered.  I don't know what sort of overhead that involves in the
Postgresql world, but I'm thinking that once I define the structure of the
table, I don't want to be messing with it very often, if at all.

My initial thought for the design of the new solution looks like this:

CREATE TABLE "DataImport"
(
  "DataImportID" serial NOT NULL PRIMARY KEY,
  "Time" timestamp without time zone NOT NULL,
  "Identifier" text NOT NULL,
  "Value" integer NOT NULL
);

Users will then be doing regular queries on this data (say, a few hundred
times per day), such as:

SELECT "Time", "Value" FROM "DataImport" WHERE "Identifier" = 'ID_ABC' AND
"Time" between '2008-11-07' and '2008-11-11';

My concern is that 1500 values * 14400 minutes per day = 21,600,000
records.  Add this up over the course of a month (the length of time I need
to keep the data in this table) and I'll have more than half a billion
records being stored in there.

I guess my question is: is my approach reasonable?  I haven't dealt with
tables of this size before (using any DBMS) - should I expect really slow
queries due to the sheer number of records present?  Is there some better
way I should be structuring my imported data?  All suggestions welcome.

Cheers,

Andrew


Re: [GENERAL] newbie table design question

2009-05-31 Thread Tom Lane
Andrew Smith  writes:
> I'm a beginner when it comes to Postgresql, and have a table design question
> about a project I'm currently working on.  I have 1500 data items that need
> to be copied every minute from an external system into my database.  The
> items have a timestamp, an identifier and a value.  For example:

> 12/10/2008 05:00   ID_ABC  14
> 12/10/2008 05:01   ID_ABC  17
> 12/10/2008 05:02   ID_ABC  13

> Pretty simple stuff.  The current solution (implemented using SQL Server a
> few years ago) looks like this (an approximation using Postgresql syntax):

> CREATE TABLE "DataImport"
> (
>   "DataImportID" serial NOT NULL PRIMARY KEY,
>   "Time" timestamp without time zone NOT NULL,
>   "ID_ABC" integer NOT NULL,
>   "ID_DEF" integer NOT NULL,
>   "ID_HIJ" integer NOT NULL,
>   etc
> );

So the table would have ~1500 columns?  You definitely don't want to do
it that way in Postgres --- you'd be way too close to the maximum column
count limitation.

> My initial thought for the design of the new solution looks like this:

> CREATE TABLE "DataImport"
> (
>   "DataImportID" serial NOT NULL PRIMARY KEY,
>   "Time" timestamp without time zone NOT NULL,
>   "Identifier" text NOT NULL,
>   "Value" integer NOT NULL
> );

DataImportID couldn't be a primary key here, could it?  Or is it just
meant as an artificial primary key?  If so, consider not bothering with
it at all --- (Time, Identifier) seems like a perfectly good natural
key, and with such short records the extra space for a serial column
is not negligible.

Anyway, my answer would definitely lean towards using this normalized
representation, if all the data values are integers.  (If they're not,
it gets messier...)

> Users will then be doing regular queries on this data (say, a few hundred
> times per day), such as:

> SELECT "Time", "Value" FROM "DataImport" WHERE "Identifier" = 'ID_ABC' AND
> "Time" between '2008-11-07' and '2008-11-11';

An index on (Identifier, Time) (not the other way around) should work
well for that type of query.

> My concern is that 1500 values * 14400 minutes per day = 21,600,000
> records.  Add this up over the course of a month (the length of time I need
> to keep the data in this table) and I'll have more than half a billion
> records being stored in there.

That's not an especially big table.  However it might be worth your
trouble to use partitioning.  Not to speed searches, particularly, but
to make it easier to drop 1/30th of the data each day.

regards, tom lane

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] INTERVAL SECOND limited to 59 seconds?

2009-05-31 Thread Tom Lane
Sebastien FLAESCH  writes:
> I would expect that an INTERVAL SECOND can store more that 59 seconds.

I took a look into the SQL spec and I think that we do indeed have a
spec compliance issue here.  SQL99 section 4.7 saith

 Within a value of type interval, the first field is constrained
 only by the  of the associated
 . Table 8, "Valid values for fields in INTERVAL
 values", specifies the constraints on subsequent field values.
 [ Table 8 says about what you'd expect, eg 0..23 for HOUR ]
 Values in interval fields other than SECOND are integers and have
 precision 2 when not the first field. SECOND, however, can be
 defined to have an  that
 indicates the number of decimal digits maintained following the
 decimal point in the seconds value. When not the first field,
 SECOND has a precision of 2 places before the decimal point.

So in other words, "999 seconds" is a valid value for a field of type
INTERVAL SECOND, *and should come out the same way*, not as "00:16:39",
and certainly not as "00:00:39".

It might be a relatively easy fix to not truncate the input value
incorrectly.  I haven't looked, but I think we should look now, because
8.4 has already changed the behavior in this area and it would be good
not to change it twice.  The focus of the 8.4 work was to make sure that
we would correctly interpret the values of spec-compliant interval
literals, but this example shows we are not there yet.

We are fairly far away from being able to make it print out as the spec
would suggest, because interval_out simply doesn't have access to the
information that the field is constrained to be INTERVAL SECOND rather
than some other kind of interval.  We also have got no concept at all of
, only of , so constraining the leading field to only a certain number
of integral digits isn't possible either.  I don't foresee anything
getting done about either of those points for 8.4.

regards, tom lane

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


[GENERAL] Foreign key verification trigger conditions

2009-05-31 Thread j-lists
I have an update statement that affects every row in a given table.
For that table it changes the value in a single column, which itself
has a foreign key constraint. The table has an additional 9 foreign
keys, some of which reference large tables.
My expectation would be that only the changed column would be checked
against the foreign key of interest, instead I find that all the
foreign keys are checked when this statement is executed.
I decided to create a simple test case to demonstrate this behaviour,
but what I found was strange. The first time I created the test cases
the behaviour matches my experience but the second time I created it
the behaviour was then as I would have expected. This is the result I
am experiencing with the unnecessary foreign key verification:
testdb=# explain analyze update t1 set B = 1;
   QUERY PLAN
-
 Seq Scan on t1  (cost=0.00..25.10 rows=1510 width=22) (actual
time=0.026..0.029 rows=1 loops=1)
 Trigger for constraint fk1: time=0.111 calls=1
 Trigger for constraint fk2: time=0.014 calls=1
 Total runtime: 0.259 ms
(4 rows)

Only fk1's column is being updated, not fk2's.

Below is both sessions. Any feedback on this and how to avoid it is
appreciated as well as whether the developers would consider this a
bug, I am inclined to believe so as it hurts performance.

Thanks,
-J


*


testdb=# create table t1 (A BIGINT, B BIGINT);
CREATE TABLE
testdb=# create table t2 (B BIGINT PRIMARY KEY);
NOTICE:  CREATE TABLE / PRIMARY KEY will create implicit index
"t2_pkey" for table "t2"
CREATE TABLE
testdb=# alter table t1 add constraint fk1 foreign key  (B) references t2 (B);
ALTER TABLE
testdb=# explain analyze insert into t2 values (1);
 QUERY PLAN

 Result  (cost=0.00..0.01 rows=1 width=0) (actual time=0.008..0.010
rows=1 loops=1)
 Total runtime: 45.508 ms
(2 rows)

testdb=# explain analyze insert into t1 values (1, 1);
 QUERY PLAN

 Result  (cost=0.00..0.01 rows=1 width=0) (actual time=0.008..0.010
rows=1 loops=1)
 Trigger for constraint fk1: time=0.256 calls=1
 Total runtime: 0.345 ms
(3 rows)

testdb=# explain analyze update t1 set A = 2;
   QUERY PLAN
-
 Seq Scan on t1  (cost=0.00..27.70 rows=1770 width=14) (actual
time=0.019..0.022 rows=1 loops=1)
 Total runtime: 0.125 ms
(2 rows)

testdb=# explain analyze update t1 set B = 1;
   QUERY PLAN
-
 Seq Scan on t1  (cost=0.00..27.70 rows=1770 width=14) (actual
time=0.021..0.024 rows=1 loops=1)
 Total runtime: 0.115 ms
(2 rows)

testdb=# explain analyze update t1 set B = 1;
   QUERY PLAN
-
 Seq Scan on t1  (cost=0.00..27.70 rows=1770 width=14) (actual
time=0.021..0.024 rows=1 loops=1)
 Total runtime: 0.113 ms
(2 rows)

testdb=# explain analyze insert into t2 values (2);
 QUERY PLAN

 Result  (cost=0.00..0.01 rows=1 width=0) (actual time=0.008..0.010
rows=1 loops=1)
 Total runtime: 0.120 ms
(2 rows)

testdb=# explain analyze update t1 set B = 2;
   QUERY PLAN
-
 Seq Scan on t1  (cost=0.00..27.70 rows=1770 width=14) (actual
time=0.020..0.024 rows=1 loops=1)
 Trigger for constraint fk1: time=0.112 calls=1
 Total runtime: 0.233 ms
(3 rows)

testdb=# explain analyze update t1 set A = 99;
   QUERY PLAN
-
 Seq Scan on t1  (cost=0.00..27.70 rows=1770 width=14) (actual
time=0.021..0.025 rows=1 loops=1)
 Total runtime: 0.117 ms
(2 rows)

testdb=# alter table t1 add b2 bigint;
ALTER TABLE
testdb=# alter table t1 add constraint fk2 foreign key  (B2) references t2 (B);
ALTER TABLE
testdb=# explain analyze update t1 set B = 1;
   QUERY PLAN
-
 Seq Scan on t1  (cost=0.00..25.10 rows=1510 width=22) (actual
time=0.026..0.029 rows=1 loops=1)
 Trigger for constraint fk1: time=0.111 calls=1
 Trigger for constraint fk2: time=0.014 cal

[GENERAL] Query to find Foreign Key column data type mismatch

2009-05-31 Thread Gurjeet Singh
Hi All,

I wanted to find out if both the ends of a foreign key reference were
using the same data types, since, in older versions, not having the same
data types can lead to Postgres not picking the appropriate index, and in
newer versions also it would be beneficial since having same data types can
avoid having to go through conversion functions for comparisons etc.

So I devised this query and the helper function for the purpose. Just
create the function and then run the query. This query checks all the
foreign keys in the database.

Feedback/comments welcome.

Best regards,
-- 
Lets call it Postgres

EnterpriseDB  http://www.enterprisedb.com

gurjeet[.sin...@enterprisedb.com
singh.gurj...@{ gmail | hotmail | indiatimes | yahoo }.com
Mail sent from my BlackLaptop device


Postgres_find_FKey_data_type_mismatch.sql
Description: Binary data

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] newbie table design question

2009-05-31 Thread Andrew Smith
On Mon, Jun 1, 2009 at 1:25 AM, Tom Lane  wrote:

> Andrew Smith  writes:
> > I'm a beginner when it comes to Postgresql, and have a table design
> question
> > about a project I'm currently working on.  I have 1500 data items that
> need
> > to be copied every minute from an external system into my database.  The
> > items have a timestamp, an identifier and a value.  For example:
>
> > 12/10/2008 05:00   ID_ABC  14
> > 12/10/2008 05:01   ID_ABC  17
> > 12/10/2008 05:02   ID_ABC  13
>
> > Pretty simple stuff.  The current solution (implemented using SQL Server
> a
> > few years ago) looks like this (an approximation using Postgresql
> syntax):
>
> > CREATE TABLE "DataImport"
> > (
> >   "DataImportID" serial NOT NULL PRIMARY KEY,
> >   "Time" timestamp without time zone NOT NULL,
> >   "ID_ABC" integer NOT NULL,
> >   "ID_DEF" integer NOT NULL,
> >   "ID_HIJ" integer NOT NULL,
> >   etc
> > );
>
> So the table would have ~1500 columns?  You definitely don't want to do
> it that way in Postgres --- you'd be way too close to the maximum column
> count limitation.
>

After my original post, I found out that the current solution in SQL Server
actually had to be split into two different tables due to a similar
limitation.


>
>
> > My initial thought for the design of the new solution looks like this:
>
> > CREATE TABLE "DataImport"
> > (
> >   "DataImportID" serial NOT NULL PRIMARY KEY,
> >   "Time" timestamp without time zone NOT NULL,
> >   "Identifier" text NOT NULL,
> >   "Value" integer NOT NULL
> > );
>
> DataImportID couldn't be a primary key here, could it?  Or is it just
> meant as an artificial primary key?  If so, consider not bothering with
> it at all --- (Time, Identifier) seems like a perfectly good natural
> key, and with such short records the extra space for a serial column
> is not negligible.
>

You're right, I guess there doesn't seem to be much point having that
surrogate key in place...and it does take up space.  I did a quick test and
got the following figures:

1 million records with DataImportID = ~80mb
1 million records without DataImportID = ~50mb.

That'll certainly add up over time.


> Anyway, my answer would definitely lean towards using this normalized
> representation, if all the data values are integers.  (If they're not,
> it gets messier...)
>
> > Users will then be doing regular queries on this data (say, a few hundred
> > times per day), such as:
>
> > SELECT "Time", "Value" FROM "DataImport" WHERE "Identifier" = 'ID_ABC'
> AND
> > "Time" between '2008-11-07' and '2008-11-11';
>
> An index on (Identifier, Time) (not the other way around) should work
> well for that type of query.


I'll give it a try.


>
>
> > My concern is that 1500 values * 14400 minutes per day = 21,600,000
> > records.  Add this up over the course of a month (the length of time I
> need
> > to keep the data in this table) and I'll have more than half a billion
> > records being stored in there.
>
> That's not an especially big table.  However it might be worth your
> trouble to use partitioning.  Not to speed searches, particularly, but
> to make it easier to drop 1/30th of the data each day.


Re-reading this I noticed that I had an extra 0 in one of my figures -
there's only 1440 minutes in a day, so my table gets down to the much more
manageable 2.16 million records per day instead of 21.6 million.  I'll have
a look into partitioning too, thanks for your help!

Cheers,

Andrew


[GENERAL] pg_dump & table space

2009-05-31 Thread Luca Ferrari
Hi,
is there a way to dump an entire database which has a specific table space 
without having in the sql file any reference to the tablespace? This can be 
useful when moving the database from one machine to another (that does not use 
the tablespace). Any way to achieve that with pg_dump?

Thanks,
Luca


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general