Re: [GENERAL] Populating large DB from Perl script

2007-11-06 Thread Andy

Kynn Jones wrote:

I have  large database that needs to be built from
scratch roughly once every month.  I use a Perl script to do this.

The tables are very large, so I avoid as much as possible using
in-memory data structures, and instead I rely heavily on temporary
flat files.

I have solved this general problem in various ways, all of them
unwieldy (in the latest version, the script generates the serial ids
and uses Perl's so-called tied hashes to retrieve them when needed).

TIA!

kj


I have done this exact same thing.  I started with tied hashes, and even 
tried BerkeleyDB.  They only helped up to a point, where they got so big 
(a couple gig if I recall correctly) they actually slowed things down. 
In the end I used a stored proc to do the lookup and insert.  In the 
beginning its not as fast, but by the time the db hits 20 gig its still 
going strong, where my BerkeleyDB was becoming painful slow.  (I 
recently thought of trying a sqlite table, I've had good luck with them, 
they can get pretty big and still be very fast... but never got around 
to trying it.)


So... not really an answer (other than I used a stored proc), but I'd be 
interested in alternatives too.


-Andy

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [GENERAL] Populating large DB from Perl script

2007-11-05 Thread Rodrigo De León
On 11/3/07, Mikko Partio [EMAIL PROTECTED] wrote:
 On Nov 2, 2007 8:45 PM, Kynn Jones [EMAIL PROTECTED] wrote:
 It would be great if there was a stored proc-archive somewhere in the
 web where people could post their procedures. I know there are some
 code examples in the official documentation but they are few in
 numbers.

In a somewhat related note, what happened to the old PostgreSQL cookbook site?

Does anyone care to revive it?

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [GENERAL] Populating large DB from Perl script

2007-11-03 Thread Mikko Partio
On Nov 2, 2007 8:45 PM, Kynn Jones [EMAIL PROTECTED] wrote:

 PS: As an aside to the list, as a programmer, when I'm starting out in
 language, I learn more than I can say from reading source code written
 by the experts, but for some reason I have had a hard time coming
 across expertly written PostgreSQL stored procedures, other than the
 occasionally didactic snippet in the docs.  All these expertly-written
 procedures seem to be very STORED away indeed!  If, on the contrary,
 it's just the case that I haven't looked in the right places, please
 hurl me a cluebrick!

It would be great if there was a stored proc-archive somewhere in the
web where people could post their procedures. I know there are some
code examples in the official documentation but they are few in
numbers.

Regards

M

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [GENERAL] Populating large DB from Perl script

2007-11-03 Thread Shane Ambler

Mikko Partio wrote:

On Nov 2, 2007 8:45 PM, Kynn Jones [EMAIL PROTECTED] wrote:


PS: As an aside to the list, as a programmer, when I'm starting out in
language, I learn more than I can say from reading source code written
by the experts, but for some reason I have had a hard time coming
across expertly written PostgreSQL stored procedures, other than the
occasionally didactic snippet in the docs.  All these expertly-written
procedures seem to be very STORED away indeed!  If, on the contrary,
it's just the case that I haven't looked in the right places, please
hurl me a cluebrick!


It would be great if there was a stored proc-archive somewhere in the
web where people could post their procedures. I know there are some
code examples in the official documentation but they are few in
numbers.

Regards

M



It's not specific for stored procedures but http://pgfoundry.org is a 
suitable location. Sourceforge could also be a source of such projects, 
finding the projects that have what you are after may be more difficult.


Projects aren't grouped into the type of project so you can't really 
search for which projects are SP's and which are external apps or 
plug-in modules.


Maybe that can be a feature request for pgfoundry - allow a type for the 
project - client app/stored proc/sql script/plug-in



I would point you to - PostBooks
http://sourceforge.net/projects/postbooks

Install this (or look through the example setup script) and have a look 
through the stored procs and functions that it uses.



--

Shane Ambler
[EMAIL PROTECTED]

Get Sheeky @ http://Sheeky.Biz

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [GENERAL] Populating large DB from Perl script

2007-11-02 Thread andy

Kynn Jones wrote:

Hi.  This is a recurrent problem that I have not been able to find a
good solution for.  I have  large database that needs to be built from
scratch roughly once every month.  I use a Perl script to do this.

The tables are very large, so I avoid as much as possible using
in-memory data structures, and instead I rely heavily on temporary
flat files.

The problem is the population of tables that refer to internal IDs
on other tables.  By internal I mean IDs that have no meaning
external to the database; they exist only to enable relational
referencing.  They are always defined as serial integers.  So the
script either must create and keep track of them, or it must populate
the database in stages, letting Pg assign the serial IDs, and query
the database for these IDs during subsequent stages.

I have solved this general problem in various ways, all of them
unwieldy (in the latest version, the script generates the serial ids
and uses Perl's so-called tied hashes to retrieve them when needed).

But it occurred to me that this is a generic enough problem, and that
I'm probably re-inventing a thoroughly invented wheel.  Are there
standard techniques or resources or Pg capabilities to deal with this
sort of situation?

TIA!

kj



(Sorry if this double posts, I wasn't subscribed the first time)

I have done this exact same thing.  I started with tied hashes, and even 
tried BerkeleyDB.  They only helped up to a point, where they got so big 
(a couple gig if I recall correctly) they actually slowed things down. 
In the end I used a stored proc to do the lookup and insert.  In the 
beginning its not as fast, but by the time the db hits 20 gig its still 
going strong, where my BerkeleyDB was becoming painful slow.  (I 
recently thought of trying a sqlite table, I've had good luck with them, 
they can get pretty big and still be very fast... but never got around 
to trying it.)


So... not really an answer (other than I used a stored proc), but I'd be 
interested in alternatives too.


-Andy



---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [GENERAL] Populating large DB from Perl script

2007-11-02 Thread Kynn Jones
 So... not really an answer (other than I used a stored proc)

Actually, I'm interested in your solution.

Just to make sure I understood what you did: you bulk-populated (i.e.
with $dbh-do('COPY...'), $dbh-pg_putline(...), $dbh-pg_endcopy) the
referring tables, with their fkey constraints disabled; then you ran
stored procedure(s) that went through these referring tables and
filled in the missing fkeys; and finally you activated their fkey
constraints.  Is this right?

I'm very much of a stored procedures dunce, so if the code for your
stored procedure is postable, please do.

TIA,

kj

PS: As an aside to the list, as a programmer, when I'm starting out in
language, I learn more than I can say from reading source code written
by the experts, but for some reason I have had a hard time coming
across expertly written PostgreSQL stored procedures, other than the
occasionally didactic snippet in the docs.  All these expertly-written
procedures seem to be very STORED away indeed!  If, on the contrary,
it's just the case that I haven't looked in the right places, please
hurl me a cluebrick!

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [GENERAL] Populating large DB from Perl script

2007-11-02 Thread Dawid Kuroczko
On 11/1/07, Kynn Jones [EMAIL PROTECTED] wrote:
 Hi.  This is a recurrent problem that I have not been able to find a
 good solution for.  I have  large database that needs to be built from
 scratch roughly once every month.  I use a Perl script to do this.

 The tables are very large, so I avoid as much as possible using
 in-memory data structures, and instead I rely heavily on temporary
 flat files.

 The problem is the population of tables that refer to internal IDs
 on other tables.  By internal I mean IDs that have no meaning
 external to the database; they exist only to enable relational
 referencing.  They are always defined as serial integers.  So the
 script either must create and keep track of them, or it must populate
 the database in stages, letting Pg assign the serial IDs, and query
 the database for these IDs during subsequent stages.

If it is possible, perhaps you could load raw data into temporary
table and then create ids using these tables.
For instance:
CREATE TEMP TABLE foo_raw (host text, city text, who text, value int);
INSERT INTO hosts (host) SELECT DISTINCT host FROM foo; -- group by perhaps?
INSERT INTO [...]
INSERT INTO foo SELECT host_id,city_id,who_id,value
   FROM foo_raw
   JOIN hosts USING (host)
   JOIN cities USING (city)
   JOIN who USING (who);
This may or may not work, depending on your setup.

But perhaps a better approach, while needing more work would be:

Your script establishes two DB connections, one for processing data
and one for maintaining IDs.  Now whenever you need to get and ID do:
  1) query memcached if found, return it
  2) query database if found return it and insert into memcached
  3) insert into database, and insert into memcached, and perhaps commit it.
Befriend thyself with Cache::* perl modules. :)

   Regards,
  Dawid

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [GENERAL] Populating large DB from Perl script

2007-11-02 Thread andy

Kynn Jones wrote:

So... not really an answer (other than I used a stored proc)


Actually, I'm interested in your solution.

Just to make sure I understood what you did: you bulk-populated (i.e.
with $dbh-do('COPY...'), $dbh-pg_putline(...), $dbh-pg_endcopy) the
referring tables, with their fkey constraints disabled; then you ran
stored procedure(s) that went through these referring tables and
filled in the missing fkeys; and finally you activated their fkey
constraints.  Is this right?

I'm very much of a stored procedures dunce, so if the code for your
stored procedure is postable, please do.

TIA,

kj

PS: As an aside to the list, as a programmer, when I'm starting out in
language, I learn more than I can say from reading source code written
by the experts, but for some reason I have had a hard time coming
across expertly written PostgreSQL stored procedures, other than the
occasionally didactic snippet in the docs.  All these expertly-written
procedures seem to be very STORED away indeed!  If, on the contrary,
it's just the case that I haven't looked in the right places, please
hurl me a cluebrick!


I'm afraid it was nothing that super.  My scripts where to insert a 
bunch of test data into a db so I could play with some very large db's.


I have a people table, and a car table.  I wanted to add ownership of 
cars to people, but thought nobody should own more than 5 cars (to make 
sure my random generator didnt bunch too many cars per person)


My perl looks like:
my $carcount = new BerkeleyDB::Btree( -Filename = 'carcount.dbm',
-Flags = DB_CREATE ) or die Cannot open file: $!;

..later...
if (rand()  0.10) {
my $person = int(rand($maxperson));
my $ok = 1;

$i = 0;
$carcount-db_get($person, $i);


if ($i = 5)
{
#shall we allow more than 5 cars?
if (rand()  0.90) {
$ok = 1;
} else {
$ok = 0;
}
}

if ($ok)
{
... do the insert ...
}


So I was keeping personid = carcount map.  This worked great, as I 
said, until I got into the gigbytes size for the BerkeleyDB.


Instead I created a stored proc:

CREATE OR REPLACE FUNCTION addowner(xpersonid integer, xcarid integer) 
returns void AS $$

declare
  cc integer;
begin
  select into cc count(*)
from ownership
where personid = xpersonid;

  if cc  5 then
insert into ownership(personid, carid) values (xpersonid, xcarid);
  end if;

  return;
end
$$ LANGUAGE plpgsql;


... and here is another stored proc I wrote for our website.  We have 
company shirts n'stuff that employee's can order.  So the webpage is 
kinda like a shopping cart thing.


create or replace function updatecart(xuserid integer, xgroupid integer, 
xprodid integer, xsizeid integer, xcolorid integer, xqty integer) 
returns void as $$

declare
  xid integer;
begin
  select into xid rowid from vcs_ordertable where userid = xuserid and 
groupid = xgroupid and prodid = xprodid and sizeid = xsizeid and colorid 
= xcolorid;

  if not found then
insert into vcs_ordertable(userid, groupid, prodid, sizeid, 
colorid, quant) values (xuserid, xgroupid, xprodid, xsizeid, xcolorid, 
xqty);

  else
update vcs_ordertable set quant = quant + xqty where rowid = xid;
  end if;
  return;
end
$$ LANGUAGE plpgsql;


When you select a shirt (including the size, color, etc) I only wanted 
one row per (user, group, productid, size and color), if one already 
exists in the table, I just bump its count, if not I insert it.


These two procs are about as complex as I've needed to get.

-Andy

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


[GENERAL] Populating large DB from Perl script

2007-11-01 Thread Kynn Jones
Hi.  This is a recurrent problem that I have not been able to find a
good solution for.  I have  large database that needs to be built from
scratch roughly once every month.  I use a Perl script to do this.

The tables are very large, so I avoid as much as possible using
in-memory data structures, and instead I rely heavily on temporary
flat files.

The problem is the population of tables that refer to internal IDs
on other tables.  By internal I mean IDs that have no meaning
external to the database; they exist only to enable relational
referencing.  They are always defined as serial integers.  So the
script either must create and keep track of them, or it must populate
the database in stages, letting Pg assign the serial IDs, and query
the database for these IDs during subsequent stages.

I have solved this general problem in various ways, all of them
unwieldy (in the latest version, the script generates the serial ids
and uses Perl's so-called tied hashes to retrieve them when needed).

But it occurred to me that this is a generic enough problem, and that
I'm probably re-inventing a thoroughly invented wheel.  Are there
standard techniques or resources or Pg capabilities to deal with this
sort of situation?

TIA!

kj

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org/


Re: [GENERAL] Populating large DB from Perl script

2007-11-01 Thread Jorge Godoy
Em Thursday 01 November 2007 16:57:36 Kynn Jones escreveu:

 But it occurred to me that this is a generic enough problem, and that
 I'm probably re-inventing a thoroughly invented wheel.  Are there
 standard techniques or resources or Pg capabilities to deal with this
 sort of situation?

You can restore the database without the constraints and then add them back 
after you restored the last table.

You can also use pg_dump / pg_restore / psql to do that.

You can also use COPY. 

When you want to keep the referential integrity checks in place since the 
beginning, you have to respect the order you need to restore your data.

-- 
Jorge Godoy  [EMAIL PROTECTED]


---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster