Janet Jacobsen wrote:
Hi.  We are running a data processing/analysis pipeline that
writes about 100K records to two tables on a daily basis.
The pipeline runs from about 6:00 a.m. to 10:00 a.m.

Our user base is small - about five people.  Each accesses
the database in a different way (generally using some script
- either Perl or Python).

Some people begin querying the database as soon as the new
data/analysis results start being loaded.  Others wait until the
day's run is complete, so the number of concurrent users is
small at this time.

The data/analysis results are loaded into two tables from two
files of 200 to 1,000 rows each using the COPY command,
which is executed from a Perl script that uses DBD-Pg.

Other details: Postgres 8.3.7 running on a Linux system
with eight processors.

Both of the big tables (now up to > 15 M rows each) have
indexes on several of the columns.  The indexes were
Both tables have one or two foreign key constraints.

My questions are:
(1) At the point that the data are being loaded into the tables,
are the new data indexed?

it depends if an index exists on the table when you fill it with data.  If 
there is an index, it will be updated.

(2) Should I REINDEX these two tables daily after the pipeline
completes?  Is this what other people do in practice?

it depends if an index exists on the table when you fill it with data.  But I 
repeat myself :-).  If an index exists you would not need to reindex it.  It 
may be faster to fill a table without an index, then add an index later.  But 
that would depend on if you need the index for unique constraints.

(3) Currently the pipeline executes in serial fashion.  We'd
like to cut the wall clock time down as much as possible.
The data processing and data analysis can be done in parallel,
but can the loading of the database be done in parallel, i.e.,
can I execute four parallel COPY commands from four copies

We'd need more specifics.  Are you COPY'ing into two different tables at once?  
(that should work).  Or the same table with different data (that should work 
too, I'd guess) or the same data with a unique key (that'll break)?

Our initial attempt at doing this failed.

What was the error?

I found one
posting in the archives about parallel COPY, but it doesn't seem
to be quite on point.

They have added parallel copy to the pg_restore, but I think that does 
different tables, not the same table.  Was that what you saw?

(4) Does COPY lock the table?  Do I need to explicitly
LOCK the table before the COPY command?  Does LOCK
even apply to using COPY?  If I used table locking, would
parallel COPY work?

pg does not need to lock tables.  Locking is counter productive to multiuser 
access.  Why would you think locking a table would let parallel copy work?  A 
lock is to give one process exclusive access to a table.  Locking is exactly 
what you dont want.

(5) If I drop the indexes and foreign key constraints, then is it
possible to COPY to a table from more than one script, i.e., do
parallel COPY?  It seems like a really bad idea to drop those
foreign key constraints.

It would be a bad idea yes.  One thing that could stop you is a unique 
constraint and two copy's are inserting the same data.  What sort of errors did 
you get last time you tried this?

I have never tried two processes copy'ing into the same table at the same time, 
but I'd bet its possible.


Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:

Reply via email to