Re: [Retrieved]RE: [ADMIN] backup and recovery

2004-03-26 Thread Murthy Kambhampaty
I think you can get both benefits of multi-statement transactions for INSERT dumps 
by doing subset copies  ... without any changes in postgresql!  The method I use is 
developed for handling single table loads, but is still relatively painless even for 
database dumps; however, it is limited to text dumps.

Let's say you want to unload-load table tbl_foo from schema sch_bar from database 
db_baz and reload sch_bar.tbl_foo to database db_quux.  Try the following:

1.) Dump-restore the table schema so you create an empty table in the destination 
database. e.g.:
/usr/local/pgsql-7.4/bin/pg_dump -s -t tbl_foo --schema sch_bar db_baz | \
/usr/local/pgsql-7.4/bin/psql -d db_quux 
This can be adjusted for different hosts, etc.

2.) COPY the records to a file:
/usr/local/pgsql-7.4/bin/psql -d db_bar \
 -c copy sch_bar.tbl_foo to stdout  sch_bar.tbl_foo.dat
OR
/usr/local/pgsql-7.4/bin/psql -d db_bar -Aqt \
 -c select * from sch_bar.tbl_foo where some condition  
sch_bar.tbl_foo.dat
The latter is slower, but selective.  You can also use the -p option to set col and 
row separators to whatever you like (as with copy options). If your source data came 
from a dump file, rather than a COPY, you can strip sql commands to leave data only, 
or modify the commands below.

3. Pipe the data from sch_bar.tbl_foo.dat to psql, with copy commands spliced in at 
chosen intervals (in numbers of lines) depending on your preferences for speed versus 
recoverability.  In the example below, the subset size is 2000 lines:
awk \
  -v SubSize=2000 \
  -v COPYSTMT=copy sch_bar.tbl_foo from stdin; \
 'BEGIN{ print COPYSTMT } \
  { print $0 } \
  FNR % SubSize == 0 { \
  print \\.\n\n ; \
  print \n; \
  print COPYSTMT }' sch_bar.tbl_foo.dat | \
/usr/local/pgsql-7.4/bin/psql -U gouser -d airfrance -f -

The awk command specifies the chosen subset size (2000) and a copy statement for 
putting stdin in the selected table; at the BEGINning, a copy statmenet is issued 
and lines are streamed in from the text file containing table rows; after each SubSize 
number of lines the copy stream is ended (as in text dumps, with a \.), and a new 
copy statment inserted.

For a 220,000 row table, times for the simple copy versus the subset copy were:

Simple copy:
real0m21.704s
user0m3.790s
sys 0m0.880s

Subset copy:
real0m24.233s
user0m5.710s
sys 0m1.090s

Over 10% more wall clock time, but the savings from not having to rerun the entire 
load if errors are found could be tremendous.


3a.  Alternately, you can generate a log so you easily know which subset failed (if 
any):
# LogFile=/home/postgres/load.log; \
 awk \
  -v SubSize=2000 \
  -v COPYSTMT=copy S2.air from stdin; \
  -v LogF=$LogFile \
 'BEGIN{ print Block Size:  SubSize  LogF; \
   print Copy Statment:  COPYSTMT  LogF; \
   print \n\n  LogF; \
   close(LogF) ; \
   print COPYSTMT } \
  { print $0 } \
  FNR % SubSize == 0 { \
  print \\.\n\n ; \
  printf(select \047Processed %d records from line no. %d to line no. %d\047;\n, 
SubSize, FNR -SubSize +1, FNR) ; \
  print \n; \
  print COPYSTMT }
  END{ \
  print \\.\n\n ; \
  printf(select \047Processed a grand total of %d lines from %s\047;\n, NR, FILENAME 
) }' \
  sch_bar.tbl_foo.dat | \
/usr/local/pgsql-7.4/bin/psql -U gouser -d airfrance -Atnq -f -  $LogFile 
21

Errors can be located with:

[EMAIL PROTECTED] postgres]$ cat load.log | grep -B 3 -A 3 ERROR:
Processed 2000 records from line no. 192001 to line no. 194000
Processed 2000 records from line no. 194001 to line no. 196000
Processed 2000 records from line no. 196001 to line no. 198000
ERROR:  invalid input syntax for integer: My0
CONTEXT:  COPY tbl_foo, line 2000, column oct 02: My0
Processed 2000 records from line no. 198001 to line no. 20
Processed 2000 records from line no. 21 to line no. 202000


HTH
Murthy




 -Original Message-
 From: [EMAIL PROTECTED]
 [mailto:[EMAIL PROTECTED] Behalf Of Naomi Walker
 Sent: Wednesday, March 24, 2004 5:58 PM
 To: Tom Lane
 Cc: Bruce Momjian; Tsirkin Evgeny; Naomi Walker; Mark M. Huber;
 [EMAIL PROTECTED]
 Subject: Re: [Retrieved]RE: [ADMIN] backup and recovery
 
 
 At 03:54 PM 3/24/2004, Tom Lane wrote:
 Bruce Momjian [EMAIL PROTECTED] writes:
   Added to TODO:
 * Have pg_dump use multi-statement transactions for 
 INSERT dumps
 
   For simple performance reasons, it would be good.  I am 
 not sure about
   allowing errors to continue loading.   Anyone else?
 
 Of course, anyone who actually cares about reload speed shouldn't be
 using INSERT-style dumps anyway ... I'm not sure why we should expend
 effort on that rather than just telling people to use the COPY mode.
 
 Understood.  I would still love this feature for when in the 
 COPY mode.
 
 
  regards, tom lane

Re: [Retrieved]RE: [ADMIN] backup and recovery

2004-03-26 Thread Murthy Kambhampaty
Title: RE: [Retrieved]RE: [ADMIN] backup and recovery






Oops, sorry for the typo in the psql command invocation. The output of the awk command in Step 3 is piped to
/usr/local/pgsql-7.4/bin/psql -d db_quux -f - -Atnq

(in the logging alternative it goes to
/usr/local/pgsql-7.4/bin/psql -d db_quux -f - -Atnq  $LogFile 21)

Cheers,
 Murthy


-Original Message-
From: [EMAIL PROTECTED] on behalf of Murthy Kambhampaty
Sent: Fri 3/26/2004 3:30 PM
To: Naomi Walker; Tom Lane
Cc: Bruce Momjian; Tsirkin Evgeny; Mark M. Huber; [EMAIL PROTECTED]
Subject: Re: [Retrieved]RE: [ADMIN] backup and recovery

I think you can get both benefits of multi-statement transactions for INSERT dumps by doing subset copies ... without any changes in postgresql! The method I use is developed for handling single table loads, but is still relatively painless even for database dumps; however, it is limited to text dumps.

Let's say you want to unload-load table tbl_foo from schema sch_bar from database db_baz and reload sch_bar.tbl_foo to database db_quux. Try the following:

1.) Dump-restore the table schema so you create an empty table in the destination database. e.g.:
 /usr/local/pgsql-7.4/bin/pg_dump -s -t tbl_foo --schema sch_bar db_baz | \
  /usr/local/pgsql-7.4/bin/psql -d db_quux
This can be adjusted for different hosts, etc.

2.) COPY the records to a file:
 /usr/local/pgsql-7.4/bin/psql -d db_bar \
 -c copy sch_bar.tbl_foo to stdout  sch_bar.tbl_foo.dat
OR
 /usr/local/pgsql-7.4/bin/psql -d db_bar -Aqt \
 -c select * from sch_bar.tbl_foo where some condition  sch_bar.tbl_foo.dat
The latter is slower, but selective. You can also use the -p option to set col and row separators to whatever you like (as with copy options). If your source data came from a dump file, rather than a COPY, you can strip sql commands to leave data only, or modify the commands below.

3. Pipe the data from sch_bar.tbl_foo.dat to psql, with copy commands spliced in at chosen intervals (in numbers of lines) depending on your preferences for speed versus recoverability. In the example below, the subset size is 2000 lines:
awk \
 -v SubSize=2000 \
 -v COPYSTMT=copy sch_bar.tbl_foo from stdin; \
'BEGIN{ print COPYSTMT } \
 { print $0 } \
 FNR % SubSize == 0 { \
 print \\.\n\n ; \
 print \n; \
 print COPYSTMT }' sch_bar.tbl_foo.dat | \
 /usr/local/pgsql-7.4/bin/psql -U gouser -d airfrance -f -

The awk command specifies the chosen subset size (2000) and a copy statement for putting stdin in the selected table; at the BEGINning, a copy statmenet is issued and lines are streamed in from the text file containing table rows; after each SubSize number of lines the copy stream is ended (as in text dumps, with a \.), and a new copy statment inserted.

For a 220,000 row table, times for the simple copy versus the subset copy were:

 Simple copy:
 real 0m21.704s
 user 0m3.790s
 sys 0m0.880s

 Subset copy:
 real 0m24.233s
 user 0m5.710s
 sys 0m1.090s

Over 10% more wall clock time, but the savings from not having to rerun the entire load if errors are found could be tremendous.


3a. Alternately, you can generate a log so you easily know which subset failed (if any):
# LogFile=/home/postgres/load.log; \
awk \
 -v SubSize=2000 \
 -v COPYSTMT=copy S2.air from stdin; \
 -v LogF=$LogFile \
'BEGIN{ print Block Size:  SubSize  LogF; \
 print Copy Statment:  COPYSTMT  LogF; \
 print \n\n  LogF; \
 close(LogF) ; \
 print COPYSTMT } \
 { print $0 } \
 FNR % SubSize == 0 { \
 print \\.\n\n ; \
 printf(select \047Processed %d records from line no. %d to line no. %d\047;\n, SubSize, FNR -SubSize +1, FNR) ; \
 print \n; \
 print COPYSTMT }
 END{ \
 print \\.\n\n ; \
 printf(select \047Processed a grand total of %d lines from %s\047;\n, NR, FILENAME ) }' \
 sch_bar.tbl_foo.dat | \
 /usr/local/pgsql-7.4/bin/psql -U gouser -d airfrance -Atnq -f -  $LogFile 21

Errors can be located with:

[EMAIL PROTECTED] postgres]$ cat load.log | grep -B 3 -A 3 ERROR:
Processed 2000 records from line no. 192001 to line no. 194000
Processed 2000 records from line no. 194001 to line no. 196000
Processed 2000 records from line no. 196001 to line no. 198000
ERROR: invalid input syntax for integer: My0
CONTEXT: COPY tbl_foo, line 2000, column oct 02: My0
Processed 2000 records from line no. 198001 to line no. 20
Processed 2000 records from line no. 21 to line no. 202000


HTH
 Murthy




 -Original Message-
 From: [EMAIL PROTECTED]
 [mailto:[EMAIL PROTECTED]]On Behalf Of Naomi Walker
 Sent: Wednesday, March 24, 2004 5:58 PM
 To: Tom Lane
 Cc: Bruce Momjian; Tsirkin Evgeny; Naomi Walker; Mark M. Huber;
 [EMAIL PROTECTED]
 Subject: Re: [Retrieved]RE: [ADMIN] backup and recovery


 At 03:54 PM 3/24/2004, Tom Lane wrote:
 Bruce Momjian [EMAIL PROTECTED] writes:
   Added to TODO:
   * Have pg_dump use multi-statement transactions for
 INSERT dumps
 
   For simple performance reasons, it would be good. I am
 not sure about
   allowing errors to continue loading. Anyone else

Re: [Retrieved]RE: [ADMIN] backup and recovery

2004-03-24 Thread Tsirkin Evgeny
Isn't it better to put this say in the pg_restore
or integrate such thing into psql ?
On Tue, 23 Mar 2004 17:02:36 -0700, Naomi Walker [EMAIL PROTECTED] 
wrote:

I'm not sure of the correct protocol for getting things on the todo
list.  Whom shall we beg?
At 10:13 AM 3/22/2004, Mark M. Huber wrote:
That sounds like a brilliant idea, who do we say it to make it so?

Mark H

-Original Message-
From: Naomi Walker [mailto:[EMAIL PROTECTED]
Sent: Monday, March 22, 2004 8:19 AM
To: Mark M. Huber
Cc: Naomi Walker; [EMAIL PROTECTED]
Subject: Re: [ADMIN] backup and recovery
That brings up a good point.  It would be extremely helpful to add two
parameters to pg_dump.  One, to add how many rows to insert before a
commit, and two, to live through X number of errors before dying (and
putting the bad rows in a file).
At 10:15 AM 3/19/2004, Mark M. Huber wrote:
What it was that I guess the pg_dump makes one large transaction and 
our
shell script wizard wrote a perl program to  add a commit transaction
every 500 rows or what every you set. Also I should have said that we 
were
doing the recovery with the insert statements created from pg_dump. 
So...
my 50 row table recovery took  10 Min.

Thanks for your help.

Mark H


-

-
Naomi Walker Chief Information Officer
   Eldorado Computing, Inc.
[EMAIL PROTECTED]   602-604-3100
-
Forget past mistakes. Forget failures. Forget everything except what 
you're
going to do now and do it.
- William Durant, founder of General Motors

Naomi Walker Chief Information Officer
   Eldorado Computing, Inc.
[EMAIL PROTECTED]   602-604-3100
-
Forget past mistakes. Forget failures. Forget everything except what 
you're
going to do now and do it.
- William Durant, founder of General Motors


-- CONFIDENTIALITY NOTICE --

This message is intended for the sole use of the individual and entity 
to whom it is addressed, and may contain information that is privileged, 
confidential and exempt from disclosure under applicable law. If you are 
not the intended addressee, nor authorized to receive for the intended 
addressee, you are hereby notified that you may not use, copy, disclose 
or distribute to anyone the message or any information contained in the 
message. If you have received this message in error, please immediately 
advise the sender by reply email, and delete the message. Thank you.

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
 joining column's datatypes do not match


Re: [Retrieved]RE: [ADMIN] backup and recovery

2004-03-24 Thread Bruce Momjian
Tsirkin Evgeny wrote:
  Uh, you just ask and we discuss it on the list.
 
  Are you using INSERTs from pg_dump?  I assume so because COPY uses a
  single transaction per command.  Right now with pg_dump -d I see:
  
  --
  -- Data for Name: has_oids; Type: TABLE DATA; Schema: public; Owner:
  postgres
  --
  
  INSERT INTO has_oids VALUES (1);
  INSERT INTO has_oids VALUES (1);
  INSERT INTO has_oids VALUES (1);
  INSERT INTO has_oids VALUES (1);
 
  Seems that should be inside a BEGIN/COMMIT for performance reasons, and
  to have the same behavior as COPY (fail if any row fails).  Commands?
 
  As far as skipping on errors, I am unsure on that one, and if we put the
  INSERTs in a transaction, we will have no way of rolling back only the
  few inserts that fail.
 
 That is right but there are sutuation when you prefer at least some
 data to be inserted and not all changes to be ralled back because
 of errors.

Added to TODO:

* Have pg_dump use multi-statement transactions for INSERT dumps

For simple performance reasons, it would be good.  I am not sure about
allowing errors to continue loading.   Anyone else?

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [Retrieved]RE: [ADMIN] backup and recovery

2004-03-24 Thread Naomi Walker
At 03:54 PM 3/24/2004, Tom Lane wrote:
Bruce Momjian [EMAIL PROTECTED] writes:
  Added to TODO:
* Have pg_dump use multi-statement transactions for INSERT dumps

  For simple performance reasons, it would be good.  I am not sure about
  allowing errors to continue loading.   Anyone else?

Of course, anyone who actually cares about reload speed shouldn't be
using INSERT-style dumps anyway ... I'm not sure why we should expend
effort on that rather than just telling people to use the COPY mode.

Understood.  I would still love this feature for when in the COPY mode.


 regards, tom lane

-
Naomi Walker Chief Information Officer
   Eldorado Computing, Inc.
[EMAIL PROTECTED]   602-604-3100
-
Forget past mistakes. Forget failures. Forget everything except what you're 
going to do now and do it.
- William Durant, founder of General Motors


-- CONFIDENTIALITY NOTICE --

This message is intended for the sole use of the individual and entity to whom it is 
addressed, and may contain information that is privileged, confidential and exempt 
from disclosure under applicable law. If you are not the intended addressee, nor 
authorized to receive for the intended addressee, you are hereby notified that you may 
not use, copy, disclose or distribute to anyone the message or any information 
contained in the message. If you have received this message in error, please 
immediately advise the sender by reply email, and delete the message. Thank you.

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [Retrieved]RE: [ADMIN] backup and recovery

2004-03-24 Thread Bruce Momjian
Tom Lane wrote:
 Bruce Momjian [EMAIL PROTECTED] writes:
  Added to TODO:
  * Have pg_dump use multi-statement transactions for INSERT dumps
 
  For simple performance reasons, it would be good.  I am not sure about
  allowing errors to continue loading.   Anyone else?
 
 Of course, anyone who actually cares about reload speed shouldn't be
 using INSERT-style dumps anyway ... I'm not sure why we should expend
 effort on that rather than just telling people to use the COPY mode.

My bigger issue is that COPY will fail on a single row failure, and
nothing will be in the table, while INSERT will not.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [Retrieved]RE: [ADMIN] backup and recovery

2004-03-24 Thread Tom Lane
Bruce Momjian [EMAIL PROTECTED] writes:
 My bigger issue is that COPY will fail on a single row failure, and
 nothing will be in the table, while INSERT will not.

In theory, there shouldn't be any failures, because the data was known
valid when it was dumped.

Of course, practice often differs from theory, but I wonder whether we
aren't talking about palliating a symptom instead of fixing the real
problem.

regards, tom lane

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [Retrieved]RE: [ADMIN] backup and recovery

2004-03-23 Thread Naomi Walker

I'm not sure of the correct protocol for getting things on the todo 
list.  Whom shall we beg?


At 10:13 AM 3/22/2004, Mark M. Huber wrote:
That sounds like a brilliant idea, who do we say it to make it so?

Mark H

-Original Message-
From: Naomi Walker [mailto:[EMAIL PROTECTED]
Sent: Monday, March 22, 2004 8:19 AM
To: Mark M. Huber
Cc: Naomi Walker; [EMAIL PROTECTED]
Subject: Re: [ADMIN] backup and recovery


That brings up a good point.  It would be extremely helpful to add two
parameters to pg_dump.  One, to add how many rows to insert before a
commit, and two, to live through X number of errors before dying (and
putting the bad rows in a file).


At 10:15 AM 3/19/2004, Mark M. Huber wrote:
 What it was that I guess the pg_dump makes one large transaction and our
 shell script wizard wrote a perl program to  add a commit transaction
 every 500 rows or what every you set. Also I should have said that we were
 doing the recovery with the insert statements created from pg_dump. So...
 my 50 row table recovery took  10 Min.
 
 Thanks for your help.
 
 Mark H
 
 
 -

-
Naomi Walker Chief Information Officer
Eldorado Computing, Inc.
[EMAIL PROTECTED]   602-604-3100
-
Forget past mistakes. Forget failures. Forget everything except what you're
going to do now and do it.
- William Durant, founder of General Motors


Naomi Walker Chief Information Officer
   Eldorado Computing, Inc.
[EMAIL PROTECTED]   602-604-3100
-
Forget past mistakes. Forget failures. Forget everything except what you're 
going to do now and do it.
- William Durant, founder of General Motors


-- CONFIDENTIALITY NOTICE --

This message is intended for the sole use of the individual and entity to whom it is 
addressed, and may contain information that is privileged, confidential and exempt 
from disclosure under applicable law. If you are not the intended addressee, nor 
authorized to receive for the intended addressee, you are hereby notified that you may 
not use, copy, disclose or distribute to anyone the message or any information 
contained in the message. If you have received this message in error, please 
immediately advise the sender by reply email, and delete the message. Thank you.

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [Retrieved]RE: [ADMIN] backup and recovery

2004-03-23 Thread Bruce Momjian
Naomi Walker wrote:
 
 I'm not sure of the correct protocol for getting things on the todo 
 list.  Whom shall we beg?
 

Uh, you just ask and we discuss it on the list.

Are you using INSERTs from pg_dump?  I assume so because COPY uses a
single transaction per command.  Right now with pg_dump -d I see:

--
-- Data for Name: has_oids; Type: TABLE DATA; Schema: public; Owner:
postgres
--

INSERT INTO has_oids VALUES (1);
INSERT INTO has_oids VALUES (1);
INSERT INTO has_oids VALUES (1);
INSERT INTO has_oids VALUES (1);

Seems that should be inside a BEGIN/COMMIT for performance reasons, and
to have the same behavior as COPY (fail if any row fails).  Commands?

As far as skipping on errors, I am unsure on that one, and if we put the
INSERTs in a transaction, we will have no way of rolling back only the
few inserts that fail.

---

 
 That brings up a good point.  It would be extremely helpful to add two
 parameters to pg_dump.  One, to add how many rows to insert before a
 commit, and two, to live through X number of errors before dying (and
 putting the bad rows in a file).
 
 
 At 10:15 AM 3/19/2004, Mark M. Huber wrote:
  What it was that I guess the pg_dump makes one large transaction and our
  shell script wizard wrote a perl program to  add a commit transaction
  every 500 rows or what every you set. Also I should have said that we were
  doing the recovery with the insert statements created from pg_dump. So...
  my 50 row table recovery took  10 Min.
  

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 8: explain analyze is your friend