Re: [Retrieved]RE: [ADMIN] backup and recovery
I think you can get both benefits of multi-statement transactions for INSERT dumps by doing subset copies ... without any changes in postgresql! The method I use is developed for handling single table loads, but is still relatively painless even for database dumps; however, it is limited to text dumps. Let's say you want to unload-load table tbl_foo from schema sch_bar from database db_baz and reload sch_bar.tbl_foo to database db_quux. Try the following: 1.) Dump-restore the table schema so you create an empty table in the destination database. e.g.: /usr/local/pgsql-7.4/bin/pg_dump -s -t tbl_foo --schema sch_bar db_baz | \ /usr/local/pgsql-7.4/bin/psql -d db_quux This can be adjusted for different hosts, etc. 2.) COPY the records to a file: /usr/local/pgsql-7.4/bin/psql -d db_bar \ -c copy sch_bar.tbl_foo to stdout sch_bar.tbl_foo.dat OR /usr/local/pgsql-7.4/bin/psql -d db_bar -Aqt \ -c select * from sch_bar.tbl_foo where some condition sch_bar.tbl_foo.dat The latter is slower, but selective. You can also use the -p option to set col and row separators to whatever you like (as with copy options). If your source data came from a dump file, rather than a COPY, you can strip sql commands to leave data only, or modify the commands below. 3. Pipe the data from sch_bar.tbl_foo.dat to psql, with copy commands spliced in at chosen intervals (in numbers of lines) depending on your preferences for speed versus recoverability. In the example below, the subset size is 2000 lines: awk \ -v SubSize=2000 \ -v COPYSTMT=copy sch_bar.tbl_foo from stdin; \ 'BEGIN{ print COPYSTMT } \ { print $0 } \ FNR % SubSize == 0 { \ print \\.\n\n ; \ print \n; \ print COPYSTMT }' sch_bar.tbl_foo.dat | \ /usr/local/pgsql-7.4/bin/psql -U gouser -d airfrance -f - The awk command specifies the chosen subset size (2000) and a copy statement for putting stdin in the selected table; at the BEGINning, a copy statmenet is issued and lines are streamed in from the text file containing table rows; after each SubSize number of lines the copy stream is ended (as in text dumps, with a \.), and a new copy statment inserted. For a 220,000 row table, times for the simple copy versus the subset copy were: Simple copy: real0m21.704s user0m3.790s sys 0m0.880s Subset copy: real0m24.233s user0m5.710s sys 0m1.090s Over 10% more wall clock time, but the savings from not having to rerun the entire load if errors are found could be tremendous. 3a. Alternately, you can generate a log so you easily know which subset failed (if any): # LogFile=/home/postgres/load.log; \ awk \ -v SubSize=2000 \ -v COPYSTMT=copy S2.air from stdin; \ -v LogF=$LogFile \ 'BEGIN{ print Block Size: SubSize LogF; \ print Copy Statment: COPYSTMT LogF; \ print \n\n LogF; \ close(LogF) ; \ print COPYSTMT } \ { print $0 } \ FNR % SubSize == 0 { \ print \\.\n\n ; \ printf(select \047Processed %d records from line no. %d to line no. %d\047;\n, SubSize, FNR -SubSize +1, FNR) ; \ print \n; \ print COPYSTMT } END{ \ print \\.\n\n ; \ printf(select \047Processed a grand total of %d lines from %s\047;\n, NR, FILENAME ) }' \ sch_bar.tbl_foo.dat | \ /usr/local/pgsql-7.4/bin/psql -U gouser -d airfrance -Atnq -f - $LogFile 21 Errors can be located with: [EMAIL PROTECTED] postgres]$ cat load.log | grep -B 3 -A 3 ERROR: Processed 2000 records from line no. 192001 to line no. 194000 Processed 2000 records from line no. 194001 to line no. 196000 Processed 2000 records from line no. 196001 to line no. 198000 ERROR: invalid input syntax for integer: My0 CONTEXT: COPY tbl_foo, line 2000, column oct 02: My0 Processed 2000 records from line no. 198001 to line no. 20 Processed 2000 records from line no. 21 to line no. 202000 HTH Murthy -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Behalf Of Naomi Walker Sent: Wednesday, March 24, 2004 5:58 PM To: Tom Lane Cc: Bruce Momjian; Tsirkin Evgeny; Naomi Walker; Mark M. Huber; [EMAIL PROTECTED] Subject: Re: [Retrieved]RE: [ADMIN] backup and recovery At 03:54 PM 3/24/2004, Tom Lane wrote: Bruce Momjian [EMAIL PROTECTED] writes: Added to TODO: * Have pg_dump use multi-statement transactions for INSERT dumps For simple performance reasons, it would be good. I am not sure about allowing errors to continue loading. Anyone else? Of course, anyone who actually cares about reload speed shouldn't be using INSERT-style dumps anyway ... I'm not sure why we should expend effort on that rather than just telling people to use the COPY mode. Understood. I would still love this feature for when in the COPY mode. regards, tom lane
Re: [Retrieved]RE: [ADMIN] backup and recovery
Title: RE: [Retrieved]RE: [ADMIN] backup and recovery Oops, sorry for the typo in the psql command invocation. The output of the awk command in Step 3 is piped to /usr/local/pgsql-7.4/bin/psql -d db_quux -f - -Atnq (in the logging alternative it goes to /usr/local/pgsql-7.4/bin/psql -d db_quux -f - -Atnq $LogFile 21) Cheers, Murthy -Original Message- From: [EMAIL PROTECTED] on behalf of Murthy Kambhampaty Sent: Fri 3/26/2004 3:30 PM To: Naomi Walker; Tom Lane Cc: Bruce Momjian; Tsirkin Evgeny; Mark M. Huber; [EMAIL PROTECTED] Subject: Re: [Retrieved]RE: [ADMIN] backup and recovery I think you can get both benefits of multi-statement transactions for INSERT dumps by doing subset copies ... without any changes in postgresql! The method I use is developed for handling single table loads, but is still relatively painless even for database dumps; however, it is limited to text dumps. Let's say you want to unload-load table tbl_foo from schema sch_bar from database db_baz and reload sch_bar.tbl_foo to database db_quux. Try the following: 1.) Dump-restore the table schema so you create an empty table in the destination database. e.g.: /usr/local/pgsql-7.4/bin/pg_dump -s -t tbl_foo --schema sch_bar db_baz | \ /usr/local/pgsql-7.4/bin/psql -d db_quux This can be adjusted for different hosts, etc. 2.) COPY the records to a file: /usr/local/pgsql-7.4/bin/psql -d db_bar \ -c copy sch_bar.tbl_foo to stdout sch_bar.tbl_foo.dat OR /usr/local/pgsql-7.4/bin/psql -d db_bar -Aqt \ -c select * from sch_bar.tbl_foo where some condition sch_bar.tbl_foo.dat The latter is slower, but selective. You can also use the -p option to set col and row separators to whatever you like (as with copy options). If your source data came from a dump file, rather than a COPY, you can strip sql commands to leave data only, or modify the commands below. 3. Pipe the data from sch_bar.tbl_foo.dat to psql, with copy commands spliced in at chosen intervals (in numbers of lines) depending on your preferences for speed versus recoverability. In the example below, the subset size is 2000 lines: awk \ -v SubSize=2000 \ -v COPYSTMT=copy sch_bar.tbl_foo from stdin; \ 'BEGIN{ print COPYSTMT } \ { print $0 } \ FNR % SubSize == 0 { \ print \\.\n\n ; \ print \n; \ print COPYSTMT }' sch_bar.tbl_foo.dat | \ /usr/local/pgsql-7.4/bin/psql -U gouser -d airfrance -f - The awk command specifies the chosen subset size (2000) and a copy statement for putting stdin in the selected table; at the BEGINning, a copy statmenet is issued and lines are streamed in from the text file containing table rows; after each SubSize number of lines the copy stream is ended (as in text dumps, with a \.), and a new copy statment inserted. For a 220,000 row table, times for the simple copy versus the subset copy were: Simple copy: real 0m21.704s user 0m3.790s sys 0m0.880s Subset copy: real 0m24.233s user 0m5.710s sys 0m1.090s Over 10% more wall clock time, but the savings from not having to rerun the entire load if errors are found could be tremendous. 3a. Alternately, you can generate a log so you easily know which subset failed (if any): # LogFile=/home/postgres/load.log; \ awk \ -v SubSize=2000 \ -v COPYSTMT=copy S2.air from stdin; \ -v LogF=$LogFile \ 'BEGIN{ print Block Size: SubSize LogF; \ print Copy Statment: COPYSTMT LogF; \ print \n\n LogF; \ close(LogF) ; \ print COPYSTMT } \ { print $0 } \ FNR % SubSize == 0 { \ print \\.\n\n ; \ printf(select \047Processed %d records from line no. %d to line no. %d\047;\n, SubSize, FNR -SubSize +1, FNR) ; \ print \n; \ print COPYSTMT } END{ \ print \\.\n\n ; \ printf(select \047Processed a grand total of %d lines from %s\047;\n, NR, FILENAME ) }' \ sch_bar.tbl_foo.dat | \ /usr/local/pgsql-7.4/bin/psql -U gouser -d airfrance -Atnq -f - $LogFile 21 Errors can be located with: [EMAIL PROTECTED] postgres]$ cat load.log | grep -B 3 -A 3 ERROR: Processed 2000 records from line no. 192001 to line no. 194000 Processed 2000 records from line no. 194001 to line no. 196000 Processed 2000 records from line no. 196001 to line no. 198000 ERROR: invalid input syntax for integer: My0 CONTEXT: COPY tbl_foo, line 2000, column oct 02: My0 Processed 2000 records from line no. 198001 to line no. 20 Processed 2000 records from line no. 21 to line no. 202000 HTH Murthy -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On Behalf Of Naomi Walker Sent: Wednesday, March 24, 2004 5:58 PM To: Tom Lane Cc: Bruce Momjian; Tsirkin Evgeny; Naomi Walker; Mark M. Huber; [EMAIL PROTECTED] Subject: Re: [Retrieved]RE: [ADMIN] backup and recovery At 03:54 PM 3/24/2004, Tom Lane wrote: Bruce Momjian [EMAIL PROTECTED] writes: Added to TODO: * Have pg_dump use multi-statement transactions for INSERT dumps For simple performance reasons, it would be good. I am not sure about allowing errors to continue loading. Anyone else
Re: [Retrieved]RE: [ADMIN] backup and recovery
Isn't it better to put this say in the pg_restore or integrate such thing into psql ? On Tue, 23 Mar 2004 17:02:36 -0700, Naomi Walker [EMAIL PROTECTED] wrote: I'm not sure of the correct protocol for getting things on the todo list. Whom shall we beg? At 10:13 AM 3/22/2004, Mark M. Huber wrote: That sounds like a brilliant idea, who do we say it to make it so? Mark H -Original Message- From: Naomi Walker [mailto:[EMAIL PROTECTED] Sent: Monday, March 22, 2004 8:19 AM To: Mark M. Huber Cc: Naomi Walker; [EMAIL PROTECTED] Subject: Re: [ADMIN] backup and recovery That brings up a good point. It would be extremely helpful to add two parameters to pg_dump. One, to add how many rows to insert before a commit, and two, to live through X number of errors before dying (and putting the bad rows in a file). At 10:15 AM 3/19/2004, Mark M. Huber wrote: What it was that I guess the pg_dump makes one large transaction and our shell script wizard wrote a perl program to add a commit transaction every 500 rows or what every you set. Also I should have said that we were doing the recovery with the insert statements created from pg_dump. So... my 50 row table recovery took 10 Min. Thanks for your help. Mark H - - Naomi Walker Chief Information Officer Eldorado Computing, Inc. [EMAIL PROTECTED] 602-604-3100 - Forget past mistakes. Forget failures. Forget everything except what you're going to do now and do it. - William Durant, founder of General Motors Naomi Walker Chief Information Officer Eldorado Computing, Inc. [EMAIL PROTECTED] 602-604-3100 - Forget past mistakes. Forget failures. Forget everything except what you're going to do now and do it. - William Durant, founder of General Motors -- CONFIDENTIALITY NOTICE -- This message is intended for the sole use of the individual and entity to whom it is addressed, and may contain information that is privileged, confidential and exempt from disclosure under applicable law. If you are not the intended addressee, nor authorized to receive for the intended addressee, you are hereby notified that you may not use, copy, disclose or distribute to anyone the message or any information contained in the message. If you have received this message in error, please immediately advise the sender by reply email, and delete the message. Thank you. ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [Retrieved]RE: [ADMIN] backup and recovery
Tsirkin Evgeny wrote: Uh, you just ask and we discuss it on the list. Are you using INSERTs from pg_dump? I assume so because COPY uses a single transaction per command. Right now with pg_dump -d I see: -- -- Data for Name: has_oids; Type: TABLE DATA; Schema: public; Owner: postgres -- INSERT INTO has_oids VALUES (1); INSERT INTO has_oids VALUES (1); INSERT INTO has_oids VALUES (1); INSERT INTO has_oids VALUES (1); Seems that should be inside a BEGIN/COMMIT for performance reasons, and to have the same behavior as COPY (fail if any row fails). Commands? As far as skipping on errors, I am unsure on that one, and if we put the INSERTs in a transaction, we will have no way of rolling back only the few inserts that fail. That is right but there are sutuation when you prefer at least some data to be inserted and not all changes to be ralled back because of errors. Added to TODO: * Have pg_dump use multi-statement transactions for INSERT dumps For simple performance reasons, it would be good. I am not sure about allowing errors to continue loading. Anyone else? -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup.| Newtown Square, Pennsylvania 19073 ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [Retrieved]RE: [ADMIN] backup and recovery
At 03:54 PM 3/24/2004, Tom Lane wrote: Bruce Momjian [EMAIL PROTECTED] writes: Added to TODO: * Have pg_dump use multi-statement transactions for INSERT dumps For simple performance reasons, it would be good. I am not sure about allowing errors to continue loading. Anyone else? Of course, anyone who actually cares about reload speed shouldn't be using INSERT-style dumps anyway ... I'm not sure why we should expend effort on that rather than just telling people to use the COPY mode. Understood. I would still love this feature for when in the COPY mode. regards, tom lane - Naomi Walker Chief Information Officer Eldorado Computing, Inc. [EMAIL PROTECTED] 602-604-3100 - Forget past mistakes. Forget failures. Forget everything except what you're going to do now and do it. - William Durant, founder of General Motors -- CONFIDENTIALITY NOTICE -- This message is intended for the sole use of the individual and entity to whom it is addressed, and may contain information that is privileged, confidential and exempt from disclosure under applicable law. If you are not the intended addressee, nor authorized to receive for the intended addressee, you are hereby notified that you may not use, copy, disclose or distribute to anyone the message or any information contained in the message. If you have received this message in error, please immediately advise the sender by reply email, and delete the message. Thank you. ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [Retrieved]RE: [ADMIN] backup and recovery
Tom Lane wrote: Bruce Momjian [EMAIL PROTECTED] writes: Added to TODO: * Have pg_dump use multi-statement transactions for INSERT dumps For simple performance reasons, it would be good. I am not sure about allowing errors to continue loading. Anyone else? Of course, anyone who actually cares about reload speed shouldn't be using INSERT-style dumps anyway ... I'm not sure why we should expend effort on that rather than just telling people to use the COPY mode. My bigger issue is that COPY will fail on a single row failure, and nothing will be in the table, while INSERT will not. -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup.| Newtown Square, Pennsylvania 19073 ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [Retrieved]RE: [ADMIN] backup and recovery
Bruce Momjian [EMAIL PROTECTED] writes: My bigger issue is that COPY will fail on a single row failure, and nothing will be in the table, while INSERT will not. In theory, there shouldn't be any failures, because the data was known valid when it was dumped. Of course, practice often differs from theory, but I wonder whether we aren't talking about palliating a symptom instead of fixing the real problem. regards, tom lane ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [Retrieved]RE: [ADMIN] backup and recovery
I'm not sure of the correct protocol for getting things on the todo list. Whom shall we beg? At 10:13 AM 3/22/2004, Mark M. Huber wrote: That sounds like a brilliant idea, who do we say it to make it so? Mark H -Original Message- From: Naomi Walker [mailto:[EMAIL PROTECTED] Sent: Monday, March 22, 2004 8:19 AM To: Mark M. Huber Cc: Naomi Walker; [EMAIL PROTECTED] Subject: Re: [ADMIN] backup and recovery That brings up a good point. It would be extremely helpful to add two parameters to pg_dump. One, to add how many rows to insert before a commit, and two, to live through X number of errors before dying (and putting the bad rows in a file). At 10:15 AM 3/19/2004, Mark M. Huber wrote: What it was that I guess the pg_dump makes one large transaction and our shell script wizard wrote a perl program to add a commit transaction every 500 rows or what every you set. Also I should have said that we were doing the recovery with the insert statements created from pg_dump. So... my 50 row table recovery took 10 Min. Thanks for your help. Mark H - - Naomi Walker Chief Information Officer Eldorado Computing, Inc. [EMAIL PROTECTED] 602-604-3100 - Forget past mistakes. Forget failures. Forget everything except what you're going to do now and do it. - William Durant, founder of General Motors Naomi Walker Chief Information Officer Eldorado Computing, Inc. [EMAIL PROTECTED] 602-604-3100 - Forget past mistakes. Forget failures. Forget everything except what you're going to do now and do it. - William Durant, founder of General Motors -- CONFIDENTIALITY NOTICE -- This message is intended for the sole use of the individual and entity to whom it is addressed, and may contain information that is privileged, confidential and exempt from disclosure under applicable law. If you are not the intended addressee, nor authorized to receive for the intended addressee, you are hereby notified that you may not use, copy, disclose or distribute to anyone the message or any information contained in the message. If you have received this message in error, please immediately advise the sender by reply email, and delete the message. Thank you. ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [Retrieved]RE: [ADMIN] backup and recovery
Naomi Walker wrote: I'm not sure of the correct protocol for getting things on the todo list. Whom shall we beg? Uh, you just ask and we discuss it on the list. Are you using INSERTs from pg_dump? I assume so because COPY uses a single transaction per command. Right now with pg_dump -d I see: -- -- Data for Name: has_oids; Type: TABLE DATA; Schema: public; Owner: postgres -- INSERT INTO has_oids VALUES (1); INSERT INTO has_oids VALUES (1); INSERT INTO has_oids VALUES (1); INSERT INTO has_oids VALUES (1); Seems that should be inside a BEGIN/COMMIT for performance reasons, and to have the same behavior as COPY (fail if any row fails). Commands? As far as skipping on errors, I am unsure on that one, and if we put the INSERTs in a transaction, we will have no way of rolling back only the few inserts that fail. --- That brings up a good point. It would be extremely helpful to add two parameters to pg_dump. One, to add how many rows to insert before a commit, and two, to live through X number of errors before dying (and putting the bad rows in a file). At 10:15 AM 3/19/2004, Mark M. Huber wrote: What it was that I guess the pg_dump makes one large transaction and our shell script wizard wrote a perl program to add a commit transaction every 500 rows or what every you set. Also I should have said that we were doing the recovery with the insert statements created from pg_dump. So... my 50 row table recovery took 10 Min. -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup.| Newtown Square, Pennsylvania 19073 ---(end of broadcast)--- TIP 8: explain analyze is your friend