Re: [GENERAL] Populating huge tables each day

2005-07-03 Thread Ben-Nes Yonatan

On Tue, Jun 28, 2005 at 10:36:58AM -0700, Dann Corbit wrote:


Nope, truncate is undoubtedly faster. But it also means you would have
downtime as you mentioned. If it were me, I'd probably make the
trade-off of using a delete inside a transaction.


For every record in a bulk loaded table?


Sure. If the data's only being loaded once a day, it probably doesn't
matter if that delete takes 10 minutes.



If it were that important that both servers be available all the time, I
would bulk load into a second table with the same shape and then rename
when completed.


Interesting idea, though the problem is that AFAIK everything will block
on the rename. If everything didn't block though, this might be a better
way to do it, although it potentially complicates the code greatly
(think about needing to add indexes, rebuild RI, etc.)


Sorry for the lack of answers i was away and unable to answer...

I thought about the idea of loading everything into a temporary table 
and then renaming it but indeed as Jim mentioned it will also complicate 
everything alot more... but im not sure that indeed everything will be 
blocked during that time if it will all be under a transaction or am i 
wrong here?


Here are some answer to Dann questions from earlier mail:
How many tables are to be replicated?
* 2 tables + another one which will get modified a little.

What is the total number of expected rows for each table?
* 1 table about 3.5 million and the second is quite hard to determine 
but its about 40,000, the third one modification will probably be about 
10 new rows each day maximum so its really not a problem (it will 
require one query on all of the data at the end though...).


How fast are the tables expected to grow?
* im not sure that i understand your question but if you ask it about 
the time that it takes the tables to get to their final size then its 
supposed to take minutes i guess cause it will probably load everything 
and i want to cut that time as much as i can (ill run it under nice).


When must the new data become available online?
* right after the changes will complete (its not really a time in 
seconds or a specific hour).


Are all of the tables in the database populated from a foreign source or
just some of them?
* just those 3.

Do you also have access to the source data in its database format, or
only as text dumps?
* only as text or XML, i prefer text cause i want to use COPY though any 
diffrent thoughts will be accepted gladly.


** The goal of the site is to create a portal of products for end-users.


Im startring to wonder maybe i shouldnt upload the New Data to a Temp 
table but instead upload it directly to the Main table and just add a 
status field which will tell that its in progress, when deleting ill 
delete all of those rows which their status is active and afterwards 
update all of the rows with the status of in progress to active.
By this solution ill also be able to save from deletion rows which ill 
want to keep from deletion by just changing their status to something 
else then active or in progress.
Also ill save the need to transfer all of the data from the Temp table 
to the Main table.


2 points bother me here...
1. by uploading the data to the Main table before deleting its content 
i create a table with 7 million rows which will stress the system more 
for every query that ill run on it (like the one which i need for the 
deletion of the old rows).
2. im not sure if ill be able to restart the counting of the indexing 
when there will be data at the table (after all at this method there 
wont be any period of time without data at the Main table).


I guess that what i really want to know is how much all of this process 
will stress the server... and what can i do to let the server work on it 
in a way that it wont disturb the rest of the processes.


Thanks alot again,
Ben-Nes Yonatan
Canaan Surfing ltd.
http://www.canaan.net.il


---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [GENERAL] Populating huge tables each day

2005-06-28 Thread Dann Corbit
 -Original Message-
 From: Jim C. Nasby [mailto:[EMAIL PROTECTED]
 Sent: Monday, June 27, 2005 6:55 PM
 To: Dann Corbit
 Cc: Ben-Nes Yonatan; pgsql-general@postgresql.org
 Subject: Re: [GENERAL] Populating huge tables each day
 
 On Mon, Jun 27, 2005 at 01:05:42PM -0700, Dann Corbit wrote:
 
   -Original Message-
   From: Jim C. Nasby [mailto:[EMAIL PROTECTED]
   Sent: Monday, June 27, 2005 12:58 PM
   To: Dann Corbit
   Cc: Ben-Nes Yonatan; pgsql-general@postgresql.org
   Subject: Re: [GENERAL] Populating huge tables each day
  
   On Mon, Jun 27, 2005 at 12:43:57PM -0700, Dann Corbit wrote:
I see a lot of problems with this idea.
   
You mention that the database is supposed to be available 24x7.
While you are loading, the database table receiving data will
not be
available.  Therefore, you will have to have one server online
(with
  
   Why do you think that's the case?
 
  He's doing a bulk load.  I assume he will have to truncate the table
and
  load it with the copy command.
 
 Don't ass-u-me; he said he'd be deleting from the main table, not
 truncating.
 
  Is there an alternative I do not know of that is equally fast?
 
 Nope, truncate is undoubtedly faster. But it also means you would have
 downtime as you mentioned. If it were me, I'd probably make the
 trade-off of using a delete inside a transaction.

For every record in a bulk loaded table?

If it were that important that both servers be available all the time, I
would bulk load into a second table with the same shape and then rename
when completed.

Be that as it may, I don't think that there is enough information yet to
give good advice.

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [GENERAL] Populating huge tables each day

2005-06-28 Thread Jim C. Nasby
On Tue, Jun 28, 2005 at 10:36:58AM -0700, Dann Corbit wrote:
  Nope, truncate is undoubtedly faster. But it also means you would have
  downtime as you mentioned. If it were me, I'd probably make the
  trade-off of using a delete inside a transaction.
 
 For every record in a bulk loaded table?
Sure. If the data's only being loaded once a day, it probably doesn't
matter if that delete takes 10 minutes.

 If it were that important that both servers be available all the time, I
 would bulk load into a second table with the same shape and then rename
 when completed.
Interesting idea, though the problem is that AFAIK everything will block
on the rename. If everything didn't block though, this might be a better
way to do it, although it potentially complicates the code greatly
(think about needing to add indexes, rebuild RI, etc.)
-- 
Jim C. Nasby, Database Consultant   [EMAIL PROTECTED] 
Give your computer some brain candy! www.distributed.net Team #1828

Windows: Where do you want to go today?
Linux: Where do you want to go tomorrow?
FreeBSD: Are you guys coming, or what?

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [GENERAL] Populating huge tables each day

2005-06-27 Thread Dann Corbit
I see a lot of problems with this idea.

You mention that the database is supposed to be available 24x7.
While you are loading, the database table receiving data will not be
available.  Therefore, you will have to have one server online (with
only the old data), while the other one is loading.  Once the load and
all procedures are complete, you could switch the active server.

You do not mention your time constraints and the total volume of data.
If the new data must become available at some critical time, this is a
very important requirement that must be spelled out clearly.

You will need some kind of guarantee of relational integrity.  This is
always difficult to achieve when bulk loading from foreign sources.

I think it is important to spell things out more clearly.

How many tables are to be replicated?
What is the total number of expected rows for each table?
How fast are the tables expected to grow?
When must the new data become available online?
Are all of the tables in the database populated from a foreign source or
just some of them?
Do you also have access to the source data in its database format, or
only as text dumps?

Is the goal to serve as a reporting server?  Is the goal to make the
same data as the original server online for end-users and in so doing to
reduce the load on the original server?  What is the real purpose of the
task to be accomplished?

 -Original Message-
 From: [EMAIL PROTECTED] [mailto:pgsql-general-
 [EMAIL PROTECTED] On Behalf Of Ben-Nes Yonatan
 Sent: Monday, June 27, 2005 1:13 PM
 To: pgsql-general@postgresql.org
 Subject: [GENERAL] Populating huge tables each day
 
 Hi All,
 
 First I apologize for the length of this email im just afraid that my
 problem is quite complicated for explination  also about my english
 which is not my native language sorry :)
 
 I'm currently building a site for a client which will need to update
his
 database on a daily routine with about 3.5 million rows from data
files
 (the files vary in size and can reach up to 1 million rows per file).
 That data will be retrieved from several files (Text or XML by my
 chioce) and part of the rows will need to be INSERT cause they doesnt
 have any previous record and some will need to UPDATE previous records
 or just INSERT and DELETE the previous records.
 Beside of that the new data also bring information in it that tell how
 to populate another table (2 fields and the result is alot less rows)
 and connect between the two tables by foreign key which is written in
 each row of the main table (the one with the 3.5 million rows).
 Now the site must work 24/7 and it will probably have 2 servers which
 will run PostreSQL (working as 1 database), the scripting language
that
 ill use is PHP if it change anything.
 
 I thought on how to accomplish this and I would like to receive
comments
 and ideas.
 I'm mostly afraid from the stress it will make on the server during
the
 process and that the site will display working information all the
time.
 
 A brief explaniation on the definitions ill use ahead:
 1. New data - The retrieved data from the files (Text or XML).
 2. Main table - The final table which need to hold about 3.5 million
 rows, it can be empty before the process or can hold already
information
 that some of it need to get updated using the New data and the rest
of
 the New data need to get inserted into it while the previous data
 which didnt got updated need to get deleted.
 3. Second table - The table which hold information that the data at
the
 Main table need to get connected to using foreign keys.
 4. Temp table - A temporary table which will hold the New data till
it
 will be ready to be INSERT/UPDATE the data at Main table (got the
 exact same columns as the Main table).
 4. Temp table2 - A temporary table which is created by CREATE
TEMPORARY
 TABLE AS (former SELECT INTO).
 
 My plan:
 1. With a COPY FROM ill insert the data to the Temp table (all of
the
 files in Text format).
 2. Run at PL/pgSQL function: {
 
 A. Start transaction
 B. DELETE the content of the current existing Second table.
 C. INSERT data into the Second table using the Temp table - each
row
 will be checked to its values and compared to check if they exist
 already (SELECT) at the Second table and if not it will run an
INSERT
 to create it at the Second table - im afraid that this process will
be
 extremly heavy on the server.
 D. DELETE the Main table rows.
 E. With CREATE TEMPORARY TABLE AS ill create Temp table2 which will
 contain all of the information of the Temp table + a subquery will
 retrieve the ID of the Second table for the foreign key - quite
heavy
 process i suspect.
 F. INSERT the data from Temp table2 to the Main table.
 G. End transaction + quit from PL/pgSQL. }
 
 3. Delete all the files.
 
 Thanks alot in advance and again im sorry for the length of the mail
:)
 
 Ben-Nes Yonatan
 Canaan Surfing ltd.
 http://www.canaan.net.il
 
 
 ---(end of

Re: [GENERAL] Populating huge tables each day

2005-06-27 Thread Jim C. Nasby
On Mon, Jun 27, 2005 at 12:43:57PM -0700, Dann Corbit wrote:
 I see a lot of problems with this idea.
 
 You mention that the database is supposed to be available 24x7.
 While you are loading, the database table receiving data will not be
 available.  Therefore, you will have to have one server online (with

Why do you think that's the case?
-- 
Jim C. Nasby, Database Consultant   [EMAIL PROTECTED] 
Give your computer some brain candy! www.distributed.net Team #1828

Windows: Where do you want to go today?
Linux: Where do you want to go tomorrow?
FreeBSD: Are you guys coming, or what?

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [GENERAL] Populating huge tables each day

2005-06-27 Thread Dann Corbit

 -Original Message-
 From: Jim C. Nasby [mailto:[EMAIL PROTECTED]
 Sent: Monday, June 27, 2005 12:58 PM
 To: Dann Corbit
 Cc: Ben-Nes Yonatan; pgsql-general@postgresql.org
 Subject: Re: [GENERAL] Populating huge tables each day
 
 On Mon, Jun 27, 2005 at 12:43:57PM -0700, Dann Corbit wrote:
  I see a lot of problems with this idea.
 
  You mention that the database is supposed to be available 24x7.
  While you are loading, the database table receiving data will not be
  available.  Therefore, you will have to have one server online (with
 
 Why do you think that's the case?

He's doing a bulk load.  I assume he will have to truncate the table and
load it with the copy command.

Is there an alternative I do not know of that is equally fast?

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [GENERAL] Populating huge tables each day

2005-06-27 Thread Jim C. Nasby
On Mon, Jun 27, 2005 at 01:05:42PM -0700, Dann Corbit wrote:
 
  -Original Message-
  From: Jim C. Nasby [mailto:[EMAIL PROTECTED]
  Sent: Monday, June 27, 2005 12:58 PM
  To: Dann Corbit
  Cc: Ben-Nes Yonatan; pgsql-general@postgresql.org
  Subject: Re: [GENERAL] Populating huge tables each day
  
  On Mon, Jun 27, 2005 at 12:43:57PM -0700, Dann Corbit wrote:
   I see a lot of problems with this idea.
  
   You mention that the database is supposed to be available 24x7.
   While you are loading, the database table receiving data will not be
   available.  Therefore, you will have to have one server online (with
  
  Why do you think that's the case?
 
 He's doing a bulk load.  I assume he will have to truncate the table and
 load it with the copy command.

Don't ass-u-me; he said he'd be deleting from the main table, not
truncating.

 Is there an alternative I do not know of that is equally fast?
 
Nope, truncate is undoubtedly faster. But it also means you would have
downtime as you mentioned. If it were me, I'd probably make the
trade-off of using a delete inside a transaction.
-- 
Jim C. Nasby, Database Consultant   [EMAIL PROTECTED] 
Give your computer some brain candy! www.distributed.net Team #1828

Windows: Where do you want to go today?
Linux: Where do you want to go tomorrow?
FreeBSD: Are you guys coming, or what?

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster