Hi, I have some questions regarding how to efficiently upload quite large amounts of data into PostgresSQL remotely. The tables I will be loading will consist of rows containing a) 1 character column of no more than 20 or 30 characters. b) a timestamp c) lots of positive integers (20-40). the integers will usally have small values. (many will be 0. most will have 2-3 digits values. some will have 5-6 digit values. The biggest uploads will consist of about 4500 rows/day from a single source. There may be up to 50-60 sources. (so yes, the system could generate up to 2-3 million rows/week) but it is intended that the data be automatically aggregated by day/week/month, etc. Aggregate functions would be SUM and AVG generally. The uploads willl occur over unreliable WAN links, where bandwidth is at a premium. The obvious way to do this a COPY TableName FROM stdin USING DELIMITERS ',' and send a newline-delimited list of comma-separated values. Fo integrs, the resulting data isn't very big (well 330K for ~4300 rows) The problem is the following- The remote systems detrmine the nature of the table meta-data. e.g. the table and column names are derived automatically from the source files for these uploads. The assumptions you can make here are as follows: The remote systems data source files will always contain 1 integer, 1 text field, and one timestamp. The integer and the text field identify the remote system. The timestamp identifies when the data was produced on the remote system. The remaining integers are all either: a) cumulative event counters for a time period usually 15 minutes, defined by the timestamp. b) rarely, a cumulative time value (in seconds) for an event in the time period, which may need to be divided by the value of an event counter and then averaged over many time periods. If two remote systems upload data into tables with the same names, the columns with the same names mean the same things. Some remote systems may have more columns defined in certain tables than others. Any columns which a system doesn't define should be set to NULL. The ordering of some columns in the source files for the remote systems may be different. They may change over time. The source files have all columns tagged with names so this isn't a problem in interpreting the source data. If a remote system's data source files contain a column name that has not been previously used, the column should be added to the table. It is likely that all systems will want to send data containing new columns around the same time, but the 'lag' between different systems starting to populate new columns with data may vary between minutes and weeks. If a remote system wants to insert data into a table that has not been previously used, the table should also be created automatically. The table names and required column names can be derived from the data source file(s) on the remote system. As with the 'new columns' case, it is likely that many of the systems will want to upload data into the new tables at around the same time, but the lag between different systems wanting to populate data in the new tables may again vary. Additions of new columns to tables and new tables should not require manual intervention on the database server. None of the remote systems will ever drop tabes or want to remove columns. The server will be used to run reports on the aggregated values of the tables created. Obviously, the reports will need to be manually updated to take account of new tables and columns, but the data collection stage should not require manual intervantion to collect and aggregate data for new columns. None of the remote systems should be required to know that any of the other remote systems exist, nor should they be required to be able to communicate with those systems. (Obviously all the systems can communicate with the central DB server.) what that means is that remote systems shouldn't need to know what colums/taboles other remote systems are uploading data to. Each system, will, however, have a unique identifier which should not change over time. The identifier might be changed at some stage, but manual intervention to reconcile the data is acceptable under that circumstance. The problem that I am encountering with designing this is as follows: a) each remote system cannot assume that anybody else is going to create a new table/column for it. It can do 1 of two things: 1. work out a new table/column is required by its data source files and try to create it. If the creation fails, it can check whether the column/table already exists and blithely ignore he error. 2. If its data source files rerquire a table/column, check the database system tables to see if the table/column exists already, if not, it can try to crete the table. (if another system tries to create the table at the same time, the crete will fail on one of them. so we have to ignore the 'table/column exists' error in this case, too) The problem is that the system must have some understanding of the returned error message to be able to do this. (I want to be able to keep the mechanism as general as possible so it can be ported to diffrent DB backends easily.) Suggestions anyone? The final problem, related to the COPY table FROM stdin vs. INSERT statements is as follows: data uploads from a system must be done in transactions, so the system can safely delete any data source files it knows to have been sucessfully uploaded. It must continue retrying ones it has failed on. Is COPY transaction-safe? Can you do: BEGIN; COPY blah FROM stdin; <rows> <failure> ABORT; and no rows would be added? Would it fail if another connected DB frontend started adding columns to tables that were being copied to? The obvious way to go on this is to use INSERT- the first problem is that: Systems cannot assume anything about the ordering of columns in a table, since that depends on what order the columns were added in (which could vary depending on which system gets to add a column first). this could be a big problem for using copy as well. So you just specify the column names in the insert. (you can't do that with a copy) The problem with is that this increases the upload data size by at least 4 times! (since you specify the column names for _every_ insert, and the colun names are expected to be quite long (even 30 chars might be a limitation here, since the data source files allow 256 characters and more...) as the programs that generate the data source files generate column names from internal variable names and those are allowed to be quite long. In fact, the people who design the programs that generate the data source files are _encouraged_ to use long variable names; the variables are listed in a GUI pick-list so the typing overhead for using the variable once entered is minimal, and the long names give a better indication of what they're good for. BTW my tests show that in a transaction, new column names are not invisible to a transaction if added by another front-end process during the transaction... what do people suggest I do here? Thanks, Dirk Dirk Niggemann [EMAIL PROTECTED] ************