[GENERAL] efficient remote bulk uploads

Dirk Niggemann Sat, 16 Oct 1999 05:53:09 -0700
Hi,

I have some questions regarding how to efficiently upload quite large
amounts of data 
into PostgresSQL remotely.
The tables I will be loading will consist of rows containing 
a) 1 character column of no more than 20 or 30 characters.
b) a timestamp
c) lots of positive integers (20-40). the integers will usally have
small values. 
        (many will be 0. most will have 2-3 digits values. some will have 5-6
digit values.
The biggest uploads will consist of about 4500 rows/day from a single
source. 
There may be up to 50-60 sources. 
(so yes, the system could generate up to 2-3 million rows/week) 
but it is intended that the data be automatically aggregated by
day/week/month, etc. 
Aggregate functions would be SUM and AVG generally. 
The uploads willl occur over unreliable WAN links, where bandwidth is at
a premium. 


The obvious way to do this a COPY TableName FROM stdin USING DELIMITERS
',' and send a newline-delimited list of comma-separated values. Fo
integrs, the resulting data isn't very big
(well 330K for ~4300 rows)
The problem is the following-
The remote systems detrmine the nature of the table meta-data. 
e.g. the table and column names are derived automatically from the
source files for these uploads. 
The assumptions you can make here are as follows:
The remote systems data source files  will always contain 1 integer, 1
text field, and one timestamp. The integer and the text field identify
the remote system. The  timestamp identifies 
when the data was produced on the remote system.  
The remaining integers are all either:
a) cumulative event counters for a time period usually 15 minutes,
defined by the timestamp.
b) rarely, a cumulative time value (in seconds) for an event in the time
period, which may need to be divided by the value of an event counter
and then averaged over many time periods.

If two remote systems upload data into tables with the same names, the
columns with the same 
names mean the same things.
Some remote systems may have more columns defined in certain tables than
others. Any columns which a system doesn't define should be set to NULL.
The ordering of some columns in the source files for the remote systems
may be different. They may change over time. The source files have all
columns tagged with names so this isn't a problem in interpreting the
source data. 
If a remote system's data source files contain a column name that has
not been previously used, the column should be added to the table. It is
likely that all systems will want to send 
data containing new columns around the same time, but the 'lag' between
different systems starting to populate new columns with data may vary
between minutes and weeks. 
If a remote system wants to insert data into a table that has not been
previously used, the table should also be created automatically. The
table names and required column names can be derived from the data
source file(s) on the remote system. As with the 'new columns' case, it
is likely that many of the systems will want to upload data into the new
tables at around the same time, 
but the lag between different systems wanting to populate data in the
new tables may again vary.

Additions of new columns to tables and new tables should not require
manual intervention on the database server. 
None of the remote systems will ever drop tabes or want to remove
columns. 

The server will be used to run reports on the aggregated values of the
tables created. Obviously, 
the reports will need to be manually updated to take account of new
tables and columns, but the data collection stage should not require
manual intervantion to collect and aggregate data for new columns. 

None of the remote systems should be required to know that any of the
other remote systems exist,
nor should they be required to be able to communicate with those
systems. 
(Obviously all the systems can communicate with the central DB server.)
what that means is that remote systems shouldn't need to know what
colums/taboles other remote systems are uploading data to.
Each system, will, however, have a unique identifier which should not
change over time. 
The identifier might be changed at some stage, but manual intervention
to reconcile the data is acceptable under that circumstance. 

The problem that I am encountering with designing this is as follows:

a) each remote system cannot assume that anybody else is going to create
a new table/column for it. It can do 1 of two things:
        1. work out a new table/column is required by its data source files and
try to create it.
        If the creation fails, it can check whether the column/table already
exists and blithely 
        ignore he error. 
        2. If its data source files rerquire a table/column, check the database
system tables 
                to see if the table/column exists already, if not, it can try to crete
the table.
                (if another system tries to create the table at the same time, the
crete will 
                fail on one of them. so we have to ignore the 'table/column exists' 
                error in this case, too)
The problem is that the system must have some understanding of the
returned error message 
to be able to do this. (I want to be able to keep the mechanism as
general as possible so it can be ported to diffrent DB backends easily.)
Suggestions anyone?

The final problem, related to the COPY table FROM stdin vs. INSERT
statements is as follows:
data uploads from a system must be done in transactions, so the system
can safely delete any 
data source files it knows to have been sucessfully uploaded. It must
continue retrying ones it has failed on. 
Is COPY transaction-safe? Can you do:
BEGIN;
COPY blah FROM stdin;
<rows>
<failure>
ABORT;
and no rows would be added?
Would it fail if another connected DB frontend started adding columns to
tables that were being copied to?

The obvious way to go on this is to use INSERT- the first problem is
that:
        Systems cannot assume anything about the ordering of columns in a
table, since that depends on what order the columns were added in (which
could vary depending on which system gets to add a column first). this
could be a big problem for using copy as well.
So you just specify the column names in the insert. (you can't do that
with a copy)
The problem with is that this increases the upload data size by at least
4 times! (since you specify the column names for _every_ insert, and the
colun names are expected to be quite long (even 30 chars might be a
limitation here, since the data source files allow 256 characters and
more...) as the programs that generate the data source files generate
column names from internal variable names and those are allowed to be
quite long. In fact, the people who design the programs that generate
the data source files are _encouraged_ to use long variable names; the
variables are listed in a GUI pick-list so the typing overhead for using
the variable once entered is minimal, and the long names give a better
indication of what they're good for. 

BTW my tests show that in a transaction, new column names are not
invisible to a transaction 
if added by another front-end process  during the transaction... 

what do people suggest I do here?

Thanks,

        Dirk

           Dirk Niggemann 
          [EMAIL PROTECTED]

************
[GENERAL] efficient remote bulk uploads

Reply via email to