Data transformation while doing a data load is a requirement now and
then.
Considering that users will have to do mass updates *after* the load
completes to mend the data to their liking should be reason enough to do
this while the loading is happening. I think to go about it the right
way we
should support the following:
* The ability to provide per-column transformation expressions
* The ability to use any kind of expressions while doing the
transformation
The transformation expression should be any expression (basically
ExecEvalExpr) that can be evaluated to give a resulting value and
obviously
a corresponding is_null value too. It should and could be system in-built
functions (e.g. UPPER, TRIM, TO_CHAR, TO_NUMBER etc.) or user defined
functions too
* The transformation expression can refer to other columns involved in
the
load. So that when the current row is extracted from the input file, the
current values should be used to generate the new resultant values before
doing a heap_form_tuple. E.g.
(col1 transform "col1 + 10", col2 transform "col1 * col2", col3 transform
"UPPER(col1 || col3)",...)
I have spent some thoughts on how to do this and will be happy to share
the
same if the list is interested. Personally, I think data transformation
using such expressions is a pretty powerful and important activity while
doing the data load itself.
Well, since COPY is about as fast as INSERT INTO ... SELECT plus the
parsing overead, I suggest adding a special SELECT form that can read from
a file instead of a table, which returns tuples, and which therefore can
be used and abused to the user's liking. This is a much more powerful
feature because :
- there is almost no new syntax
- it is much simpler for the user
- lots of existing stuff can be leveraged
EXAMPLE :
Suppose I want to import a MySQL dump file (gasp !) which obviously
contains lots of crap like 0000-00-00 dates, '' instead of NULL, borken
foreign keys, etc.
Let's have a new command :
CREATE FLATFILE READER mydump (
id INTEGER,
date TEXT,
...
) FROM file 'dump.txt'
(followed by delimiter specification syntax identical to COPY, etc)
;
This command would create a set-returning function which is basically a
wrapper around the existing parser in COPY.
Column definition gives a name and type to the fields in the text file,
and tells the parser what to expect and what to return.
It looks like a table definition, and this is actually pretty normal : it
is, after all, very close to a table.
INSERT INTO mytable (id, date, ...) SELECT id, NULLIF( date, '0000-00-00'
), ... FROM mydump WHERE (FKs check and drop the borken records);
Now I can import data and transform it at will using a simple SELECT. The
advantage is that everybody will know what to do without learning a new
command, no awkward syntax (transform...), you can combine columns in
expressions, JOIN to ckeck FKs, use ORDER to get a clustered table,
anything you want, without any extension to the Postgres engine besides
the creation of this file-parsing set-returning function, which should be
pretty simple.
Or, if I have a few gigabytes of logs, but I am absolutely not interested
in inserting them into a table, instead I want to make some statistics, or
perhaps I want to insert into my table some aggregate computation from
this data, I would just :
CREATE FLATFILE READER accesses_dump (
date TEXT,
ip INET,
...
) FROM file 'web_server_logtxt';
And I can do some stats without even loading the data :
SELECT ip, count(*) FROM accesses_dump GROUP BY ip ORDER BY count(*)
HAVING count(*) > 1000;
Much better than having to load those gigabytes just to make a query on
them...
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers