Data transformation while doing a data load is a requirement now and then.
Considering that users will have to do mass updates *after* the load
completes to mend the data to their liking should be reason enough to do
this while the loading is happening. I think to go about it the right way we
should support the following:

* The ability to provide per-column transformation expressions
* The ability to use any kind of expressions while doing the transformation
The transformation expression should be any expression (basically
ExecEvalExpr) that can be evaluated to give a resulting value and obviously
a corresponding is_null value too. It should and could be system in-built
functions (e.g. UPPER, TRIM, TO_CHAR, TO_NUMBER etc.) or user defined
functions too
* The transformation expression can refer to other columns involved in the
load. So that when the current row is extracted from the input file, the
current values should be used to generate the new resultant values before
doing a heap_form_tuple. E.g.
(col1 transform "col1 + 10", col2 transform "col1 * col2", col3 transform
"UPPER(col1 || col3)",...)
I have spent some thoughts on how to do this and will be happy to share the
same if the list is interested. Personally, I think data transformation
using such expressions is a pretty powerful and important activity while
doing the data load itself.


Well, since COPY is about as fast as INSERT INTO ... SELECT plus the parsing overead, I suggest adding a special SELECT form that can read from a file instead of a table, which returns tuples, and which therefore can be used and abused to the user's liking. This is a much more powerful feature because :

        - there is almost no new syntax
        - it is much simpler for the user
        - lots of existing stuff can be leveraged

        EXAMPLE :

Suppose I want to import a MySQL dump file (gasp !) which obviously contains lots of crap like 0000-00-00 dates, '' instead of NULL, borken foreign keys, etc.

        Let's have a new command :

CREATE FLATFILE READER mydump (
        id      INTEGER,
        date    TEXT,
        ...
) FROM file 'dump.txt'
(followed by delimiter specification syntax identical to COPY, etc)
;

This command would create a set-returning function which is basically a wrapper around the existing parser in COPY. Column definition gives a name and type to the fields in the text file, and tells the parser what to expect and what to return. It looks like a table definition, and this is actually pretty normal : it is, after all, very close to a table.

INSERT INTO mytable (id, date, ...) SELECT id, NULLIF( date, '0000-00-00' ), ... FROM mydump WHERE (FKs check and drop the borken records);

Now I can import data and transform it at will using a simple SELECT. The advantage is that everybody will know what to do without learning a new command, no awkward syntax (transform...), you can combine columns in expressions, JOIN to ckeck FKs, use ORDER to get a clustered table, anything you want, without any extension to the Postgres engine besides the creation of this file-parsing set-returning function, which should be pretty simple.

Or, if I have a few gigabytes of logs, but I am absolutely not interested in inserting them into a table, instead I want to make some statistics, or perhaps I want to insert into my table some aggregate computation from this data, I would just :

CREATE FLATFILE READER accesses_dump (
        date    TEXT,
        ip      INET,
        ...
) FROM file 'web_server_logtxt';

        And I can do some stats without even loading the data :

SELECT ip, count(*) FROM accesses_dump GROUP BY ip ORDER BY count(*) HAVING count(*) > 1000;

Much better than having to load those gigabytes just to make a query on them...































--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to