Re: [HACKERS] COPY Transform support

PFC Thu, 03 Apr 2008 07:41:47 -0700

Data transformation while doing a data load is a requirement now andthen.
Considering that users will have to do mass updates *after* the load
completes to mend the data to their liking should be reason enough to do
this while the loading is happening. I think to go about it the rightway we
should support the following:

* The ability to provide per-column transformation expressions

* The ability to use any kind of expressions while doing thetransformation

The transformation expression should be any expression (basically

ExecEvalExpr) that can be evaluated to give a resulting value andobviously

a corresponding is_null value too. It should and could be system in-built
functions (e.g. UPPER, TRIM, TO_CHAR, TO_NUMBER etc.) or user defined
functions too

* The transformation expression can refer to other columns involved inthe

load. So that when the current row is extracted from the input file, the
current values should be used to generate the new resultant values before
doing a heap_form_tuple. E.g.
(col1 transform "col1 + 10", col2 transform "col1 * col2", col3 transform
"UPPER(col1 || col3)",...)

I have spent some thoughts on how to do this and will be happy to sharethe

same if the list is interested. Personally, I think data transformation
using such expressions is a pretty powerful and important activity while
doing the data load itself.

Well, since COPY is about as fast as INSERT INTO ... SELECT plus theparsing overead, I suggest adding a special SELECT form that can read froma file instead of a table, which returns tuples, and which therefore canbe used and abused to the user's liking. This is a much more powerfulfeature because :


        - there is almost no new syntax
        - it is much simpler for the user
        - lots of existing stuff can be leveraged

        EXAMPLE :

Suppose I want to import a MySQL dump file (gasp !) which obviouslycontains lots of crap like 0000-00-00 dates, '' instead of NULL, borkenforeign keys, etc.


        Let's have a new command :

CREATE FLATFILE READER mydump (
        id      INTEGER,
        date    TEXT,
        ...
) FROM file 'dump.txt'
(followed by delimiter specification syntax identical to COPY, etc)
;

This command would create a set-returning function which is basically awrapper around the existing parser in COPY.Column definition gives a name and type to the fields in the text file,and tells the parser what to expect and what to return.It looks like a table definition, and this is actually pretty normal : itis, after all, very close to a table.

INSERT INTO mytable (id, date, ...) SELECT id, NULLIF( date, '0000-00-00'), ... FROM mydump WHERE (FKs check and drop the borken records);

Now I can import data and transform it at will using a simple SELECT. Theadvantage is that everybody will know what to do without learning a newcommand, no awkward syntax (transform...), you can combine columns inexpressions, JOIN to ckeck FKs, use ORDER to get a clustered table,anything you want, without any extension to the Postgres engine besidesthe creation of this file-parsing set-returning function, which should bepretty simple.

Or, if I have a few gigabytes of logs, but I am absolutely not interestedin inserting them into a table, instead I want to make some statistics, orperhaps I want to insert into my table some aggregate computation fromthis data, I would just :


CREATE FLATFILE READER accesses_dump (
        date    TEXT,
        ip      INET,
        ...
) FROM file 'web_server_logtxt';

        And I can do some stats without even loading the data :

SELECT ip, count(*) FROM accesses_dump GROUP BY ip ORDER BY count(*)HAVING count(*) > 1000;

Much better than having to load those gigabytes just to make a query onthem...
































--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] COPY Transform support

Reply via email to