On 03/30/2015 01:43 PM, Guillaume Drolet wrote:
Thanks Adrian,

In option 2, when you say "this is automated in an external Python
script", do you mean that you use something like psycopg2 to perform the
queries the database (e.g. for comparing data in the holding table with
the older table)?

Yes. Basically I use the dbf module I mentioned previously to read the DBF files, output the data I need, writing it into an in memory CSV file which I then use the psycopg2 COPY functions to dump into the Postgres database. The files I am working with are not as the large the ones you mention, so doing all this in memory is workable. The script then calls a series of user functions in Postgres to do the comparing and manipulating. At the time I did this plpythonu was less featured then it is now, so to do what I wanted made more sense in an external script. Also the script pulls the DBF files from elsewhere and I felt more comfortable doing that outside the database then in.

Though more and more I seem to be using pandas(pandas.pydata.org) to do data conversions. Saves a lot of the steps in the above. In this case you would still need to get the data out of the DBF files.


Thanks.

2015-03-30 9:53 GMT-04:00 Adrian Klaver <adrian.kla...@aklaver.com
<mailto:adrian.kla...@aklaver.com>>:

    On 03/30/2015 06:04 AM, Guillaume Drolet wrote:

        Hello,

        I need your help speeding up the procedure I will explain below.
        I am
        looking for improvements to my method or different
        approaches/ideas to
        would help in this matter.

        I have a set of DBF files that I load into my database using a
        plpython
        function and a call to ogr2ogr
        (http://www.gdal.org/drv_pg.__html
        <http://www.gdal.org/drv_pg.html>). Once
        in a while, I'll have to load updated versions of these tables
        to get
        the latest additions and possible corrections to older versions.

        In my plpython script, if a table is loaded for the first time,
        I first
        load it empty, then I create a trigger function on insert
        (execute on
        row) that will check for duplicates on each insert. Depending on the
        type of data I load, my trigger first checks for equality in a
        subset of
        columns (between 1 and 3 columns that would be like my primary
        key(s))
        and if true, I check if all columns are equal between NEW and the
        matching row from my table. When this condition is true, I
        return null,
        else I store rows (i.e. NEW.* and matching row(s) in a new table
        called
        "duplicate" for further manual investigation. Here's an example
        for one
        table:

        CREATE OR REPLACE FUNCTION check_naipf_insert()
            RETURNS trigger AS
        ' BEGIN
             IF EXISTS (SELECT 1
                              FROM    public.naipf
                              WHERE id_pet_mes IS NOT DISTINCT FROM
        NEW.id_pet_mes
                              AND etage IS NOT DISTINCT FROM NEW.etage) THEN
                IF EXISTS (SELECT 1
                                 FROM public.naipf
                                 WHERE id_pet_mes IS NOT DISTINCT FROM
        NEW.id_pet_mes
                                 AND etage IS NOT DISTINCT FROM NEW.etage
                                 AND type_couv IS NOT DISTINCT FROM
        NEW.type_couv
                                 AND densite IS NOT DISTINCT FROM
        NEW.densite
                                 AND hauteur IS NOT DISTINCT FROM
        NEW.hauteur
                                 AND cl_age IS NOT DISTINCT FROM
        NEW.cl_age) THEN
                                   RETURN NULL;
                 ELSE
                   INSERT INTO public.duplic_naipf SELECT NEW.*;
                   INSERT INTO public.duplic_naipf (SELECT *
                                                                     FROM
        public.naipf
                                                                     WHERE
        id_pet_mes IS NOT DISTINCT FROM  NEW.id_pet_mes

          AND etage
        IS NOT DISTINCT FROM NEW.etage );
                   RETURN NULL;
                 END IF;
               END IF;
               RETURN NEW;
           END;  '
           LANGUAGE plpgsql VOLATILE COST 100;

        CREATE TRIGGER check_insert_naipf
            BEFORE INSERT
            ON public.pet4_naipf
            FOR EACH ROW
            EXECUTE PROCEDURE check_naipf_insert();

        (in this case, duplicate rows that need investigation are rows
        that may
        have changed relative to older version of the DBF file, but that
        have no
        change in what I call their primary keys although they are not
        really
        PKs since I don't want to raise errors at loading)

        Once this is done, ogr2ogr is called a second time to load the
        data. It
        is quite fast for small tables (tens of thousands of rows, tens of
        columns) but for large tables it takes forever. For example, I
        started
        loading a table with 3.5 million rows/33 columns last Friday at
        3PM and
        this now, Monday morning at 9PM some 3 million rows have been
        loaded.

        My question is: what are the other approaches that would make this
        procedure faster? How is this kind of task usually implemented in
        postgresql? Would it be better to load everything with no check
        and then
        apply some functions to find duplicate rows (although this would
        involve
        more manual work)?


    I guess it depends on what end purpose of the above is? If you are
    just trying to keep relatively update to date information from the
    DBF sources, would it not be easier just to load them into a new table?

    So, where existing table is some_dbf_data:

    1) CREATE TABLE new_some_dbf_data(...)
    2) Dump DBF file into new_some_dbf_data
    3)In transaction rename/drop some_dbf_data, rename new_some_dbf_data
    to some_dbf_data


    Option 2 is what I do for a similar procedure:

    1) Dump DBF data into holding table.
    2) Use SQL in function(s) to compare old/new table and make
    appropriate adjustments. Doing SQL in bulk is a lot faster then
    checking each row, or least that is what I found. In any case the
    way you are doing it looks to involve 3.5 million inserts with a
    trigger action on each, that is bound to be slow:)
    3) This is automated in an external Python script.

    Option 3

    Use dbf(https://pypi.python.org/__pypi/dbf/0.88.16
    <https://pypi.python.org/pypi/dbf/0.88.16>) and do the comparisons
    in the DBF files outside Postgres and only import what has changed.





        Thanks a lot for your help!






    --
    Adrian Klaver
    adrian.kla...@aklaver.com <mailto:adrian.kla...@aklaver.com>




--
Adrian Klaver
adrian.kla...@aklaver.com


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Reply via email to