subject:"Re\: \[GENERAL\] Using the database to validate data"

Re: [GENERAL] Using the database to validate data

2015-07-27 Thread 林士博

No, when you catch an insert exception , you can save the information you
needed to array or something likes that, and continue to do the next insert.
In your application,  you can write codes as follows.

begin transaction
loop
try
do insert
catch exception
save error info
end loop
if have any error
   print erro
   rollback transaction
else
  commit transaction

or you can you that in postgresql procedure.




2015-07-24 22:51 GMT+09:00 JPLapham :

> 林士博 wrote
> > If I am following correctly, you can do it in your application as
> follows.
> > 1.begin transaction
> > 2.insert each data. Catch db exception,
> > and save exception message and other information you need to array.
> > 3.in the end ,you can get all the information about the wrong data in
> > array
> > if there is any.
> >and then you can decide whether it is need to rollback or to commit.
>
> Yes, I agree that I could do that, which I believe is my "IDEA 1" from my
> original message. This method will naturally work, but it is a very slow
> iterative process because you can only catch the *first* error, after which
> new INSERTS are not allowed. If you have a data input with say 1000 record,
> and there are 50 errors, it would require 50 iterations of fixing the input
> data, running it again, to find them all.
>
>
> 林士博 wrote
> > By the way, this is about programming but not postgresql.
>
> I was hoping that there would be a way to have Postgresql run in a mode
> where it allows INSERTS within a transaction even after an error. Naturally
> when the error condition occurs, COMMIT would not be allowed at the end of
> the transaction block.
>
> But, this way, you could collect all the error information in one pass.
> Seemed postgresql related to me.  :)
>
>
>
> --
> View this message in context:
> http://postgresql.nabble.com/Using-the-database-to-validate-data-tp5859046p5859237.html
> Sent from the PostgreSQL - general mailing list archive at Nabble.com.
>
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general
>



-- 
─repica group──
▼ポイント×電子マネー×メールで店舗販促に必要な機能を全て提供！
【point+plus】http://www.repica.jp/pointplus/

▼フォローアップメールや外部連携に対応！
【mail solution】http://ms.repica.jp/

▼9年連続シェアNo.1 個人情報漏えい対策ソフト
【P-Pointer】http://ppointer.jp/

▼単月導入可能！AR動画再生アプリ
【marcs】http://www.arappli.com/service/marcs/

▼ITビジネスを創造しながら未来を創る
【VARCHAR】http://varchar.co.jp/
───

Re: [GENERAL] Using the database to validate data

2015-07-27 Thread Adam Brusselback

A little late to the party, but i'll share how I do my data imports /
validation for anyone interested.

I have a bunch of data that comes in from various sources, and it isn't
always guaranteed to be in the correct format, have the right foreign keys,
or even the right data types.

I have a staging table that is in the format of the feed I have coming in,
with all columns text and no constraints at all on the data columns.
Example:

> CREATE TABLE import_sale
> (
>   client_id uuid NOT NULL,
>   row_id uuid NOT NULL DEFAULT gen_random_uuid(),
>   row_date timestamp with time zone NOT NULL DEFAULT now(),
>   file_id uuid NOT NULL,
>   sale_number character varying,
>   company_number character varying,
>   invoice_number character varying,
>   invoice_date character varying,
>   order_date character varying,
>   ship_date character varying,
>   sale_date character varying,
>   product_number character varying,
>   quantity character varying,
>   quantity_uom character varying,
>   price character varying,
>   reduction character varying,
>   direct_indicator character varying,
>   redistributor_company_number character varying,
>   freight numeric,
>   processed_ind boolean DEFAULT false,
>   CONSTRAINT import_sales_client_id_fkey FOREIGN KEY (client_id)
>   REFERENCES client (client_id) MATCH SIMPLE
>   ON UPDATE NO ACTION ON DELETE NO ACTION,
>   CONSTRAINT import_sales_file_id_fkey FOREIGN KEY (file_id)
>   REFERENCES import_file (file_id) MATCH SIMPLE
>   ON UPDATE NO ACTION ON DELETE NO ACTION,
>   CONSTRAINT import_sales_row_id_unique UNIQUE (row_id)
> );


I use a talend package, or COPY to get the data into this table. However
you want to do that is up to you.

I have a final table that I want all this data to eventually get to once
there are no issues with it.  Example:

> CREATE TABLE sale
> (
>   sale_id uuid NOT NULL DEFAULT gen_random_uuid(),
>   client_id uuid NOT NULL,
>   source_row_id uuid NOT NULL,
>   sale_number character varying NOT NULL,
>   company_id uuid NOT NULL,
>   invoice_number character varying NOT NULL,
>   invoice_date date,
>   order_date date,
>   ship_date date,
>   sale_date date NOT NULL,
>   product_id uuid NOT NULL,
>   quantity numeric NOT NULL,
>   uom_type_id uuid NOT NULL,
>   price numeric NOT NULL,
>   reduction numeric NOT NULL,
>   redistributor_company_id uuid,
>   freight numeric,
>   active_range tstzrange DEFAULT tstzrange(now(), NULL::timestamp with
> time zone),
>   CONSTRAINT sale_pkey PRIMARY KEY (sale_id),
>   CONSTRAINT sale_client_id_fkey FOREIGN KEY (client_id)
>   REFERENCES client (client_id) MATCH SIMPLE
>   ON UPDATE NO ACTION ON DELETE NO ACTION,
>   CONSTRAINT sale_company_id_fkey FOREIGN KEY (company_id)
>   REFERENCES company (company_id) MATCH SIMPLE
>   ON UPDATE NO ACTION ON DELETE NO ACTION,
>   CONSTRAINT sale_product_id_fkey FOREIGN KEY (product_id)
>   REFERENCES product (product_id) MATCH SIMPLE
>   ON UPDATE NO ACTION ON DELETE NO ACTION,
>   CONSTRAINT sale_redistributor_company_id_fkey FOREIGN KEY
> (redistributor_company_id)
>   REFERENCES company (company_id) MATCH SIMPLE
>   ON UPDATE NO ACTION ON DELETE NO ACTION,
>   CONSTRAINT sale_source_row_id_fkey FOREIGN KEY (source_row_id)
>   REFERENCES import_sale (row_id) MATCH SIMPLE
>   ON UPDATE NO ACTION ON DELETE NO ACTION,
>   CONSTRAINT sale_uom_type_id_fkey FOREIGN KEY (uom_type_id)
>   REFERENCES uom_type (uom_type_id) MATCH SIMPLE
>   ON UPDATE NO ACTION ON DELETE NO ACTION,
>   CONSTRAINT sale_sale_number_active_range_excl EXCLUDE
>   USING gist (sale_number WITH =, (client_id::character varying) WITH =,
> active_range WITH &&),
>   CONSTRAINT sale_unique UNIQUE (sale_number, client_id, active_range)
> );


I then have couple functions which run over the data and do the validations
/ insert / update where necessary.

This one validates that the data is able to map to all the foreign keys,
the data types can be converted properly, and that not null constraints are
enforced.

CREATE OR REPLACE FUNCTION import_validate_sale()
>   RETURNS void AS
> $BODY$
> /*
> Remove any prior exceptions
> */
> DELETE FROM import_sale_error
> WHERE EXISTS (
> SELECT 1
> FROM import_sale
> WHERE import_sale_error.row_id = import_sale.row_id);
> /*
> Null checks for required fields
> */
> INSERT INTO import_sale_error(row_id, error_message)
> SELECT s.row_id, 'sale_number is null, but required.'
> FROM import_sale s
> WHERE s.sale_number IS NULL;



INSERT INTO import_sale_error(row_id, error_message)
> SELECT s.row_id, 'distributor company_number is null, but required.'
> FROM import_sale s
> WHERE s.company_number IS NULL;



INSERT INTO import_sale_error(row_id, error_message)
> SELECT s.row_id, 'invoice_number is null, but required.'
> FROM import_sale s
> WHERE s.invoice_number IS NULL;



INSERT INTO import_sale_error(row_id, error_message)
> SELECT s.row_id, 'sale_date is null, but required.'
> FROM import_sale s
> WHERE

Re: [GENERAL] Using the database to validate data

2015-07-27 Thread JPLapham

Zdeněk Bělehrádek wrote
> What about creating a SAVEPOINT before each INSERT, and if the INSERT
> returns 
> an error, then ROLLBACK TO SAVEPOINT? This way you will have all the 
> insertable data in your table, and you can still ROLLBACK the whole 
> transaction, or COMMIT it if there were no errors.
> 
> It will probably be quite slow, but if you have only thousands of lines,
> it 
> should be fast enough for your usecase IMHO.
> 
> -- Zdeněk Bělehrádek

Hmmm, interesting. Thanks, if that works, it would be exactly what I'm
looking for! 

You are right, speed is not an issue.

-Jon



--
View this message in context: 
http://postgresql.nabble.com/Using-the-database-to-validate-data-tp5859046p5859239.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] Using the database to validate data

2015-07-27 Thread JPLapham

林士博 wrote
> If I am following correctly, you can do it in your application as follows.
> 1.begin transaction
> 2.insert each data. Catch db exception,
> and save exception message and other information you need to array.
> 3.in the end ,you can get all the information about the wrong data in
> array
> if there is any.
>and then you can decide whether it is need to rollback or to commit.

Yes, I agree that I could do that, which I believe is my "IDEA 1" from my
original message. This method will naturally work, but it is a very slow
iterative process because you can only catch the *first* error, after which
new INSERTS are not allowed. If you have a data input with say 1000 record,
and there are 50 errors, it would require 50 iterations of fixing the input
data, running it again, to find them all.


林士博 wrote
> By the way, this is about programming but not postgresql.

I was hoping that there would be a way to have Postgresql run in a mode
where it allows INSERTS within a transaction even after an error. Naturally
when the error condition occurs, COMMIT would not be allowed at the end of
the transaction block. 

But, this way, you could collect all the error information in one pass.
Seemed postgresql related to me.  :)



--
View this message in context: 
http://postgresql.nabble.com/Using-the-database-to-validate-data-tp5859046p5859237.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] Using the database to validate data

2015-07-26 Thread Allan Kamau

To add onto what others a have said. I would use a bash script (and awk) to
prepared each record of the raw CSV file with a dataset name, name of the
file, timestamp and a serial number and place the newly generated data into
a new file. In this bash script, the value of the dataset name, name of
file and timestamp will come from bash variables declared and initialized
at the start ensuring that all the records of a given file will have the
same values for these three fields.
The dataset name is to help identify that dataset to which these data
belongs when we load it into the database. The file name would indicate the
name of the file from which these data was obtained. The timestamp is
mainly used for versioning purposes. The sequential numbers (one unique
number per row) is for auditing purposes.

I would then create a staging schema inside the postgresql database to
which you intend to load the data. In this staging schema I would create a
table containing all the fields as present in the newly generated CSV file
and use TEXT datatype (and NULL constraint) for all these fields with the
exception of the field that would hold the serial number which would be set
to INTEGER.

Now load the data using COPY staging_schema.sometable(field1, field2, ...)
FROM STDIN WITH (FORMAT csv, ...)

Now that we have the data loaded into the "staging_schema.sometable" table,
we may now create another table ("staging_schema.sometable_good") also in
the staging schema having the same structure as the previous table but this
time having all the constrains as well as the "correct" datatypes you plan
to have. Create yet another table identical to the first table and call is
something like "staging_schema.sometable_bad" to this table we shall write
the records that fail insertion into the "staging_schema.sometable_good"
table.

Create a PLPGSQL script (with exception handling) to read the staging table
one record at a time (perhaps using CURSORS), populate the
"staging_schema.sometable_good" table with the data you have just read
(from "staging_schema.sometable"). On failure of insert, write the record
into the "staging_schema.sometable_bad" table.

After processing of these data, check to see if
"staging_schema.sometable_bad" table has no records matching the specific
dataset name, file name and timestamp (the first three fields). If any
records are found, have a look at them to determine why the insertions to
the "staging_schema.sometable_good" table failed.

If no matching records are found in the "staging_schema.sometable_bad"
table, write another table ("schema.sometable") in the main schema of the
same database. This table should have the same structure as
"staging_schema.sometable_good". Simply insert into this table the records
returned by doing a filtered query on the "staging_schema.sometable_good".
This query should be filtered on the given dataset name, file name and
timestamp (the first three fields).

Allan.

On Sat, Jul 25, 2015 at 4:19 PM, rob stone  wrote:

> On Thu, 2015-07-23 at 15:34 -0700, JPLapham wrote:
> > Tim Clarke wrote
> > > Shouldn't be too difficult to import those new rows into one table,
> > > write a procedure that inserts them into the real table one by one
> > > and
> > > logs the validation failure if any - committing good rows and
> > > rolling
> > > back bad. In fact if you could then write the failures to a third
> > > table
> > > with a completely relaxed (or no) validation?
> >
> > Tim-
> >
> > Thanks for your response. Yes, you are right, it shouldn't be too
> > difficult,
> > and in fact I have already implemented basically what you suggest,
> > see "Idea
> > 2" from my original message.
> >
> > The problem with this approach is that it fails to find violations
> > such as
> > UNIQUE (there are probably others) from within the input data until
> > after
> > the first has been committed to the database. But, the error may have
> > been
> > with that earlier row, not the later.
> >
> > I want my users to fix all the problems with their data and then load
> > it in
> > an "all or none" fashion.
> >
> > -Jon
> >
> If you have multiple users loading (possibly) multiple files and
> (possibly) concurrently, then the only solution is to write some code
> to process the data.
> You also need to consider load sequence. If user A creates a file that
> contains data that will end up creating a new primary key and a file
> from user B refers to that, then user B needs to wait until user A's
> file has been processed successfully.
> Without knowing all the details I can envisage a scenario where data
> being loaded could reference "good" data already in the DB as well as
> referencing data that exists within that file load, possibly giving a
> rejection or an update of the "good" data.
>
> My 2 cents worth!
>
> Cheers,
> Rob (in Floripa)
>
>
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/p

Re: [GENERAL] Using the database to validate data

2015-07-25 Thread rob stone

On Thu, 2015-07-23 at 15:34 -0700, JPLapham wrote:
> Tim Clarke wrote
> > Shouldn't be too difficult to import those new rows into one table,
> > write a procedure that inserts them into the real table one by one 
> > and
> > logs the validation failure if any - committing good rows and 
> > rolling
> > back bad. In fact if you could then write the failures to a third 
> > table
> > with a completely relaxed (or no) validation?
> 
> Tim-
> 
> Thanks for your response. Yes, you are right, it shouldn't be too 
> difficult,
> and in fact I have already implemented basically what you suggest, 
> see "Idea
> 2" from my original message.
> 
> The problem with this approach is that it fails to find violations 
> such as
> UNIQUE (there are probably others) from within the input data until 
> after
> the first has been committed to the database. But, the error may have 
> been
> with that earlier row, not the later. 
> 
> I want my users to fix all the problems with their data and then load 
> it in
> an "all or none" fashion. 
> 
> -Jon
> 
If you have multiple users loading (possibly) multiple files and
(possibly) concurrently, then the only solution is to write some code
to process the data.
You also need to consider load sequence. If user A creates a file that
contains data that will end up creating a new primary key and a file
from user B refers to that, then user B needs to wait until user A's
file has been processed successfully.
Without knowing all the details I can envisage a scenario where data
being loaded could reference "good" data already in the DB as well as
referencing data that exists within that file load, possibly giving a
rejection or an update of the "good" data.

My 2 cents worth!

Cheers,
Rob (in Floripa)



-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] Using the database to validate data

2015-07-25 Thread Jon Lapham


On 07/23/2015 03:02 PM, Adrian Klaver wrote:

http://pgloader.io/


Ok, thanks, I'll look into pgloader's data validation abilities.

However, my naive understanding of pgloader is that it is used to 
quickly load data into a database, which is not what I am looking to do. 
I want to validate data integrity *before* putting it into the database. 
If there is a problem with any part of the data, I don't want any of it 
in the database.


-Jon

--
-**-*-*---*-*---*-*---*-*-*-*---*-*---*-*-*-*-*---
 Jon Lapham  Rio de Janeiro, Brasil
 Personal: http://www.jandr.org/
***-*--**---***---



--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] Using the database to validate data

2015-07-25 Thread JPLapham

Tim Clarke wrote
> Shouldn't be too difficult to import those new rows into one table,
> write a procedure that inserts them into the real table one by one and
> logs the validation failure if any - committing good rows and rolling
> back bad. In fact if you could then write the failures to a third table
> with a completely relaxed (or no) validation?

Tim-

Thanks for your response. Yes, you are right, it shouldn't be too difficult,
and in fact I have already implemented basically what you suggest, see "Idea
2" from my original message.

The problem with this approach is that it fails to find violations such as
UNIQUE (there are probably others) from within the input data until after
the first has been committed to the database. But, the error may have been
with that earlier row, not the later. 

I want my users to fix all the problems with their data and then load it in
an "all or none" fashion. 

-Jon

-- 
Sent via pgsql-general mailing list (pgsql-general@)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general




--
View this message in context: 
http://postgresql.nabble.com/Using-the-database-to-validate-data-tp5859046p5859160.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] Using the database to validate data

2015-07-24 Thread Jeff Janes

On Fri, Jul 24, 2015 at 5:17 AM, Zdeněk Bělehrádek <
zdenek.belehra...@superhosting.cz> wrote:

> What about creating a SAVEPOINT before each INSERT, and if the INSERT
> returns
> an error, then ROLLBACK TO SAVEPOINT?

Make sure you release the savepoint if there was no error.  Otherwise you
will quickly run out of memory.

Also, if there was an error, then after rolling back to the savepoint
either release it, or refrain from starting a new one at the beginning of
the next insert.

Cheers,

Jeff

Re: [GENERAL] Using the database to validate data

2015-07-24 Thread Jeff Janes

On Thu, Jul 23, 2015 at 5:55 AM, JPLapham  wrote:

> Hello,
>
> I have an application that occasionally performs large batch inserts of
> user
> hand-generated data. Input is a tab delimited file with typically hundreds
> to a thousand lines of data.
>
> Because the data is generated by hand, there are always many
> transaction-stopping errors in a typical input run. For example, missing
> datum in a NOT NULL column, a duplicate value in a UNIQUE column, data type
> mismatch, FOREIGN KEY reference to something non-existing, etc. Of course,
> we chose PostgreSQL exactly because of these problems, because of the
> robust
> transactional control, rollback on errors, etc.
>
> My question is the following. I would like to *test* the data input for
> integrity in such a way that I can create a report to the user informing
> them of exactly where in their input file to correct the problems.
>
> IDEA 1: My first attempt at this was to simply slurp the data into the
> database, collect the errors, and then rollback. Of course (as I now know),
> this doesn't work because after the first problem, the database reports,
> "current transaction is aborted, commands ignored until end of transaction
> block". This means that I can only report to the user the location of the
> first problem, and then they run the data again, and keep looping through
> the process until the data is good, a huge waste of time.
>

This is how I usually do it, until it become unbearable or an order comes
down from on high to do it differently.  If the errors are due to people
being sloppy, then it ought to be annoying for them to be sloppy.  Why make
it convenient for them?  If the errors are more excusable than just
sloppiness, or if the annoyance is more to you than to the people creating
the errors, then you have to go on to other methods.

>
> IDEA 2: My second idea on how to do this was to ROLLBACK after each INSERT.
> This allows me to check for things like NOT NULL and data type issues, but
> not violations of UNIQUE within the new data.
>

Issue a savepoint before each insert, and then issue a "release savepoint"
if the insert succeeds or a "rollback savepoint" if it does not.  If you
release a savepoint, remember that fact so that at the end you rollback the
entire transaction instead of committing it.  I rarely actually resort to
this.  It might miss some errors in which one failed row failed for
multiple reasons, or where one row would have failed had another row not
already failed for a different reason.

>
> IDEA 3: Write my own data pre-conditioner. Ugh, what a nightmare, I feel
> like I'm writing my own database! Checking for FKEY constraints, UNIQUE,
> etc
> is not trivial. It seems ridiculous to do this when I have the *actual*
> database available to test against!
>

I do this one a lot when I can't get away with method 1, and I don't see
why it is a nightmare.  Writing queries against the existing database to
see if the new proposed keys exist *is* trivial.  You write them in SQL,
not a low level language.

Testing for internal duplication within the new dataset is a bit harder, I
usually do that in Perl with a hash.  There might be cases where Perl and
PostgreSQL disagree about when two values are equal, but I've almost never
run into them in practise.

Check constraints or character encoding issues or typing issues can be
harder to deal with.  If those are likely to be a problem, create a temp or
unlogged table with the same check constraints as the real table but
without the unique or foreign key constraints and see if each row inserts.

Cheers,

Jeff

Re: [GENERAL] Using the database to validate data

2015-07-24 Thread Zdeněk Bělehrádek

What about creating a SAVEPOINT before each INSERT, and if the INSERT returns 
an error, then ROLLBACK TO SAVEPOINT? This way you will have all the 
insertable data in your table, and you can still ROLLBACK the whole 
transaction, or COMMIT it if there were no errors.

It will probably be quite slow, but if you have only thousands of lines, it 
should be fast enough for your usecase IMHO.

-- Zdeněk Bělehrádek

> Hello,
> 
> I have an application that occasionally performs large batch inserts of user
> hand-generated data. Input is a tab delimited file with typically hundreds
> to a thousand lines of data.
> 
> Because the data is generated by hand, there are always many
> transaction-stopping errors in a typical input run. For example, missing
> datum in a NOT NULL column, a duplicate value in a UNIQUE column, data type
> mismatch, FOREIGN KEY reference to something non-existing, etc. Of course,
> we chose PostgreSQL exactly because of these problems, because of the robust
> transactional control, rollback on errors, etc.
> 
> My question is the following. I would like to *test* the data input for
> integrity in such a way that I can create a report to the user informing
> them of exactly where in their input file to correct the problems.
> 
> IDEA 1: My first attempt at this was to simply slurp the data into the
> database, collect the errors, and then rollback. Of course (as I now know),
> this doesn't work because after the first problem, the database reports,
> "current transaction is aborted, commands ignored until end of transaction
> block". This means that I can only report to the user the location of the
> first problem, and then they run the data again, and keep looping through
> the process until the data is good, a huge waste of time.
> 
> IDEA 2: My second idea on how to do this was to ROLLBACK after each INSERT.
> This allows me to check for things like NOT NULL and data type issues, but
> not violations of UNIQUE within the new data.
> 
> IDEA 3: Write my own data pre-conditioner. Ugh, what a nightmare, I feel
> like I'm writing my own database! Checking for FKEY constraints, UNIQUE, etc
> is not trivial. It seems ridiculous to do this when I have the *actual*
> database available to test against!
> 
> Has anyone dealt with this kind of issue before? What are your opinions on
> best practice for this? Of course I do not want to actually COMMIT until the
> data is perfect!
> 
> Thanks for your time!
> -Jon
> 
> 
> 
> --
> View this message in context:
> http://postgresql.nabble.com/Using-the-database-to-validate-data-tp5859046.
> html Sent from the PostgreSQL - general mailing list archive at Nabble.com.



-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] Using the database to validate data

2015-07-23 Thread 林士博

Yes. You are right.
Do it in postgresql procedure is faster than in application.

Re: [GENERAL] Using the database to validate data

2015-07-23 Thread Adrian Klaver


On 07/23/2015 08:20 PM, 林士博 wrote:

If I am following correctly, you can do it in your application as follows.
1.begin transaction
2.insert each data. Catch db exception,
and save exception message and other information you need to array.
3.in  the end ,you can get all the information about the
wrong data in array if there is any.
and then you can decide whether it is need to rollback or to commit.
By the way, this is about programming but not postgresql.


Not sure I am following, the OP was asking the best way to catch any 
errors Postgres might throw using built in database features as much as 
possible. That seems to to be Postgres related.




2015-07-24 5:58 GMT+09:00 Tim Clarke mailto:tim.cla...@manifest.co.uk>>:

Shouldn't be too difficult to import those new rows into one table,
write a procedure that inserts them into the real table one by one and
logs the validation failure if any - committing good rows and rolling
back bad. In fact if you could then write the failures to a third table
with a completely relaxed (or no) validation?

Tim Clarke

On 23/07/15 21:48, Adrian Klaver wrote:
 > On 07/23/2015 12:04 PM, Jon Lapham wrote:
 >> On 07/23/2015 03:02 PM, Adrian Klaver wrote:
 >>> http://pgloader.io/
 >>
 >> Ok, thanks, I'll look into pgloader's data validation abilities.
 >>
 >> However, my naive understanding of pgloader is that it is used to
 >> quickly load data into a database, which is not what I am
looking to do.
 >> I want to validate data integrity *before* putting it into the
database.
 >> If there is a problem with any part of the data, I don't want
any of it
 >> in the database.
 >
 > I misunderstood, I thought you just wanted  information on the rows
 > that did not get in. pgloader does this by including the rejected
data
 > in *.dat and the Postgres log of why it was rejected in *.log.
 >
 > 
 >
 > I could still see making use of this by using the --before
 > , where file_name contains a CREATE TEMPORARY TABLE
 > some_table script that mimics the permanent table. Then it would load
 > against the temporary table, write out any errors and then drop the
 > table at the end. This would not put data into the permanent table on
 > complete success though. That would require some magic in AFTER LOAD
 > EXECUTE that I have not come up with yet:)
 >
 > 
 >>
 >> -Jon
 >>
 >
 >



--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org
)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general




--
─repica group──
▼ポイント×電子マネー×メールで店舗販促に必要な機能を全て提供！
【point+plus】http://www.repica.jp/pointplus/

▼フォローアップメールや外部連携に対応！
【mail solution】http://ms.repica.jp/

▼9年連続シェアNo.1 個人情報漏えい対策ソフト
【P-Pointer】http://ppointer.jp/

▼単月導入可能！AR動画再生アプリ
【marcs】http://www.arappli.com/service/marcs/

▼ITビジネスを創造しながら未来を創る
【VARCHAR】http://varchar.co.jp/
───



--
Adrian Klaver
adrian.kla...@aklaver.com


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] Using the database to validate data

2015-07-23 Thread 林士博

If I am following correctly, you can do it in your application as follows.
1.begin transaction
2.insert each data. Catch db exception,
and save exception message and other information you need to array.
3.in the end ,you can get all the information about the wrong data in array
if there is any.
   and then you can decide whether it is need to rollback or to commit.

By the way, this is about programming but not postgresql.

2015-07-24 5:58 GMT+09:00 Tim Clarke :

> Shouldn't be too difficult to import those new rows into one table,
> write a procedure that inserts them into the real table one by one and
> logs the validation failure if any - committing good rows and rolling
> back bad. In fact if you could then write the failures to a third table
> with a completely relaxed (or no) validation?
>
> Tim Clarke
>
> On 23/07/15 21:48, Adrian Klaver wrote:
> > On 07/23/2015 12:04 PM, Jon Lapham wrote:
> >> On 07/23/2015 03:02 PM, Adrian Klaver wrote:
> >>> http://pgloader.io/
> >>
> >> Ok, thanks, I'll look into pgloader's data validation abilities.
> >>
> >> However, my naive understanding of pgloader is that it is used to
> >> quickly load data into a database, which is not what I am looking to do.
> >> I want to validate data integrity *before* putting it into the database.
> >> If there is a problem with any part of the data, I don't want any of it
> >> in the database.
> >
> > I misunderstood, I thought you just wanted  information on the rows
> > that did not get in. pgloader does this by including the rejected data
> > in *.dat and the Postgres log of why it was rejected in *.log.
> >
> > 
> >
> > I could still see making use of this by using the --before
> > , where file_name contains a CREATE TEMPORARY TABLE
> > some_table script that mimics the permanent table. Then it would load
> > against the temporary table, write out any errors and then drop the
> > table at the end. This would not put data into the permanent table on
> > complete success though. That would require some magic in AFTER LOAD
> > EXECUTE that I have not come up with yet:)
> >
> > 
> >>
> >> -Jon
> >>
> >
> >
>
>
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general
>



-- 
─repica group──
▼ポイント×電子マネー×メールで店舗販促に必要な機能を全て提供！
【point+plus】http://www.repica.jp/pointplus/

▼フォローアップメールや外部連携に対応！
【mail solution】http://ms.repica.jp/

▼9年連続シェアNo.1 個人情報漏えい対策ソフト
【P-Pointer】http://ppointer.jp/

▼単月導入可能！AR動画再生アプリ
【marcs】http://www.arappli.com/service/marcs/

▼ITビジネスを創造しながら未来を創る
【VARCHAR】http://varchar.co.jp/
───

Re: [GENERAL] Using the database to validate data

2015-07-23 Thread Tim Clarke

Shouldn't be too difficult to import those new rows into one table,
write a procedure that inserts them into the real table one by one and
logs the validation failure if any - committing good rows and rolling
back bad. In fact if you could then write the failures to a third table
with a completely relaxed (or no) validation?

Tim Clarke

On 23/07/15 21:48, Adrian Klaver wrote:
> On 07/23/2015 12:04 PM, Jon Lapham wrote:
>> On 07/23/2015 03:02 PM, Adrian Klaver wrote:
>>> http://pgloader.io/
>>
>> Ok, thanks, I'll look into pgloader's data validation abilities.
>>
>> However, my naive understanding of pgloader is that it is used to
>> quickly load data into a database, which is not what I am looking to do.
>> I want to validate data integrity *before* putting it into the database.
>> If there is a problem with any part of the data, I don't want any of it
>> in the database.
>
> I misunderstood, I thought you just wanted  information on the rows
> that did not get in. pgloader does this by including the rejected data
> in *.dat and the Postgres log of why it was rejected in *.log.
>
> 
>
> I could still see making use of this by using the --before
> , where file_name contains a CREATE TEMPORARY TABLE
> some_table script that mimics the permanent table. Then it would load
> against the temporary table, write out any errors and then drop the
> table at the end. This would not put data into the permanent table on
> complete success though. That would require some magic in AFTER LOAD
> EXECUTE that I have not come up with yet:)
>
> 
>>
>> -Jon
>>
>
>



-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] Using the database to validate data

2015-07-23 Thread Adrian Klaver


On 07/23/2015 12:04 PM, Jon Lapham wrote:

On 07/23/2015 03:02 PM, Adrian Klaver wrote:

http://pgloader.io/


Ok, thanks, I'll look into pgloader's data validation abilities.

However, my naive understanding of pgloader is that it is used to
quickly load data into a database, which is not what I am looking to do.
I want to validate data integrity *before* putting it into the database.
If there is a problem with any part of the data, I don't want any of it
in the database.


I misunderstood, I thought you just wanted  information on the rows that 
did not get in. pgloader does this by including the rejected data in 
*.dat and the Postgres log of why it was rejected in *.log.




I could still see making use of this by using the --before , 
where file_name contains a CREATE TEMPORARY TABLE some_table script that 
mimics the permanent table. Then it would load against the temporary 
table, write out any errors and then drop the table at the end. This 
would not put data into the permanent table on complete success though. 
That would require some magic in AFTER LOAD EXECUTE that I have not come 
up with yet:)





-Jon




--
Adrian Klaver
adrian.kla...@aklaver.com


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] Using the database to validate data

2015-07-23 Thread Adrian Klaver


On 07/23/2015 05:55 AM, JPLapham wrote:

Hello,

I have an application that occasionally performs large batch inserts of user
hand-generated data. Input is a tab delimited file with typically hundreds
to a thousand lines of data.

Because the data is generated by hand, there are always many
transaction-stopping errors in a typical input run. For example, missing
datum in a NOT NULL column, a duplicate value in a UNIQUE column, data type
mismatch, FOREIGN KEY reference to something non-existing, etc. Of course,
we chose PostgreSQL exactly because of these problems, because of the robust
transactional control, rollback on errors, etc.

My question is the following. I would like to *test* the data input for
integrity in such a way that I can create a report to the user informing
them of exactly where in their input file to correct the problems.

IDEA 1: My first attempt at this was to simply slurp the data into the
database, collect the errors, and then rollback. Of course (as I now know),
this doesn't work because after the first problem, the database reports,
"current transaction is aborted, commands ignored until end of transaction
block". This means that I can only report to the user the location of the
first problem, and then they run the data again, and keep looping through
the process until the data is good, a huge waste of time.

IDEA 2: My second idea on how to do this was to ROLLBACK after each INSERT.
This allows me to check for things like NOT NULL and data type issues, but
not violations of UNIQUE within the new data.

IDEA 3: Write my own data pre-conditioner. Ugh, what a nightmare, I feel
like I'm writing my own database! Checking for FKEY constraints, UNIQUE, etc
is not trivial. It seems ridiculous to do this when I have the *actual*
database available to test against!


That has already been done:

http://pgloader.io/




Has anyone dealt with this kind of issue before? What are your opinions on
best practice for this? Of course I do not want to actually COMMIT until the
data is perfect!

Thanks for your time!
-Jon



--
View this message in context: 
http://postgresql.nabble.com/Using-the-database-to-validate-data-tp5859046.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.





--
Adrian Klaver
adrian.kla...@aklaver.com


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] Using the database to validate data

Re: [GENERAL] Using the database to validate data

Re: [GENERAL] Using the database to validate data

Re: [GENERAL] Using the database to validate data

Re: [GENERAL] Using the database to validate data

Re: [GENERAL] Using the database to validate data

Re: [GENERAL] Using the database to validate data

Re: [GENERAL] Using the database to validate data

Re: [GENERAL] Using the database to validate data

Re: [GENERAL] Using the database to validate data

Re: [GENERAL] Using the database to validate data

Re: [GENERAL] Using the database to validate data

Re: [GENERAL] Using the database to validate data

Re: [GENERAL] Using the database to validate data

Re: [GENERAL] Using the database to validate data

Re: [GENERAL] Using the database to validate data

Re: [GENERAL] Using the database to validate data

17 matches

Site Navigation

Mail list logo

Footer information