[web2py] Re: CSV Import performance improvement idea
I use gluon.scheduler to import large csv files for production environment, and execute db.commit() by 100 records in the task. This works well for me. Kenji On 2月14日, 午後4:34, Johann Spies johann.sp...@gmail.com wrote: On 14 February 2012 00:54, Omi Chiba ochib...@gmail.com wrote: I have a problem with the performance of CSV import and I assume it generate INSERT statement for every record so it will be 8000 statement if you have 8000 records in csv file. Can we use bulk_insert method instead so there will be always only one INSERT statement which should reduce the performance significantly ? As I understand it the database (at least Postgresql) uses the COPY statement to import csv-files. That is much quicker than a series of individual inserts. Regards Johann -- Because experiencing your loyal love is better than life itself, my lips will praise you. (Psalm 63:3)
[web2py] Re: CSV Import performance improvement idea
Johann and Kenji san, Thank you ! DB2 has CPYFRMIMPF command (I think it's same as COPY on postgres) so I will try it. On Feb 14, 6:48 am, kenji4569 hos...@s-cubism.jp wrote: I use gluon.scheduler to import large csv files for production environment, and execute db.commit() by 100 records in the task. This works well for me. Kenji On 2月14日, 午後4:34, Johann Spies johann.sp...@gmail.com wrote: On 14 February 2012 00:54, Omi Chiba ochib...@gmail.com wrote: I have a problem with the performance of CSV import and I assume it generate INSERT statement for every record so it will be 8000 statement if you have 8000 records in csv file. Can we use bulk_insert method instead so there will be always only one INSERT statement which should reduce the performance significantly ? As I understand it the database (at least Postgresql) uses the COPY statement to import csv-files. That is much quicker than a series of individual inserts. Regards Johann -- Because experiencing your loyal love is better than life itself, my lips will praise you. (Psalm 63:3)
[web2py] Re: CSV Import performance improvement idea
I remember I got that performance problem, importing 8.300 records from a csv file using appadmin and remote database ( over 10 min ) Nest IMPROVEMENTS worked for me: -if local database was fast from appadmin -if remote database, A- using another DB manager to import csv file, phpmyadmin, sqlbuddy,etc B- coping file to remote host and calling importcsv from code
[web2py] Re: CSV Import performance improvement idea
you need to make differences between methods . web2py stand in the middle, not all databases have facilities like the COPY command of postgresql. Also, there are methods to export/import from csv at the database level and at the table level. If you need to restore a table from the admin app, you have to take into account the time to phisically transfer the csv file. web2py methods also can use uuid fields to do upserts , deals with reference, list:* types, etc. If you have huge tables, or machines not performant enough, you shouldn't use web2py methods and code something tailored to your needsI don't have issues for normal tables with 50 records (file uploaded to remote, import from local file directly from the code, using DAL) . Of course, if your table contains several columns or large blobs, YMMV. PS @ Kenji: normally if you restore a table or import something, you want to know what records were committed and what records weren't, if you're not using pkeys or uuids for every single record committing 100 records at a time is normally considered a bad behaviour: if your file contain a bad record, you can't assure the consistency of the table The time taken to import 8000 records with a commit every 100 is more than importing 8000 records and doing a single final commit, and you'll retain consistency, e.g. if something falls apart, you have a functioning table instead of a broken one. Again, @ all, YMMV: if your csv's are millions of records and you don't mind database consistency (e.g., all records in the tables are then polished or filtered in a smart way), you can commit every n records and store the last committed line in some place else, so if your import crashes, you know where to start (be aware, writing performant checks and logics to import incrementally something to dbs and make it work in every possible condition is a quite daunting task). My point is: don't be shy using transactions, commits and rollbacks, they are simpler to use, very performant and maybe the most wonderful features in a relational database!
[web2py] Re: CSV Import performance improvement idea
@ Niphlod: The situation for me is that non-programmer operators upload 10,000-100,000 product records mainly for inventory updates via an admin interface. It's not so huge but large enoght to exceed a server resposne time limit. The upload is not a bulk copy operation, but insert/update operations for each record which has a pkey. I think the upload keeps consistency when it aborted halfway. I am concerned that the single commit would cause high memory usage which should be avoided for live environment. Regards, Kenji On 2月15日, 午前8:23, Niphlod niph...@gmail.com wrote: you need to make differences between methods . web2py stand in the middle, not all databases have facilities like the COPY command of postgresql. Also, there are methods to export/import from csv at the database level and at the table level. If you need to restore a table from the admin app, you have to take into account the time to phisically transfer the csv file. web2py methods also can use uuid fields to do upserts , deals with reference, list:* types, etc. If you have huge tables, or machines not performant enough, you shouldn't use web2py methods and code something tailored to your needsI don't have issues for normal tables with 50 records (file uploaded to remote, import from local file directly from the code, using DAL) . Of course, if your table contains several columns or large blobs, YMMV. PS @ Kenji: normally if you restore a table or import something, you want to know what records were committed and what records weren't, if you're not using pkeys or uuids for every single record committing 100 records at a time is normally considered a bad behaviour: if your file contain a bad record, you can't assure the consistency of the table The time taken to import 8000 records with a commit every 100 is more than importing 8000 records and doing a single final commit, and you'll retain consistency, e.g. if something falls apart, you have a functioning table instead of a broken one. Again, @ all, YMMV: if your csv's are millions of records and you don't mind database consistency (e.g., all records in the tables are then polished or filtered in a smart way), you can commit every n records and store the last committed line in some place else, so if your import crashes, you know where to start (be aware, writing performant checks and logics to import incrementally something to dbs and make it work in every possible condition is a quite daunting task). My point is: don't be shy using transactions, commits and rollbacks, they are simpler to use, very performant and maybe the most wonderful features in a relational database!