Alan Gauld wrote: > "Shadab Sayani" <[EMAIL PROTECTED]> wrote > >> The data I need to deal with is in 100s of GB. >> I am using postgresql backend and SQLALCHEMY ORM. > > All ORMs will introduce a significant performance hit. > If you really need high speed, and populating a 100G+ database > probably is included in that, then you should look at raw SQL. > In fact for really big data loads most databases provide a > C api that goes under the SQL, because even SQL is > relatively slow.
I've had this experience, too. I routinely load a couple million records from a tab delimited file into our PostgreSQL database with Python. My import script: 1. iterates over the tab file a line/record at a time 2. does some data validation, some data manipulation 3. writes an SQL file with the massaged data 4. calls psql <dbname> -f <python-generated sql file> Generating the SQL file in python goes amazingly fast, considering the amount of validation and manipulation (normalizing addresses, generating hashes, etc) - about 1.5minutes/million records. The SQL it generates does the usual DB stuff including dropping indexes and constraints, COPY FROM stdin, regenerate indexes, replace constraints. In my experience, psql is MUCH faster at these kinds of imports than even direct python interaction with psycopg. Adding an ORM on top of the database connector (psycopg in my case) slows things down even more. As a rough metric, the import I described takes a little over 2 hours on my 2GHz/1GB laptop. Good luck, Eric. _______________________________________________ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor