Re: [Tutor] Optimal solution in dealing with huge databases in python

2007-01-25 Thread Eike Welk
You could try Pytables:
http://www.pytables.org/moin

It is not a database, but a storage library for large amounts of data.

Regards,
Eike.

___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Optimal solution in dealing with huge databases in python

2007-01-25 Thread Shadab Sayani
Hi,
  
I got your point.But before inserting data I need to store it into  a file in a 
format supported by postgresql.Wont this operation incur a  performance hit as 
it includes writing to a file which is on disk?
  
  Thanks and Regards,
  Shadab.
  
johnf <[EMAIL PROTECTED]> wrote:  On Wednesday 24 January 2007 19:22, Shadab 
Sayani wrote:
> Hi,
>   I am working in a biodatabases project.The data I need to deal with is 
> in 100s of GB.I am using postgresql backend and SQLALCHEMY ORM.I need  to
> read the bio datafiles and parse them and then store them in  database.I am
> in the process of storing them. I used the session,flush concept in
> SQLALCHEMY.Initially I used to  flush every query immediately.Later I
> realised that the queries are  independent of each other and so started
> flushing 3-5 Lakh insert  queries at a time.This increased the response
> time.But the memory is  overflowing.Then I released not-in-use memory using
> del command in  python still there is no use as this del statement can only
> free part  of memory.I need to increase the above 3-5 lakh number to a much
> large  one to get a real time response.Other wise my estimation is it will 
> take 1 year to just insert the data into the database.From postgresql  side
> also I turned off WAL. Please suggest some viable solution to handle such
> enormous data from  python.Is there a better solution than SQL alchemy?Any
> solution that  speeds up my program is highly appreciated. Thanks and
> Regards,
>   Shadab.
>
>  Send instant messages to your online friends http://uk.messenger.yahoo.com

To tell the truth I know nothing about SQLALCHEMY ORM.  But I do know how to 
load large amounts of data into Postgres.  You use the 'copy ... from' 
command.  I once had 2 gigs of data and was surprised how fast it added the 
data.  Two things to remember:
1.  no indexes
2.  you have to format the data correctly - see Postgres DOC's

BTW I use psycopg for my connection
-- 
John Fabiani
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


 Send instant messages to your online friends http://uk.messenger.yahoo.com ___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Optimal solution in dealing with huge databases in python

2007-01-25 Thread Eric Walstad
Alan Gauld wrote:
> "Shadab Sayani" <[EMAIL PROTECTED]> wrote 
> 
>> The data I need to deal with is  in 100s of GB.
>> I am using postgresql backend and SQLALCHEMY ORM.
> 
> All ORMs will introduce a significant performance hit.
> If you really need high speed, and populating a 100G+ database 
> probably is included in that, then you should look at raw SQL.
> In fact for really big data loads most databases provide a 
> C api that goes under the SQL, because even SQL is 
> relatively slow.

I've had this experience, too.  I routinely load a couple million
records from a tab delimited file into our PostgreSQL database with
Python.  My import script:
 1. iterates over the tab file a line/record at a time
 2. does some data validation, some data manipulation
 3. writes an SQL file with the massaged data
 4. calls psql  -f 

Generating the SQL file in python goes amazingly fast, considering the
amount of validation and manipulation (normalizing addresses, generating
hashes, etc) - about 1.5minutes/million records.

The SQL it generates does the usual DB stuff including dropping indexes
and constraints, COPY FROM stdin, regenerate indexes, replace
constraints.  In my experience, psql is MUCH faster at these kinds of
imports than even direct python interaction with psycopg.  Adding an ORM
on top of the database connector (psycopg in my case) slows things down
even more.

As a rough metric, the import I described takes a little over 2 hours on
my 2GHz/1GB laptop.

Good luck,

Eric.
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Optimal solution in dealing with huge databases in python

2007-01-25 Thread johnf
On Wednesday 24 January 2007 19:22, Shadab Sayani wrote:
> Hi,
>   I am working in a biodatabases project.The data I need to deal with is 
> in 100s of GB.I am using postgresql backend and SQLALCHEMY ORM.I need  to
> read the bio datafiles and parse them and then store them in  database.I am
> in the process of storing them. I used the session,flush concept in
> SQLALCHEMY.Initially I used to  flush every query immediately.Later I
> realised that the queries are  independent of each other and so started
> flushing 3-5 Lakh insert  queries at a time.This increased the response
> time.But the memory is  overflowing.Then I released not-in-use memory using
> del command in  python still there is no use as this del statement can only
> free part  of memory.I need to increase the above 3-5 lakh number to a much
> large  one to get a real time response.Other wise my estimation is it will 
> take 1 year to just insert the data into the database.From postgresql  side
> also I turned off WAL. Please suggest some viable solution to handle such
> enormous data from  python.Is there a better solution than SQL alchemy?Any
> solution that  speeds up my program is highly appreciated. Thanks and
> Regards,
>   Shadab.
>
>  Send instant messages to your online friends http://uk.messenger.yahoo.com

To tell the truth I know nothing about SQLALCHEMY ORM.  But I do know how to 
load large amounts of data into Postgres.  You use the 'copy ... from' 
command.  I once had 2 gigs of data and was surprised how fast it added the 
data.  Two things to remember:
1.  no indexes
2.  you have to format the data correctly - see Postgres DOC's

BTW I use psycopg for my connection
-- 
John Fabiani
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Optimal solution in dealing with huge databases in python

2007-01-25 Thread Chuck Coker
This page has some info on C/C++ for PostgreSQL:
http://www.postgresql.org/docs/8.0/interactive/xfunc-c.html

Shadab Sayani wrote:

> Do have any idea about the C api for Postgresql and
> some documentation to use it?

-- 
==
Chuck Coker, Software Developer[EMAIL PROTECTED]
Tyrell Software Corporation  http://www.tyrell.com
Office: +1 949 458 1911 x 203Cell: +1 714 326 5939
==

___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Optimal solution in dealing with huge databases in python

2007-01-25 Thread Shadab Sayani
Hi Alan,
Thank you very much for immediate response.I didnt get
the point of loading the data using SQL.What does that
mean?Does it mean Bulk file load.If so then there will
I/O performance hit right.Writing the parsed tokens in
the csv format in a file.After that only I can bulk
load this file.
Do have any idea about the C api for Postgresql and
some documentation to use it?
Thanks and Regards,
Shadab.
--- Alan Gauld <[EMAIL PROTECTED]> wrote:

> "Shadab Sayani" <[EMAIL PROTECTED]> wrote 
> 
> > The data I need to deal with is  in 100s of GB.
> > I am using postgresql backend and SQLALCHEMY ORM.
> 
> All ORMs will introduce a significant performance
> hit.
> If you really need high speed, and populating a
> 100G+ database 
> probably is included in that, then you should look
> at raw SQL.
> In fact for really big data loads most databases
> provide a 
> C api that goes under the SQL, because even SQL is 
> relatively slow.
> 
> As an example, we use a large Oracle database at
> work.
> Loading about 3 Terabytes of data through an ORM
> took 
> over 3 months! Loading it through SQL took about 3
> days.
> Loading it through the C API took less than a day.
> 
> Your mileage may vary since a lot depends on locks, 
> indexes etc etc. And of course the server spec!
> 
> But for loading large data volumes get as close to
> the metal 
> as possible. Once the data is loaded you can use the
> 
> ORM to simplify the application code for extracting
> and 
> modifying the data.
> 
> > I need  to read the bio datafiles and parse them
> and 
> > then store them in  database.
> 
> Parsing them and preparing the SQL statements can 
> be done in Python. But the actual loading I suggest 
> should be done in SQL if possible (The C aPI should 
> be a last resort - its frought with danger!)
> 
> > Please suggest some viable solution to handle such
> 
> > enormous data from  python.
> 
> A few hundred gigabytes is not too enormous these
> days
> but you are never going to achieve times of less
> than hours.
> You do need to be realistic about that. And if you
> are using
> a standard PC spec server instead of a large multi
> CPU 
> box with SCSI/RAID disc arrays etc then you could be
> 
> looking at days.
> 
> The other important factor is your data schema. The
> more 
> tables, joins, indexes etc the database has to
> maintain the 
> more work it takes and the sloewer it gets. The 3TB
> example 
> I gave had over 2000 tables, so it was always going
> to be 
> slow. If you have a single unindexed table then it
> will be 
> much simpler. (But the queries later will be much
> harder!)
> 
> -- 
> Alan Gauld
> Author of the Learn to Program web site
> http://www.freenetpages.co.uk/hp/alan.gauld
> 
> ___
> Tutor maillist  -  Tutor@python.org
> http://mail.python.org/mailman/listinfo/tutor
> 


Send instant messages to your online friends http://uk.messenger.yahoo.com 
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Optimal solution in dealing with huge databases in python

2007-01-25 Thread Alan Gauld
"Shadab Sayani" <[EMAIL PROTECTED]> wrote 

> The data I need to deal with is  in 100s of GB.
> I am using postgresql backend and SQLALCHEMY ORM.

All ORMs will introduce a significant performance hit.
If you really need high speed, and populating a 100G+ database 
probably is included in that, then you should look at raw SQL.
In fact for really big data loads most databases provide a 
C api that goes under the SQL, because even SQL is 
relatively slow.

As an example, we use a large Oracle database at work.
Loading about 3 Terabytes of data through an ORM took 
over 3 months! Loading it through SQL took about 3 days.
Loading it through the C API took less than a day.

Your mileage may vary since a lot depends on locks, 
indexes etc etc. And of course the server spec!

But for loading large data volumes get as close to the metal 
as possible. Once the data is loaded you can use the 
ORM to simplify the application code for extracting and 
modifying the data.

> I need  to read the bio datafiles and parse them and 
> then store them in  database.

Parsing them and preparing the SQL statements can 
be done in Python. But the actual loading I suggest 
should be done in SQL if possible (The C aPI should 
be a last resort - its frought with danger!)

> Please suggest some viable solution to handle such 
> enormous data from  python.

A few hundred gigabytes is not too enormous these days
but you are never going to achieve times of less than hours.
You do need to be realistic about that. And if you are using
a standard PC spec server instead of a large multi CPU 
box with SCSI/RAID disc arrays etc then you could be 
looking at days.

The other important factor is your data schema. The more 
tables, joins, indexes etc the database has to maintain the 
more work it takes and the sloewer it gets. The 3TB example 
I gave had over 2000 tables, so it was always going to be 
slow. If you have a single unindexed table then it will be 
much simpler. (But the queries later will be much harder!)

-- 
Alan Gauld
Author of the Learn to Program web site
http://www.freenetpages.co.uk/hp/alan.gauld

___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor