On May 24, 2007, at 5:52 AM, Julien Cigar wrote:

>
> Hello,
>
> I have to import, clean and merge data from different sources  
> (Excel sheets, Access, text files, etc) into a new database  
> (Postgresql 8.1).
> Some of those excel sheets, which have been converted to CSV files,  
> are very very big (about 500 000 lines).
> For this new project I decided to use SQLAlchemy to import those  
> data (I used it only for the web interface in the past), and it's  
> really a big saving of time.
> However, after ~15000 iterations my script consumes about 60% of  
> the memory and things become very very slow ...
> For some parts of the script things can be slow, because for each  
> line in file A I have to search corresponding line in file B (but  
> file B is not so big in this case).
>
> I don't know if this problem is related to SQLAlchemy or not ...  
> but I didn't have this problem before.
>
> Here is the script : http://rafb.net/p/RvsH3133.html (it's quite  
> huge but the details aren't important)
>
> Does somebody has an idea what could cause this high memory  
> consumption ?

here is the formula:

ORM (i.e., using mappers/sessions) + > 1000 of anything - *extremely*  
careful usage of relationships/cascade/sessions == major memory usage

the ORM is largely oriented towards keeping itself "correct" as well  
as towards being able to load data on demand...therefore if you have  
a large recordset that is linked by a relation its very easy for that  
whole recordset to get loaded into memory.  additionally, SA has to  
keep track of a lot of state information for a loaded entity.

to analyze this program you should be doing things like setting  
'sqlalchemy.engine' to logging.DEBUG so that you can see not just the  
queries being issued but also the numbers of rows being loaded in,  
and you should also be monitoring the size of your session at various  
points, using something like:

print "session size:", len(session.identity_map), len(session.new),  
len(session.dirty)

as well as looking at objects in memory.  you can use a debugger like  
PDB for this, though i am lazy and i usually just play with "gc",  
doing things like:

import gc
print len(gc.get_objects())

for obj in gc.get_objects():
     if getattr(obj, '__module__', None) is not None and  
obj.__module__.startswith('mypackage.name'):
         print repr(obj)

you should try to ensure that the total number of objects present in  
the session at any given time is in the hundreds, making liberal use  
of clear()/expunge() etc. to keep the size down.


--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"sqlalchemy" group.
To post to this group, send email to sqlalchemy@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/sqlalchemy?hl=en
-~----------~----~----~----~------~----~------~--~---

Reply via email to