On May 24, 2007, at 5:52 AM, Julien Cigar wrote:
> > Hello, > > I have to import, clean and merge data from different sources > (Excel sheets, Access, text files, etc) into a new database > (Postgresql 8.1). > Some of those excel sheets, which have been converted to CSV files, > are very very big (about 500 000 lines). > For this new project I decided to use SQLAlchemy to import those > data (I used it only for the web interface in the past), and it's > really a big saving of time. > However, after ~15000 iterations my script consumes about 60% of > the memory and things become very very slow ... > For some parts of the script things can be slow, because for each > line in file A I have to search corresponding line in file B (but > file B is not so big in this case). > > I don't know if this problem is related to SQLAlchemy or not ... > but I didn't have this problem before. > > Here is the script : http://rafb.net/p/RvsH3133.html (it's quite > huge but the details aren't important) > > Does somebody has an idea what could cause this high memory > consumption ? here is the formula: ORM (i.e., using mappers/sessions) + > 1000 of anything - *extremely* careful usage of relationships/cascade/sessions == major memory usage the ORM is largely oriented towards keeping itself "correct" as well as towards being able to load data on demand...therefore if you have a large recordset that is linked by a relation its very easy for that whole recordset to get loaded into memory. additionally, SA has to keep track of a lot of state information for a loaded entity. to analyze this program you should be doing things like setting 'sqlalchemy.engine' to logging.DEBUG so that you can see not just the queries being issued but also the numbers of rows being loaded in, and you should also be monitoring the size of your session at various points, using something like: print "session size:", len(session.identity_map), len(session.new), len(session.dirty) as well as looking at objects in memory. you can use a debugger like PDB for this, though i am lazy and i usually just play with "gc", doing things like: import gc print len(gc.get_objects()) for obj in gc.get_objects(): if getattr(obj, '__module__', None) is not None and obj.__module__.startswith('mypackage.name'): print repr(obj) you should try to ensure that the total number of objects present in the session at any given time is in the hundreds, making liberal use of clear()/expunge() etc. to keep the size down. --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "sqlalchemy" group. To post to this group, send email to sqlalchemy@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/sqlalchemy?hl=en -~----------~----~----~----~------~----~------~--~---