On Fri, Feb 23, 2007, =?ISO-8859-1?Q? Arild_B._N=E6ss ?= wrote: >Hi, > >I'm working on a python script for a task in statistical language >processing. Briefly put it all boils down to counting different >things in very large text files, doing simple computations on these >counts and storing the results. I have been using python's dictionary >type as my basic data structure of storing the counts. This has been >a nice and simple solution, but turns out to be a bad idea in the >long run, since the dictionaries become _very_ large, and create >MemoryErrors when I try to run my script on texts of a certain size. > >It seems that an SQL database would probably be the way to go, but I >am a bit concerned about speed issues (even though running time is >not all that crucial here). In any case it would probably take me a >while to get a database up and running and I need to hand in some >preliminary results pretty soon, so for now I think I'll postpone the >SQL and try to tweak my current script to be able to run it on >slightly longer texts than it can handle now.
You would probably be better off using one of the hash databases, Berkeley, gdbm, etc. (see the anydbm documentation). These can be treated exactly like dictionaries in python, and are probably orders of magnitude faster than using an SQL database. Bill -- INTERNET: [EMAIL PROTECTED] Bill Campbell; Celestial Software LLC URL: http://www.celestial.com/ PO Box 820; 6641 E. Mercer Way FAX: (206) 232-9186 Mercer Island, WA 98040-0820; (206) 236-1676 ``Rightful liberty is unobstructed action according to our will within limits drawn around us by the equal rights of others. I do not add 'within the limits of the law' because law is often but the tyrant's will, and always so when it violates the rights of the individual.'' -Thomas Jefferson _______________________________________________ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor