On Tue, 22 Nov 2016 10:27 am, Fillmore wrote: > > Hi there, Python newbie here. > > I am working with large files. For this reason I figured that I would > capture the large input into a list and serialize it with pickle for > later (faster) usage. > Everything has worked beautifully until today when the large data (1GB) > file caused a MemoryError :(
At what point do you run out of memory? When building the list? If so, then you need more memory, or smaller lists, or avoid creating a giant list in the first place. If you can successfully build the list, but then run out of memory when trying to pickle it, then you may need another approach. But as always, to really be sure what is going on, we need to see the full traceback (not just the "MemoryError" part) and preferably a short, simple example that replicates the error: http://www.sscce.org/ > Question for experts: is there a way to refactor this so that data may > be filled/written/released as the scripts go and avoid the problem? I'm not sure what you are doing with this data. I guess you're not just: - read the input, one line at a time - create a giant data list - pickle the list and then never look at the pickle again. I imagine that you want to process the list in some way, but how and where and when is a mystery. But most likely you will later do: - unpickle the list, creating a giant data list again - process the data list So I'm not sure what advantage the pickle is, except as make-work. Maybe I've missed something, but if you're running out of memory processing the giant list, perhaps a better approach is: - read the input, one line at a time - process that line and avoid building the giant list or the pickle at all. > code below. > > Thanks > > data = list() > for line in sys.stdin: > try: > parts = line.strip().split("\t") > t = parts[0] > w = parts[1] > u = parts[2] > #let's retain in-memory copy of data > data.append({"ta": t, > "wa": w, > "ua": u > }) > except IndexError: > print("Problem with line :"+line, file=sys.stderr) > pass > > #time to save data object into a pickle file > > fileObject = open(filename,"wb") > pickle.dump(data,fileObject) > fileObject.close() Let's re-write some of your code to make it better: data = [] for line in sys.stdin: try: t, w, u = line.strip().split("\t") except ValueError as err: print("Problem with line:", line, file=sys.stderr) data.append({"ta": t, "wa": w, "ua": u}) with open(filename, "wb") as fileObject: pickle.dump(data, fileObject) Its not obvious where you are running out of memory, but my guess is that it is most likely while building the giant list. You have a LOT of small dicts, each one with exactly the same set of keys. You can probably save a lot of memory by using a tuple, or better, a namedtuple. py> from collections import namedtuple py> struct = namedtuple("struct", "ta wa ua") py> x = struct("abc", "def", "ghi") py> y = {"ta": "abc", "wa": "def", "ua": "ghi"} py> sys.getsizeof(x) 36 py> sys.getsizeof(y) 144 So each of those little dicts {"ta": t, "wa": w, "ua": u} in your list potentially use as much as four times the memory as a namedtuple would use. So using namedtuple might very well save enough memory to avoid the MemoryError altogether. from collections import namedtuple struct = namedtuple("struct", "ta wa ua") data = [] for line in sys.stdin: try: t, w, u = line.strip().split("\t") except ValueError as err: print("Problem with line:", line, file=sys.stderr) data.append(struct(t, w, a)) with open(filename, "wb") as fileObject: pickle.dump(data, fileObject) And as a bonus, when you come to use the record, instead of having to write: line["ta"] to access the first field, you can write: line.ta -- Steve “Cheer up,” they said, “things could be worse.” So I cheered up, and sure enough, things got worse. -- https://mail.python.org/mailman/listinfo/python-list