Hello all, Well I am searching my dream scientific language :)
The current codebase that I am working with is related to a language translation software written in C++. I wanted to re-implement parts of it in Python and/or Julia to both learn it (as I didn't write the C++ stuff) and maybe to make it available for other people who are interested. I saw Pyston last night then I came back to PyPy. As a first step, I tried to parse a 300MB structed text file containing 1.1M lines like these: 0 ||| I love you mother . ||| label=number1 number2 number3 number4 label2=number5 number6 ... number19 ||| number20 Line-by-line accessing was actually pretty fast *but* trying to store the lines in a Python list drains RAM on my 4G laptop. This is disappointing. A raw text file (utf-8) of 300MB takes more than 1GB of memory. Today I went through pypy and did some benchmarks. Line parsing is as follows: - Split it from " ||| " - Convert 1st field to int and 4rd field to float. - Cleanup label= stuff from 2nd field using re.sub() - Append a dict(1) or a class(2) representing each line to a list. # Dict(1): # PyPy: ~1.4G RAM, ~12.7 seconds # CPython: ~1.2G RAM, 28.7 seconds # Class(2): # PyPy: ~1.2G, ~11.1 seconds # CPython: ~1.3G, ~32 seconds The memory measurements are not precise as I tracked them visually using top :) Attaching the code. I'm not an optimization guru, I'm pretty sure that there are suboptimal parts in the code. But the crucial part is memory complexity. Normally those text files are ~1GB on disk this means that I can't represent them in-memory with Python with this code. This is bad. Any suggestions? Thanks!
#!/usr/bin/env python from __future__ import print_function import array import sys import re # Dict: # PyPy: ~1.4G RAM, ~12.7 seconds # CPython: ~1.2G RAM, 28.7 seconds # Class: # PyPy: ~1.2G, ~11.1 seconds # CPython: ~1.3G, ~32 seconds class Hypothesis(object): def __init__(self, idx, translation, scores, global_score, alignments=None): self.idx = idx self.translation = translation self.scores = scores self.global_score = global_score self.alignments = alignments def __repr__(self): return "<Hypothesis: id:%d, translation:<%s>>" % (self.idx, self.translation) class NBestFile(object): def __init__(self, fname, delim = " ||| "): self.fname = fname self.delim = delim self.hypotheses = [] self.nb_phrases = 0 self.tm_pos = 0 self.f = open(self.fname) # Source wordlist self.wl_src_fname = None # Target wordlist self.wl_tgt_fname = None def __repr__(self): return "<NBestFile: %d hypotheses for %d different phrases>" % \ (len(self.hypotheses), self.nb_phrases) def read(self): pat = re.compile("[A-Za-z]+[0-9]=") delim = self.delim for line in self.f: fields = line[:-1].split(delim) # Cleanup score labels str_scores = re.sub(pat, "", fields[2]).split() """ self.hypotheses.append({ "idx" : int(fields[0]), "tr" : fields[1], "sc" : str_scores, "global" : float(fields[3]), "ali": fields[4] if len(fields) == 5 else None}) """ self.hypotheses.append(Hypothesis(int(fields[0]), fields[1], str_scores, float(fields[3]), fields[4] if len(fields) == 5 else None)) # Assuming the file is ordered by phrase id's # self.nb_phrases = self.hypotheses[-1]["idx"] def main(args): try: nbest_fname = args[1] except IndexError: print("Usage: %s <nbest file>" % args[0]) return 1 nb = NBestFile(nbest_fname) nb.read() print(nb) if __name__ == '__main__': sys.exit(main(sys.argv))
_______________________________________________ pypy-dev mailing list pypy-dev@python.org https://mail.python.org/mailman/listinfo/pypy-dev