Re: [pypy-dev] A simple file reading is 2x slow wrt CPython

Ozan Çağlayan Mon, 29 Jun 2015 08:14:13 -0700

Hello all,

Well I am searching my dream scientific language :)


The current codebase that I am working with is related to a language
translation software written in C++. I wanted to re-implement parts of
it in Python and/or Julia to both learn it (as I didn't write the C++
stuff) and maybe to make it available for other people who are
interested.

I saw Pyston last night then I came back to PyPy.

As a first step, I tried to parse a 300MB structed text file
containing 1.1M lines like these:

0 ||| I love you mother . ||| label=number1 number2 number3 number4
label2=number5 number6 ... number19 ||| number20

Line-by-line accessing was actually pretty fast *but* trying to store
the lines in a Python list drains RAM on my 4G laptop. This is
disappointing. A raw text file (utf-8) of 300MB takes more than 1GB of
memory.

Today I went through pypy and did some benchmarks. Line parsing is as follows:
- Split it from " ||| "
- Convert 1st field to int and 4rd field to float.
- Cleanup label= stuff from 2nd field using re.sub()
- Append a dict(1) or a class(2) representing each line to a list.

# Dict(1):
#   PyPy: ~1.4G RAM, ~12.7 seconds
#   CPython: ~1.2G RAM, 28.7 seconds

# Class(2):
#   PyPy: ~1.2G, ~11.1 seconds
#   CPython: ~1.3G, ~32 seconds

The memory measurements are not precise as I tracked them visually using top :)
Attaching the code. I'm not an optimization guru, I'm pretty sure that
there are suboptimal parts in the code. But the crucial part is memory
complexity. Normally those text files are ~1GB on disk this means that
I can't represent them in-memory with Python with this code. This is
bad. Any suggestions?

Thanks!

#!/usr/bin/env python

from __future__ import print_function
import array
import sys
import re

# Dict:
#   PyPy: ~1.4G RAM, ~12.7 seconds
#   CPython: ~1.2G RAM, 28.7 seconds

# Class:
#   PyPy: ~1.2G, ~11.1 seconds
#   CPython: ~1.3G, ~32 seconds
class Hypothesis(object):
    def __init__(self, idx, translation, scores, global_score, alignments=None):
        self.idx = idx
        self.translation = translation
        self.scores = scores
        self.global_score = global_score
        self.alignments = alignments

    def __repr__(self):
        return "<Hypothesis: id:%d, translation:<%s>>" % (self.idx,
                self.translation)


class NBestFile(object):
    def __init__(self, fname, delim = " ||| "):
        self.fname = fname
        self.delim = delim
        self.hypotheses = []
        self.nb_phrases = 0
        self.tm_pos = 0
        self.f = open(self.fname)

        # Source wordlist
        self.wl_src_fname = None

        # Target wordlist
        self.wl_tgt_fname = None

    def __repr__(self):
        return "<NBestFile: %d hypotheses for %d different phrases>" % \
                (len(self.hypotheses), self.nb_phrases)

    def read(self):
        pat = re.compile("[A-Za-z]+[0-9]=")
        delim = self.delim
        for line in self.f:
            fields = line[:-1].split(delim)

            # Cleanup score labels
            str_scores = re.sub(pat, "", fields[2]).split()

            """
            self.hypotheses.append({
                                        "idx" : int(fields[0]),
                                        "tr" : fields[1],
                                        "sc" : str_scores,
                                        "global" : float(fields[3]),
                                        "ali": fields[4] if len(fields) == 5 else None})

            """
            self.hypotheses.append(Hypothesis(int(fields[0]),
                                              fields[1],
                                              str_scores,
                                              float(fields[3]),
                                              fields[4] if len(fields) == 5 else None))

        # Assuming the file is ordered by phrase id's
        # self.nb_phrases = self.hypotheses[-1]["idx"]


def main(args):
    try:
        nbest_fname = args[1]
    except IndexError:
        print("Usage: %s <nbest file>" % args[0])
        return 1

    nb = NBestFile(nbest_fname)
    nb.read()
    print(nb)

if __name__ == '__main__':
    sys.exit(main(sys.argv))

_______________________________________________
pypy-dev mailing list
pypy-dev@python.org
https://mail.python.org/mailman/listinfo/pypy-dev

Re: [pypy-dev] A simple file reading is 2x slow wrt CPython

Reply via email to