Dear colleagues, I was doing a small program to classify log files for a cluster of PCs, I just wanted to simplify a quite repetitive task in order to find errors and so.
My first naive implementation was something like: patterns = [] while(logs): pattern = logs[0] new_logs = [l for l in logs if dist(pattern,l)>THERESHOLD] entry = (len(logs)-len(new_logs),pattern) patterns.append(entry) logs = new_logs Where dist(...) is the levenshtein distance (i.e. edit distance) and logs is something like 1.5M logs (700 MB file). I thought that python will be an easy choice although not really fast.. I was not surprised when the first iteration of the while loop was taking ~10min. I thought "not bad, let's how much it takes". However, it seemed that the second iteration never finished. My surprise was big when I added a print instead of the list comprehension: new_logs=[] for count,l in enumerate(logs): print count if dist(pattern,l)>THERESHOLD: new_logs.append(l) The surprise was that the displayed counter was running ~10 times slower on the second iteration of the while loop. I am a little lost. Anyone knows the reson of this behavior? How should I write a program that deals with large data sets in python? Thanks a lot! marc magrans de abril -- http://mail.python.org/mailman/listinfo/python-list