Maybe it is a good idea to use Disco (http://discoproject.org/) to process your data.
Yours faithfully, Alexander Abushkevich On Sat, Jan 30, 2010 at 10:36 PM, marc magrans de abril <marcmagransdeab...@gmail.com> wrote: > Dear colleagues, > > I was doing a small program to classify log files for a cluster of > PCs, I just wanted to simplify a quite repetitive task in order to > find errors and so. > > My first naive implementation was something like: > patterns = [] > while(logs): > pattern = logs[0] > new_logs = [l for l in logs if dist(pattern,l)>THERESHOLD] > entry = (len(logs)-len(new_logs),pattern) > patterns.append(entry) > logs = new_logs > > Where dist(...) is the levenshtein distance (i.e. edit distance) and > logs is something like 1.5M logs (700 MB file). I thought that python > will be an easy choice although not really fast.. > > I was not surprised when the first iteration of the while loop was > taking ~10min. I thought "not bad, let's how much it takes". However, > it seemed that the second iteration never finished. > > My surprise was big when I added a print instead of the list > comprehension: > new_logs=[] > for count,l in enumerate(logs): > print count > if dist(pattern,l)>THERESHOLD: > new_logs.append(l) > > The surprise was that the displayed counter was running ~10 times > slower on the second iteration of the while loop. > > I am a little lost. Anyone knows the reson of this behavior? How > should I write a program that deals with large data sets in python? > > Thanks a lot! > marc magrans de abril > -- > http://mail.python.org/mailman/listinfo/python-list > -- http://mail.python.org/mailman/listinfo/python-list