Alf P. Steinbach wrote:
* marc magrans de abril:
Dear colleagues,

I was doing a small program to classify log files for a cluster of
PCs, I just wanted to simplify a quite repetitive task in order to
find errors and so.

My first naive implementation was something like:
    patterns = []
    while(logs):
        pattern = logs[0]
        new_logs = [l for l in logs if dist(pattern,l)>THERESHOLD]
        entry = (len(logs)-len(new_logs),pattern)
        patterns.append(entry)
        logs = new_logs

Where dist(...) is the levenshtein distance (i.e. edit distance) and
logs is something like 1.5M logs (700 MB file). I thought that python
will be an easy choice although not really fast..

I was not surprised when the first iteration of the while loop was
taking ~10min. I thought "not bad, let's how much it takes". However,
it seemed that the second iteration never finished.

My surprise was big when I added a print instead of the list
comprehension:
new_logs=[]
for count,l in enumerate(logs):
   print count
   if dist(pattern,l)>THERESHOLD:
      new_logs.append(l)

The surprise was that the displayed counter was running ~10 times
slower on the second iteration of the while loop.

I am a little lost. Anyone knows the reson of this behavior?

It's on line 42 of your program. :-) That is, it's in the dist function. Evidently it doesn't like a more complex 'pattern'.

Find out which pattern is being used on the second iteration and then
try it on the first iteration. Is it just as slow? If so, then it's down
to the length/complexity of that pattern - a much longer/more complex
pattern might take much longer when computing the distance.

How should I write a program that deals with large data sets in python?

As in any other language. Try to avoid repeating the same computations. Try to make the data fit the computational task.

True. Basically, you're computing the distance between every pair of
logs!
--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to