>> A small note about performance here. If your log files are very large >> (say, hundreds of thousands or millions of lines) you will find that >> this part is *horribly horrible slow*. There's two problems, a minor and >> a major one. > > Ah, actually I was mistaken about that. I forgot that for built-in > lists, += augmented assignment is equivalent to calling list.extend(), > so it actually does make the modifications in place. So I was wrong to > say:
No problem. Linus's Law wins again. :P (http://en.wikipedia.org/wiki/Linus's_Law) Thank goodness for public mailing lists where we can share our successes and learning experiences together! With regards to the earlier part about using the decoding at the call to open(), rather than on each individual line, next time you'll want to make the point that it's better to do so at open() time not because it's more efficient, but because it's more correct. Correctness needs to be the winning argument here. Encoding is a property of the entire file, not a property on individual lines. In fact, we can get into trouble by doing the decoding piece-wise across lines because certain encodings are multi-byte in nature. What this means is that what might look like a newline in the uninterpreted bytes of a file may be deceptive: that "newline" byte might actually be part of a multibyte character! Let's see if we can construct an example to demonstrate. ######################################################################## for encoding in ('utf-8', 'utf-16', 'utf-32'): for i in range(0x110000): aChar = unichr(i) try: someBytes = aChar.encode(encoding) if '\n' in someBytes: print("%r contains a newline in its bytes encoded with %s" % (aChar, encoding)) except: ## Normally, try/catches with an empty except is a bad idea. ## Here, this is toy code, and we're just exploring. pass ######################################################################## This toy code goes through all possible Unicode code points, and then encodes them in three different codecs. We look to see if any of the encoded characters have newlines in them, and report. Try running it. Notice how many characters start being reported. :P Hopefully, this makes the point clearer: we must not try to decode individual lines. By that time, the damage has been done: the act of trying to break the file into lines by looking naively at newline byte characters is invalid when certain characters can themselves have newline characters. _______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor