On 2015-04-24 16:39:51 -0500, David Wright wrote: [...] > And another: it's probably faster to slurp bigger chunks of each file > (with an intelligent guess of the best buffer size) and use a fast > search for \nMessage-ID rather than reading and checking line by line."
This is not that simple. I want my script to be very reliable. In particular, if there is a message without a Message-ID and with "\nMessage-ID" in the body, I want to detect it. This kind of thing really happens in practice (though this is rare), e.g. due to some buggy mail software that breaks the headers and put a part of them in the body. I also want to check the format of the headers and possible duplicate Message-ID. What my script really does is: while (<FILE>) { /^[\t ]/ and next; /^\S+:/ || (!$from++ && /^From /) or die "$proc: bad message format ($file)"; /^Message-ID:\s+(<\S+>)( \(added by .*\))?$/i or next; defined $files{$1} and die "$proc: duplicate message-id $1 ($files{$1} and $file)\n"; $files{$1} = $file; last; } [...] > And should you read the whole directory by specifying <directory-name>/*, > you lose the benefit and thrash the disk again. With zsh, I often do things like: grep ... <directory-name>/**/*.c One can choose to sort the result, but zsh doesn't support sorting by inode number. I've sent a feature request. -- Vincent Lefèvre <vinc...@vinc17.net> - Web: <https://www.vinc17.net/> 100% accessible validated (X)HTML - Blog: <https://www.vinc17.net/blog/> Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon) -- To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org Archive: https://lists.debian.org/20150425003907.gb12...@xvii.vinc17.org