Quoting Vincent Lefevre (vinc...@vinc17.net): > [The context: *very basic* header validation of e-mail messages] > > On 2015-04-28 10:27:40 +0200, Nicolas George wrote: > > L'octidi 8 floréal, an CCXXIII, Vincent Lefevre a écrit : > > > I don't understand the point. Accumulating in strings (which involves > > > copies and possible reallocations) and doing a split is much slower > > > than reading lines one by one and treating them separately. > > > > First: not necessarily, because once the header is loaded in a string, you > > can apply regexps to the whole header at once instead of using a loop. This > > may prove faster. > > I've finally tried this solution (i.e. accumulating, then apply > regexp on the full strings) and it takes about 60% more time when > the data are in the disk cache.
I can't quite understand Nicolas's sentence because I'm not sure whether by "the header" and "the whole header" he means the several lines of headers taken together. However, in https://lists.debian.org/debian-user/2015/04/msg01265.html I was perhaps less ambiguous (point 2): "In which case, if you want to know how come mutt is so fast, take a look at the source. Just to mention one optimisation I would consider: slurp the directory and sort the entries by inode. Open the files in inode order. And another: it's probably faster to slurp bigger chunks of each file (with an intelligent guess of the best buffer size) and use a fast search for \nMessage-ID rather than reading and checking line by line. " > This is not surprising, IMHO, for > the following reasons: > > First, as I've said, accumulating lines in a string may involve copies > and reallocations because the string grows (I don't know whether there > is a way to solve that without obfuscating the code). By slurp, I meant for you to try reading the top of the file as a single chunk using a read-a-load-of-bytes method rather than a repetitive readline method. In Python's terms (because I don't know the Perl ones) a call of read() class io.RawIOBase read(size=-1) Read up to size bytes from the object and return them. As a convenience, if size is unspecified or -1, readall() is called. Otherwise, only one system call is ever made... rather than readline() class io.IOBase readline(size=-1) Read and return one line from the stream. If size is specified, at most size bytes will be read. > Then I don't think that in the particular case of header validation, > there is much gain applying regexp's on the full header at once; the > reason is that my regexp's use the end of line as a separator (things > like /\n[^:\s]+\s/ and /^Message-ID:.../im). So, when I read the file > line by line, I already do a part of the job of regexp matching. But I would assume that regexp in languages like Perl/Python has code far more optimised than reading files line by line. So you would search for \nmessage-id:.*?\n (where .*? is non-greedy). > And finally, for each test, the header has to be read several times. I'm not sure why, without knowing the tests to apply (or did I miss seeing them?). > In my case, I don't need to deal with folded headers, except validating > the format, which is very easy with a line-by-line parsing. You did mention validating message-id and other headers and checking for missing ones, but do your scripts throw all this work away and, if so, why? For example, if you add your own distinctive Message-ID header to any file that doesn't have one, then that's one test you never have to repeat. > I may have other scripts that need to deal with them, but in this case, > I accumulate physical lines into a single logical one. AFAIK, this is > what mail processors do (postfix header filtering, procmail...). But > there is no need to accumulate the full header in a single string. Why not think of it this way: the "full header" (ie all the header lines of a message) *is* a single string: it's the beginning of the file, terminated by \n\n. I wonder how much speed-up you could achieve with a C function using strstr to find the end of the headers and returning them as a single string. Cheers, David. -- To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org Archive: https://lists.debian.org/20150523020101.GA8528@alum