Re: What to learn from the BSD grep case [Was: why GNU grep is fast]
On Mon, Aug 23, 2010 at 5:04 PM, Gabor Kovesdan wrote: > 4, We really need a good regex library. From the comments, it seems there's > no such in the open source world. GNU libregex isn't efficient because GNU > grep uses those workarounds that Mike kindly pointed out. Oniguruma was > extremely slow when I checked it. PCRE supports Perl-style syntax with a > POSIX-like API but not POSIX regex. Google RE2 is the same with additional > egrep syntax but doesn't have support for standard POSIX regexes. Plan 9 > regex only supports egrep syntax. It seems that TRE is the best choice. It > is BSD-licensed, supports wchar and POSIX(ish) regexes and it is quite fast. I know it's C++ and not exactly what you're needing, but have you evaluated Boost::Regex? Isn't there some code that can be retrofitted into a C lib? http://www.boost.org/doc/libs/1_44_0/libs/regex/doc/html/index.html > I don't know the theoretical background of regex engines but I'm wondering > if it's possible top provide an alternative API with byte-counted buffers > and use the heuristical speedup with fixed string matching. As Mike pointed > out the POSIX API is quite limiting because it works on NUL-terminated > strings and not on byte-counted buffers, so we couldn't just do it with a > POSIX-conformant library but it would be nice if we could implement it in > such a library with an alternative interface. > > Gabor -cpghost. -- Cordula's Web. http://www.cordula.ws/ ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: What to learn from the BSD grep case [Was: why GNU grep is fast]
On Aug 23, 2010, at 9:04 AM, Gabor Kovesdan wrote: > Hi all, > > there are some consequences that we can see from the grep case. Here I'd like > to add a summary, which raises some questions. All comments are welcome. > > 1, When grep entered -CURRENT and bugs were found I immediately got kind bug > reports and sharp criticism, as well. According to my understanding, -CURRENT > is for development and it's fine to expose new pieces of work there but now > I'm in doubt about that because of complaining people. On the other hand, an > earlier version of BSD grep has been in the ports tree for a very long time > and users reported some problems, which have been fixed but still, there is a > lot of bugs there which haven't been reported that time. If users don't > volunteer to test new pieces of code on a volunteer basis, somehow we have to > make them test it, so I think committing BSD grep to -CURRENT was a good > decision in the first round. You did everything right. You were responsive, you were open to suggestions, and you got the code in. Even more importantly, you got the code in a year before 9.0, instead of waiting until the last minute, months from now, and creating a dilemma for the release engineers. Software is an iterative process of feedback and improvement. The way that you've handled this should be a model for the project. Scott ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
What to learn from the BSD grep case [Was: why GNU grep is fast]
Hi all, there are some consequences that we can see from the grep case. Here I'd like to add a summary, which raises some questions. All comments are welcome. 1, When grep entered -CURRENT and bugs were found I immediately got kind bug reports and sharp criticism, as well. According to my understanding, -CURRENT is for development and it's fine to expose new pieces of work there but now I'm in doubt about that because of complaining people. On the other hand, an earlier version of BSD grep has been in the ports tree for a very long time and users reported some problems, which have been fixed but still, there is a lot of bugs there which haven't been reported that time. If users don't volunteer to test new pieces of code on a volunteer basis, somehow we have to make them test it, so I think committing BSD grep to -CURRENT was a good decision in the first round. 2, This issue also brought up some bottlenecks and potential optimization points (like memchr() and mmap), which other softwre may benefit from. This is another reason to let such pieces of work in. But unfortunately, this means that noone profiled another utilities because these bottlenecks remained undiscovered. Neither did I. It's a lesson that we have to learn from this particular case. 3, Because of point 2, we need more content to developers-handbook to help development with such ideas and best practices. It has been also raised on another list that our end-user documentation isn't that shiny and cool that it used to be and actually, developers-handbook has never been "finished" to be more or less complete. If someone looks at it, it looks like a sketch, not a book. I'll see if I can write a section on profiling. 4, We really need a good regex library. From the comments, it seems there's no such in the open source world. GNU libregex isn't efficient because GNU grep uses those workarounds that Mike kindly pointed out. Oniguruma was extremely slow when I checked it. PCRE supports Perl-style syntax with a POSIX-like API but not POSIX regex. Google RE2 is the same with additional egrep syntax but doesn't have support for standard POSIX regexes. Plan 9 regex only supports egrep syntax. It seems that TRE is the best choice. It is BSD-licensed, supports wchar and POSIX(ish) regexes and it is quite fast. I don't know the theoretical background of regex engines but I'm wondering if it's possible top provide an alternative API with byte-counted buffers and use the heuristical speedup with fixed string matching. As Mike pointed out the POSIX API is quite limiting because it works on NUL-terminated strings and not on byte-counted buffers, so we couldn't just do it with a POSIX-conformant library but it would be nice if we could implement it in such a library with an alternative interface. Gabor ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: why GNU grep is fast
Later on, he summarizes some of the existing implementations, including comments about the Plan 9 implementation and his own RE2, both of which efficiently handle international text (which seems to be a major concern of Gabor's). I believe Gabor is considering TRE for a good replacement regex library. Yes. Oniguruma is slow, Google RE2 only supports Perl and fgrep syntax but not standard regex and Plan 9 implementation iirc only supports fgrep syntax and Unicode but not wchar_t in general. The key comment in Mike's GNU grep notes is the one about not breaking into lines. That's simply double-scanning the input; instead, run the matcher over blocks of text and, when it finds a match, work backwards from the match to find the appropriate line beginning. This is efficient because most lines don't match. I do like the idea. So do I. BTW, the fastgrep portion of bsdgrep is my fault/contribution to do a faster search bypassing the regex library. :) It certainly was not written with any encodings in mind; it was purely ASCII. As I have not kept up with it, I do not know if anyone improved it or not. It has been made wchar-compliant. Gabor ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: why GNU grep is fast
Em 2010.08.21. 4:31, Mike Haertel escreveu: Hi Gabor, I am the original author of GNU grep. I am also a FreeBSD user, although I live on -stable (and older) and rarely pay attention to -current. However, while searching the -current mailing list for an unrelated reason, I stumbled across some flamage regarding BSD grep vs GNU grep performance. You may have noticed that discussion too... Anyway, just FYI, here's a quick summary of where GNU grep gets its speed. Hopefully you can carry these ideas over to BSD grep. Thanks, Mike, I'll check how I could adopt these ideas into BSD grep. I think your kind help was such a nice gesture and the open souce world should be based on such cooperation instead of meaningless competition. Gabor ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: why GNU grep is fast
Mike Haertel writes: > Dag-Erling Smørgrav writes: > > You don't really need to "isolate the containing line" unless you have > > an actual match, do you? > Theoretically no. However, suppose the pattern was /foo.*blah/. > The Boyer-Moore search will be for "blah", since that's the longest > fixed substring. But verifying a match for the full regexp either > requires a regexp matcher with the feature "start here, at THIS point > in the middle of the RE and THAT point in the middle of the buffer, > and match backwards and forwards", or else running a more standard > RE matcher starting from the beginning of the line. Stupidly enough, I only considered the case where the fixed search string is at the end of the regexp :) For the general case, you'd have to either build a second FSA that mirrors the first, or design your engine and data structures in such a way that the FSA can be traversed in either direction. Then you note which states correspond to the beginning and end of the fixed substring, and run the FSA forward from the end and backward from the beginning. That could prove advantageous when the input consists of very long lines and the frequency of partial matches is relatively high; large files with very long lines are very common in bioinformatics. > As you mentioned, you can avoid searching forwards for the next > newline if your RE matcher supports using newline as an exit marker. Umm, I don't understand why it needs to support using *anything* as an exit marker. The newline character will simply not match any of the transitions from whichever state the FSA is currently in, and neither will the null character. The only bit that needs to know about them is the bit that handles end-of-line and end-of-word anchors. DES -- Dag-Erling Smørgrav - d...@des.no ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: why GNU grep is fast
"Sean C. Farley" writes: > Dag-Erling Smørgrav writes: > > Aho-Corasick is not really a search algorithm, but an algorithm for > > constructing a table-driven finite state machine that will match > > either of the search strings you fed it. I believe it is less > > efficient than Boyer-Moore for small numbers of search terms, since > > it scans the entire input. I don't see the point in using it in > > grep, because grep already has an algorithm for constructing finite > > state machines: regcomp(3). > especially those that could be useful for fgrep functionality Not entirely sure what you mean by that, but in most cases, I think Boyer-Moore is a better fit for fgrep. For large numbers of search terms, I might use Aho-Corasick... if it weren't for the fact that, as mentioned, grep already has a regexp engine, which is a superset of Aho-Corasick. It might be a tad slower at building the FSA, because it's more general, and the FSA might occupy a tad more memory, but the complexity is *exactly* the same, and adding Aho-Corasick just for fgrep is, for lack of a better word, pedantry. For all you know, the increased code size could very well offset whatever performance advantage Aho-Corasick might offer. > Glimpse may be a POS; I have not used it personally. I only noted its > algorithm for possible use within fgrep. Apples and oranges. The problem glimpse tries to solve (fixed corpus, variable search terms) is precisely the opposite of the one fgrep tries to solve (fixed search terms, variable corpus). Glimpse does include grep-like functionality, but in the form of agrep, which is designed for approximate matching. DES -- Dag-Erling Smørgrav - d...@des.no ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: why GNU grep is fast
On Sun, 22 Aug 2010, Tim Kientzle wrote: On Aug 22, 2010, at 9:30 AM, Sean C. Farley wrote: On Sun, 22 Aug 2010, Dag-Erling Smørgrav wrote: Mike Haertel writes: GNU grep uses the well-known Boyer-Moore algorithm, which looks first for the final letter of the target string, and uses a lookup table to tell it how far ahead it can skip in the input whenever it finds a non-matching character. Boyer-Moore is for fixed search strings. I don't see how that optimization can work with a regexp search unless the regexp is so simple that you break it down into a small number of cases with known length and final character. When I was working on making FreeGrep faster (years ago), I wrote down a few notes about possible algorithms, especially those that could be useful for fgrep functionality. I am just passing these onto the list. Some algorithms: 1. http://en.wikipedia.org/wiki/Aho-Corasick_string_matching_algorithm 2. http://en.wikipedia.org/wiki/Rabin-Karp_string_search_algorithm 3. GNU fgrep: Commentz-Walter 4. GLIMPSE: http://webglimpse.net/pubs/TR94-17.pdf (Boyer-Moore variant) Also, this may be a useful book: http://www.dcc.uchile.cl/~gnavarro/FPMbook/ And of course, Russ Cox' excellent series of articles starting at: http://swtch.com/~rsc/regexp/regexp1.html I saved that link from an E-mail earlier because it looked very interesting. Later on, he summarizes some of the existing implementations, including comments about the Plan 9 implementation and his own RE2, both of which efficiently handle international text (which seems to be a major concern of Gabor's). I believe Gabor is considering TRE for a good replacement regex library. The key comment in Mike's GNU grep notes is the one about not breaking into lines. That's simply double-scanning the input; instead, run the matcher over blocks of text and, when it finds a match, work backwards from the match to find the appropriate line beginning. This is efficient because most lines don't match. I do like the idea. Boyer-Moore is great for fixed strings (a very common use case for grep) and for more complex patterns that contain long fixed strings (helps to discard most lines early). Sophisticated regex matchers implement a number of strategies and choose different ones depending on the pattern. That is what fastgrep (in bsdgrep) attempts to accomplish with very simply regex lines (beginning of line, end of line and dot). In the case of bsdgrep, it might make sense to use the regex library for the general case but implement a hand-tuned search for fixed strings that can be heavily optimized for that case. Of course, international text support complicates the picture; you have to consider the input character set (if you want to auto-detect Unicode encodings by looking for leading BOMs, for example, you either need to translate the fixed-string pattern to match the input encoding or vice-versa). BTW, the fastgrep portion of bsdgrep is my fault/contribution to do a faster search bypassing the regex library. :) It certainly was not written with any encodings in mind; it was purely ASCII. As I have not kept up with it, I do not know if anyone improved it or not. Sean -- s...@freebsd.org___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: why GNU grep is fast
On Sun, 22 Aug 2010, Dag-Erling Smørgrav wrote: "Sean C. Farley" writes: Some algorithms: 1. http://en.wikipedia.org/wiki/Aho-Corasick_string_matching_algorithm Aho-Corasick is not really a search algorithm, but an algorithm for constructing a table-driven finite state machine that will match either of the search strings you fed it. I believe it is less efficient than Boyer-Moore for small numbers of search terms, since it scans the entire input. I don't see the point in using it in grep, because grep already has an algorithm for constructing finite state machines: regcomp(3). especially those that could be useful for fgrep functionality I was mainly talking about algorithms useful for the fgrep portion within FreeGrep. fgrep would run (still runs?) over the same text for each pattern. Therefore, Aho–Corasick had to be mentioned for the reason referenced within the link: The Aho–Corasick string matching algorithm formed the basis of the original Unix command fgrep. 2. http://en.wikipedia.org/wiki/Rabin-Karp_string_search_algorithm It doesn't seem to compare favorably to the far older Aho-Corasick. It uses slightly less memory, but memory is usually not an issue with grep. I agree, yet I like to keep alternative algorithms in mind in case a variant would be useful. 4. GLIMPSE: http://webglimpse.net/pubs/TR94-17.pdf (Boyer-Moore variant) Glimpse is a POS... and not really comparable, because grep is designed to search for a single search string in multiple texts, while glimpse is designed to search a large amount of text over and over with different search strings. I believe it uses suffix tables to construct its index, and Boyer-Moore only to locate specific matches, since the index lists only files, not exact positions. For anything other than fixed strings, it reverts to agrep, but I assume (I haven't looked at the code) that if the regexp has one or more fixed components, it uses those to narrow the search space before running agrep. Glimpse may be a POS; I have not used it personally. I only noted its algorithm for possible use within fgrep. Of course, there may be much better algorithms out there to boost fgrep's speed, but these are what I had found at one time. Sean -- s...@freebsd.org___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: why GNU grep is fast
Dag-Erling Smørgrav writes: > Mike Haertel writes: > > For example if your regex is /foo.*bar/, the initial Boyer-Moore search > > is (probably) searching for foo. > > > > If the initial search succeeds, GNU grep isolates the containing line, > > and then runs the full regex matcher on that line to make sure. > > You don't really need to "isolate the containing line" unless you have > an actual match, do you? There are two cases: Theoretically no. However, suppose the pattern was /foo.*blah/. The Boyer-Moore search will be for "blah", since that's the longest fixed substring. But verifying a match for the full regexp either requires a regexp matcher with the feature "start here, at THIS point in the middle of the RE and THAT point in the middle of the buffer, and match backwards and forwards", or else running a more standard RE matcher starting from the beginning of the line. So, in practice you pretty much have to at least search backwards for the preceding newline. As you mentioned, you can avoid searching forwards for the next newline if your RE matcher supports using newline as an exit marker. But if the workload characteristics are that matching lines are scarce compared to the input, this is an optimization that just won't matter much either way. ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: why GNU grep is fast
Mike Haertel writes: > For example if your regex is /foo.*bar/, the initial Boyer-Moore search > is (probably) searching for foo. > > If the initial search succeeds, GNU grep isolates the containing line, > and then runs the full regex matcher on that line to make sure. You don't really need to "isolate the containing line" unless you have an actual match, do you? There are two cases: 1) The regexp does not use any character classes, including /./, so the FSA will stop if it hits EOL before it reaches an accepting state. 2) The regexp uses character classes, and you rewrite them to exclude \n: /[^bar]/ becomes /[^bar\n]/, /./ becomes /[^\n]/, etc., and the FSA will stop if it hits EOL before it reaches an accepting state. DES -- Dag-Erling Smørgrav - d...@des.no ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: why GNU grep is fast
"Sean C. Farley" writes: > Some algorithms: > 1. http://en.wikipedia.org/wiki/Aho-Corasick_string_matching_algorithm Aho-Corasick is not really a search algorithm, but an algorithm for constructing a table-driven finite state machine that will match either of the search strings you fed it. I believe it is less efficient than Boyer-Moore for small numbers of search terms, since it scans the entire input. I don't see the point in using it in grep, because grep already has an algorithm for constructing finite state machines: regcomp(3). > 2. http://en.wikipedia.org/wiki/Rabin-Karp_string_search_algorithm It doesn't seem to compare favorably to the far older Aho-Corasick. It uses slightly less memory, but memory is usually not an issue with grep. > 4. GLIMPSE: http://webglimpse.net/pubs/TR94-17.pdf (Boyer-Moore > variant) Glimpse is a POS... and not really comparable, because grep is designed to search for a single search string in multiple texts, while glimpse is designed to search a large amount of text over and over with different search strings. I believe it uses suffix tables to construct its index, and Boyer-Moore only to locate specific matches, since the index lists only files, not exact positions. For anything other than fixed strings, it reverts to agrep, but I assume (I haven't looked at the code) that if the regexp has one or more fixed components, it uses those to narrow the search space before running agrep. DES -- Dag-Erling Smørgrav - d...@des.no ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: why GNU grep is fast
Dag-Erling Sm�rgrav writes: > Mike Haertel writes: > > GNU grep uses the well-known Boyer-Moore algorithm, which looks > > first for the final letter of the target string, and uses a lookup > > table to tell it how far ahead it can skip in the input whenever > > it finds a non-matching character. > > Boyer-Moore is for fixed search strings. I don't see how that > optimization can work with a regexp search unless the regexp is so > simple that you break it down into a small number of cases with known > length and final character. GNU grep uses heuristics to look for a fixed string that any string matching the regex *must* contain, and uses that fixed string as the bases for its initial Boyer-Moore search. For example if your regex is /foo.*bar/, the initial Boyer-Moore search is (probably) searching for foo. If the initial search succeeds, GNU grep isolates the containing line, and then runs the full regex matcher on that line to make sure. This is the sort of thing that a good regex library could do internally. Unfortunately, you can'd do this with a library that conforms to the !...@#%$!@#% POSIX regex API. The problem is that regexec()'s interface is based on NUL-terminated strings, rather than byte-counted buffers. So POSIX regexec() is necessarily character-at-a-time, because it has to look for that input-terminating NUL byte, and also you can't use it to search binary data that might contain NULs. (GNU grep works fine with arbitrary input files, as long as it can allocate enough memory to hold the longest line.) For these reasons a good grep implementation is pretty muched doomed to bundle its own regex matcher. ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: why GNU grep is fast
On Aug 22, 2010, at 9:30 AM, Sean C. Farley wrote: > On Sun, 22 Aug 2010, Dag-Erling Smørgrav wrote: >> Mike Haertel writes: >>> GNU grep uses the well-known Boyer-Moore algorithm, which looks first for >>> the final letter of the target string, and uses a lookup table to tell it >>> how far ahead it can skip in the input whenever it finds a non-matching >>> character. >> >> Boyer-Moore is for fixed search strings. I don't see how that optimization >> can work with a regexp search unless the regexp is so simple that you break >> it down into a small number of cases with known length and final character. > > When I was working on making FreeGrep faster (years ago), I wrote down a few > notes about possible algorithms, especially those that could be useful for > fgrep functionality. I am just passing these onto the list. > > Some algorithms: > 1. http://en.wikipedia.org/wiki/Aho-Corasick_string_matching_algorithm > 2. http://en.wikipedia.org/wiki/Rabin-Karp_string_search_algorithm > 3. GNU fgrep: Commentz-Walter > 4. GLIMPSE: http://webglimpse.net/pubs/TR94-17.pdf (Boyer-Moore variant) > > Also, this may be a useful book: > http://www.dcc.uchile.cl/~gnavarro/FPMbook/ And of course, Russ Cox' excellent series of articles starting at: http://swtch.com/~rsc/regexp/regexp1.html Later on, he summarizes some of the existing implementations, including comments about the Plan 9 implementation and his own RE2, both of which efficiently handle international text (which seems to be a major concern of Gabor's). The key comment in Mike's GNU grep notes is the one about not breaking into lines. That's simply double-scanning the input; instead, run the matcher over blocks of text and, when it finds a match, work backwards from the match to find the appropriate line beginning. This is efficient because most lines don't match. Boyer-Moore is great for fixed strings (a very common use case for grep) and for more complex patterns that contain long fixed strings (helps to discard most lines early). Sophisticated regex matchers implement a number of strategies and choose different ones depending on the pattern. In the case of bsdgrep, it might make sense to use the regex library for the general case but implement a hand-tuned search for fixed strings that can be heavily optimized for that case. Of course, international text support complicates the picture; you have to consider the input character set (if you want to auto-detect Unicode encodings by looking for leading BOMs, for example, you either need to translate the fixed-string pattern to match the input encoding or vice-versa). Tim ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: why GNU grep is fast
In article <20100822163644.gu2...@hoeg.nl> you write: >I think that implementing a simple fgrep boils down to mmap()ing a file >and calling memmem() on the mapping to search for the input string. Of >course this relies on having an efficient memmem() implementation, for >example using one of the algorithms mentioned in this thread. It's actually more complicated than that, because you have to ensure that you are not matching the middle of a multibyte character, when the current locale specifies a character set with a multibyte encoding. Likewise when searching for the newlines that delimit the matched line. (I'm not sure whether FreeBSD supports any character encodings that would be ambiguous in that way.) I don't think this was considered an issue when Mike Haertel was developing GNU grep. It seems reasonable to implement BMG or some other fast search in memmem(). Note that if you can't (or don't want to) mmap the whole file at once, you'll need special handling for the boundary conditions -- both at the string search level and at the search for line delimiters. This is much easier in the fgrep case, obviously, since the length of the query puts a finite upper bound on the amount of the old buffer you need to keep -- with regexps you really need your regexp engine to be able to report its matching state, or else limit your input to strictly conforming POSIX text files (i.e., line lengths limited to {LINE_MAX}). -GAWollman ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: why GNU grep is fast
On Aug 22, 2010, at 8:02 AM, Garrett Wollman wrote: > In article <86k4nikglg@ds4.des.no> you write: >> Mike Haertel writes: >>> GNU grep uses the well-known Boyer-Moore algorithm, which looks >>> first for the final letter of the target string, and uses a lookup >>> table to tell it how far ahead it can skip in the input whenever >>> it finds a non-matching character. >> >> Boyer-Moore is for fixed search strings. I don't see how that >> optimization can work with a regexp search unless the regexp is so >> simple that you break it down into a small number of cases with known >> length and final character. > > The common case of regexps used in the "grep" utility (and, for > obvious reasons, universal in the "fgrep" utility) is fixed-length > search strings. Even non-fixed-length regexps typically consist of > one one or two variable-length parts. This is an important point: A good grep implementation will use different strategies depending on the input regexp. Fixed-string matching is a very important special case. > Matching a completely > variable-length regexp is just hard, computationally, See Russ Cox' articles for why this is not true. It does require considerable sophistication to build an efficient DFA but the actual matcher, once built, can run very fast indeed: http://swtch.com/~rsc/regexp/ Tim ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: why GNU grep is fast
* Mike Haertel wrote: > Moreover, GNU grep AVOIDS BREAKING THE INPUT INTO LINES. Looking > for newlines would slow grep down by a factor of several times, > because to find the newlines it would have to look at every byte! I think that implementing a simple fgrep boils down to mmap()ing a file and calling memmem() on the mapping to search for the input string. Of course this relies on having an efficient memmem() implementation, for example using one of the algorithms mentioned in this thread. -- Ed Schouten WWW: http://80386.nl/ pgp3jSdL5D2SW.pgp Description: PGP signature
Re: why GNU grep is fast
On Sun, 22 Aug 2010, Dag-Erling Smørgrav wrote: Mike Haertel writes: GNU grep uses the well-known Boyer-Moore algorithm, which looks first for the final letter of the target string, and uses a lookup table to tell it how far ahead it can skip in the input whenever it finds a non-matching character. Boyer-Moore is for fixed search strings. I don't see how that optimization can work with a regexp search unless the regexp is so simple that you break it down into a small number of cases with known length and final character. When I was working on making FreeGrep faster (years ago), I wrote down a few notes about possible algorithms, especially those that could be useful for fgrep functionality. I am just passing these onto the list. Some algorithms: 1. http://en.wikipedia.org/wiki/Aho-Corasick_string_matching_algorithm 2. http://en.wikipedia.org/wiki/Rabin-Karp_string_search_algorithm 3. GNU fgrep: Commentz-Walter 4. GLIMPSE: http://webglimpse.net/pubs/TR94-17.pdf (Boyer-Moore variant) Also, this may be a useful book: http://www.dcc.uchile.cl/~gnavarro/FPMbook/ Sean -- s...@freebsd.org___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: why GNU grep is fast
2010/8/22 Dag-Erling Smørgrav : > Amen. The current bottleneck in BSD grep is the memchr() that looks for > '\n' in the input buffer. FYI I actually have a rewritten memchr() which is faster than the current one here: http://people.freebsd.org/~delphij/for_review/memchr.c Review/comments welcome. I've done some preliminary validation/benchmark on this but still need to compare it with some hand optimized assembler implementations that I have seen and see if it's worthy. Cheers, -- Xin LI http://www.delphij.net ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: why GNU grep is fast
In article <86k4nikglg@ds4.des.no> you write: >Mike Haertel writes: >> GNU grep uses the well-known Boyer-Moore algorithm, which looks >> first for the final letter of the target string, and uses a lookup >> table to tell it how far ahead it can skip in the input whenever >> it finds a non-matching character. > >Boyer-Moore is for fixed search strings. I don't see how that >optimization can work with a regexp search unless the regexp is so >simple that you break it down into a small number of cases with known >length and final character. The common case of regexps used in the "grep" utility (and, for obvious reasons, universal in the "fgrep" utility) is fixed-length search strings. Even non-fixed-length regexps typically consist of one one or two variable-length parts. Matching a completely variable-length regexp is just hard, computationally, so it's OK for it to be slower. There are other tricks you can do, such as turning the anchors ^ and $ into explicit newlines in your search -- "^foo" is a very common regexp to search for, and it's really a fixed-string search for "\nfoo" which is entirely amenable to the B-M treatment. You just have to remember that a matched newline isn't part of the result. The GNU regexp library also uses the Boyer-Moore (or is it Boyer-Moore-Gosper?) strategy. -GAWollman ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: why GNU grep is fast
Mike Haertel writes: > GNU grep uses the well-known Boyer-Moore algorithm, which looks > first for the final letter of the target string, and uses a lookup > table to tell it how far ahead it can skip in the input whenever > it finds a non-matching character. Boyer-Moore is for fixed search strings. I don't see how that optimization can work with a regexp search unless the regexp is so simple that you break it down into a small number of cases with known length and final character. > GNU grep uses raw Unix input system calls and avoids copying data > after reading it. Yes, that was the first thing we looked at (and fixed) in BSD grep. > Moreover, GNU grep AVOIDS BREAKING THE INPUT INTO LINES. Looking > for newlines would slow grep down by a factor of several times, > because to find the newlines it would have to look at every byte! Amen. The current bottleneck in BSD grep is the memchr() that looks for '\n' in the input buffer. DES -- Dag-Erling Smørgrav - d...@des.no ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: why GNU grep is fast
That's a good read for other things as Mike, thanks for taking the time to pass on this knowledge :) - Original Message - From: "Mike Haertel" To: Anyway, just FYI, here's a quick summary of where GNU grep gets its speed. Hopefully you can carry these ideas over to BSD grep. #1 trick: GNU grep is fast because it AVOIDS LOOKING AT EVERY INPUT BYTE. #2 trick: GNU grep is fast because it EXECUTES VERY FEW INSTRUCTIONS FOR EACH BYTE that it *does* look at. ... This e.mail is private and confidential between Multiplay (UK) Ltd. and the person or entity to whom it is addressed. In the event of misdirection, the recipient is prohibited from using, copying, printing or otherwise disseminating it or any information contained in it. In the event of misdirection, illegible or incomplete transmission please telephone +44 845 868 1337 or return the E.mail to postmas...@multiplay.co.uk. ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
why GNU grep is fast
Hi Gabor, I am the original author of GNU grep. I am also a FreeBSD user, although I live on -stable (and older) and rarely pay attention to -current. However, while searching the -current mailing list for an unrelated reason, I stumbled across some flamage regarding BSD grep vs GNU grep performance. You may have noticed that discussion too... Anyway, just FYI, here's a quick summary of where GNU grep gets its speed. Hopefully you can carry these ideas over to BSD grep. #1 trick: GNU grep is fast because it AVOIDS LOOKING AT EVERY INPUT BYTE. #2 trick: GNU grep is fast because it EXECUTES VERY FEW INSTRUCTIONS FOR EACH BYTE that it *does* look at. GNU grep uses the well-known Boyer-Moore algorithm, which looks first for the final letter of the target string, and uses a lookup table to tell it how far ahead it can skip in the input whenever it finds a non-matching character. GNU grep also unrolls the inner loop of Boyer-Moore, and sets up the Boyer-Moore delta table entries in such a way that it doesn't need to do the loop exit test at every unrolled step. The result of this is that, in the limit, GNU grep averages fewer than 3 x86 instructions executed for each input byte it actually looks at (and it skips many bytes entirely). See "Fast String Searching", by Andrew Hume and Daniel Sunday, in the November 1991 issue of Software Practice & Experience, for a good discussion of Boyer-Moore implementation tricks. It's available as a free PDF online. Once you have fast search, you'll find you also need fast input. GNU grep uses raw Unix input system calls and avoids copying data after reading it. Moreover, GNU grep AVOIDS BREAKING THE INPUT INTO LINES. Looking for newlines would slow grep down by a factor of several times, because to find the newlines it would have to look at every byte! So instead of using line-oriented input, GNU grep reads raw data into a large buffer, searches the buffer using Boyer-Moore, and only when it finds a match does it go and look for the bounding newlines. (Certain command line options like -n disable this optimization.) Finally, when I was last the maintainer of GNU grep (15+ years ago...), GNU grep also tried very hard to set things up so that the *kernel* could ALSO avoid handling every byte of the input, by using mmap() instead of read() for file input. At the time, using read() caused most Unix versions to do extra copying. Since GNU grep passed out of my hands, it appears that use of mmap became non-default, but you can still get it via --mmap. And at least in cases where the data is already file system buffer caches, mmap is still faster: $ time sh -c 'find . -type f -print | xargs grep -l 123456789abcdef' real 0m1.530s user 0m0.230s sys 0m1.357s $ time sh -c 'find . -type f -print | xargs grep --mmap -l 123456789abcdef' real 0m1.201s user 0m0.330s sys 0m0.929s [workload was a 648 megabyte MH mail folder containing 41000 messages] So even nowadays, using --mmap can be worth a >20% speedup. Summary: - Use Boyer-Moore (and unroll its inner loop a few times). - Roll your own unbuffered input using raw system calls. Avoid copying the input bytes before searching them. (Do, however, use buffered *output*. The normal grep scenario is that the amount of output is small compared to the amount of input, so the overhead of output buffer copying is small, while savings due to avoiding many small unbuffered writes can be large.) - Don't look for newlines in the input until after you've found a match. - Try to set things up (page-aligned buffers, page-sized read chunks, optionally use mmap) so the kernel can ALSO avoid copying the bytes. The key to making programs fast is to make them do practically nothing. ;-) Regards, Mike ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"