Re: What to learn from the BSD grep case [Was: why GNU grep is fast]

2010-08-23 Thread C. P. Ghost
On Mon, Aug 23, 2010 at 5:04 PM, Gabor Kovesdan  wrote:
> 4, We really need a good regex library. From the comments, it seems there's
> no such in the open source world. GNU libregex isn't efficient because GNU
> grep uses those workarounds that Mike kindly pointed out. Oniguruma was
> extremely slow when I checked it. PCRE supports Perl-style syntax with a
> POSIX-like API but not POSIX regex. Google RE2 is the same with additional
> egrep syntax but doesn't have support for standard POSIX regexes. Plan 9
> regex only supports egrep syntax. It seems that TRE is the best choice. It
> is BSD-licensed, supports wchar and POSIX(ish) regexes and it is quite fast.

I know it's C++ and not exactly what you're needing, but have you evaluated
Boost::Regex? Isn't there some code that can be retrofitted into a C lib?

http://www.boost.org/doc/libs/1_44_0/libs/regex/doc/html/index.html

> I don't know the theoretical background of regex engines but I'm wondering
> if it's possible top provide an alternative API with byte-counted buffers
> and use the heuristical speedup with fixed string matching. As Mike pointed
> out the POSIX API is quite limiting because it works on NUL-terminated
> strings and not on byte-counted buffers, so we couldn't just do it with a
> POSIX-conformant library but it would be nice if we could implement it in
> such a library with an alternative interface.
>
> Gabor

-cpghost.

-- 
Cordula's Web. http://www.cordula.ws/
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: What to learn from the BSD grep case [Was: why GNU grep is fast]

2010-08-23 Thread Scott Long

On Aug 23, 2010, at 9:04 AM, Gabor Kovesdan wrote:

> Hi all,
> 
> there are some consequences that we can see from the grep case. Here I'd like 
> to add a summary, which raises some questions. All comments are welcome.
> 
> 1, When grep entered -CURRENT and bugs were found I immediately got kind bug 
> reports and sharp criticism, as well. According to my understanding, -CURRENT 
> is for development and it's fine to expose new pieces of work there but now 
> I'm in doubt about that because of complaining people. On the other hand, an 
> earlier version of BSD grep has been in the ports tree for a very long time 
> and users reported some problems, which have been fixed but still, there is a 
> lot of bugs there which haven't been reported that time. If users don't 
> volunteer to test new pieces of code on a volunteer basis, somehow we have to 
> make them test it, so I think committing BSD grep to -CURRENT was a good 
> decision in the first round.

You did everything right.  You were responsive, you were open to suggestions, 
and you got the code in.  Even more importantly, you got the code in a year 
before 9.0, instead of waiting until the last minute, months from now, and 
creating a dilemma for the release engineers.  Software is an iterative process 
of feedback and improvement.  The way that you've handled this should be a 
model for the project.

Scott

___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


What to learn from the BSD grep case [Was: why GNU grep is fast]

2010-08-23 Thread Gabor Kovesdan

 Hi all,

there are some consequences that we can see from the grep case. Here I'd 
like to add a summary, which raises some questions. All comments are 
welcome.


1, When grep entered -CURRENT and bugs were found I immediately got kind 
bug reports and sharp criticism, as well. According to my understanding, 
-CURRENT is for development and it's fine to expose new pieces of work 
there but now I'm in doubt about that because of complaining people. On 
the other hand, an earlier version of BSD grep has been in the ports 
tree for a very long time and users reported some problems, which have 
been fixed but still, there is a lot of bugs there which haven't been 
reported that time. If users don't volunteer to test new pieces of code 
on a volunteer basis, somehow we have to make them test it, so I think 
committing BSD grep to -CURRENT was a good decision in the first round.


2, This issue also brought up some bottlenecks and potential 
optimization points (like memchr() and mmap), which other softwre may 
benefit from. This is another reason to let such pieces of work in. But 
unfortunately, this means that noone profiled another utilities because 
these bottlenecks remained undiscovered. Neither did I. It's a lesson 
that we have to learn from this particular case.


3, Because of point 2, we need more content to developers-handbook to 
help development with such ideas and best practices. It has been also 
raised on another list that our end-user documentation isn't that shiny 
and cool that it used to be and actually, developers-handbook has never 
been "finished" to be more or less complete. If someone looks at it, it 
looks like a sketch, not a book. I'll see if I can write a section on 
profiling.


4, We really need a good regex library. From the comments, it seems 
there's no such in the open source world. GNU libregex isn't efficient 
because GNU grep uses those workarounds that Mike kindly pointed out. 
Oniguruma was extremely slow when I checked it. PCRE supports Perl-style 
syntax with a POSIX-like API but not POSIX regex. Google RE2 is the same 
with additional egrep syntax but doesn't have support for standard POSIX 
regexes. Plan 9 regex only supports egrep syntax. It seems that TRE is 
the best choice. It is BSD-licensed, supports wchar and POSIX(ish) 
regexes and it is quite fast. I don't know the theoretical background of 
regex engines but I'm wondering if it's possible top provide an 
alternative API with byte-counted buffers and use the heuristical 
speedup with fixed string matching. As Mike pointed out the POSIX API is 
quite limiting because it works on NUL-terminated strings and not on 
byte-counted buffers, so we couldn't just do it with a POSIX-conformant 
library but it would be nice if we could implement it in such a library 
with an alternative interface.


Gabor
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: why GNU grep is fast

2010-08-23 Thread Gabor Kovesdan




Later on, he summarizes some of the existing implementations, 
including comments about the Plan 9 implementation and his own RE2, 
both of which efficiently handle international text (which seems to 
be a major concern of Gabor's).


I believe Gabor is considering TRE for a good replacement regex library.
Yes. Oniguruma is slow, Google RE2 only supports Perl and fgrep syntax 
but not standard regex and Plan 9 implementation iirc only supports 
fgrep syntax and Unicode but not wchar_t in general.


The key comment in Mike's GNU grep notes is the one about not 
breaking into lines.  That's simply double-scanning the input; 
instead, run the matcher over blocks of text and, when it finds a 
match, work backwards from the match to find the appropriate line 
beginning.  This is efficient because most lines don't match.


I do like the idea.

So do I.


BTW, the fastgrep portion of bsdgrep is my fault/contribution to do a 
faster search bypassing the regex library.  :)  It certainly was not 
written with any encodings in mind; it was purely ASCII.  As I have 
not kept up with it, I do not know if anyone improved it or not.



It has been made wchar-compliant.

Gabor
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: why GNU grep is fast

2010-08-23 Thread Gabor Kovesdan

 Em 2010.08.21. 4:31, Mike Haertel escreveu:

Hi Gabor,

I am the original author of GNU grep.  I am also a FreeBSD user,
although I live on -stable (and older) and rarely pay attention
to -current.

However, while searching the -current mailing list for an unrelated
reason, I stumbled across some flamage regarding BSD grep vs GNU grep
performance.  You may have noticed that discussion too...

Anyway, just FYI, here's a quick summary of where GNU grep gets
its speed.  Hopefully you can carry these ideas over to BSD grep.
Thanks, Mike, I'll check how I could adopt these ideas into BSD grep. I 
think your kind help was such a nice gesture and the open souce world 
should be based on such cooperation instead of meaningless competition.


Gabor
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: why GNU grep is fast

2010-08-23 Thread Dag-Erling Smørgrav
Mike Haertel  writes:
> Dag-Erling Smørgrav  writes:
> > You don't really need to "isolate the containing line" unless you have
> > an actual match, do you?
> Theoretically no.  However, suppose the pattern was /foo.*blah/.
> The Boyer-Moore search will be for "blah", since that's the longest
> fixed substring.  But verifying a match for the full regexp either
> requires a regexp matcher with the feature "start here, at THIS point
> in the middle of the RE and THAT point in the middle of the buffer,
> and match backwards and forwards", or else running a more standard
> RE matcher starting from the beginning of the line.

Stupidly enough, I only considered the case where the fixed search
string is at the end of the regexp :)  For the general case, you'd have
to either build a second FSA that mirrors the first, or design your
engine and data structures in such a way that the FSA can be traversed
in either direction.  Then you note which states correspond to the
beginning and end of the fixed substring, and run the FSA forward from
the end and backward from the beginning.  That could prove advantageous
when the input consists of very long lines and the frequency of partial
matches is relatively high; large files with very long lines are very
common in bioinformatics.

> As you mentioned, you can avoid searching forwards for the next
> newline if your RE matcher supports using newline as an exit marker.

Umm, I don't understand why it needs to support using *anything* as an
exit marker.  The newline character will simply not match any of the
transitions from whichever state the FSA is currently in, and neither
will the null character.  The only bit that needs to know about them is
the bit that handles end-of-line and end-of-word anchors.

DES
-- 
Dag-Erling Smørgrav - d...@des.no
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: why GNU grep is fast

2010-08-23 Thread Dag-Erling Smørgrav
"Sean C. Farley"  writes:
> Dag-Erling Smørgrav  writes:
> > Aho-Corasick is not really a search algorithm, but an algorithm for
> > constructing a table-driven finite state machine that will match
> > either of the search strings you fed it.  I believe it is less
> > efficient than Boyer-Moore for small numbers of search terms, since
> > it scans the entire input.  I don't see the point in using it in
> > grep, because grep already has an algorithm for constructing finite
> > state machines: regcomp(3).
> especially those that could be useful for fgrep functionality

Not entirely sure what you mean by that, but in most cases, I think
Boyer-Moore is a better fit for fgrep.  For large numbers of search
terms, I might use Aho-Corasick...  if it weren't for the fact that, as
mentioned, grep already has a regexp engine, which is a superset of
Aho-Corasick.  It might be a tad slower at building the FSA, because
it's more general, and the FSA might occupy a tad more memory, but the
complexity is *exactly* the same, and adding Aho-Corasick just for fgrep
is, for lack of a better word, pedantry.  For all you know, the
increased code size could very well offset whatever performance
advantage Aho-Corasick might offer.

> Glimpse may be a POS; I have not used it personally.  I only noted its
> algorithm for possible use within fgrep.

Apples and oranges.  The problem glimpse tries to solve (fixed corpus,
variable search terms) is precisely the opposite of the one fgrep tries
to solve (fixed search terms, variable corpus).

Glimpse does include grep-like functionality, but in the form of agrep,
which is designed for approximate matching.

DES
-- 
Dag-Erling Smørgrav - d...@des.no
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: why GNU grep is fast

2010-08-22 Thread Sean C. Farley

On Sun, 22 Aug 2010, Tim Kientzle wrote:


On Aug 22, 2010, at 9:30 AM, Sean C. Farley wrote:

On Sun, 22 Aug 2010, Dag-Erling Smørgrav wrote:

Mike Haertel  writes:
GNU grep uses the well-known Boyer-Moore algorithm, which looks 
first for the final letter of the target string, and uses a lookup 
table to tell it how far ahead it can skip in the input whenever it 
finds a non-matching character.


Boyer-Moore is for fixed search strings.  I don't see how that 
optimization can work with a regexp search unless the regexp is so 
simple that you break it down into a small number of cases with 
known length and final character.


When I was working on making FreeGrep faster (years ago), I wrote 
down a few notes about possible algorithms, especially those that 
could be useful for fgrep functionality.  I am just passing these 
onto the list.


Some algorithms:
1. http://en.wikipedia.org/wiki/Aho-Corasick_string_matching_algorithm
2. http://en.wikipedia.org/wiki/Rabin-Karp_string_search_algorithm
3. GNU fgrep:  Commentz-Walter
4. GLIMPSE:  http://webglimpse.net/pubs/TR94-17.pdf (Boyer-Moore variant)

Also, this may be a useful book:
http://www.dcc.uchile.cl/~gnavarro/FPMbook/


And of course, Russ Cox' excellent series of articles starting at:

 http://swtch.com/~rsc/regexp/regexp1.html


I saved that link from an E-mail earlier because it looked very 
interesting.


Later on, he summarizes some of the existing implementations, 
including comments about the Plan 9 implementation and his own RE2, 
both of which efficiently handle international text (which seems to be 
a major concern of Gabor's).


I believe Gabor is considering TRE for a good replacement regex library.

The key comment in Mike's GNU grep notes is the one about not breaking 
into lines.  That's simply double-scanning the input; instead, run the 
matcher over blocks of text and, when it finds a match, work backwards 
from the match to find the appropriate line beginning.  This is 
efficient because most lines don't match.


I do like the idea.

Boyer-Moore is great for fixed strings (a very common use case for 
grep) and for more complex patterns that contain long fixed strings 
(helps to discard most lines early).  Sophisticated regex matchers 
implement a number of strategies and choose different ones depending 
on the pattern.


That is what fastgrep (in bsdgrep) attempts to accomplish with very 
simply regex lines (beginning of line, end of line and dot).


In the case of bsdgrep, it might make sense to use the regex library 
for the general case but implement a hand-tuned search for fixed 
strings that can be heavily optimized for that case.  Of course, 
international text support complicates the picture; you have to 
consider the input character set (if you want to auto-detect Unicode 
encodings by looking for leading BOMs, for example, you either need to 
translate the fixed-string pattern to match the input encoding or 
vice-versa).


BTW, the fastgrep portion of bsdgrep is my fault/contribution to do a 
faster search bypassing the regex library.  :)  It certainly was not 
written with any encodings in mind; it was purely ASCII.  As I have not 
kept up with it, I do not know if anyone improved it or not.


Sean
--
s...@freebsd.org___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"

Re: why GNU grep is fast

2010-08-22 Thread Sean C. Farley

On Sun, 22 Aug 2010, Dag-Erling Smørgrav wrote:


"Sean C. Farley"  writes:

Some algorithms:
1. http://en.wikipedia.org/wiki/Aho-Corasick_string_matching_algorithm


Aho-Corasick is not really a search algorithm, but an algorithm for 
constructing a table-driven finite state machine that will match 
either of the search strings you fed it.  I believe it is less 
efficient than Boyer-Moore for small numbers of search terms, since it 
scans the entire input.  I don't see the point in using it in grep, 
because grep already has an algorithm for constructing finite state 
machines: regcomp(3).


especially those that could be useful for fgrep functionality

I was mainly talking about algorithms useful for the fgrep portion 
within FreeGrep.  fgrep would run (still runs?) over the same text for 
each pattern.


Therefore, Aho–Corasick had to be mentioned for the reason referenced 
within the link:

The Aho–Corasick string matching algorithm formed the basis of the
original Unix command fgrep.


2. http://en.wikipedia.org/wiki/Rabin-Karp_string_search_algorithm


It doesn't seem to compare favorably to the far older Aho-Corasick. 
It uses slightly less memory, but memory is usually not an issue with 
grep.


I agree, yet I like to keep alternative algorithms in mind in case a 
variant would be useful.



4. GLIMPSE:  http://webglimpse.net/pubs/TR94-17.pdf (Boyer-Moore
variant)


Glimpse is a POS...  and not really comparable, because grep is 
designed to search for a single search string in multiple texts, while 
glimpse is designed to search a large amount of text over and over 
with different search strings.  I believe it uses suffix tables to 
construct its index, and Boyer-Moore only to locate specific matches, 
since the index lists only files, not exact positions.  For anything 
other than fixed strings, it reverts to agrep, but I assume (I haven't 
looked at the code) that if the regexp has one or more fixed 
components, it uses those to narrow the search space before running 
agrep.


Glimpse may be a POS; I have not used it personally.  I only noted its 
algorithm for possible use within fgrep.


Of course, there may be much better algorithms out there to boost 
fgrep's speed, but these are what I had found at one time.


Sean
--
s...@freebsd.org___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"

Re: why GNU grep is fast

2010-08-22 Thread Mike Haertel
Dag-Erling Smørgrav  writes:
> Mike Haertel  writes:
> > For example if your regex is /foo.*bar/, the initial Boyer-Moore search
> > is (probably) searching for foo.
> >
> > If the initial search succeeds, GNU grep isolates the containing line,
> > and then runs the full regex matcher on that line to make sure.
> 
> You don't really need to "isolate the containing line" unless you have
> an actual match, do you?  There are two cases:

Theoretically no.  However, suppose the pattern was /foo.*blah/.
The Boyer-Moore search will be for "blah", since that's the longest
fixed substring.  But verifying a match for the full regexp either
requires a regexp matcher with the feature "start here, at THIS point
in the middle of the RE and THAT point in the middle of the buffer,
and match backwards and forwards", or else running a more standard
RE matcher starting from the beginning of the line.

So, in practice you pretty much have to at least search backwards
for the preceding newline.

As you mentioned, you can avoid searching forwards for the next
newline if your RE matcher supports using newline as an exit marker.
But if the workload characteristics are that matching lines are
scarce compared to the input, this is an optimization that just
won't matter much either way.

___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: why GNU grep is fast

2010-08-22 Thread Dag-Erling Smørgrav
Mike Haertel  writes:
> For example if your regex is /foo.*bar/, the initial Boyer-Moore search
> is (probably) searching for foo.
>
> If the initial search succeeds, GNU grep isolates the containing line,
> and then runs the full regex matcher on that line to make sure.

You don't really need to "isolate the containing line" unless you have
an actual match, do you?  There are two cases:

 1) The regexp does not use any character classes, including /./, so the
FSA will stop if it hits EOL before it reaches an accepting state.

 2) The regexp uses character classes, and you rewrite them to exclude
\n: /[^bar]/ becomes /[^bar\n]/, /./ becomes /[^\n]/, etc., and the
FSA will stop if it hits EOL before it reaches an accepting state.

DES
-- 
Dag-Erling Smørgrav - d...@des.no
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: why GNU grep is fast

2010-08-22 Thread Dag-Erling Smørgrav
"Sean C. Farley"  writes:
> Some algorithms:
> 1. http://en.wikipedia.org/wiki/Aho-Corasick_string_matching_algorithm

Aho-Corasick is not really a search algorithm, but an algorithm for
constructing a table-driven finite state machine that will match either
of the search strings you fed it.  I believe it is less efficient than
Boyer-Moore for small numbers of search terms, since it scans the entire
input.  I don't see the point in using it in grep, because grep already
has an algorithm for constructing finite state machines: regcomp(3).

> 2. http://en.wikipedia.org/wiki/Rabin-Karp_string_search_algorithm

It doesn't seem to compare favorably to the far older Aho-Corasick.  It
uses slightly less memory, but memory is usually not an issue with grep.

> 4. GLIMPSE:  http://webglimpse.net/pubs/TR94-17.pdf (Boyer-Moore
> variant)

Glimpse is a POS...  and not really comparable, because grep is designed
to search for a single search string in multiple texts, while glimpse is
designed to search a large amount of text over and over with different
search strings.  I believe it uses suffix tables to construct its index,
and Boyer-Moore only to locate specific matches, since the index lists
only files, not exact positions.  For anything other than fixed strings,
it reverts to agrep, but I assume (I haven't looked at the code) that if
the regexp has one or more fixed components, it uses those to narrow the
search space before running agrep.

DES
-- 
Dag-Erling Smørgrav - d...@des.no
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: why GNU grep is fast

2010-08-22 Thread Mike Haertel
Dag-Erling Sm�rgrav  writes:
> Mike Haertel  writes:
> > GNU grep uses the well-known Boyer-Moore algorithm, which looks
> > first for the final letter of the target string, and uses a lookup
> > table to tell it how far ahead it can skip in the input whenever
> > it finds a non-matching character.
> 
> Boyer-Moore is for fixed search strings.  I don't see how that
> optimization can work with a regexp search unless the regexp is so
> simple that you break it down into a small number of cases with known
> length and final character.

GNU grep uses heuristics to look for a fixed string that any string
matching the regex *must* contain, and uses that fixed string as the
bases for its initial Boyer-Moore search.

For example if your regex is /foo.*bar/, the initial Boyer-Moore search
is (probably) searching for foo.

If the initial search succeeds, GNU grep isolates the containing line,
and then runs the full regex matcher on that line to make sure.

This is the sort of thing that a good regex library could do internally.

Unfortunately, you can'd do this with a library that conforms to
the !...@#%$!@#% POSIX regex API.

The problem is that regexec()'s interface is based on NUL-terminated
strings, rather than byte-counted buffers.  So POSIX regexec() is
necessarily character-at-a-time, because it has to look for that
input-terminating NUL byte, and also you can't use it to search binary
data that might contain NULs.  (GNU grep works fine with arbitrary
input files, as long as it can allocate enough memory to hold the
longest line.)

For these reasons a good grep implementation is pretty muched doomed
to bundle its own regex matcher.


___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: why GNU grep is fast

2010-08-22 Thread Tim Kientzle
On Aug 22, 2010, at 9:30 AM, Sean C. Farley wrote:
> On Sun, 22 Aug 2010, Dag-Erling Smørgrav wrote:
>> Mike Haertel  writes:
>>> GNU grep uses the well-known Boyer-Moore algorithm, which looks first for 
>>> the final letter of the target string, and uses a lookup table to tell it 
>>> how far ahead it can skip in the input whenever it finds a non-matching 
>>> character.
>> 
>> Boyer-Moore is for fixed search strings.  I don't see how that optimization 
>> can work with a regexp search unless the regexp is so simple that you break 
>> it down into a small number of cases with known length and final character.
> 
> When I was working on making FreeGrep faster (years ago), I wrote down a few 
> notes about possible algorithms, especially those that could be useful for 
> fgrep functionality.  I am just passing these onto the list.
> 
> Some algorithms:
> 1. http://en.wikipedia.org/wiki/Aho-Corasick_string_matching_algorithm
> 2. http://en.wikipedia.org/wiki/Rabin-Karp_string_search_algorithm
> 3. GNU fgrep:  Commentz-Walter
> 4. GLIMPSE:  http://webglimpse.net/pubs/TR94-17.pdf (Boyer-Moore variant)
> 
> Also, this may be a useful book:
> http://www.dcc.uchile.cl/~gnavarro/FPMbook/

And of course, Russ Cox' excellent series of articles starting at:

  http://swtch.com/~rsc/regexp/regexp1.html

Later on, he summarizes some of the existing implementations,
including comments about the Plan 9 implementation and his own RE2, both
of which efficiently handle international text (which seems to
be a major concern of Gabor's).

The key comment in Mike's GNU grep notes is the one about not breaking
into lines.  That's simply double-scanning the input; instead, run
the matcher over blocks of text and, when it finds a match, work
backwards from the match to find the appropriate line beginning.
This is efficient because most lines don't match.

Boyer-Moore is great for fixed strings (a very common use case
for grep) and for more complex patterns that contain long fixed
strings (helps to discard most lines early).  Sophisticated regex
matchers implement a number of strategies and choose different
ones depending on the pattern.

In the case of bsdgrep, it might make sense to use the regex library
for the general case but implement a hand-tuned search for
fixed strings that can be heavily optimized for that case.
Of course, international text support complicates the picture;
you have to consider the input character set (if you want to auto-detect
Unicode encodings by looking for leading BOMs, for example, you
either need to translate the fixed-string pattern to match the input
encoding or vice-versa).

Tim

___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: why GNU grep is fast

2010-08-22 Thread Garrett Wollman
In article <20100822163644.gu2...@hoeg.nl> you write:

>I think that implementing a simple fgrep boils down to mmap()ing a file
>and calling memmem() on the mapping to search for the input string. Of
>course this relies on having an efficient memmem() implementation, for
>example using one of the algorithms mentioned in this thread.

It's actually more complicated than that, because you have to ensure
that you are not matching the middle of a multibyte character, when
the current locale specifies a character set with a multibyte
encoding.  Likewise when searching for the newlines that delimit the
matched line.  (I'm not sure whether FreeBSD supports any character
encodings that would be ambiguous in that way.)  I don't think this
was considered an issue when Mike Haertel was developing GNU grep.

It seems reasonable to implement BMG or some other fast search in
memmem().  Note that if you can't (or don't want to) mmap the whole
file at once, you'll need special handling for the boundary conditions
-- both at the string search level and at the search for line
delimiters.  This is much easier in the fgrep case, obviously, since
the length of the query puts a finite upper bound on the amount of the
old buffer you need to keep -- with regexps you really need your
regexp engine to be able to report its matching state, or else limit
your input to strictly conforming POSIX text files (i.e., line lengths
limited to {LINE_MAX}).

-GAWollman

___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: why GNU grep is fast

2010-08-22 Thread Tim Kientzle
On Aug 22, 2010, at 8:02 AM, Garrett Wollman wrote:
> In article <86k4nikglg@ds4.des.no> you write:
>> Mike Haertel  writes:
>>> GNU grep uses the well-known Boyer-Moore algorithm, which looks
>>> first for the final letter of the target string, and uses a lookup
>>> table to tell it how far ahead it can skip in the input whenever
>>> it finds a non-matching character.
>> 
>> Boyer-Moore is for fixed search strings.  I don't see how that
>> optimization can work with a regexp search unless the regexp is so
>> simple that you break it down into a small number of cases with known
>> length and final character.
> 
> The common case of regexps used in the "grep" utility (and, for
> obvious reasons, universal in the "fgrep" utility) is fixed-length
> search strings.  Even non-fixed-length regexps typically consist of
> one one or two variable-length parts.

This is an important point:  A good grep implementation will
use different strategies depending on the input regexp.
Fixed-string matching is a very important special case.

>  Matching a completely
> variable-length regexp is just hard, computationally, 

See Russ Cox' articles for why this is not true.  It does require
considerable sophistication to build an efficient DFA but the
actual matcher, once built, can run very fast indeed:

  http://swtch.com/~rsc/regexp/

Tim

___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: why GNU grep is fast

2010-08-22 Thread Ed Schouten
* Mike Haertel  wrote:
> Moreover, GNU grep AVOIDS BREAKING THE INPUT INTO LINES.  Looking
> for newlines would slow grep down by a factor of several times,
> because to find the newlines it would have to look at every byte!

I think that implementing a simple fgrep boils down to mmap()ing a file
and calling memmem() on the mapping to search for the input string. Of
course this relies on having an efficient memmem() implementation, for
example using one of the algorithms mentioned in this thread.

-- 
 Ed Schouten 
 WWW: http://80386.nl/


pgp3jSdL5D2SW.pgp
Description: PGP signature


Re: why GNU grep is fast

2010-08-22 Thread Sean C. Farley

On Sun, 22 Aug 2010, Dag-Erling Smørgrav wrote:


Mike Haertel  writes:
GNU grep uses the well-known Boyer-Moore algorithm, which looks first 
for the final letter of the target string, and uses a lookup table to 
tell it how far ahead it can skip in the input whenever it finds a 
non-matching character.


Boyer-Moore is for fixed search strings.  I don't see how that 
optimization can work with a regexp search unless the regexp is so 
simple that you break it down into a small number of cases with known 
length and final character.


When I was working on making FreeGrep faster (years ago), I wrote down a 
few notes about possible algorithms, especially those that could be 
useful for fgrep functionality.  I am just passing these onto the list.


Some algorithms:
1. http://en.wikipedia.org/wiki/Aho-Corasick_string_matching_algorithm
2. http://en.wikipedia.org/wiki/Rabin-Karp_string_search_algorithm
3. GNU fgrep:  Commentz-Walter
4. GLIMPSE:  http://webglimpse.net/pubs/TR94-17.pdf (Boyer-Moore variant)

Also, this may be a useful book:
http://www.dcc.uchile.cl/~gnavarro/FPMbook/

Sean
--
s...@freebsd.org___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"

Re: why GNU grep is fast

2010-08-22 Thread Xin LI
2010/8/22 Dag-Erling Smørgrav :
> Amen.  The current bottleneck in BSD grep is the memchr() that looks for
> '\n' in the input buffer.

FYI I actually have a rewritten memchr() which is faster than the
current one here:

http://people.freebsd.org/~delphij/for_review/memchr.c

Review/comments welcome.  I've done some preliminary
validation/benchmark on this but still need to compare it with some
hand optimized assembler implementations that I have seen and see if
it's worthy.

Cheers,
-- 
Xin LI  http://www.delphij.net
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: why GNU grep is fast

2010-08-22 Thread Garrett Wollman
In article <86k4nikglg@ds4.des.no> you write:
>Mike Haertel  writes:
>> GNU grep uses the well-known Boyer-Moore algorithm, which looks
>> first for the final letter of the target string, and uses a lookup
>> table to tell it how far ahead it can skip in the input whenever
>> it finds a non-matching character.
>
>Boyer-Moore is for fixed search strings.  I don't see how that
>optimization can work with a regexp search unless the regexp is so
>simple that you break it down into a small number of cases with known
>length and final character.

The common case of regexps used in the "grep" utility (and, for
obvious reasons, universal in the "fgrep" utility) is fixed-length
search strings.  Even non-fixed-length regexps typically consist of
one one or two variable-length parts.  Matching a completely
variable-length regexp is just hard, computationally, so it's OK for
it to be slower.  There are other tricks you can do, such as
turning the anchors ^ and $ into explicit newlines in your search --
"^foo" is a very common regexp to search for, and it's really a
fixed-string search for "\nfoo" which is entirely amenable to the B-M
treatment.  You just have to remember that a matched newline isn't
part of the result.

The GNU regexp library also uses the Boyer-Moore (or is it
Boyer-Moore-Gosper?) strategy.

-GAWollman

___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: why GNU grep is fast

2010-08-22 Thread Dag-Erling Smørgrav
Mike Haertel  writes:
> GNU grep uses the well-known Boyer-Moore algorithm, which looks
> first for the final letter of the target string, and uses a lookup
> table to tell it how far ahead it can skip in the input whenever
> it finds a non-matching character.

Boyer-Moore is for fixed search strings.  I don't see how that
optimization can work with a regexp search unless the regexp is so
simple that you break it down into a small number of cases with known
length and final character.

> GNU grep uses raw Unix input system calls and avoids copying data
> after reading it.

Yes, that was the first thing we looked at (and fixed) in BSD grep.

> Moreover, GNU grep AVOIDS BREAKING THE INPUT INTO LINES.  Looking
> for newlines would slow grep down by a factor of several times,
> because to find the newlines it would have to look at every byte!

Amen.  The current bottleneck in BSD grep is the memchr() that looks for
'\n' in the input buffer.

DES
-- 
Dag-Erling Smørgrav - d...@des.no
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: why GNU grep is fast

2010-08-21 Thread Steven Hartland

That's a good read for other things as Mike, thanks for taking the
time to pass on this knowledge :)

- Original Message - 
From: "Mike Haertel" 

To: 


Anyway, just FYI, here's a quick summary of where GNU grep gets
its speed.  Hopefully you can carry these ideas over to BSD grep.

#1 trick: GNU grep is fast because it AVOIDS LOOKING AT
EVERY INPUT BYTE.

#2 trick: GNU grep is fast because it EXECUTES VERY FEW
INSTRUCTIONS FOR EACH BYTE that it *does* look at.

...


This e.mail is private and confidential between Multiplay (UK) Ltd. and the person or entity to whom it is addressed. In the event of misdirection, the recipient is prohibited from using, copying, printing or otherwise disseminating it or any information contained in it. 


In the event of misdirection, illegible or incomplete transmission please 
telephone +44 845 868 1337
or return the E.mail to postmas...@multiplay.co.uk.

___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


why GNU grep is fast

2010-08-20 Thread Mike Haertel
Hi Gabor,

I am the original author of GNU grep.  I am also a FreeBSD user,
although I live on -stable (and older) and rarely pay attention
to -current.

However, while searching the -current mailing list for an unrelated
reason, I stumbled across some flamage regarding BSD grep vs GNU grep
performance.  You may have noticed that discussion too...

Anyway, just FYI, here's a quick summary of where GNU grep gets
its speed.  Hopefully you can carry these ideas over to BSD grep.

#1 trick: GNU grep is fast because it AVOIDS LOOKING AT
EVERY INPUT BYTE.

#2 trick: GNU grep is fast because it EXECUTES VERY FEW
INSTRUCTIONS FOR EACH BYTE that it *does* look at.

GNU grep uses the well-known Boyer-Moore algorithm, which looks
first for the final letter of the target string, and uses a lookup
table to tell it how far ahead it can skip in the input whenever
it finds a non-matching character.

GNU grep also unrolls the inner loop of Boyer-Moore, and sets up
the Boyer-Moore delta table entries in such a way that it doesn't
need to do the loop exit test at every unrolled step.  The result
of this is that, in the limit, GNU grep averages fewer than 3 x86
instructions executed for each input byte it actually looks at
(and it skips many bytes entirely).

See "Fast String Searching", by Andrew Hume and Daniel Sunday,
in the November 1991 issue of Software Practice & Experience, for
a good discussion of Boyer-Moore implementation tricks.  It's
available as a free PDF online.

Once you have fast search, you'll find you also need fast input.

GNU grep uses raw Unix input system calls and avoids copying data
after reading it.

Moreover, GNU grep AVOIDS BREAKING THE INPUT INTO LINES.  Looking
for newlines would slow grep down by a factor of several times,
because to find the newlines it would have to look at every byte!

So instead of using line-oriented input, GNU grep reads raw data into
a large buffer, searches the buffer using Boyer-Moore, and only when
it finds a match does it go and look for the bounding newlines.
(Certain command line options like -n disable this optimization.)

Finally, when I was last the maintainer of GNU grep (15+ years ago...),
GNU grep also tried very hard to set things up so that the *kernel*
could ALSO avoid handling every byte of the input, by using mmap()
instead of read() for file input.  At the time, using read() caused
most Unix versions to do extra copying.  Since GNU grep passed out
of my hands, it appears that use of mmap became non-default, but you
can still get it via --mmap.  And at least in cases where the data
is already file system buffer caches, mmap is still faster:

  $ time sh -c 'find . -type f -print | xargs grep -l 123456789abcdef'
  real  0m1.530s
  user  0m0.230s
  sys   0m1.357s
  $ time sh -c 'find . -type f -print | xargs grep --mmap -l 123456789abcdef'
  real  0m1.201s
  user  0m0.330s
  sys   0m0.929s

[workload was a 648 megabyte MH mail folder containing 41000 messages]
So even nowadays, using --mmap can be worth a >20% speedup.

Summary:

- Use Boyer-Moore (and unroll its inner loop a few times).

- Roll your own unbuffered input using raw system calls.  Avoid copying
  the input bytes before searching them.  (Do, however, use buffered
  *output*.  The normal grep scenario is that the amount of output is
  small compared to the amount of input, so the overhead of output
  buffer copying is small, while savings due to avoiding many small
  unbuffered writes can be large.)

- Don't look for newlines in the input until after you've found a match.

- Try to set things up (page-aligned buffers, page-sized read chunks,
  optionally use mmap) so the kernel can ALSO avoid copying the bytes.

The key to making programs fast is to make them do practically nothing. ;-)

Regards,

Mike
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"