Re: [RFC] Replacing our regex implementation
Thanks to all for the valuable comments. I've decided to continue the work with TRE because - Many people don't like the idea of adding C++ code into libc - I'm not very skilled in C++ However, in a later phase of development, I think re2 may be a good source of ideas in optimizing the performance. For now, I think if the new code without optimizations is at least as fast as the old regex code, that's fine. Imho, the motivation of replacing it is not just the performance but also the fact it is outdated code and not maintained any more and also the lack of wide character support. Once it is feature-complete to replace the old code, we can start thinking of optimizations. And as proposed, I'll separate those from libc for now and later I can propose some changes to the TRE maintainer. Gabor ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"
Re: [RFC] Replacing our regex implementation
Em 09-05-2011 06:44, Tim Kientzle escreveu: Have you asked the TRE maintainers if they would accept this change? If they would, then getting this change into TRE would benefit a lot more people than just FreeBSD's libc. This is a longer term goal. First, I prefer having something to talk about with the maintainer and then I can work out the patches to integrate it if he accepts it. For now, I prefer keeping them separate and if he is open to those ideas then I can easily refactor the code to fit the vendor code. Gabor ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"
Re: [RFC] Replacing our regex implementation
--- On Sun, 5/8/11, Bakul Shah wrote: ... > > C++ may be an impediment for it to go into libc but one > can certainly put a C interface on a C++ library. > I wouldn't think it's very consistent to use C++ in libc. Perhaps we could have the best of both worlds by using libtre as the libc regex replacement and re2 for grep and diff? As an extra benefit using Re2 would make it easier to support --perl-regex in grep. cheers, Pedro. ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"
Re: [RFC] Replacing our regex implementation
On Mon, 09 May 2011 17:51:46 EDT David Schultz wrote: > On Sun, May 08, 2011, Bakul Shah wrote: > > On Sun, 08 May 2011 21:35:04 CDT Zhihao Yuan wrote: > > > 1. This lib accepts many popular grammars (PCRE, POSIX, vim, etc.), > > > but it does not allow you to change the mode. > > > http://code.google.com/p/re2/source/browse/re2/re2.h > > > > The mode is decided when an RE2 object is instantiated so this > > is ok. You can certainly instantiate multiple objects with > > different options if so desired. > > > > > 2. It focuses on speed and features, not stability and standardization. > > > > Look at the open issues. Seems stable enough to me. re2 has a > > posix only mode. It also does unicode. s/posix only mode/posix only mode as well/ > > > > > 3. It uses C++. We seldom accepts C++ code in base system, and does > > > not accept it in libc. > > > > This is the show stopper. > > Use of C++ is a clear show-stopper if it introduces new runtime > requirements, e.g., dependencies on STL or exceptions. Aside from > that, however, I can't think of any fundamental, technical reasons > why a component of libc couldn't be written in C++. (Perhaps the > toolchain maintainers could name some, and they'd be the best > authority on the matter.) You can expect some resistance > regardless, however, so make sure the technical merits of RE2 are > worth the trouble. Ok, I just verified there are no additional runtime requirements by running a simple test, where I added a C wrapper around an RE2 C++ call, compiled it with c++, then compiled the client C code with cc, and linked everything with cc. This works (tested on on x86_64, under 8.1). I do think RE2 is very well done (see swtch.com/~rsc/regexp articles) and it is actively maintained, has a battery of pretty exhaustive tests. Seems TRE's author also likes re2: http://hackerboss.com/is-your-regex-matcher-up-to-snuff/ So if we want to consider this, it is a real possibility. > IIRC, some of the prior discussions on using more C++ in the base > system got derailed by tangents on multiple inheritance, operator > overloading, misfeatures of STL, and what subset of C++ ought to > be considered kosher in FreeBSD. You don't have to get involved > in any of that because you'd only be proposing to import a > self-contained third-party library. Indeed; we would just use it via a C wrapper API. But I can see someone thinking this is the camel's nose in the tent :-) ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"
Re: [RFC] Replacing our regex implementation
On Sun, May 08, 2011, Bakul Shah wrote: > On Sun, 08 May 2011 21:35:04 CDT Zhihao Yuan wrote: > > 1. This lib accepts many popular grammars (PCRE, POSIX, vim, etc.), > > but it does not allow you to change the mode. > > http://code.google.com/p/re2/source/browse/re2/re2.h > > The mode is decided when an RE2 object is instantiated so this > is ok. You can certainly instantiate multiple objects with > different options if so desired. > > > 2. It focuses on speed and features, not stability and standardization. > > Look at the open issues. Seems stable enough to me. re2 has a > posix only mode. It also does unicode. > > > 3. It uses C++. We seldom accepts C++ code in base system, and does > > not accept it in libc. > > This is the show stopper. Use of C++ is a clear show-stopper if it introduces new runtime requirements, e.g., dependencies on STL or exceptions. Aside from that, however, I can't think of any fundamental, technical reasons why a component of libc couldn't be written in C++. (Perhaps the toolchain maintainers could name some, and they'd be the best authority on the matter.) You can expect some resistance regardless, however, so make sure the technical merits of RE2 are worth the trouble. IIRC, some of the prior discussions on using more C++ in the base system got derailed by tangents on multiple inheritance, operator overloading, misfeatures of STL, and what subset of C++ ought to be considered kosher in FreeBSD. You don't have to get involved in any of that because you'd only be proposing to import a self-contained third-party library. ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"
Re: [RFC] Replacing our regex implementation
On Sun, 08 May 2011 21:35:04 CDT Zhihao Yuan wrote: > 1. This lib accepts many popular grammars (PCRE, POSIX, vim, etc.), > but it does not allow you to change the mode. > http://code.google.com/p/re2/source/browse/re2/re2.h The mode is decided when an RE2 object is instantiated so this is ok. You can certainly instantiate multiple objects with different options if so desired. > 2. It focuses on speed and features, not stability and standardization. Look at the open issues. Seems stable enough to me. re2 has a posix only mode. It also does unicode. > 3. It uses C++. We seldom accepts C++ code in base system, and does > not accept it in libc. This is the show stopper. ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"
Re: [RFC] Replacing our regex implementation
On May 8, 2011, at 5:29 PM, Gabor Kovesdan wrote: > 2, Optimizations for matching with a fixed pattern heuristic > ... First, I was thinking of putting it into TRE but now I consider a better > solution building a small library, libregexutils or such. It would decouple > this optimization from the vendor code, ... Have you asked the TRE maintainers if they would accept this change? If they would, then getting this change into TRE would benefit a lot more people than just FreeBSD's libc. (BTW, I agree that C++ in libc is a bad idea.) Tim ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"
Re: [RFC] Replacing our regex implementation
On Mon, 09 May 2011 08:30:57 +0400 Lev Serebryakov wrote: > Hello, Bakul. > You wrote 9 =EC=E0=FF 2011 =E3., 5:17:09: > > > As per the following URLs re2 is much faster than TRE (on the > > benchmarks they ran): > > > http://lh3lh3.users.sourceforge.net/reb.shtml > > http://sljit.sourceforge.net/regex_perf.html > re2 is much faster at price of memory. I don't remember details now, > but I've found (simple) situations when re2 consumes a HUGE amount of > memory (read: hundreds of megabytes). It work faster than tre, yes. If > you have this memory to RE engine alone. As per http://swtch.com/~rsc/regexp/regexp3.html RE2 requires about 10 KB per regexp, in contrast to PCRE's half a KB. This is not excessive in this day and age. But 100s of megabytes sounds very strange I'd appreciate a reference to an actual example (and I am sure so would the author of re2). But I do not want to defend re2 here. My intent was to just make sure re2 was at least considered. Mainly because it was actually quite surprising to see TRE is 10 to 45 times slower than re2! ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"
Re: [RFC] Replacing our regex implementation
Hello, Bakul. You wrote 9 мая 2011 г., 5:17:09: > As per the following URLs re2 is much faster than TRE (on the > benchmarks they ran): > http://lh3lh3.users.sourceforge.net/reb.shtml > http://sljit.sourceforge.net/regex_perf.html re2 is much faster at price of memory. I don't remember details now, but I've found (simple) situations when re2 consumes a HUGE amount of memory (read: hundreds of megabytes). It work faster than tre, yes. If you have this memory to RE engine alone. -- // Black Lion AKA Lev Serebryakov ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"
Re: [RFC] Replacing our regex implementation
On Sun, May 8, 2011 at 8:49 PM, Bakul Shah wrote: > On Mon, 09 May 2011 02:37:10 BST Gabor Kovesdan wrote: >> Em 09-05-2011 02:17, Bakul Shah escreveu: >> > As per the following URLs re2 is much faster than TRE (on the >> > benchmarks they ran): >> > >> > http://lh3lh3.users.sourceforge.net/reb.shtml >> > http://sljit.sourceforge.net/regex_perf.html >> > >> > re2 is in C++& has a PCRE API, while TRE is in C& has a >> > POSIX API. Both have BSD copyright. Is it worth considering >> > making re2 posix compliant? >> Is it wchar-clean and is it actively maintained? C++ is quite >> anticipated for the base system and I'm not very skilled in it so atm I >> couldn't promise to use re2 instead of TRE. And anyway, can C++ go into >> libc? According to POSIX, the regex code has to be there. But let's see >> what others say... If we happen to use re2 later, my extensions that I >> talked about in points 2, and 3, would still be useful. >> >> Anyway, according to some earlier vague measures, TRE seems to be slower >> in small matching tasks but scales well. These tests seem to compare >> only short runs with the same regex. It should be seem how they compare >> e.g. if you grep the whole ports tree with the same pattern. If the >> matching scales well once the pattern is compiled, that's more important >> than the overall result for such short tasks, imho. > > re2 is certainly maintained. Don't know about whcar cleanliness. > See > http://code.google.com/p/re2/ > Also check out Russ Cox's excellent articles on implementing it > http://swtch.com/~rsc/regexp/ > and this: > > http://google-opensource.blogspot.com/2010/03/re2-principled-approach-to-regular.html > > C++ may be an impediment for it to go into libc but one can > certainly put a C interface on a C++ library. 1. This lib accepts many popular grammars (PCRE, POSIX, vim, etc.), but it does not allow you to change the mode. http://code.google.com/p/re2/source/browse/re2/re2.h 2. It focuses on speed and features, not stability and standardization. 3. It uses C++. We seldom accepts C++ code in base system, and does not accept it in libc. So, as far as I concerned, re2 is good as a re engine in some applications, but may not fit the requirements for a regex in libc. > ___ > freebsd-hackers@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-hackers > To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org" > -- Zhihao Yuan The best way to predict the future is to invent it. ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"
Re: [RFC] Replacing our regex implementation
On Mon, 09 May 2011 02:37:10 BST Gabor Kovesdan wrote: > Em 09-05-2011 02:17, Bakul Shah escreveu: > > As per the following URLs re2 is much faster than TRE (on the > > benchmarks they ran): > > > > http://lh3lh3.users.sourceforge.net/reb.shtml > > http://sljit.sourceforge.net/regex_perf.html > > > > re2 is in C++& has a PCRE API, while TRE is in C& has a > > POSIX API. Both have BSD copyright. Is it worth considering > > making re2 posix compliant? > Is it wchar-clean and is it actively maintained? C++ is quite > anticipated for the base system and I'm not very skilled in it so atm I > couldn't promise to use re2 instead of TRE. And anyway, can C++ go into > libc? According to POSIX, the regex code has to be there. But let's see > what others say... If we happen to use re2 later, my extensions that I > talked about in points 2, and 3, would still be useful. > > Anyway, according to some earlier vague measures, TRE seems to be slower > in small matching tasks but scales well. These tests seem to compare > only short runs with the same regex. It should be seem how they compare > e.g. if you grep the whole ports tree with the same pattern. If the > matching scales well once the pattern is compiled, that's more important > than the overall result for such short tasks, imho. re2 is certainly maintained. Don't know about whcar cleanliness. See http://code.google.com/p/re2/ Also check out Russ Cox's excellent articles on implementing it http://swtch.com/~rsc/regexp/ and this: http://google-opensource.blogspot.com/2010/03/re2-principled-approach-to-regular.html C++ may be an impediment for it to go into libc but one can certainly put a C interface on a C++ library. ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"
Re: [RFC] Replacing our regex implementation
Em 09-05-2011 02:17, Bakul Shah escreveu: As per the following URLs re2 is much faster than TRE (on the benchmarks they ran): http://lh3lh3.users.sourceforge.net/reb.shtml http://sljit.sourceforge.net/regex_perf.html re2 is in C++& has a PCRE API, while TRE is in C& has a POSIX API. Both have BSD copyright. Is it worth considering making re2 posix compliant? Is it wchar-clean and is it actively maintained? C++ is quite anticipated for the base system and I'm not very skilled in it so atm I couldn't promise to use re2 instead of TRE. And anyway, can C++ go into libc? According to POSIX, the regex code has to be there. But let's see what others say... If we happen to use re2 later, my extensions that I talked about in points 2, and 3, would still be useful. Anyway, according to some earlier vague measures, TRE seems to be slower in small matching tasks but scales well. These tests seem to compare only short runs with the same regex. It should be seem how they compare e.g. if you grep the whole ports tree with the same pattern. If the matching scales well once the pattern is compiled, that's more important than the overall result for such short tasks, imho. Gabor ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"
Re: [RFC] Replacing our regex implementation
As per the following URLs re2 is much faster than TRE (on the benchmarks they ran): http://lh3lh3.users.sourceforge.net/reb.shtml http://sljit.sourceforge.net/regex_perf.html re2 is in C++ & has a PCRE API, while TRE is in C & has a POSIX API. Both have BSD copyright. Is it worth considering making re2 posix compliant? ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"
Re: [RFC] Replacing our regex implementation
Hello; Thanks Gabor for this cool project! --- On Sun, 5/8/11, Gabor Kovesdan wrote: ... > - It doesn't provide the REG_STARTEND macro, which is our > non-POSIX extension. Still, it is useful and easy to > implement so it is not a problem either. Our sed requires REG_STARTEND (the illumos guys noted so when they ported our sed). While on this you should also look at bin/153257 and check if TRE supports the sysv legacy delimiters. FWIW, the now defunct lang/gpc used those delimiters to distinguish between our sed and GNU sed. There are still more than 20 ports that depend on gsed. > Now I'm working on this little feature and on building a > libc with TRE. After that I'll publish a patch for testing > and will also ask portmgr to run it on the cluster. > It may be interesting, although unnecessarily risky, to see if diff can also use libtre (it uses GNU regex now), but you will probably want to see this as a future step. (NEVERMIND.. you explained this right ahead ;) ) > > 3, Adding support for GNU-specific permissive regex syntax > GNU grep accepts regexes that are invalid in POSIX, like > [a|]. This is necessary for grep and diff in the base > system. If we don't have them we can never trow out the GNU > regex implementation. However, we should not make them > default, so I'm thinking of implementing them somewhere > else, e.g. in the previously mentioned library. > > So far, these are the plans, please comments if you have > something in your mind about it. > Sounds good! Pedro. ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"