Re: [RFC] Replacing our regex implementation

2011-05-10 Thread Gabor Kovesdan

Thanks to all for the valuable comments.
I've decided to continue the work with TRE because
- Many people don't like the idea of adding C++ code into libc
- I'm not very skilled in C++

However, in a later phase of development, I think re2 may be a good 
source of ideas in optimizing the performance. For now, I think if the 
new code without optimizations is at least as fast as the old regex 
code, that's fine. Imho, the motivation of replacing it is not just the 
performance but also the fact it is outdated code and not maintained any 
more and also the lack of wide character support. Once it is 
feature-complete to replace the old code, we can start thinking of 
optimizations. And as proposed, I'll separate those from libc for now 
and later I can propose some changes to the TRE maintainer.


Gabor
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: [RFC] Replacing our regex implementation

2011-05-10 Thread Gabor Kovesdan

Em 09-05-2011 06:44, Tim Kientzle escreveu:

Have you asked the TRE maintainers if they would accept this change?

If they would, then getting this change into TRE would benefit a lot more 
people than just FreeBSD's libc.
This is a longer term goal. First, I prefer having something to talk 
about with the maintainer and then I can work out the patches to 
integrate it if he accepts it. For now, I prefer keeping them separate 
and if he is open to those ideas then I can easily refactor the code to 
fit the vendor code.


Gabor
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: [RFC] Replacing our regex implementation

2011-05-09 Thread Pedro F. Giffuni

--- On Sun, 5/8/11, Bakul Shah  wrote:

...

> 
> C++ may be an impediment for it to go into libc but one
> can certainly put a C interface on a C++ library.
>

I wouldn't think it's very consistent to use C++ in libc.
Perhaps we could have the best of both worlds by using
libtre as the libc regex replacement and re2 for grep
and diff? As an extra benefit using Re2 would make it
easier to support --perl-regex in grep.

cheers,

Pedro.
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: [RFC] Replacing our regex implementation

2011-05-09 Thread Bakul Shah
On Mon, 09 May 2011 17:51:46 EDT David Schultz   wrote:
> On Sun, May 08, 2011, Bakul Shah wrote:
> > On Sun, 08 May 2011 21:35:04 CDT Zhihao Yuan   wrote:
> > > 1. This lib accepts many popular grammars (PCRE, POSIX, vim, etc.),
> > > but it does not allow you to change the mode.
> > > http://code.google.com/p/re2/source/browse/re2/re2.h
> > 
> > The mode is decided when an RE2 object is instantiated so this
> > is ok. You can certainly instantiate multiple objects with
> > different options if so desired.
> > 
> > > 2. It focuses on speed and features, not stability and standardization.
> > 
> > Look at the open issues. Seems stable enough to me. re2 has a
> > posix only mode. It also does unicode.

s/posix only mode/posix only mode as well/

> > 
> > > 3. It uses C++. We seldom accepts C++ code in base system, and does
> > > not accept it in libc.
> > 
> > This is the show stopper.
> 
> Use of C++ is a clear show-stopper if it introduces new runtime
> requirements, e.g., dependencies on STL or exceptions.  Aside from
> that, however, I can't think of any fundamental, technical reasons
> why a component of libc couldn't be written in C++.  (Perhaps the
> toolchain maintainers could name some, and they'd be the best
> authority on the matter.)  You can expect some resistance
> regardless, however, so make sure the technical merits of RE2 are
> worth the trouble.

Ok, I just verified there are no additional runtime
requirements by running a simple test, where I added a C
wrapper around an RE2 C++ call, compiled it with c++, then
compiled the client C code with cc, and linked everything with
cc. This works (tested on on x86_64, under 8.1).

I do think RE2 is very well done (see swtch.com/~rsc/regexp
articles) and it is actively maintained, has a battery of
pretty exhaustive tests.  Seems TRE's author also likes re2:
http://hackerboss.com/is-your-regex-matcher-up-to-snuff/

So if we want to consider this, it is a real possibility.

> IIRC, some of the prior discussions on using more C++ in the base
> system got derailed by tangents on multiple inheritance, operator
> overloading, misfeatures of STL, and what subset of C++ ought to
> be considered kosher in FreeBSD.  You don't have to get involved
> in any of that because you'd only be proposing to import a
> self-contained third-party library.

Indeed; we would just use it via a C wrapper API.  But I can
see someone thinking this is the camel's nose in the tent :-)
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: [RFC] Replacing our regex implementation

2011-05-09 Thread David Schultz
On Sun, May 08, 2011, Bakul Shah wrote:
> On Sun, 08 May 2011 21:35:04 CDT Zhihao Yuan   wrote:
> > 1. This lib accepts many popular grammars (PCRE, POSIX, vim, etc.),
> > but it does not allow you to change the mode.
> > http://code.google.com/p/re2/source/browse/re2/re2.h
> 
> The mode is decided when an RE2 object is instantiated so this
> is ok. You can certainly instantiate multiple objects with
> different options if so desired.
> 
> > 2. It focuses on speed and features, not stability and standardization.
> 
> Look at the open issues. Seems stable enough to me. re2 has a
> posix only mode. It also does unicode.
> 
> > 3. It uses C++. We seldom accepts C++ code in base system, and does
> > not accept it in libc.
> 
> This is the show stopper.

Use of C++ is a clear show-stopper if it introduces new runtime
requirements, e.g., dependencies on STL or exceptions.  Aside from
that, however, I can't think of any fundamental, technical reasons
why a component of libc couldn't be written in C++.  (Perhaps the
toolchain maintainers could name some, and they'd be the best
authority on the matter.)  You can expect some resistance
regardless, however, so make sure the technical merits of RE2 are
worth the trouble.

IIRC, some of the prior discussions on using more C++ in the base
system got derailed by tangents on multiple inheritance, operator
overloading, misfeatures of STL, and what subset of C++ ought to
be considered kosher in FreeBSD.  You don't have to get involved
in any of that because you'd only be proposing to import a
self-contained third-party library.
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: [RFC] Replacing our regex implementation

2011-05-08 Thread Bakul Shah
On Sun, 08 May 2011 21:35:04 CDT Zhihao Yuan   wrote:
> 1. This lib accepts many popular grammars (PCRE, POSIX, vim, etc.),
> but it does not allow you to change the mode.
> http://code.google.com/p/re2/source/browse/re2/re2.h

The mode is decided when an RE2 object is instantiated so this
is ok. You can certainly instantiate multiple objects with
different options if so desired.

> 2. It focuses on speed and features, not stability and standardization.

Look at the open issues. Seems stable enough to me. re2 has a
posix only mode. It also does unicode.

> 3. It uses C++. We seldom accepts C++ code in base system, and does
> not accept it in libc.

This is the show stopper.
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: [RFC] Replacing our regex implementation

2011-05-08 Thread Tim Kientzle
On May 8, 2011, at 5:29 PM, Gabor Kovesdan wrote:
> 2, Optimizations for matching with a fixed pattern heuristic
> ... First, I was thinking of putting it into TRE but now I consider a better 
> solution building a small library, libregexutils or such. It would decouple 
> this optimization from the vendor code, ...

Have you asked the TRE maintainers if they would accept this change?

If they would, then getting this change into TRE would benefit a lot more 
people than just FreeBSD's libc.

(BTW, I agree that C++ in libc is a bad idea.)

Tim

___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: [RFC] Replacing our regex implementation

2011-05-08 Thread Bakul Shah
On Mon, 09 May 2011 08:30:57 +0400 Lev Serebryakov   wrote:
> Hello, Bakul.
> You wrote 9 =EC=E0=FF 2011 =E3., 5:17:09:
> 
> > As per the following URLs re2 is much faster than TRE (on the
> > benchmarks they ran):
> 
> > http://lh3lh3.users.sourceforge.net/reb.shtml
> > http://sljit.sourceforge.net/regex_perf.html
>   re2 is much faster at price of memory. I don't remember details now,
> but I've found (simple) situations when re2 consumes a HUGE amount of
> memory (read: hundreds of megabytes). It work faster than tre, yes. If
> you have this memory to RE engine alone.

As per http://swtch.com/~rsc/regexp/regexp3.html RE2 requires
about 10 KB per regexp, in contrast to PCRE's half a KB.  This
is not excessive in this day and age. But 100s of megabytes
sounds very strange  I'd appreciate a reference to an
actual example (and I am sure so would the author of re2).

But I do not want to defend re2 here. My intent was to just
make sure re2 was at least considered.  Mainly because it was
actually quite surprising to see TRE is 10 to 45 times slower
than re2!

___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: [RFC] Replacing our regex implementation

2011-05-08 Thread Lev Serebryakov
Hello, Bakul.
You wrote 9 мая 2011 г., 5:17:09:

> As per the following URLs re2 is much faster than TRE (on the
> benchmarks they ran):

> http://lh3lh3.users.sourceforge.net/reb.shtml
> http://sljit.sourceforge.net/regex_perf.html
  re2 is much faster at price of memory. I don't remember details now,
but I've found (simple) situations when re2 consumes a HUGE amount of
memory (read: hundreds of megabytes). It work faster than tre, yes. If
you have this memory to RE engine alone.

-- 
// Black Lion AKA Lev Serebryakov 

___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: [RFC] Replacing our regex implementation

2011-05-08 Thread Zhihao Yuan
On Sun, May 8, 2011 at 8:49 PM, Bakul Shah  wrote:
> On Mon, 09 May 2011 02:37:10 BST Gabor Kovesdan   wrote:
>> Em 09-05-2011 02:17, Bakul Shah escreveu:
>> > As per the following URLs re2 is much faster than TRE (on the
>> > benchmarks they ran):
>> >
>> > http://lh3lh3.users.sourceforge.net/reb.shtml
>> > http://sljit.sourceforge.net/regex_perf.html
>> >
>> > re2 is in C++&  has a PCRE API, while TRE is in C&  has a
>> > POSIX API.  Both have BSD copyright. Is it worth considering
>> > making re2 posix compliant?
>> Is it wchar-clean and is it actively maintained? C++ is quite
>> anticipated for the base system and I'm not very skilled in it so atm I
>> couldn't promise to use re2 instead of TRE. And anyway, can C++ go into
>> libc? According to POSIX, the regex code has to be there. But let's see
>> what others say... If we happen to use re2 later, my extensions that I
>> talked about in points 2, and 3, would still be useful.
>>
>> Anyway, according to some earlier vague measures, TRE seems to be slower
>> in small matching tasks but scales well. These tests seem to compare
>> only short runs with the same regex. It should be seem how they compare
>> e.g. if you grep the whole ports tree with the same pattern. If the
>> matching scales well once the pattern is compiled, that's more important
>> than the overall result for such short tasks, imho.
>
> re2 is certainly maintained. Don't know about whcar cleanliness.
> See
>    http://code.google.com/p/re2/
> Also check out Russ Cox's excellent articles on implementing it
>    http://swtch.com/~rsc/regexp/
> and this:
>    
> http://google-opensource.blogspot.com/2010/03/re2-principled-approach-to-regular.html
>
> C++ may be an impediment for it to go into libc but one can
> certainly put a C interface on a C++ library.

1. This lib accepts many popular grammars (PCRE, POSIX, vim, etc.),
but it does not allow you to change the mode.
http://code.google.com/p/re2/source/browse/re2/re2.h

2. It focuses on speed and features, not stability and standardization.

3. It uses C++. We seldom accepts C++ code in base system, and does
not accept it in libc.

So, as far as I concerned, re2 is good as a re engine in some
applications, but may not fit the requirements for a regex in libc.

> ___
> freebsd-hackers@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
> To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"
>



-- 
Zhihao Yuan
The best way to predict the future is to invent it.
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: [RFC] Replacing our regex implementation

2011-05-08 Thread Bakul Shah
On Mon, 09 May 2011 02:37:10 BST Gabor Kovesdan   wrote:
> Em 09-05-2011 02:17, Bakul Shah escreveu:
> > As per the following URLs re2 is much faster than TRE (on the
> > benchmarks they ran):
> >
> > http://lh3lh3.users.sourceforge.net/reb.shtml
> > http://sljit.sourceforge.net/regex_perf.html
> >
> > re2 is in C++&  has a PCRE API, while TRE is in C&  has a
> > POSIX API.  Both have BSD copyright. Is it worth considering
> > making re2 posix compliant?
> Is it wchar-clean and is it actively maintained? C++ is quite 
> anticipated for the base system and I'm not very skilled in it so atm I 
> couldn't promise to use re2 instead of TRE. And anyway, can C++ go into 
> libc? According to POSIX, the regex code has to be there. But let's see 
> what others say... If we happen to use re2 later, my extensions that I 
> talked about in points 2, and 3, would still be useful.
> 
> Anyway, according to some earlier vague measures, TRE seems to be slower 
> in small matching tasks but scales well. These tests seem to compare 
> only short runs with the same regex. It should be seem how they compare 
> e.g. if you grep the whole ports tree with the same pattern. If the 
> matching scales well once the pattern is compiled, that's more important 
> than the overall result for such short tasks, imho.

re2 is certainly maintained. Don't know about whcar cleanliness.
See 
http://code.google.com/p/re2/
Also check out Russ Cox's excellent articles on implementing it
http://swtch.com/~rsc/regexp/
and this:

http://google-opensource.blogspot.com/2010/03/re2-principled-approach-to-regular.html

C++ may be an impediment for it to go into libc but one can
certainly put a C interface on a C++ library.
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: [RFC] Replacing our regex implementation

2011-05-08 Thread Gabor Kovesdan

Em 09-05-2011 02:17, Bakul Shah escreveu:

As per the following URLs re2 is much faster than TRE (on the
benchmarks they ran):

http://lh3lh3.users.sourceforge.net/reb.shtml
http://sljit.sourceforge.net/regex_perf.html

re2 is in C++&  has a PCRE API, while TRE is in C&  has a
POSIX API.  Both have BSD copyright. Is it worth considering
making re2 posix compliant?
Is it wchar-clean and is it actively maintained? C++ is quite 
anticipated for the base system and I'm not very skilled in it so atm I 
couldn't promise to use re2 instead of TRE. And anyway, can C++ go into 
libc? According to POSIX, the regex code has to be there. But let's see 
what others say... If we happen to use re2 later, my extensions that I 
talked about in points 2, and 3, would still be useful.


Anyway, according to some earlier vague measures, TRE seems to be slower 
in small matching tasks but scales well. These tests seem to compare 
only short runs with the same regex. It should be seem how they compare 
e.g. if you grep the whole ports tree with the same pattern. If the 
matching scales well once the pattern is compiled, that's more important 
than the overall result for such short tasks, imho.


Gabor
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: [RFC] Replacing our regex implementation

2011-05-08 Thread Bakul Shah
As per the following URLs re2 is much faster than TRE (on the
benchmarks they ran):

http://lh3lh3.users.sourceforge.net/reb.shtml
http://sljit.sourceforge.net/regex_perf.html

re2 is in C++ & has a PCRE API, while TRE is in C & has a
POSIX API.  Both have BSD copyright. Is it worth considering
making re2 posix compliant?
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: [RFC] Replacing our regex implementation

2011-05-08 Thread Pedro F. Giffuni
Hello;

Thanks Gabor for this cool project!

--- On Sun, 5/8/11, Gabor Kovesdan  wrote:
...
> - It doesn't provide the REG_STARTEND macro, which is our
> non-POSIX extension. Still, it is useful and easy to
> implement so it is not a problem either.
 
Our sed requires REG_STARTEND (the illumos guys noted so
when they ported our sed).

While on this you should also look at bin/153257 and check
if TRE supports the sysv legacy delimiters. FWIW, the now
defunct lang/gpc used those delimiters to distinguish
between our sed and GNU sed. There are still more than 20
ports that depend on gsed.

> Now I'm working on this little feature and on building a
> libc with TRE. After that I'll publish a patch for testing
> and will also ask portmgr to run it on the cluster.
> 

It may be interesting, although unnecessarily risky, to
see if diff can also use libtre (it uses GNU regex now),
but you will probably want to see this as a future step.
(NEVERMIND.. you explained this right ahead ;) )

> 
> 3, Adding support for GNU-specific permissive regex syntax
> GNU grep accepts regexes that are invalid in POSIX, like
> [a|]. This is necessary for grep and diff in the base
> system. If we don't have them we can never trow out the GNU
> regex implementation. However, we should not make them
> default, so I'm thinking of implementing them somewhere
> else, e.g. in the previously mentioned library.
> 
> So far, these are the plans, please comments if you have
> something in your mind about it.
> 

Sounds good!

Pedro.

___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"