subject:"GRegex"

g_once_init*() (Re: Performance implications of GRegex structure)

2007-03-20 Thread Tim Janik

On Sat, 17 Mar 2007, Owen Taylor wrote:

> On Sat, 2007-03-17 at 16:14 +0100, Marco Barisione wrote:

>> Owen: what should do exactly G_STATIC_REGEX_INIT?
>
> I was imagining:
>
> struct _GStaticRegex {
>GOnce once;
>GStaticRegex *regex;
>const gchar *pattern;
>GRegexCompileFlags flags;
> }
>
> #define G_STATIC_REGEX_INIT(pattern, flags) { \
>G_ONCE_INIT,  \
>NULL, \
>pattern,  \
>flags \
> }

hm, i'd like to point out that there really should not be
extra macro magic around for handling cross-thread regexps.
with GOnce, GLib provides means for thread-safe one time
initialization already, and if the use of that is too cumbersome,
we're currently working on a more convenient alternative here:
http://bugzilla.gnome.org/show_bug.cgi?id=65041
with that, you should be able to write:
   static GRegexp *pattern = NULL;
   static gsize initialized = 0;
   if (g_once_init_enter (&initialized))
 {
   pattern = g_regexp_new (...);
   g_once_init_leave (&initialized, 1);
 }
providing a g_once_*() convenience API that can be used everywhere
is a much better way than to re-invent G_STATIC_*() magic for every
component that's potentially used across multiple threads.

>   - Owen

---
ciaoTJ
___
gtk-devel-list mailing list
gtk-devel-list@gnome.org
http://mail.gnome.org/mailman/listinfo/gtk-devel-list

Re: Performance implications of GRegex structure

2007-03-18 Thread Yevgen Muntyan

Gustavo J. A. M. Carneiro wrote:
>   I can't resist to not state my opining on this :P
>
>   I think it's OK to have a single GRegex object, with no separate match
> or matcher, IF g_regex_copy is basically a lightweight copy[1].
>   
It is.
>   I think this matches well with the rest of the GLib APIs wrt. thread
> safety.  None[2] of the other GLib data structures are thread-safe. E.g.
> you can't share a GList between threads, you have to protect it by a
> mutex or have one copy for each thread.  So why should GRegex follow a
> different pattern?
>   
As far as I understand, it won't be made magically threadsafe.
You will have to have a per-thread copy or protect it with mutex
or something [1].

The real problem is that g_regex_copy() is not what people think it is,
and people are used to object names which contain "match". It sounds
heavy enough to want to change the API, to make API names reflect
what objects and methods do.

I personally am fine with GRegex, and I'd prefer it stay as it is,
but having to use Matcher in place of Regex doesn't seem
to be high price for everybody being happy, so why not?

As to language bindings, IMO no public API which is intended to
be used from bindings should use GRegex. Don't know why, but I
am pretty sure it will lead to troubles or at least to unnecessary
work in bindings and non-C-code using those bindings. GRegex
properties or signal arguments is a bad idea, like GHashTable.

Yevgen

[1] It's possible to make regex_match() return some Match (as opposed
to "Matcher") objects, like in Python, but then these objects would need
to keep state, not just the "result", and you would need to pass Match
object to next regex_match(), which would be totally weird.

___
gtk-devel-list mailing list
gtk-devel-list@gnome.org
http://mail.gnome.org/mailman/listinfo/gtk-devel-list

Re: Performance implications of GRegex structure

2007-03-18 Thread Gustavo J. A. M. Carneiro

  I can't resist to not state my opining on this :P

  I think it's OK to have a single GRegex object, with no separate match
or matcher, IF g_regex_copy is basically a lightweight copy[1].

  I think this matches well with the rest of the GLib APIs wrt. thread
safety.  None[2] of the other GLib data structures are thread-safe. E.g.
you can't share a GList between threads, you have to protect it by a
mutex or have one copy for each thread.  So why should GRegex follow a
different pattern?

  And there is not doubt that, with C's manual memory management,
managing a single object is easier than managing multiple objects.  And
for language bindings, wrapping one object is easier than wrapping
two :P

  Just my .02€,

[1]  Without looking at the code, I think g_regex_copy can easily become
lightweight by internally splitting the GRegex into a shared
read-only/immutable part and a non-shared state part, wherein
g_regex_copy would incref the shared part and create a new non-shared
part.  Any attempt to modify a property that belongs to the immutable
part would trigger a new copy of that part (copy-on-write).

[2] Well, maybe except GQueue, but it was designed specifically for
threading, else it is more or less a GList.

On Sáb, 2007-03-17 at 18:37 -0500, Yevgen Muntyan wrote:
> Owen Taylor wrote:
> > On Sat, 2007-03-17 at 16:08 -0500, Yevgen Muntyan wrote:
> >   
> >> Yevgen Muntyan wrote:
> >> 
> >>> [snip]
> >>> To me here the only good argument in favor of separate Match objects is 
> >>> multi-thread uses.
> >>> Simply because we already have Match object, just hidden. If the best 
> >>> way to fix GRegex
> >>> for multi-threading is a separate match object, then it should be a 
> >>> separate match object.
> >>>   
> >>>   
> >> In fact it's not a solution, right? Since if it's separate Match
> >> structure, then Regex still needs to keep state.
> >> So, the solution is to rename some stuff to make GRegex be
> >> a GRegexExp or something, and move the actual functionality
> >> to some new GMatcher, i.e. not change anything conceptually but
> >> explicitly separate Pattern and Matcher. Did I get it right?
> >> 
> >
> > Yes, I think you've understood what I was talking about with a
> > matcher object ... almost all  the methods in GRegex currently other
> > than g_regex_new()/g_regex_optimize() are conceptually matcher methods.
> >
> > I don't have any objection to a matcher object with state; what I don't
> >   
> > like is binding it together with the pattern into a single indivisible
> > object.
> >   
> What I was arguing to (if you ignore "don't change it period" part)
> was creating new match objects every time you perform a match
> (it's what's done in Python). Basically I don't want every match()
> method to get me new allocated structure which has to be freed.
> And given it wouldn't work anyway, I was arguing to something
> which wouldn't work anyway :)
> Making cool new API which would be nice is certainly not a bad
> thing.
> 
> One thing should be taken care of: how all those things will
> be copied/referenced. The language bindings concern led
> to this silly g_regex_copy(); so we can get same funny
> thing when someone says "not bindings-friendly" about
> new API.
> Perhaps making Matcher and Regex ref-countable (perhaps
> internally for Regex) wouldn't be bad, would it?
> 
> It would be great if concerned people [1] commented  about it in
> http://bugzilla.gnome.org/show_bug.cgi?id=419368
> 
> Yevgen
> 
> [1] Owen :)
> 
> ___
> gtk-devel-list mailing list
> gtk-devel-list@gnome.org
> http://mail.gnome.org/mailman/listinfo/gtk-devel-list
-- 
Gustavo J. A. M. Carneiro
<[EMAIL PROTECTED]> <[EMAIL PROTECTED]>
The universe is always one step beyond logic

___
gtk-devel-list mailing list
gtk-devel-list@gnome.org
http://mail.gnome.org/mailman/listinfo/gtk-devel-list

Re: Performance implications of GRegex structure

2007-03-18 Thread Freddie Unpenstein


> When you evaluate an API, you have to look at a number of things:
> - Is the API complete? Can it do what is needed
> - Does the API allow getting common things done in a few lines of
>   code?
> - Is the API easy to figure out?
> - Is the resulting code legible and easy to read?
> - Does the API encourage writing efficient and correct code?

Personally, my main interests in this thread, based on tasks I've had to 
perform in the past, are these;


*) To be able to compile a pattern that allows a given search to be applied 
repeatedly, as efficiently as possible.

Once a pattern is compiled, there should be no reason for the compiled version 
to change.  It may even be possible to incorporate a pre-compiled regexp 
directly, if it makes any sense to do so.  As much work as possible should be 
done on the pattern so that matching against the compiled form should be as 
fast as possible.

A pattern intended to be used solely from one thread could, however, be 
partially compiled, and then finished off if and when it's needed.  There 
should, if this is the case, be a function to call to fully boil it down 
immediately if it's going to be shared.

[history] I have had a situation where I needed to search several MB of log 
files, for all occurrences of some 20 regular expressions.


*) to be able to search for a pattern repeatedly within a potentially infinite 
length data stream.

Even in a single-threadded application, a given pattern could be used 
concurrently on several different data streams.  So keeping the search state 
separate from the pattern would be a good idea.

Being able to obtain both the start and end of the last match will allow an 
application developer to decide whether to allow over-lapping matches or not.

[history] The closest I've come to this, is searching for one of several 
patterns within the first 4KB of a data stream.  These patterns were acceptance 
and rejection patterns of a TCP data stream.   Definite negatives on the 
acceptance patterns, and a definite positive on any rejection pattern, with as 
few accepted packets as possible, would have been exceedingly useful.


*) I personally don't care if it's a single object, or separate objects.  As 
long as it can be used where it's needed, without unnecessary duplication of 
the compiled regular expression, or unnecessary re-compilation.

Refcounting on the search pattern object may allow it to be shared without any 
unnecessary duplication, and would most certainly simplify the whole API.  
However, it doesn't allow for multiple simultaneous searches without some 
pretty bizarre magic.  And a _copy() function isn't really that obvious, at 
least to me.  I'd be expecting it to copy everything, including the present 
search state, as opposed to just the search pattern.  Which isn't what you want 
(there may still be some use to a _copy() function that does copy the current 
search state).  At least with the separate object method, it's obvious that 
you're taking a pattern, and starting a search on it.  Reference counting on 
the pattern will allow it to be automatically destroyed when no longer needed, 
regardless of how many concurrent searches are running.


*) if any part of it is going to be called GMatch, then it should be done as a 
generic pattern searching API.  This could work well for allowing one routine 
to search for regexp's, fixed strings, file-system style glob expressions, and 
just about anything else someone can imagine.

Create a "pattern" object with g_match_regexp_new(), or g_match_fixed_new(), 
etc., and pass it to the GMatch functions to actually do the search.  The class 
of the pattern object would know how to interpret the object data correctly.  
The "fixed" pattern compiler could, for example, build a Boyer-Moore style skip 
table.  The "glob" pattern compiler might simply function much as a simplified 
regexp.  More interestingly, this opens the way rather nicely to supporting 
different "levels" of regexp.

[history] This type of functionality has been quite useful in some scripting 
languages I have used on occasion.  Unfortunately, in those cases, I had to use 
some rather dirty tricks.  I imagine similar circumstances must occasionally 
occur in C also, and will require just this kind of structure.


*) wrappers should be written for the common cases.  g_match_regexp_simple() 
would search a buffer for a given regular expression.  It will essentially 
compile (g_match_regexp_new), match (g_match_???), destroy (g_object_unref), 
and return the basic boolean result.  Another variant can accept pointers into 
which to deposit the first several "results".  If allowing other search pattern 
types, similar functions would exist for them also.

GMatch would have a similar function which takes a compiled pattern, and runs 
it against a buffer, disregarding any result strings, and just returning the 
simple boolean result.  For any more complex tasks, the full weight of GMatch 
is available.

[history

Re: Performance implications of GRegex structure

2007-03-17 Thread Yevgen Muntyan

Owen Taylor wrote:
> On Sat, 2007-03-17 at 16:08 -0500, Yevgen Muntyan wrote:
>   
>> Yevgen Muntyan wrote:
>> 
>>> [snip]
>>> To me here the only good argument in favor of separate Match objects is 
>>> multi-thread uses.
>>> Simply because we already have Match object, just hidden. If the best 
>>> way to fix GRegex
>>> for multi-threading is a separate match object, then it should be a 
>>> separate match object.
>>>   
>>>   
>> In fact it's not a solution, right? Since if it's separate Match
>> structure, then Regex still needs to keep state.
>> So, the solution is to rename some stuff to make GRegex be
>> a GRegexExp or something, and move the actual functionality
>> to some new GMatcher, i.e. not change anything conceptually but
>> explicitly separate Pattern and Matcher. Did I get it right?
>> 
>
> Yes, I think you've understood what I was talking about with a
> matcher object ... almost all  the methods in GRegex currently other
> than g_regex_new()/g_regex_optimize() are conceptually matcher methods.
>
> I don't have any objection to a matcher object with state; what I don't
>   
> like is binding it together with the pattern into a single indivisible
> object.
>   
What I was arguing to (if you ignore "don't change it period" part)
was creating new match objects every time you perform a match
(it's what's done in Python). Basically I don't want every match()
method to get me new allocated structure which has to be freed.
And given it wouldn't work anyway, I was arguing to something
which wouldn't work anyway :)
Making cool new API which would be nice is certainly not a bad
thing.

One thing should be taken care of: how all those things will
be copied/referenced. The language bindings concern led
to this silly g_regex_copy(); so we can get same funny
thing when someone says "not bindings-friendly" about
new API.
Perhaps making Matcher and Regex ref-countable (perhaps
internally for Regex) wouldn't be bad, would it?

It would be great if concerned people [1] commented  about it in
http://bugzilla.gnome.org/show_bug.cgi?id=419368

Yevgen

[1] Owen :)

___
gtk-devel-list mailing list
gtk-devel-list@gnome.org
http://mail.gnome.org/mailman/listinfo/gtk-devel-list

Re: Performance implications of GRegex structure

2007-03-17 Thread Owen Taylor

On Sat, 2007-03-17 at 16:08 -0500, Yevgen Muntyan wrote:
> Yevgen Muntyan wrote:
> > [snip]
> > To me here the only good argument in favor of separate Match objects is 
> > multi-thread uses.
> > Simply because we already have Match object, just hidden. If the best 
> > way to fix GRegex
> > for multi-threading is a separate match object, then it should be a 
> > separate match object.
> >   
> In fact it's not a solution, right? Since if it's separate Match
> structure, then Regex still needs to keep state.
> So, the solution is to rename some stuff to make GRegex be
> a GRegexExp or something, and move the actual functionality
> to some new GMatcher, i.e. not change anything conceptually but
> explicitly separate Pattern and Matcher. Did I get it right?

Yes, I think you've understood what I was talking about with a
matcher object ... almost all  the methods in GRegex currently other
than g_regex_new()/g_regex_optimize() are conceptually matcher methods.

I don't have any objection to a matcher object with state; what I don't
like is binding it together with the pattern into a single indivisible
object.

- Owen



signature.asc
Description: This is a digitally signed message part
___
gtk-devel-list mailing list
gtk-devel-list@gnome.org
http://mail.gnome.org/mailman/listinfo/gtk-devel-list

Re: Performance implications of GRegex structure

2007-03-17 Thread Owen Taylor

On Sat, 2007-03-17 at 15:45 -0500, Yevgen Muntyan wrote:
> Owen Taylor wrote:
[...]

> > If we can identify the most common patterns of usage, I think we can
> > add convenience functions that make usage of an immutable pattern object
> > almost as convenient as the current GRegex.
> >
> > You can have functions like:
> >
> >  if (g_regex_matches(regex, str, -1, 0)) 
> > ...
> >
> >  if (g_regex_get_matches(regex, str, -1, 0,
> >  0, &whole_match,
> >  1, &first_substring,
> >  -1)
> > ...
> >
> >  if (g_regex_get_named_matches(regex, str, -1, 0,
> >"firstName", &first_name,
> >"lastName", &last_name,
> >NULL)
> > ...
> >
> > The first two cover 98% of all cases when I've ever used a regular
> > expression ... I either want a boolean match / doesn't match, or I
> > want to match against a pattern, and if succeeds, do something with
> > several substrings.

>  It won't cover usage of EggRegex in gtksourceview. The second variant 
>  seems to be nice for "usual" uses, while the third is not - if your
>  named pattern didn't match you get NULL and if whole regex didn't
>  match, you get NULL too. You really want to match, get to know if
>  whole thing matched, and then look at subpatterns or whatnot.

I didn't provide API docs for my examples! :-) Anyways, my intent 
was that the third example was just like the second example, but for
the funky (?Pname) named subpatterns. In both cases the boolean
return value is whether the match succeeded.

I haven't looked at the GtkSourceView code, but my assumption is that
there are only a few places in the code where it is creating regular
expressions, since it's regular expressions are configured in files.
So, adding a few extra lines of code in those places isn't a big deal,
and it doesn't seem to me like the best example of what we need
on the convenience end of things.

It's probably an excellent example of what is needed for performance.

> > That to me, would relegate the matcher object to cases where the
> > annoyance of an extra object is small compared to the complexity of
> > the operation.
> >   
> > You could also take the above functions and have the same thing for:
> >
> >  - Strings (like the current _simple() convenience functions)
> >  - Something like my GStaticRegex proposal
> >
> > As always, the question about convenience functions is "where do you
> > stop?"...

>  Right here, I guess. Let me stress: it's not about *convenience
>  functions*. It's about convenience in using non-simple GRegex API.

Maybe I don't understand your concern. Obviously GRegex needs
to work well for complex uses, but if I have 50 lines of code
manipulating a single regular expression, then changing that to
52 lines of code isn't a big deal.

>  Perhaps it's just that I already have adapted code to changes in
>  EggRegex, not once, and I naturally don't want to do it once again,
>  because some people are used to some stuff in Java...

You are probably half-joking here, but I'll answer it anyways:
once it's in a stable release of GLib, it's in there forever. This is
our only chance to get the API right.

>  To me here the only good argument in favor of separate Match objects is
>  multi-thread uses. Simply because we already have Match object, just
>  hidden. If the best way to fix GRegex for multi-threading is a
>  separate match object, then it should be a separate match object. The
>  rest is really philosophy - if one thinks separate object in code
>  makes it something different conceptually, then he's wrong (it does
>  make API less convenient to use though).

When you evaluate an API, you have to look at a number of things:

 - Is the API complete? Can it do what is needed
 - Does the API allow getting common things done in a few lines of code?
 - Is the API easy to figure out?
 - Is the resulting code legible and easy to read?
 - Does the API encourage writing efficient and correct code?

That last element is an important one; you can't ignore the psychology
of the person using your API.

>  A separate Match*er* object, which would actually have functionality of
>  current GRegex, is not a good idea, since it would only add an extra
>  object without any change in functionality, in particular it would not
>  be thread-safe (some_get_matcher() or some_new_matcher() would be
>  simply equivalent to current g_regex_copy()).

As I demonstrated earlier, g_regex_copy() *does* provide a way of using
GRegex in a thread safe manner, but it's unintuitive and a little
clumsy. I think we can do better than that.

- Owen

signature.asc
Description: This is a digitally signed message part
___
gtk-devel-list mailing list
gtk-devel-list@gnome.org
http://mail.gnome.org/mailman/listinfo/gtk-devel-list

Re: Performance implications of GRegex structure

2007-03-17 Thread Yevgen Muntyan

Yevgen Muntyan wrote:
> [snip]
> To me here the only good argument in favor of separate Match objects is 
> multi-thread uses.
> Simply because we already have Match object, just hidden. If the best 
> way to fix GRegex
> for multi-threading is a separate match object, then it should be a 
> separate match object.
>   
In fact it's not a solution, right? Since if it's separate Match
structure, then Regex still needs to keep state.
So, the solution is to rename some stuff to make GRegex be
a GRegexExp or something, and move the actual functionality
to some new GMatcher, i.e. not change anything conceptually but
explicitly separate Pattern and Matcher. Did I get it right?

Then usage would be something like

/* Compile regex for future uses */
GRegexExp *re = g_compile_pattern("foo");
GMatcher *m = g_matcher_new (re);
g_regex_exp_free(re);
...
g_matcher_find_something(m, "blah");
...
/* Finally free it when program exits */
g_matcher_free(m);

instead of current

/* Compile regex for future uses */
GRegex *re = g_regex_new("foo");
...
g_regex_match(re, "blah");
...
/* Finally free it when program exits */
g_regex_free(re);

And in multithreaded case threads would do g_matcher_new()
instead of g_regex_copy().

g_regex_copy() is weird, that's clear. API is weird, true. But
functionality right now is no different from what would be
in case of some GMatcher objects, correct?

Best regards,
Yevgen

___
gtk-devel-list mailing list
gtk-devel-list@gnome.org
http://mail.gnome.org/mailman/listinfo/gtk-devel-list

Re: Performance implications of GRegex structure

2007-03-17 Thread Yevgen Muntyan

Owen Taylor wrote:
> On Fri, 2007-03-16 at 21:30 +0100, Marco Barisione wrote:
>   
>> Il giorno gio, 15/03/2007 alle 10.18 -0400, Owen Taylor ha scritto:
>> 
>>> But looking over the header file, there is something that puzzles me
>>> about the way that it's set up: there is no distinction between a
>>> "pattern/regular expression" object and a match/matcher object.
>>>   
>> The internal code in GRegex was deeply modified but the API is quite
>> similar to the original one written by Scott Wimer and then modified by
>> Matthias Clasen, so I kept a single GRegex object but with lots of
>> doubts.
>>
>> In the end I decided to keep a single object because I prefer this
>> approach when using languages without a garbage collector and because
>> QRegExp (the equivalent object in QT) is a single object.
>>
>> This matter was brought out in the mailing list and in bugzilla but only
>> Havoc Pennington and Yevgen Muntyan expressed their opinion saying that
>> they prefer a single object.
>> 
>
> I apologize for not speaking up on the bugzilla bug. I must admit that
> though I saw the discussion, I didn't really pay a lot of attention
> until the header file appeared in CVS.
>
> I certainly appreciate the arguments for convenience in C; it's a valid
> concern. But I don't think we should let convenience be the overriding
> factor over everything else; after all, the user *is* writing in C,
> so convenience almost certainly wasn't utmost on their mind ;-)
>   

If he uses GRegex instead of raw pcre, then one could say it *is* about 
convenience ;)
> If we can identify the most common patterns of usage, I think we can
> add convenience functions that make usage of an immutable pattern object
> almost as convenient as the current GRegex.
>
> You can have functions like:
>
>  if (g_regex_matches(regex, str, -1, 0)) 
> ...
>
>  if (g_regex_get_matches(regex, str, -1, 0,
>  0, &whole_match,
>  1, &first_substring,
>  -1)
> ...
>
>  if (g_regex_get_named_matches(regex, str, -1, 0,
>"firstName", &first_name,
>"lastName", &last_name,
>NULL)
> ...
>
> The first two cover 98% of all cases when I've ever used a regular
> expression ... I either want a boolean match / doesn't match, or I
> want to match against a pattern, and if succeeds, do something with
> several substrings.
>   
It won't cover usage of EggRegex in gtksourceview. The second variant 
seems to be nice
for "usual" uses, while the third is not - if your named pattern didn't 
match you get NULL
and if whole regex didn't match, you get NULL too. You really want to 
match, get to know
if whole thing matched, and then look at subpatterns or whatnot.
> That to me, would relegate the matcher object to cases where the
> annoyance of an extra object is small compared to the complexity of
> the operation.
>   
> You could also take the above functions and have the same thing for:
>
>  - Strings (like the current _simple() convenience functions)
>  - Something like my GStaticRegex proposal
>
> As always, the question about convenience functions is "where do you
> stop?"...
>   
Right here, I guess. Let me stress: it's not about *convenience 
functions*. It's about convenience
in using non-simple GRegex API. Perhaps it's just that I already have 
adapted code to changes in
EggRegex, not once, and I naturally don't want to do it once again, 
because some people are used
to some stuff in Java...

To me here the only good argument in favor of separate Match objects is 
multi-thread uses.
Simply because we already have Match object, just hidden. If the best 
way to fix GRegex
for multi-threading is a separate match object, then it should be a 
separate match object.
The rest is really philosophy - if one thinks separate object in code 
makes it something different
conceptually, then he's wrong (it does make API less convenient to use 
though).

A separate Match*er* object, which would actually have functionality of 
current GRegex,
is not a good idea, since it would only add an extra object without any 
change in functionality,
in particular it would not be thread-safe (some_get_matcher() or 
some_new_matcher()
would be simply equivalent to current g_regex_copy()).

Best regards,
Yevgen

___
gtk-devel-list mailing list
gtk-devel-list@gnome.org
http://mail.gnome.org/mailman/listinfo/gtk-devel-list

Re: Performance implications of GRegex structure

2007-03-17 Thread Owen Taylor

On Sat, 2007-03-17 at 16:19 +0100, Marco Barisione wrote:
> Il giorno sab, 17/03/2007 alle 10.07 -0400, Matthias Clasen ha scritto:
> > Btw, one thing we might want to consider doing (regardless if we go
> > with separate pattern and matcher objects) is to make the pattern
> > optimization an optional part of the constructor rather than a
> > separate
> > function. That will ensure that the pattern is truly immutable after
> > construction. 
> 
> I prefer to keep it a separate function, what are the advantages of
> having a truly immutable GRegex? Note that the _optimize function is
> already thread-safe.

As long as it is thread safe (and it looks like it is) I don't think
there is a big advantage. Still, I would wonder why you don't have
"optimize" just as another compile flag. PCRE can't do that because
of the allocation of an extra block of memory, but that doesn't
affect GRegex.

- Owen

P.S. - The docs could really use more specific guidance on optimize;
 I'd say something like "call optimize() (use the OPTIMIZE flag) if
 you are going to match the regular expression against large amounts
 of text. Optimization provides faster operation in that case, at the
 expense of using more memory and slightly greater compilation time."
 The current docs will make it very hard for people to know whether they
 should optimize or not. 

P.P.S - Though unless it actually provides a 50% or more speedup in
 matching and a 50% slowdown in compilation, I'm  not sure it's worth   
 making people decide whether optimize or not.

P.P.P.S. - I'm very glad you didn't call it _study(), since it is
 something entirely different than Perl's study() function. What was
 the PCRE author thinking with pcre_study?

___
gtk-devel-list mailing list
gtk-devel-list@gnome.org
http://mail.gnome.org/mailman/listinfo/gtk-devel-list

Re: Performance implications of GRegex structure

2007-03-17 Thread Owen Taylor

On Sat, 2007-03-17 at 16:14 +0100, Marco Barisione wrote:
> I opened bug #419368[1] to track this issue, the API used by Owen in the
> examples could be inefficient in some cases, so in the next days I'm
> going to think to a usable and efficient API.
> How can I call the match object? GRegexMatcher? GMatcher? GMatch?
> GRegexMatch?

The choice of whether it's "Matcher" or "Match" depends on whether
it is a Java-style "Matcher" object or a Python-style "Match" object - 
that is, whether it is created before matching, or returned in the
case of a successful match.

(http://java.sun.com/j2se/1.5.0/docs/api/java/util/regex/Matcher.html,
http://docs.python.org/lib/mamtch-objects.html)

My feeling was that the Java style object was a better fit, because
it allows for iteration through matches, as you currently
have with g_regex_match_next().

I don't think there is a strong argument either way for GRegexMatcher
vs. GMatcher. GMatcher produces shorter function names and less
typing...

> Owen: what should do exactly G_STATIC_REGEX_INIT?

I was imagining:

struct _GStaticRegex {
GOnce once;
GStaticRegex *regex;
const gchar *pattern;
GRegexCompileFlags flags;
}

#define G_STATIC_REGEX_INIT(pattern, flags) { \
G_ONCE_INIT,  \
NULL, \
pattern,  \
flags \
}

- Owen

___
gtk-devel-list mailing list
gtk-devel-list@gnome.org
http://mail.gnome.org/mailman/listinfo/gtk-devel-list

Re: Performance implications of GRegex structure

2007-03-17 Thread Marco Barisione

Il giorno sab, 17/03/2007 alle 10.07 -0400, Matthias Clasen ha scritto:
> Btw, one thing we might want to consider doing (regardless if we go
> with separate pattern and matcher objects) is to make the pattern
> optimization an optional part of the constructor rather than a
> separate
> function. That will ensure that the pattern is truly immutable after
> construction. 

I prefer to keep it a separate function, what are the advantages of
having a truly immutable GRegex? Note that the _optimize function is
already thread-safe.

-- 
Marco Barisione
http://www.barisione.org/

___
gtk-devel-list mailing list
gtk-devel-list@gnome.org
http://mail.gnome.org/mailman/listinfo/gtk-devel-list

Re: Performance implications of GRegex structure

2007-03-17 Thread Marco Barisione

I opened bug #419368[1] to track this issue, the API used by Owen in the
examples could be inefficient in some cases, so in the next days I'm
going to think to a usable and efficient API.
How can I call the match object? GRegexMatcher? GMatcher? GMatch?
GRegexMatch?

Owen: what should do exactly G_STATIC_REGEX_INIT?

I opened other two bug reports for minor problems with GRegex (#419371
[2], #419376 [3]), I will fix them after fixing bug #419368.


[1] http://bugzilla.gnome.org/show_bug.cgi?id=419368
[2] http://bugzilla.gnome.org/show_bug.cgi?id=419371
[3] http://bugzilla.gnome.org/show_bug.cgi?id=419376

-- 
Marco Barisione
http://www.barisione.org/

___
gtk-devel-list mailing list
gtk-devel-list@gnome.org
http://mail.gnome.org/mailman/listinfo/gtk-devel-list

Re: Performance implications of GRegex structure

2007-03-17 Thread Owen Taylor

On Fri, 2007-03-16 at 21:15 -0500, Yevgen Muntyan wrote:
> Matthias Clasen wrote:
> > On 3/16/07, Marco Barisione <[EMAIL PROTECTED]> wrote:
> >
> >   
> >> BTW if you want I can split GRegex in two separate objects.
> >> 
> >
> > Since that seems to be the overwhelming preference,
> "overwhelming"?
> >  that might
> > be a good idea. I hope this shouldn't be too bad, since GRegex
> > is already split into pattern and match objects, internally.
> >   
> Exactly, internally. Compare
> 
> return g_regex_match(re, "foobar", 0, 0);
> 
> and
> 
> Match *m;
> gboolean result;
> m = g_regex_match(re, "foobar", 0, 0);
> result = m != NULL;
> g_regex_match_free(m);
> return result;

This is certainly a common case, and it's important to optimize
for it. But I'd rather see it optimized with a convenience function
than by forcing an (in my opinion) unnatural structure on GRegex.

As said in my reply to Marco, it's certainly very possible to 
have:

  return g_regex_matches(re, "foobar", -1, 0);

Without storing state in the regular expression.

- Owen



___
gtk-devel-list mailing list
gtk-devel-list@gnome.org
http://mail.gnome.org/mailman/listinfo/gtk-devel-list

Re: Performance implications of GRegex structure

2007-03-17 Thread Owen Taylor

On Fri, 2007-03-16 at 21:30 +0100, Marco Barisione wrote:
> Il giorno gio, 15/03/2007 alle 10.18 -0400, Owen Taylor ha scritto:
> > But looking over the header file, there is something that puzzles me
> > about the way that it's set up: there is no distinction between a
> > "pattern/regular expression" object and a match/matcher object.
> 
> The internal code in GRegex was deeply modified but the API is quite
> similar to the original one written by Scott Wimer and then modified by
> Matthias Clasen, so I kept a single GRegex object but with lots of
> doubts.
> 
> In the end I decided to keep a single object because I prefer this
> approach when using languages without a garbage collector and because
> QRegExp (the equivalent object in QT) is a single object.
> 
> This matter was brought out in the mailing list and in bugzilla but only
> Havoc Pennington and Yevgen Muntyan expressed their opinion saying that
> they prefer a single object.

I apologize for not speaking up on the bugzilla bug. I must admit that
though I saw the discussion, I didn't really pay a lot of attention
until the header file appeared in CVS.

I certainly appreciate the arguments for convenience in C; it's a valid
concern. But I don't think we should let convenience be the overriding
factor over everything else; after all, the user *is* writing in C,
so convenience almost certainly wasn't utmost on their mind ;-)

If we can identify the most common patterns of usage, I think we can
add convenience functions that make usage of an immutable pattern object
almost as convenient as the current GRegex.

You can have functions like:

 if (g_regex_matches(regex, str, -1, 0)) 
...

 if (g_regex_get_matches(regex, str, -1, 0,
 0, &whole_match,
 1, &first_substring,
 -1)
...

 if (g_regex_get_named_matches(regex, str, -1, 0,
   "firstName", &first_name,
   "lastName", &last_name,
   NULL)
...

The first two cover 98% of all cases when I've ever used a regular
expression ... I either want a boolean match / doesn't match, or I
want to match against a pattern, and if succeeds, do something with
several substrings.

That to me, would relegate the matcher object to cases where the
annoyance of an extra object is small compared to the complexity of
the operation.

You could also take the above functions and have the same thing for:

 - Strings (like the current _simple() convenience functions)
 - Something like my GStaticRegex proposal

As always, the question about convenience functions is "where do you
stop?"...

- Owen

___
gtk-devel-list mailing list
gtk-devel-list@gnome.org
http://mail.gnome.org/mailman/listinfo/gtk-devel-list

Re: Performance implications of GRegex structure

2007-03-17 Thread mark

On Sat, Mar 17, 2007 at 12:48:15AM -0500, Yevgen Muntyan wrote:
> I am suggesting something which is currently used in real code.
> Simple, nice, and working. *If* it's not as good as it should be,
> *then* it should be changed.

It's not as good as it should be. :-)

1) In terms of interface, it does not appear to be designed to be
   thread-safe, and the documentation does not hint that it is
   thread-safe. Some sort of guarantee would have to be made, such
   as "the match function does not modify GRegexp state". PCRE makes
   this guarantee:

   The  compiled form of a regular expression is not altered during
   matching, so the same compiled pattern can safely be used by
   several threads at once.

2) In terms of functionality, the concept of matching for anything
   beyond the various simplest of matching, *does* have state that
   should be kept and maintained. This state should not be kept in
   GRegexp, because of 1). It should be kept in a GMatcher object.
   This state includes:
- the compiled regexp being used
- the string being matched against
- the offset into the string that the next match will begin
- the captured parameters from the last match

This isn't to say that a GMatcher object should be *required*. There is
nothing wrong with having convenience functions for the simple cases.
This is why I suggested:

> > If you want simple - give up on GRegexp altogether. Try something like:
> > if (g_string_regexp_match(s, pattern))
> > ...
> > If it happens to do some sort of internal caching - great. If not?
> > At least it is simple. For Java, this maps to:
> > if (string.matches(pattern))
> > ...
> Should I say "if you want cool - give up on C altogether"? I guess
> not. I mean, it's not like I want simplicity at all costs. No need to
> provide fancy-shmancy java stuff. It's a real serious question: what
> would be the uses of GRegex, where it would be convenient
> to have separate Match structure and where it would not.

There should be simple wrapper functions for simple uses. If there was
a 'gboolean g_string_regexp_match(GString* string, gchar* pattern)' or a
'gboolean g_strmatch(gchar* string, gchar* pattern)', I would probably
use them. When using regular expressions, I find myself in two positions:

1) I want to express a complicated match using very simple code.
   I am not concerned about performance. This would be best served
   by the above two functions, that would require no object-handling
   at all (at least for my program). If the patterns happened to be
   cached, great. If not, at least my implementation is simple, which
   represents: a) fewer product defects, and b) easier to maintain.

2) I have a need for a heavy-duty performance-optimized match function,
   in which case my needs are beyond the very simple, and I require
   it to perform well, and scale. For this, having two objects works
   well in other languages. The compilation is performed as early as
   required. The matching can then be freely performed by any thread
   without worry. For example, if I have a multi-threaded server with
   a thread-pool to manage active connections. I may use a regular
   expression pattern to ensure that various parts of the protocol
   are syntactically correct, and to extract data from the records
   sent by the client. I do not wish to recompile the pattern every
   time a new client is accepted, and I do not match to have a mutex
   lock around any use of the GRegexp. I want any number of clients
   to be at any number of match states, on the same pre-compiled
   regular expression.

An important factor in all of this, is that regular expressions are
powerful because they allow complex expressions to be written safely.
Without a good regular expression library available to me in C, I've
found it necessary to write my own pattern matching code. This is very
error-prone. Even if I get it right - who is to say somebody who changes
my code will keep it right?

> > Having a matcher object serves more purposes than just thread
> > scaleability. What if I wish to walk through the string, finding
> > each match, processing each match as it is found?
> > Possible.
> > Why should I
> > have to search the entire search before I can display the first
> > match?
> You can't do the contrary - find all matches and display them.
> (I guess Marco should know better, I've never done stuff like
> this)
> > In Perl, this functionality is available as:
> > while ($scalar =~ /(pattern)/g) {
> > ... each match ...
> > }
> > With a Matcher o

Re: Performance implications of GRegex structure

2007-03-17 Thread Matthias Clasen

On 3/17/07, Yevgen Muntyan <[EMAIL PROTECTED]> wrote:

> >  Why should I
> > have to search the entire search before I can display the first
> > match?
> >
> You can't do the contrary - find all matches and display them.
> (I guess Marco should know better, I've never done stuff like
> this)
> > In Perl, this functionality is available as:
> >
> > while ($scalar =~ /(pattern)/g) {
> > ... each match ...
> > }
> >
> > With a Matcher object, the same can be accomplished in a thread-safe
> > manner.
> >
> Could you show how it would be done (i.e. show C code)?

 With GRegex  you call g_regex_match_next() until it returns FALSE to
find all nonoverlapping matches.

Since a GRegex is a combination of pattern+matcher, it is not too surprising
that you can do all the things that you can do with a separate matcher
with a GRegex, too.

I can certainly appreciate the arguments of both sides here:

pro single object: makes for simpler application code in simple cases

contra single object: complicated semantics of g_regex_copy() and a
somewhat unnatural pattern for sharing between threads.

Personally, I'd tend towards giving the contra arguments more weight,
since in many of the simple cases you can probably either get away
with something like g_regex_match_simple() (which saves you dealing
with the single GRegex object, too), or you can write your own similar
wrapper that handles the match object for you.

Btw, one thing we might want to consider doing (regardless if we go
with separate pattern and matcher objects) is to make the pattern
optimization an optional part of the constructor rather than a separate
function. That will ensure that the pattern is truly immutable after
construction.

Matthias
___
gtk-devel-list mailing list
gtk-devel-list@gnome.org
http://mail.gnome.org/mailman/listinfo/gtk-devel-list

Re: Performance implications of GRegex structure

2007-03-16 Thread Yevgen Muntyan

Hey,

I looked at gtksourceview and its patterns, the syntax
highlighting engine uses regular expressions with up
to 56 subpatterns (length of patterns was the reason
for egg_regex_ref()), which amounts to 670 bytes array
to store offsets. The match structure in this case is
some 40 bytes + those 670 bytes. I am not sure if
it's a lot, is it?

Yevgen

___
gtk-devel-list mailing list
gtk-devel-list@gnome.org
http://mail.gnome.org/mailman/listinfo/gtk-devel-list

Re: Performance implications of GRegex structure

2007-03-16 Thread Yevgen Muntyan

[Mark, I apologize, I accidentally sent it to you in private]

[EMAIL PROTECTED] wrote:
> On Fri, Mar 16, 2007 at 09:15:37PM -0500, Yevgen Muntyan wrote:
>   
>> I do understand that a separate match object is a good idea.
>> But "separate match object in C API is a good idea" is questionable.
>> While thread-safety is important, it doesn't sound feasible a single
>> GRegex object will be used from different threads to match something
>> in *many* cases. Maybe it makes sense to add thread safety
>> in some other way? The single-object version is certainly more
>> convenient than a version with a separate match object.
>> By the way, I don't know about Java, but having re.match()
>> return an object is very often gets in your way in Python (for
>> different reasons but it does say something about "it's done
>> so in Python").
>> 
>
> It looks to me like you are suggesting the worst of all worlds.
> Not thread-safe, not scaleable, and not simple.
>   

I am suggesting something which is currently used in real code.
Simple, nice, and working. *If* it's not as good as it should be,
*then* it should be changed.

> If you want simple - give up on GRegexp altogether. Try something like:
>
> if (g_string_regexp_match(s, pattern))
> ...
>
> If it happens to do some sort of internal caching - great. If not?
> At least it is simple. For Java, this maps to:
>
> if (string.matches(pattern))
> ...
>   

Should I say "if you want cool - give up on C altogether"? I guess
not. I mean, it's not like I want simplicity at all costs. No need to
provide fancy-shmancy java stuff. It's a real serious question: what
would be the uses of GRegex, where it would be convenient
to have separate Match structure and where it would not.

> Having a matcher object serves more purposes than just thread
> scaleability. What if I wish to walk through the string, finding
> each match, processing each match as it is found?
Possible.
>  Why should I
> have to search the entire search before I can display the first
> match?
>   
You can't do the contrary - find all matches and display them.
(I guess Marco should know better, I've never done stuff like
this)
> In Perl, this functionality is available as:
>
> while ($scalar =~ /(pattern)/g) {
> ... each match ...
> }
>
> With a Matcher object, the same can be accomplished in a thread-safe
> manner.
>   
Could you show how it would be done (i.e. show C code)? And
what's "Matcher"? Is it something that performs matching, or
is it a result of search (match)? I guess it could make sense to
collect all results after searching repeatedly, but it doesn't seem
to be what you are talking about.

Yevgen


___
gtk-devel-list mailing list
gtk-devel-list@gnome.org
http://mail.gnome.org/mailman/listinfo/gtk-devel-list

Re: Performance implications of GRegex structure

2007-03-16 Thread mark

On Fri, Mar 16, 2007 at 09:15:37PM -0500, Yevgen Muntyan wrote:
> I do understand that a separate match object is a good idea.
> But "separate match object in C API is a good idea" is questionable.
> While thread-safety is important, it doesn't sound feasible a single
> GRegex object will be used from different threads to match something
> in *many* cases. Maybe it makes sense to add thread safety
> in some other way? The single-object version is certainly more
> convenient than a version with a separate match object.
> By the way, I don't know about Java, but having re.match()
> return an object is very often gets in your way in Python (for
> different reasons but it does say something about "it's done
> so in Python").

It looks to me like you are suggesting the worst of all worlds.
Not thread-safe, not scaleable, and not simple.

If you want simple - give up on GRegexp altogether. Try something like:

if (g_string_regexp_match(s, pattern))
...

If it happens to do some sort of internal caching - great. If not?
At least it is simple. For Java, this maps to:

if (string.matches(pattern))
...

Having a matcher object serves more purposes than just thread
scaleability. What if I wish to walk through the string, finding
each match, processing each match as it is found? Why should I
have to search the entire search before I can display the first
match?

In Perl, this functionality is available as:

while ($scalar =~ /(pattern)/g) {
... each match ...
}

With a Matcher object, the same can be accomplished in a thread-safe
manner.

Cheers,
mark

-- 
[EMAIL PROTECTED] / [EMAIL PROTECTED] / [EMAIL PROTECTED] 
__
.  .  _  ._  . .   .__.  . ._. .__ .   . . .__  | Neighbourhood Coder
|\/| |_| |_| |/|_ |\/|  |  |_  |   |/  |_   | 
|  | | | | \ | \   |__ .  |  | .|. |__ |__ | \ |__  | Ottawa, Ontario, Canada

  One ring to rule them all, one ring to find them, one ring to bring them all
   and in the darkness bind them...

   http://mark.mielke.cc/

___
gtk-devel-list mailing list
gtk-devel-list@gnome.org
http://mail.gnome.org/mailman/listinfo/gtk-devel-list

Re: Performance implications of GRegex structure

2007-03-16 Thread Yevgen Muntyan

Matthias Clasen wrote:
> On 3/16/07, Marco Barisione <[EMAIL PROTECTED]> wrote:
>
>   
>> BTW if you want I can split GRegex in two separate objects.
>> 
>
> Since that seems to be the overwhelming preference,
"overwhelming"?
>  that might
> be a good idea. I hope this shouldn't be too bad, since GRegex
> is already split into pattern and match objects, internally.
>   
Exactly, internally. Compare

return g_regex_match(re, "foobar", 0, 0);

and

Match *m;
gboolean result;
m = g_regex_match(re, "foobar", 0, 0);
result = m != NULL;
g_regex_match_free(m);
return result;

I do understand that a separate match object is a good idea.
But "separate match object in C API is a good idea" is questionable.
While thread-safety is important, it doesn't sound feasible a single
GRegex object will be used from different threads to match something
in *many* cases. Maybe it makes sense to add thread safety
in some other way? The single-object version is certainly more
convenient than a version with a separate match object.
By the way, I don't know about Java, but having re.match()
return an object is very often gets in your way in Python (for
different reasons but it does say something about "it's done
so in Python").

Best regards,
Yevgen

___
gtk-devel-list mailing list
gtk-devel-list@gnome.org
http://mail.gnome.org/mailman/listinfo/gtk-devel-list

Re: Performance implications of GRegex structure

2007-03-16 Thread Matthias Clasen

On 3/16/07, Marco Barisione <[EMAIL PROTECTED]> wrote:

> BTW if you want I can split GRegex in two separate objects.

Since that seems to be the overwhelming preference, that might
be a good idea. I hope this shouldn't be too bad, since GRegex
is already split into pattern and match objects, internally.

If you are doing this,  it might also be nice to expand the doc section
on using GRegex with threads with an example that shows how
to share a compiled pattern between multiple threads. The docs
currently say:

If you have two threads manipulating the same #GRegex, they must use a
lock to synchronize their operation, as these functions are not threadsafe.
Creating and manipulating different #GRegex structures from
differentthreads is not a problem.


Matthias
___
gtk-devel-list mailing list
gtk-devel-list@gnome.org
http://mail.gnome.org/mailman/listinfo/gtk-devel-list

Re: GRegex(win32) : 500 tests passed, 3 failed

2007-03-16 Thread Marco Barisione

Il giorno gio, 15/03/2007 alle 18.41 +0100, Hans Breuer ha scritto:
> with only small modifications I was able to compile GRegex with msvc,
> thanks for providing an almost working makefile.msc ;-)
> [...]
> But now for the question: are these 3 failed specific to my build so I
> should investigate them further?

It's my fault, I wrote makefile.msc (without testing it) before the
release of PCRE 7.
PCRE 6.x can recognize as a newline one of \n, \r or \r\n. PCRE 7.x
added the ability to match any newline character, so I changed the
default value from 10 (\n) to -1 (PCRE_NEWLINE_ANY) in Makefile.am but
not in makefile.msc.

Sorry :)

-- 
Marco Barisione
http://www.barisione.org/

___
gtk-devel-list mailing list
gtk-devel-list@gnome.org
http://mail.gnome.org/mailman/listinfo/gtk-devel-list

Re: Performance implications of GRegex structure

2007-03-16 Thread Marco Barisione

Il giorno gio, 15/03/2007 alle 10.18 -0400, Owen Taylor ha scritto:
> But looking over the header file, there is something that puzzles me
> about the way that it's set up: there is no distinction between a
> "pattern/regular expression" object and a match/matcher object.

The internal code in GRegex was deeply modified but the API is quite
similar to the original one written by Scott Wimer and then modified by
Matthias Clasen, so I kept a single GRegex object but with lots of
doubts.

In the end I decided to keep a single object because I prefer this
approach when using languages without a garbage collector and because
QRegExp (the equivalent object in QT) is a single object.

This matter was brought out in the mailing list and in bugzilla but only
Havoc Pennington and Yevgen Muntyan expressed their opinion saying that
they prefer a single object.

BTW if you want I can split GRegex in two separate objects.

-- 
Marco Barisione
http://www.barisione.org/

___
gtk-devel-list mailing list
gtk-devel-list@gnome.org
http://mail.gnome.org/mailman/listinfo/gtk-devel-list

Re: Performance implications of GRegex structure

2007-03-16 Thread mark

On Fri, Mar 16, 2007 at 08:20:11AM +0100, Mathieu Lacage wrote:
> On Thu, 2007-03-15 at 10:56 -0400, Owen Taylor wrote:
> > Well, I could imagine (maybe, barely) that someone could show me numbers
> > that showed that with a variety of long and complicated regular
> > expressions, compiling them was still 10x as fast as matching them
> > against very short strings.
> > 
> > But in general, yes, part of my concern is that there are situations
> > where you are going to matching the same regular expression against
> > thousands of strings, and in that situation, unless compilation is very,
> > very, fast, the need to repeatedly recompile will inevitably produce
> > measurable overhead.
> If this were to happen, could you not just put together a per-thread
> 5-entry double hash table to cache the 5 most recently used regex
> strings ? It really seems like a no-brainer. Am I missing something ?

No. If so-required (for example, if implemented using regcomp()/regexec()),
this would be one possible way to emulate the Matcher object. It's silly,
in that GLIB-users should not need to each work their own way around it.
It's also silly, in that a PCRE-based implementation would not require
this hackery.

I already made my suggestion for emulation with libraries that don't
provide both. Pattern becomes a pool object. Matcher grabs from the pool,
or allocates for the pool, returning it to the pool on completion. Steady
state is reached. In your no-brainer example, you have assumed that there
are only 5 common RE's in use. If there were 6 that cycled, it would break.
Having each Pattern have it's own pool, allows for a minimum allocation.

Cheers,
mark

-- 
[EMAIL PROTECTED] / [EMAIL PROTECTED] / [EMAIL PROTECTED] 
__
.  .  _  ._  . .   .__.  . ._. .__ .   . . .__  | Neighbourhood Coder
|\/| |_| |_| |/|_ |\/|  |  |_  |   |/  |_   | 
|  | | | | \ | \   |__ .  |  | .|. |__ |__ | \ |__  | Ottawa, Ontario, Canada

  One ring to rule them all, one ring to find them, one ring to bring them all
   and in the darkness bind them...

   http://mark.mielke.cc/

___
gtk-devel-list mailing list
gtk-devel-list@gnome.org
http://mail.gnome.org/mailman/listinfo/gtk-devel-list

Re: Performance implications of GRegex structure

2007-03-16 Thread mark

On Fri, Mar 16, 2007 at 02:18:23PM -0400, Owen Taylor wrote:
> On Fri, 2007-03-16 at 10:57 -0700, David Moffatt wrote:
> >   char *
> >   get_leading_digits(const char *str)
> >   {
> >  static GRegex *regex = NULL;
> >  char *result = NULL;
> > 
> >  if (!regex)
> >  regex = g_regex_new("^\\d+", 0, 0, NULL);
> > 
> >  if (g_regex_match(str, 0))
> >  result = g_regex_fetch(regex, 0);
> > 
> >  return result;
> >   }
> > 
> > That code bothers me a fair bit; not because so much because it's
> > not thread safe, but because it exhibits a pattern that is *inherently*
> > not thread safe or re-entrant
> > 
> > Actually if you look at it carefully you realize it is thread safe and
> > depending on what you call the pattern it too is thread safe.  I think
> > this is a great way of doing things.  I work in embedded so we are very
> > sensitive to things like library global constructor creating regex.
> > That is overhead you pay whether or not you use the function.  That is
> > also overhead that you pay at launch time which for us is a big deal.
> 
> OK, maybe I'm just not looking at it right, but to me, it's not 
> thread safe two ways:
> 
>  A) The initializaton of the regex isn't thread safe. If multiple
> threads race to construct the regex, there will be a memory
> leak.
>  B) GRegex has internal state that is modified by g_regex_match()

Agree with both.

> None of the examples I shown used global constructors.

Correct. Overhead is only on first use.

Dave: Your detector is not sensitive enough I think... :-)

Cheers,
mark

-- 
[EMAIL PROTECTED] / [EMAIL PROTECTED] / [EMAIL PROTECTED] 
__
.  .  _  ._  . .   .__.  . ._. .__ .   . . .__  | Neighbourhood Coder
|\/| |_| |_| |/|_ |\/|  |  |_  |   |/  |_   | 
|  | | | | \ | \   |__ .  |  | .|. |__ |__ | \ |__  | Ottawa, Ontario, Canada

  One ring to rule them all, one ring to find them, one ring to bring them all
   and in the darkness bind them...

   http://mark.mielke.cc/

___
gtk-devel-list mailing list
gtk-devel-list@gnome.org
http://mail.gnome.org/mailman/listinfo/gtk-devel-list

RE: Performance implications of GRegex structure

2007-03-16 Thread Owen Taylor

On Fri, 2007-03-16 at 10:57 -0700, David Moffatt wrote:
>   char *
>   get_leading_digits(const char *str)
>   {
>  static GRegex *regex = NULL;
>  char *result = NULL;
> 
>  if (!regex)
>  regex = g_regex_new("^\\d+", 0, 0, NULL);
> 
>  if (g_regex_match(str, 0))
>  result = g_regex_fetch(regex, 0);
> 
>  return result;
>   }
> 
> That code bothers me a fair bit; not because so much because it's
> not thread safe, but because it exhibits a pattern that is *inherently*
> not thread safe or re-entrant
> 
> Actually if you look at it carefully you realize it is thread safe and
> depending on what you call the pattern it too is thread safe.  I think
> this is a great way of doing things.  I work in embedded so we are very
> sensitive to things like library global constructor creating regex.
> That is overhead you pay whether or not you use the function.  That is
> also overhead that you pay at launch time which for us is a big deal.

OK, maybe I'm just not looking at it right, but to me, it's not 
thread safe two ways:

 A) The initializaton of the regex isn't thread safe. If multiple
threads race to construct the regex, there will be a memory
leak.
 B) GRegex has internal state that is modified by g_regex_match()

None of the examples I shown used global constructors.

- Owen



signature.asc
Description: This is a digitally signed message part
___
gtk-devel-list mailing list
gtk-devel-list@gnome.org
http://mail.gnome.org/mailman/listinfo/gtk-devel-list

RE: Performance implications of GRegex structure

2007-03-16 Thread David Moffatt

  char *
  get_leading_digits(const char *str)
  {
 static GRegex *regex = NULL;
 char *result = NULL;

 if (!regex)
 regex = g_regex_new("^\\d+", 0, 0, NULL);

 if (g_regex_match(str, 0))
 result = g_regex_fetch(regex, 0);

 return result;
  }

That code bothers me a fair bit; not because so much because it's
not thread safe, but because it exhibits a pattern that is *inherently*
not thread safe or re-entrant

Actually if you look at it carefully you realize it is thread safe and
depending on what you call the pattern it too is thread safe.  I think
this is a great way of doing things.  I work in embedded so we are very
sensitive to things like library global constructor creating regex.
That is overhead you pay whether or not you use the function.  That is
also overhead that you pay at launch time which for us is a big deal.


___
gtk-devel-list mailing list
gtk-devel-list@gnome.org
http://mail.gnome.org/mailman/listinfo/gtk-devel-list

Re: Performance implications of GRegex structure

2007-03-16 Thread Behdad Esfahbod

On Fri, 2007-03-16 at 09:36 -0400, Morten Welinder wrote:
> Is there a guarantee that for GRegex (unlike, say, GDate) multiple
> threads can use
> the same object at the same time?
> 
> I.e., two threads cannot call g_date_get_weekday on the same date, so why are 
> we
> expect that two threads can call g_regex_copy or anything like it?

There are the following reasons to have separate regex and match
objects:

1) Less prune to reentrancy problems
2) Less prune to writing inefficient code
3) More familiar to people who have used any scripting language
4) The GRegex object already has separate match and regex objects
   internally, and has to copy them differently to get good performance
   possible.  This is at best a hack.

So, what's the deal?  The question is not "can we please change it?".
It's "why is it this way?".

And as someone already showed, safe simple wrappers can be written
around the regex object for simple boolean queries.

> Morten

-- 
behdad
http://behdad.org/

"Those who would give up Essential Liberty to purchase a little
 Temporary Safety, deserve neither Liberty nor Safety."
-- Benjamin Franklin, 1759

___
gtk-devel-list mailing list
gtk-devel-list@gnome.org
http://mail.gnome.org/mailman/listinfo/gtk-devel-list

Re: Performance implications of GRegex structure

2007-03-16 Thread Dimi Paun


On Fri, March 16, 2007 08:28, Owen Taylor wrote:
>   char *
>   get_leading_digits(const char *str)
>   {
>  GStaticRegex regex = G_STATIC_REGEX_INIT("^\\d+", 0);
>  GMatcher *matcher;
>  char *result = NULL;
>
>  matcher = g_matcher_new_static(®ex, str, 0);
>  if (g_matcher_matches(matcher))
>  result = g_matcher_get(regex, 0);
>
>  g_matcher_free(matcher);
>
>  return result;
>   }
>
> I'm not going to argue that the current GRegex API is unworkable,
> but I think it obscures the nature of the system - first you compile a
> regular expression, then you match against it - and that's going to make
> it harder for people to write correct, efficient code.

IMO the separation between regex and the matcher is so obvious
(IIRC Java and Python do it) that's not even worth discussing.

As for the examples, the last version seems the most readable.
And for the case where you're not interested in performance, it
ca simplify to:

  char *
  get_leading_digits(const char *str)
  {
 GStaticRegex regex = G_STATIC_REGEX_INIT("^\\d+", 0);
 return g_static_regex_matches(®ex, str);
  }

where g_regex_matches() is a helper defined as:

  char *
  g_static_regex_matches(const GStaticRegex *regex, const char *str)
  {
 GMatcher *matcher;
 char *result = NULL;

 matcher = g_matcher_new_static(®ex, str, 0);
 if (g_matcher_matches(matcher))
 result = g_matcher_get(regex, 0);

 g_matcher_free(matcher);

 return result;
  }

IOW separating the compiled regex from the matcher does not have to
result in more complicated usage pattern for people that don't care
about performance.

-- 
Dimi Paun <[EMAIL PROTECTED]>
Lattica, Inc.


___
gtk-devel-list mailing list
gtk-devel-list@gnome.org
http://mail.gnome.org/mailman/listinfo/gtk-devel-list

Re: Performance implications of GRegex structure

2007-03-16 Thread Morten Welinder

Is there a guarantee that for GRegex (unlike, say, GDate) multiple
threads can use
the same object at the same time?

I.e., two threads cannot call g_date_get_weekday on the same date, so why are we
expect that two threads can call g_regex_copy or anything like it?

Morten
___
gtk-devel-list mailing list
gtk-devel-list@gnome.org
http://mail.gnome.org/mailman/listinfo/gtk-devel-list

Re: Performance implications of GRegex structure

2007-03-16 Thread Owen Taylor

On Thu, 2007-03-15 at 14:16 -0500, Yevgen Muntyan wrote:
> [Owen, I apologize, I hit Reply instead of Reply All]
> 
> Owen Taylor wrote:
> > So, the regular expression code has been committed to CVS finally. Yay!
> >
> > But looking over the header file, there is something that puzzles me
> > about the way that it's set up: there is no distinction between a
> > "pattern/regular expression" object and a match/matcher object.
> ...
> > Or to Javascript, Perl, etc. (Javascript and Perl hide the issue a bit
> > by having regular expression literals.) While I have never actually done
> > timings on the matter, I've always assumed that the reason that regular
> > expression API's are set up this way is compiling a regular expression
> > has a significant expense.
> 
> EggRegex contains two structures internally - Pattern and Match.
> egg_regex_copy() references Pattern and creates new Match structure.
> If you got a copy of EggRegex object using egg_regex_copy(), you
> can use it in another thread. Don't know how convenient it is in
> multi-thread setup. The idea was that it's inconvenient to create
> some match objects on heap.

If people will forgive a code-verbose mail, I want to show some examples
of how things look in practice.

So, start off with a simple function written with the current API:

  char *
  get_leading_digits(const char *str)
  {
 GRegex *regex = g_regex_new("^\\d+", 0, 0, NULL);
 char *result = NULL;
 
 if (g_regex_match(str, 0))
 result = g_regex_fetch(regex, 0);

 g_regex_free(regex);

 return result;
  }

Short, simple, thread-safe, but inherently slow. We could write
the same thing with an API with an explicit match object:

  char *
  get_leading_digits(const char *str) 
  {
 GRegex *regex = g_regex_new("^\\d+", 0, 0, NULL);
 GMatcher *matcher = g_matcher_new(regex, str, 0);
 char *result = NULL;

 if (g_matcher_matches(matcher))
 result = g_matcher_get(regex, 0);

 g_matcher_unref(matcher);
 g_regex_free(regex);

 return result;
  }

It gets two lines longer and slightly more cumbersome, but it's not
tremendously worse. But really, repeatedly compiling a regular
expression is a bad idea - something we don't want to encourage.So, the
simplest optimization of the original code looks something like:

  char *
  get_leading_digits(const char *str)
  {
 static GRegex *regex = NULL;
 char *result = NULL;

 if (!regex)
 regex = g_regex_new("^\\d+", 0, 0, NULL);

 if (g_regex_match(str, 0))
 result = g_regex_fetch(regex, 0);

 return result;
  }

That code bothers me a fair bit; not because so much because it's
not thread safe, but because it exhibits a pattern that is *inherently*
not thread safe or re-entrant. People should have built-in reflexes
against writing such code. I'm much happier with the alternative:

  char *
  get_leading_digits(const char *str)
  {
 static GRegex *regex = NULL;
 GMatcher *matcher;
 char *result = NULL;

 if (!regex)
 regex = g_regex_new("^\\d+", 0, NULL);
   
 matcher = g_matcher_new(regex, str, 0);
 if (g_matcher_matches(matcher))
 result = g_matcher_get(regex, 0);

 g_matcher_free(matcher);

 return result;
  }

Even though it's longer and no more thread safe. It's not thread
safe, but can be made so without changing the basic pattern:

  gpointer
  get_leading_digits_regex(gpointer data) 
  {
   return g_regex_new("^\\d+", 0, NULL);
  }

  char *
  get_leading_digits(const char *str)
  {
 static GOnce once = G_ONCE_INIT;
 GRegex *regex = g_once(&once, get_leading_digits_regex, NULL);
 GMatcher *matcher;
 char *result = NULL;

 matcher = g_matcher_new(regex, str, 0);
 if (g_matcher_matches(matcher))
 result = g_matcher_get(regex, 0);

 g_matcher_free(matcher);

 return result;
  }

As Yevgen points out, we can do the same thing with the current GRegex
API, but I don't think the result is very readable. It isn't code that
"says what it does".

  gpointer
  get_leading_digits_regex(gpointer data) 
  {
   return g_regex_new("^\\d+", 0, NULL);
  }

  char *
  get_leading_digits(const char *str)
  {
 static GOnce once = G_ONCE_INIT;
 GRegex *template = g_once(&once, get_leading_digits_regex, NULL);
 GRegex *regex;
 char *result = NULL;

 regex = g_regex_copy(template);
 if (g_regex_match(str, 0))
 result = g_regex_fetch(regex, 0);

 g_regex_free(regex);
  
 return result;
  }

Now, a regular expression that you want compile once and keep in a
static variable is actually the most common use of a regular
expression. So, maybe the right version of the function actually
looks like:

Re: Performance implications of GRegex structure

2007-03-16 Thread Nikolai Weibull

On 3/16/07, Mathieu Lacage <[EMAIL PROTECTED]> wrote:
> On Thu, 2007-03-15 at 10:56 -0400, Owen Taylor wrote:
>
> > Well, I could imagine (maybe, barely) that someone could show me numbers
> > that showed that with a variety of long and complicated regular
> > expressions, compiling them was still 10x as fast as matching them
> > against very short strings.
> >
> > But in general, yes, part of my concern is that there are situations
> > where you are going to matching the same regular expression against
> > thousands of strings, and in that situation, unless compilation is very,
> > very, fast, the need to repeatedly recompile will inevitably produce
> > measurable overhead.
>
> If this were to happen, could you not just put together a per-thread
> 5-entry double hash table to cache the 5 most recently used regex
> strings ? It really seems like a no-brainer. Am I missing something ?

Now it's beginning to get out of hand.  What's conceptually so
wrong/difficult with having one object for the regex and one object
for the matches generated when matching this regex against some input?
 It's how basically every other library/language does it.  It's easy
to understand and easy to use.

Here follows a way to academic description for why it doesn't make
sense to keep the matches with the regex.

Here's a diagram of how pattern matching is done at a very abstract level:

   Pattern
  |
  v
Pattern Matcher Generator
  |
  v
Input->Matcher->Yes/No

The Pattern is the regular expression string, for example, "a*b".  The
Pattern Matcher Generator is what turns the Pattern into the finite
automaton Matcher that recognizes the (not always) regular language
that is described by the regular expression.  Input is the string that
we want to determine if it is part of the regular language that is
described by the regular expression and accepted by the finite
automaton.  Running a finite control for the finite automaton and the
input will yield a Yes or No answer.

We can substitute that Yes/No answer with a Matches object that tells
us more information than simply if the Input matches the Pattern or
not.

In no way does it make sense to keep the answer in the Matcher.

Marke Mielke wrote:

> To answer Owen - I expect this is because the base regcomp()/regexec()
> libraries to not make this distinction.

Sure they do.  The function regcomp() returns an opaque type regex_t,
while regexec() returns an array of transparent type regmatch_t.

On the whole multithreaded issue, I feel that that's beside the point.
 And, besides, the PCRE documentation seems clear about this issue:

The compiled form of a regular expression is not altered during match-
ing, so the same compiled pattern can safely be used by several threads
at once.

   (pcreapi(3) Manual Page)

  nikolai
___
gtk-devel-list mailing list
gtk-devel-list@gnome.org
http://mail.gnome.org/mailman/listinfo/gtk-devel-list

Re: Performance implications of GRegex structure

2007-03-15 Thread Mathieu Lacage

On Thu, 2007-03-15 at 10:56 -0400, Owen Taylor wrote:

> Well, I could imagine (maybe, barely) that someone could show me numbers
> that showed that with a variety of long and complicated regular
> expressions, compiling them was still 10x as fast as matching them
> against very short strings.
> 
> But in general, yes, part of my concern is that there are situations
> where you are going to matching the same regular expression against
> thousands of strings, and in that situation, unless compilation is very,
> very, fast, the need to repeatedly recompile will inevitably produce
> measurable overhead.

If this were to happen, could you not just put together a per-thread
5-entry double hash table to cache the 5 most recently used regex
strings ? It really seems like a no-brainer. Am I missing something ?

Mathieu
-- 

___
gtk-devel-list mailing list
gtk-devel-list@gnome.org
http://mail.gnome.org/mailman/listinfo/gtk-devel-list

Re: Performance implications of GRegex structure

2007-03-15 Thread Yevgen Muntyan

[Owen, I apologize, I hit Reply instead of Reply All]

Owen Taylor wrote:
> So, the regular expression code has been committed to CVS finally. Yay!
>
> But looking over the header file, there is something that puzzles me
> about the way that it's set up: there is no distinction between a
> "pattern/regular expression" object and a match/matcher object.
...
> Or to Javascript, Perl, etc. (Javascript and Perl hide the issue a bit
> by having regular expression literals.) While I have never actually done
> timings on the matter, I've always assumed that the reason that regular
> expression API's are set up this way is compiling a regular expression
> has a significant expense.

EggRegex contains two structures internally - Pattern and Match.
egg_regex_copy() references Pattern and creates new Match structure.
If you got a copy of EggRegex object using egg_regex_copy(), you
can use it in another thread. Don't know how convenient it is in
multi-thread setup. The idea was that it's inconvenient to create
some match objects on heap.

Best regards,
Yevgen

___
gtk-devel-list mailing list
gtk-devel-list@gnome.org
http://mail.gnome.org/mailman/listinfo/gtk-devel-list

Re: GRegex(win32) : 500 tests passed, 3 failed

2007-03-15 Thread Hans Breuer

On 15.03.2007 18:45, Jake Goulding wrote:
> Having newlines seems suspicious. What kind of newlines are they?
>
\r\n and \r were failing (NEWLINE was defined to 10 in makefile.msc but -1
in Makefile.am).

I have modified my patch, the crash was somewhere else than expected, i.e.
completely in the test application ;)

Now it also prints readable newlines for the failed cases:

D:\devel\from-svn\glib\glib>..\tests\regex-test.exe
failed  (unexpected mismatch) '^b$' against 'a\r\nb\r\nc'
failed  (unexpected mismatch) '^b$' against 'a\rb\rc'
failed  (unexpected match) 'a#\rb' against 'a'

But after NEWLINE=-1 it does not fail anymore.

Thanks,
Hans

> Hans Breuer wrote:
>> with only small modifications I was able to compile GRegex with msvc,
>> thanks for providing an almost working makefile.msc ;-)
>>
>> The first attempt to run
>>
>>  regex-test.exe --noisy
>>
>> did crash due to gnulib not liking
>>
>> g_strdup_vprintf ("matching \"%s\" against \"%s\" \t", "%", "\p{Common}")
>>
>> The attached patch works around this and also removes the
>> #include  from gregex.h. I think it is better to only include
>> required sub-headers like almost all glib/*.h do.
>>
>> But now for the question: are these 3 failed specific to my build so I
>> should investigate them further?
>>
>> Thanks,
>>  Hans
>>
>> matching "a
>>
>> b
>>
>> c" against "^b$" (start: 0, len: -1) failed  (unexpected mismatch)
>> matching "a
>> b
>> c" against "^b$" (start: 0, len: -1) failed  (unexpected mismatch)
>>
>> matching "a" against "a#
>> b" (start: 0, len: -1)   failed  (unexpected match)
>>
>>
>>  Hans "at" Breuer "dot" Org ---
>> Tell me what you need, and I'll tell you how to
>> get along without it.-- Dilbert
>>   
>> 
>>
>> Index: glib/gregex.h
>> ===
>> --- glib/gregex.h(revision 5410)
>> +++ glib/gregex.h(working copy)
>> @@ -22,7 +22,8 @@
>>  #ifndef __G_REGEX_H__
>>  #define __G_REGEX_H__
>>  
>> -#include 
>> +#include 
>> +#include 
>>  
>>  G_BEGIN_DECLS
>>  
>> Index: tests/regex-test.c
>> ===
>> --- tests/regex-test.c   (revision 5409)
>> +++ tests/regex-test.c   (working copy)
>> @@ -230,7 +230,10 @@
>> gbooleanexpected)
>>  {
>>gboolean match;
>> -  
>> +  
>> +  if (string[0] == '%' && string[1] == '\0')
>> +  string = "%%";
>> +
>>verbose ("matching \"%s\" against \"%s\" \t", string, pattern);
>>  
>>match = g_regex_match_simple (pattern, string, compile_opts, match_opts);
>>   
>> 
>>
>> ___
>> gtk-devel-list mailing list
>> gtk-devel-list@gnome.org
>> http://mail.gnome.org/mailman/listinfo/gtk-devel-list
>>   
> 


-- 
 Hans "at" Breuer "dot" Org ---
Tell me what you need, and I'll tell you how to
get along without it.-- Dilbert
Index: glib/gregex.h
===
--- glib/gregex.h   (revision 5410)
+++ glib/gregex.h   (working copy)
@@ -22,7 +22,8 @@
 #ifndef __G_REGEX_H__
 #define __G_REGEX_H__
 
-#include 
+#include 
+#include 
 
 G_BEGIN_DECLS
 
Index: tests/regex-test.c
===
--- tests/regex-test.c  (revision 5409)
+++ tests/regex-test.c  (working copy)
@@ -87,7 +87,7 @@
   va_end (args);
 
   if (noisy) 
-g_print (msg);
+g_print ("%s", msg);
   g_free (msg);
 }
 
@@ -230,8 +230,8 @@
   gbooleanexpected)
 {
   gboolean match;
-  
-  verbose ("matching \"%s\" against \"%s\" \t", string, pattern);
+  
+  verbose ("matching \"%s\" against \"%s\" \t", string, pattern);
 
   match = g_regex_match_simple (pattern, string, compile_opts, match_opts);
   if (match != expected)
@@ -274,8 +274,12 @@
   match = g_regex_match_full (regex, string, string_len,
  start_position, match_opts2, NULL);
   if (match != expected)
-{
-  g_print ("failed \t(unexpected %s)\n", match ? "match" : "mismatch");
+{
+  gchar *e1 = g_strescape (pattern, NULL);
+  gchar *e2 = g_strescape (string, NULL);
+  g_print ("failed \t(unexpected %s) '%s' against '%s'\n", match ? "match" 
: "mismatch", e1, e2);
+  g_free (e1);
+  g_free (e2);
   g_regex_free (regex);
   return FALSE;
 }
___
gtk-devel-list mailing list
gtk-devel-list@gnome.org
http://mail.gnome.org/mailman/listinfo/gtk-devel-list

Re: GRegex(win32) : 500 tests passed, 3 failed

2007-03-15 Thread Jake Goulding

Having newlines seems suspicious. What kind of newlines are they?

Hans Breuer wrote:
> with only small modifications I was able to compile GRegex with msvc,
> thanks for providing an almost working makefile.msc ;-)
>
> The first attempt to run
>
>   regex-test.exe --noisy
>
> did crash due to gnulib not liking
>
> g_strdup_vprintf ("matching \"%s\" against \"%s\" \t", "%", "\p{Common}")
>
> The attached patch works around this and also removes the
> #include  from gregex.h. I think it is better to only include
> required sub-headers like almost all glib/*.h do.
>
> But now for the question: are these 3 failed specific to my build so I
> should investigate them further?
>
> Thanks,
>   Hans
>
> matching "a
>
> b
>
> c" against "^b$" (start: 0, len: -1)  failed  (unexpected mismatch)
> matching "a
> b
> c" against "^b$" (start: 0, len: -1)  failed  (unexpected mismatch)
>
> matching "a" against "a#
> b" (start: 0, len: -1)failed  (unexpected match)
>
>
>  Hans "at" Breuer "dot" Org ---
> Tell me what you need, and I'll tell you how to
> get along without it.-- Dilbert
>   
> 
>
> Index: glib/gregex.h
> ===
> --- glib/gregex.h (revision 5410)
> +++ glib/gregex.h (working copy)
> @@ -22,7 +22,8 @@
>  #ifndef __G_REGEX_H__
>  #define __G_REGEX_H__
>  
> -#include 
> +#include 
> +#include 
>  
>  G_BEGIN_DECLS
>  
> Index: tests/regex-test.c
> ===
> --- tests/regex-test.c(revision 5409)
> +++ tests/regex-test.c(working copy)
> @@ -230,7 +230,10 @@
>  gbooleanexpected)
>  {
>gboolean match;
> -  
> +  
> +  if (string[0] == '%' && string[1] == '\0')
> +  string = "%%";
> +
>verbose ("matching \"%s\" against \"%s\" \t", string, pattern);
>  
>match = g_regex_match_simple (pattern, string, compile_opts, match_opts);
>   
> 
>
> ___
> gtk-devel-list mailing list
> gtk-devel-list@gnome.org
> http://mail.gnome.org/mailman/listinfo/gtk-devel-list
>   

-- 

JAKE GOULDING
Software Engineer
[EMAIL PROTECTED]

Vivísimo [Search Done Right]
1710 Murray Avenue
Pittsburgh, PA 15217 USA
tel: +1.412.422.2499 x105
fax: +1.412.422.2495
vivisimo.com  clusty.com

___
gtk-devel-list mailing list
gtk-devel-list@gnome.org
http://mail.gnome.org/mailman/listinfo/gtk-devel-list

GRegex(win32) : 500 tests passed, 3 failed

2007-03-15 Thread Hans Breuer

with only small modifications I was able to compile GRegex with msvc,
thanks for providing an almost working makefile.msc ;-)

The first attempt to run

regex-test.exe --noisy

did crash due to gnulib not liking

g_strdup_vprintf ("matching \"%s\" against \"%s\" \t", "%", "\p{Common}")

The attached patch works around this and also removes the
#include  from gregex.h. I think it is better to only include
required sub-headers like almost all glib/*.h do.

But now for the question: are these 3 failed specific to my build so I
should investigate them further?

Thanks,
Hans

matching "a

b

c" against "^b$" (start: 0, len: -1)failed  (unexpected mismatch)
matching "a
b
c" against "^b$" (start: 0, len: -1)failed  (unexpected mismatch)

matching "a" against "a#
b" (start: 0, len: -1)  failed  (unexpected match)


 Hans "at" Breuer "dot" Org ---
Tell me what you need, and I'll tell you how to
get along without it.-- Dilbert
Index: glib/gregex.h
===
--- glib/gregex.h   (revision 5410)
+++ glib/gregex.h   (working copy)
@@ -22,7 +22,8 @@
 #ifndef __G_REGEX_H__
 #define __G_REGEX_H__
 
-#include 
+#include 
+#include 
 
 G_BEGIN_DECLS
 
Index: tests/regex-test.c
===
--- tests/regex-test.c  (revision 5409)
+++ tests/regex-test.c  (working copy)
@@ -230,7 +230,10 @@
   gbooleanexpected)
 {
   gboolean match;
-  
+  
+  if (string[0] == '%' && string[1] == '\0')
+  string = "%%";
+
   verbose ("matching \"%s\" against \"%s\" \t", string, pattern);
 
   match = g_regex_match_simple (pattern, string, compile_opts, match_opts);
___
gtk-devel-list mailing list
gtk-devel-list@gnome.org
http://mail.gnome.org/mailman/listinfo/gtk-devel-list

Re: Performance implications of GRegex structure

2007-03-15 Thread mark

On Thu, Mar 15, 2007 at 10:56:57AM -0400, Owen Taylor wrote:
> The  compiled form of a regular expression is not altered during matching, 
> so the same compiled pattern can safely be used by several threads at once.
> ...
> Well, I could imagine (maybe, barely) that someone could show me numbers
> that showed that with a variety of long and complicated regular
> expressions, compiling them was still 10x as fast as matching them
> against very short strings.

To answer Owen - I expect this is because the base regcomp()/regexec()
libraries to not make this distinction. To emulate the higher
performing libraries that separate the Pattern from the Matcher would
require jumping through some hoops.

There are two cases I see. One is multithreaded scaleability. If this
was impotant, simulation for these older libraries could be performed
using a pool of pre-compiled regular expression objects. For example,
if "give me a new matcher object" would pull the compiled regular
expression from the pool, or if none is available, compile a new one,
and once complete, it would return the regular expression to the
pool. At some point, it would reach a steady state where new
compilation was not required. I expect it would begin to line up with
the number of threads using it.

The second case is ability to re-use a compiled pattern from the same
thread. I believe this is possible using the provided interface, although
the freedom to use more than one Matcher at the same time might be
convenient.

To illustrate the cost of compile-every-time vs compile-once (19X slower!):

Using the regcomp()/regexec() that comes with my FC6 system with
compile each time:

-- CUT --
$ cat r.c
#include 
#include 

int main ()
{
regex_t regex;
int i;

for (i = 0; i < 100; i++) {
regcomp(®ex, "constant", 0);
regexec(®ex, "text that contains constant somewhere", 0, 0, 0);
regfree(®ex);
}

return 0;
}

$ gcc -O3 -o r r.c

$ time ./r
./r  15.04s user 0.04s system 99% cpu 15.223 total
-- CUT --

Using the regcomp()/regexec() that comes with my FC6 system with
compile once:

-- CUT --
$ cat r2.c
#include 
#include 

int main ()
{
regex_t regex;
int i;

regcomp(®ex, "constant", 0);
for (i = 0; i < 100; i++) {
regexec(®ex, "text that contains constant somewhere", 0, 0, 0);
}
regfree(®ex);

return 0;
}
$ gcc -O3 -o r2 r2.c
$ time ./r2
./r2  0.77s user 0.00s system 100% cpu 0.773 total
-- CUT --

-- 
[EMAIL PROTECTED] / [EMAIL PROTECTED] / [EMAIL PROTECTED] 
__
.  .  _  ._  . .   .__.  . ._. .__ .   . . .__  | Neighbourhood Coder
|\/| |_| |_| |/|_ |\/|  |  |_  |   |/  |_   | 
|  | | | | \ | \   |__ .  |  | .|. |__ |__ | \ |__  | Ottawa, Ontario, Canada

  One ring to rule them all, one ring to find them, one ring to bring them all
   and in the darkness bind them...

   http://mark.mielke.cc/

___
gtk-devel-list mailing list
gtk-devel-list@gnome.org
http://mail.gnome.org/mailman/listinfo/gtk-devel-list

Re: Performance implications of GRegex structure

2007-03-15 Thread Owen Taylor

On Thu, 2007-03-15 at 10:38 -0400, Morten Welinder wrote:
> [Re PCRE]
> 
> > (There is no match[er] object here, but the equivalent is in all the in
> > and out parameters ...)
> 
> Is it?  If PCRE is as glibc, there is lots of state in the compiled expression
> and you cannot use it threaded.  However, once the match call is done,
> another thread can use the compiled regexp.

PCRE has no relation to glibc, and the man page says:

 The  compiled form of a regular expression is not altered during matching, 
 so the same compiled pattern can safely be used by several threads at once.

> > Neither is very appealing to me as a coder, though I could be convinced
> > that the second [==re-compile] is OK by suitable performance timings. Do we
> > have such numbers?
> 
> It's hard to see what kind of numbers would make sense to use as an argument
> here.  The numbers will depend heavily (orders of magnitude) on the regexps
> and the data.

Well, I could imagine (maybe, barely) that someone could show me numbers
that showed that with a variety of long and complicated regular
expressions, compiling them was still 10x as fast as matching them
against very short strings.

But in general, yes, part of my concern is that there are situations
where you are going to matching the same regular expression against
thousands of strings, and in that situation, unless compilation is very,
very, fast, the need to repeatedly recompile will inevitably produce
measurable overhead.
- Owen



signature.asc
Description: This is a digitally signed message part
___
gtk-devel-list mailing list
gtk-devel-list@gnome.org
http://mail.gnome.org/mailman/listinfo/gtk-devel-list

Re: Performance implications of GRegex structure

2007-03-15 Thread Morten Welinder

[Re PCRE]

> (There is no match[er] object here, but the equivalent is in all the in
> and out parameters ...)

Is it?  If PCRE is as glibc, there is lots of state in the compiled expression
and you cannot use it threaded.  However, once the match call is done,
another thread can use the compiled regexp.

> Neither is very appealing to me as a coder, though I could be convinced
> that the second [==re-compile] is OK by suitable performance timings. Do we
> have such numbers?

It's hard to see what kind of numbers would make sense to use as an argument
here.  The numbers will depend heavily (orders of magnitude) on the regexps
and the data.

M.
___
gtk-devel-list mailing list
gtk-devel-list@gnome.org
http://mail.gnome.org/mailman/listinfo/gtk-devel-list

Performance implications of GRegex structure

2007-03-15 Thread Owen Taylor

So, the regular expression code has been committed to CVS finally. Yay!

But looking over the header file, there is something that puzzles me
about the way that it's set up: there is no distinction between a
"pattern/regular expression" object and a match/matcher object.


GRegex*g_regex_new(const gchar *pattern,
   GRegexCompileFlags   compile_options,
   GRegexMatchFlags match_options,
   GError **error);
gbooleang_regex_match (GRegex  *regex,
   const gchar *string,
   GRegexMatchFlags match_options);
gboolean g_regex_fetch_pos(const GRegex*regex,
   gint match_num,
   gint*start_pos,
   gint*end_pos);

Compare that to Java:

 Pattern pattern = new Pattern("(.*?)-(.*)");
 Matcher m = pattern.matcher(str);
 if (m.matches()) {
  before_dash = matcher.group(1)
 }

Or to Python:

 re = re.compile("(.*?)-(.*)")
 match = re.match("str)
 if m:
before_dash = m.group(1)

Or to PCRE:

 pcre *compiled = pcre_compile("(.*?)-(.*), 0, &err, &err_offset, NULL);
 [...]
 if (pcre_exec(pattern->compiled, NULL,
   str, strlen(str), 0, 0,
   ovector, G_N_ELEMENTS(ovector)) >= 0) {
 before_dash = g_strndup(str + ovector[2], ovector[3] - ovector[2]);
  }

(There is no match[er] object here, but the equivalent is in all the in
and out parameters ...)

Or to Javascript, Perl, etc. (Javascript and Perl hide the issue a bit
by having regular expression literals.) While I have never actually done
timings on the matter, I've always assumed that the reason that regular
expression API's are set up this way is compiling a regular expression
has a significant expense.

With the GRegex structure I seem to have two choices:

 - Compile the regular expression once, and use it in a non-thread-safe,
   non-reentrant way. (shades of strtok)

 - Compile a new regular expression every time I want to do a match.

Neither is very appealing to me as a coder, though I could be convinced
that the second is OK by suitable performance timings. Do we have such
numbers?

- Owen


___
gtk-devel-list mailing list
gtk-devel-list@gnome.org
http://mail.gnome.org/mailman/listinfo/gtk-devel-list

Re: GRegex

2006-10-29 Thread Marco Barisione

Il giorno sab, 28/10/2006 alle 19.35 +0200, Murray Cumming ha scritto:
> If it's possible, it would be nice to avoid making it a GObject just to
> add easy reference counting. That tends to restrict how it can be
> wrapped by language bindings for whom automatic memory management is not
> the default.

It can't be a GObject because GRegex will be in libglib.

> I don't know exactly how it might be done in C (it's easy in C++), but I
> would hope that there's some way to reference-count anything without
> forcing the object itself to do the reference counting.

What do you mean? GRegex handles ref counting as other structures in
GLib.

-- 
Marco Barisione
http://www.barisione.org/

___
gtk-devel-list mailing list
gtk-devel-list@gnome.org
http://mail.gnome.org/mailman/listinfo/gtk-devel-list

Re: GRegex

2006-10-28 Thread Havoc Pennington

If the regex object is immutable, language bindings can treat it as a 
"value object" (like a string or GdkColor or GdkRectangle) even though 
it's refcounted for efficiency. Also immutable objects can be shared 
among threads without locking.

A GObject on the other hand is never immutable since the GObject base 
class has post-construct-modifiable state.

Havoc
___
gtk-devel-list mailing list
gtk-devel-list@gnome.org
http://mail.gnome.org/mailman/listinfo/gtk-devel-list

Re: GRegex

2006-10-28 Thread Murray Cumming

On Tue, 2006-10-24 at 22:53 +0200, Marco Barisione wrote:
> GtkSourceView needs only to have regexes ref counted.

If it's possible, it would be nice to avoid making it a GObject just to
add easy reference counting. That tends to restrict how it can be
wrapped by language bindings for whom automatic memory management is not
the default.

I don't know exactly how it might be done in C (it's easy in C++), but I
would hope that there's some way to reference-count anything without
forcing the object itself to do the reference counting.

-- 
Murray Cumming
[EMAIL PROTECTED]
www.murrayc.com
www.openismus.com

___
gtk-devel-list mailing list
gtk-devel-list@gnome.org
http://mail.gnome.org/mailman/listinfo/gtk-devel-list

Re: GRegex

2006-10-25 Thread Yevgen Muntyan

Murray Cumming wrote:

>On Tue, 2006-10-24 at 22:05 +0200, Marco Barisione wrote:
>  
>
>>Il giorno mar, 24/10/2006 alle 13.17 -0400, Dominic Lachowicz ha
>>scritto:
>>
>>
>>>1) Please don't name variables 'string', as there may be a conflict
>>>with C++'s std::string
>>>  
>>>
>>I think they were called "string" in the original version of GRegex
>>written by Scott Wimer in 1999. PCRE calls the string "subject".
>>
>>However it's not a problem with C++, this program is valid:
>>#include 
>>#include 
>>
>>using namespace std;
>>
>>int main ()
>>{
>>  string string = "hello";
>>  cout << string << endl;
>>}
>>
>>
>
>It's not necessary to challenge every compiler and every build
>environment with that. A rename is easy.
>  
>
There are already stdin, stdout, and stderr forbidden thanks
to nice C macros. Are you saying that now we must not use a nice
word "string" because there may be a broken C++ compiler?
Which C++ compiler will break on "void func (const char *string);" ?

>>>2) I noticed that there are g_regex_ref/unref() methods. Why did you
>>>choose to do this, rather than subclass GObject? You would also then
>>>have easy GObject-style accessors for the regex's "pattern" and
>>>"match_options".
>>>  
>>>
>>The original plan was to include directly GRegex in GLib, so it cannot
>>depend on GObject. This could be changed if we decide to include GRegex
>>in a separate library.
>>
>>However is really necessary to have a real object?
>>
>>I added _ref and _unref because the only two programs that are currently
>>using my modified version of EggRegex are GtkSourceView and MooEdit.Both
>>programs need reference counting for regular expressions.
>>
>>
>[snip]
>
>Do they need to reference count plain strings too?
>  
>
Of course we do, we also reference count plain ints and chars.

Regards,
Yevgen


___
gtk-devel-list mailing list
gtk-devel-list@gnome.org
http://mail.gnome.org/mailman/listinfo/gtk-devel-list

Re: GRegex

2006-10-25 Thread Marco Barisione

On 10/24/06, Marco Barisione <[EMAIL PROTECTED]> wrote:
> As discussed some times ago [1] I propose to add a PCRE wrapper to GLib.
> Bug #50075 [2] contains a patch that adds it as a separate libgregex.
> The documentation of the new API is at [3] (yes, there are some
> unresolved problems with gtk-doc).
>
> Owen Taylor would prefer to have GRegex directly in the main GLib
> library:
> [...]

To give you an idea of the size of libgregex and libpcre, these are
the sizes of the stripped .so files on my computer:

libgregex with internal PCRE  138 KB
libgregex with system PCRE24  KB
libpcre with Unicode support  125 KB
libpcre without Unicode support   96  KB

-- 
Marco Barisione
http://www.barisione.org/
___
gtk-devel-list mailing list
gtk-devel-list@gnome.org
http://mail.gnome.org/mailman/listinfo/gtk-devel-list

Re: GRegex

2006-10-25 Thread Marco Barisione

On 10/24/06, Behdad Esfahbod <[EMAIL PROTECTED]> wrote:
> On Tue, 2006-10-24 at 16:05 -0400, Marco Barisione wrote:
> This is broken.  It should err at configure time, not run time.  The
> user shouldn't need to check the output of g_regex_new for failures,
> just like any other thing we do with glib.

I have just uploaded a new patch that corrects this and some other problems.

I kept the run-time check, it's useful if cross-compiling or if the
installed PCRE library is updated.

-- 
Marco Barisione
http://www.barisione.org/
___
gtk-devel-list mailing list
gtk-devel-list@gnome.org
http://mail.gnome.org/mailman/listinfo/gtk-devel-list

Re: GRegex

2006-10-25 Thread Paul LeoNerd Evans

On Tue, 24 Oct 2006 16:38:30 -0400
Behdad Esfahbod <[EMAIL PROTECTED]> wrote:

> This is broken.  It should err at configure time, not run time.  The
> user shouldn't need to check the output of g_regex_new for failures,
> just like any other thing we do with glib.

I would argue it should do both.

Don't compile against a PCRE that doesn't do UTF-8, but also check the
installed library on each startup. That would guard against the system's
library being upgraded at some later time, to one that doesn't support
UTF-8.

-- 
Paul "LeoNerd" Evans

[EMAIL PROTECTED]
ICQ# 4135350   |  Registered Linux# 179460
http://www.leonerd.org.uk/

signature.asc
Description: PGP signature
___
gtk-devel-list mailing list
gtk-devel-list@gnome.org
http://mail.gnome.org/mailman/listinfo/gtk-devel-list

Re: GRegex

2006-10-24 Thread Marco Barisione

Il giorno mar, 24/10/2006 alle 16.48 -0400, Dominic Lachowicz ha
scritto:
> It should be possible to write an auto* check that basically checks
> whether something like:
> 
> #include 
> int main(int argc, char ** argv) {
> int has_utf8_support;
>  if(pcre_config(PCRE_CONFIG_UTF8,  &has_utf8_support))
>return has_utf8_support;
>  return 0;
> }
> 
> returns '1' or '0'. If so, we should probably favor the system
> installation of PCRE over the glib-supplied one.

I would prefer to always default to the internal version of PCRE because
it uses GLib for Unicode properties and UTF-8. Note that the tables used
by GLib and PCRE for Unicode are really big.


-- 
Marco Barisione
http://www.barisione.org/

___
gtk-devel-list mailing list
gtk-devel-list@gnome.org
http://mail.gnome.org/mailman/listinfo/gtk-devel-list

Re: GRegex

2006-10-24 Thread Behdad Esfahbod

On Tue, 2006-10-24 at 16:48 -0400, Dominic Lachowicz wrote:
> On 10/24/06, Behdad Esfahbod <[EMAIL PROTECTED]> wrote:
> > On Tue, 2006-10-24 at 16:05 -0400, Marco Barisione wrote:
> > >
> > > If you prefer you can pass --enable-system-pcre to use the
> > > system-supplied library but, if it's compiled without utf-8 support,
> > > g_regex_new fails.
> >
> > This is broken.  It should err at configure time, not run time.  The
> > user shouldn't need to check the output of g_regex_new for failures,
> > just like any other thing we do with glib.
> 
> It should be possible to write an auto* check that basically checks
> whether something like:
> 
> #include 
> int main(int argc, char ** argv) {
> int has_utf8_support;
>  if(pcre_config(PCRE_CONFIG_UTF8,  &has_utf8_support))
>return has_utf8_support;
>  return 0;
> }
> 
> returns '1' or '0'. If so, we should probably favor the system
> installation of PCRE over the glib-supplied one.

At the expense of relying whatever older version of the Unicode
Character Database that is using, and of course loading two sets of
Unicode data tables into memory.  PCRE itself is rather small compared
to the data tables, so last time the conclusion was that using glib's
probably makes more sense as they are already in memory anyway.

> Best,
> Dom
-- 
behdad
http://behdad.org/

"Commandment Three says Do Not Kill, Amendment Two says Blood Will Spill"
-- Dan Bern, "New American Language"

___
gtk-devel-list mailing list
gtk-devel-list@gnome.org
http://mail.gnome.org/mailman/listinfo/gtk-devel-list

Re: GRegex

2006-10-24 Thread Marco Barisione

Il giorno mar, 24/10/2006 alle 16.38 -0400, Behdad Esfahbod ha scritto:
> On Tue, 2006-10-24 at 16:05 -0400, Marco Barisione wrote: 
> > If you prefer you can pass --enable-system-pcre to use the
> > system-supplied library but, if it's compiled without utf-8 support,
> > g_regex_new fails. 
> 
> This is broken.  It should err at configure time, not run time.  The
> user shouldn't need to check the output of g_regex_new for failures,
> just like any other thing we do with glib.

I will modify it.

-- 
Marco Barisione
http://www.barisione.org/

___
gtk-devel-list mailing list
gtk-devel-list@gnome.org
http://mail.gnome.org/mailman/listinfo/gtk-devel-list

Re: GRegex

2006-10-24 Thread Marco Barisione

Il giorno mar, 24/10/2006 alle 22.31 +0200, Murray Cumming ha scritto:
> > 1) Please don't name variables 'string', as there may be a conflict
> > > with C++'s std::string
> > 
> > I think they were called "string" in the original version of GRegex
> > written by Scott Wimer in 1999. PCRE calls the string "subject".
> > 
> > However it's not a problem with C++, this program is valid:
> > [...]
> It's not necessary to challenge every compiler and every build
> environment with that. A rename is easy.

IMHO string is the easier to understand than subject or any other name.

There are lots of functions using parameters called string, such as
g_strsplit, g_string_append, g_pattern_match or g_quark_from_string.
 
> > I added _ref and _unref because the only two programs that are currently
> > using my modified version of EggRegex are GtkSourceView and MooEdit.Both
> > programs need reference counting for regular expressions.
> [snip]
> 
> Do they need to reference count plain strings too?

GtkSourceView needs only to have regexes ref counted.

-- 
Marco Barisione
http://www.barisione.org/

___
gtk-devel-list mailing list
gtk-devel-list@gnome.org
http://mail.gnome.org/mailman/listinfo/gtk-devel-list

Re: GRegex

2006-10-24 Thread Dominic Lachowicz

On 10/24/06, Behdad Esfahbod <[EMAIL PROTECTED]> wrote:
> On Tue, 2006-10-24 at 16:05 -0400, Marco Barisione wrote:
> >
> > If you prefer you can pass --enable-system-pcre to use the
> > system-supplied library but, if it's compiled without utf-8 support,
> > g_regex_new fails.
>
> This is broken.  It should err at configure time, not run time.  The
> user shouldn't need to check the output of g_regex_new for failures,
> just like any other thing we do with glib.

It should be possible to write an auto* check that basically checks
whether something like:

#include 
int main(int argc, char ** argv) {
int has_utf8_support;
 if(pcre_config(PCRE_CONFIG_UTF8,  &has_utf8_support))
   return has_utf8_support;
 return 0;
}

returns '1' or '0'. If so, we should probably favor the system
installation of PCRE over the glib-supplied one.

Best,
Dom
-- 
Counting bodies like sheep to the rhythm of the war drums.
___
gtk-devel-list mailing list
gtk-devel-list@gnome.org
http://mail.gnome.org/mailman/listinfo/gtk-devel-list

Re: GRegex

2006-10-24 Thread Behdad Esfahbod

On Tue, 2006-10-24 at 16:05 -0400, Marco Barisione wrote:
> 
> If you prefer you can pass --enable-system-pcre to use the
> system-supplied library but, if it's compiled without utf-8 support,
> g_regex_new fails. 

This is broken.  It should err at configure time, not run time.  The
user shouldn't need to check the output of g_regex_new for failures,
just like any other thing we do with glib.

-- 
behdad
http://behdad.org/

"Commandment Three says Do Not Kill, Amendment Two says Blood Will Spill"
-- Dan Bern, "New American Language"

___
gtk-devel-list mailing list
gtk-devel-list@gnome.org
http://mail.gnome.org/mailman/listinfo/gtk-devel-list

Re: GRegex

2006-10-24 Thread Murray Cumming

On Tue, 2006-10-24 at 22:05 +0200, Marco Barisione wrote:
> Il giorno mar, 24/10/2006 alle 13.17 -0400, Dominic Lachowicz ha
> scritto:
> > 1) Please don't name variables 'string', as there may be a conflict
> > with C++'s std::string
> 
> I think they were called "string" in the original version of GRegex
> written by Scott Wimer in 1999. PCRE calls the string "subject".
> 
> However it's not a problem with C++, this program is valid:
> #include 
> #include 
> 
> using namespace std;
> 
> int main ()
> {
>   string string = "hello";
>   cout << string << endl;
> }

It's not necessary to challenge every compiler and every build
environment with that. A rename is easy.

> > 2) I noticed that there are g_regex_ref/unref() methods. Why did you
> > choose to do this, rather than subclass GObject? You would also then
> > have easy GObject-style accessors for the regex's "pattern" and
> > "match_options".
> 
> The original plan was to include directly GRegex in GLib, so it cannot
> depend on GObject. This could be changed if we decide to include GRegex
> in a separate library.
> 
> However is really necessary to have a real object?
> 
> I added _ref and _unref because the only two programs that are currently
> using my modified version of EggRegex are GtkSourceView and MooEdit.Both
> programs need reference counting for regular expressions.
[snip]

Do they need to reference count plain strings too?


-- 
Murray Cumming
[EMAIL PROTECTED]
www.murrayc.com
www.openismus.com

___
gtk-devel-list mailing list
gtk-devel-list@gnome.org
http://mail.gnome.org/mailman/listinfo/gtk-devel-list

Re: GRegex

2006-10-24 Thread Marco Barisione

Il giorno mar, 24/10/2006 alle 13.17 -0400, Dominic Lachowicz ha
scritto:
> 1) Please don't name variables 'string', as there may be a conflict
> with C++'s std::string

I think they were called "string" in the original version of GRegex
written by Scott Wimer in 1999. PCRE calls the string "subject".

However it's not a problem with C++, this program is valid:
#include 
#include 

using namespace std;

int main ()
{
  string string = "hello";
  cout << string << endl;
}

> 2) I noticed that there are g_regex_ref/unref() methods. Why did you
> choose to do this, rather than subclass GObject? You would also then
> have easy GObject-style accessors for the regex's "pattern" and
> "match_options".

The original plan was to include directly GRegex in GLib, so it cannot
depend on GObject. This could be changed if we decide to include GRegex
in a separate library.

However is really necessary to have a real object?

I added _ref and _unref because the only two programs that are currently
using my modified version of EggRegex are GtkSourceView and MooEdit.Both
programs need reference counting for regular expressions.

In Glib there are other structures that are reference counted without
being objects, such as GHashTable, GAsyncQueue, GIOChannel and others.

> 3) Should there be a "GRegexMatch" object too? For instance, at least
> Python and Java have a notion of a read-only "Pattern" and a "Match
> Set". Your design combines the two into a single GRegex object. Having
> the pattern be read-only gets around your thread-safety "gotcha"
> comment in the docs.

I know this but using them in a language with garbage collector is
easier. The regex class in QT uses the same approach of GRegex.

> 4) Python's search() and match() methods have a "start position" and
> an "end position" argument, while your match_full() has only a "start
> position" argument. Is there a reason for this? Could it be
> implemented?

It has a length argument.

> 5) I didn't fully investigate, but Java and Python have a concept of
> "search vs. match" with slightly different semantics. Is this semantic
> distinction easily expressible in your API?
> 
> http://docs.python.org/lib/re-objects.html

In Python match matches only at the start of the string, search at any
position. You can have the match behavior adding a "^" at the beginning
of the string or passing the compile option G_REGEX_ANCHORED or the
match option G_REGEX_MATCH_ANCHORED.

I prefer to have only a function as I always this distinction in Python
a bit confusing.

> 6) GRegex requires that PCRE be built with UTF-8 support, which some
> existing installations aren't. For reference, Gnumeric and Goffice get
> around this by including a copy of PCRE in their distribution and
> statically link it in. How do you ensure that GRegex finds a version
> of PCRE compiled with UTF-8 support?

The default for GRegex is to use its internal copy of PCRE. This is
automatically patched to use GLib for Unicode and memory management.

If you prefer you can pass --enable-system-pcre to use the
system-supplied library but, if it's compiled without utf-8 support,
g_regex_new fails.


-- 
Marco Barisione
http://www.barisione.org/

___
gtk-devel-list mailing list
gtk-devel-list@gnome.org
http://mail.gnome.org/mailman/listinfo/gtk-devel-list

Re: GRegex

2006-10-24 Thread Brian J. Tarricone

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 10/24/2006 10:17 AM, Dominic Lachowicz wrote:
> Hi Marco,
> 
> Please take my review with a grain of salt. I've been wanting a
> convenience API on top of PCRE for a while now, and it would be great
> if we could get something like GRegex into Glib proper.
[...]
> 2) I noticed that there are g_regex_ref/unref() methods. Why did you
> choose to do this, rather than subclass GObject? You would also then
> have easy GObject-style accessors for the regex's "pattern" and
> "match_options".

In that case, GRegex couldn't be included in libglib proper.  It would
have to be in libgobject, or in a separate (libgregex?) library that
depends on libgobject.

-brian

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.2.2 (MingW32)

iD8DBQFFPmxp6XyW6VEeAnsRAvniAKCL71koL8aWDduD1Xn+wnRVvgTI9QCfb2OP
NEvfq3v8t1K+EJ4PUiIh8z8=
=IL7u
-END PGP SIGNATURE-
___
gtk-devel-list mailing list
gtk-devel-list@gnome.org
http://mail.gnome.org/mailman/listinfo/gtk-devel-list

Re: GRegex

2006-10-24 Thread Dominic Lachowicz

Hi Marco,

Please take my review with a grain of salt. I've been wanting a
convenience API on top of PCRE for a while now, and it would be great
if we could get something like GRegex into Glib proper.

1) Please don't name variables 'string', as there may be a conflict
with C++'s std::string

2) I noticed that there are g_regex_ref/unref() methods. Why did you
choose to do this, rather than subclass GObject? You would also then
have easy GObject-style accessors for the regex's "pattern" and
"match_options".

3) Should there be a "GRegexMatch" object too? For instance, at least
Python and Java have a notion of a read-only "Pattern" and a "Match
Set". Your design combines the two into a single GRegex object. Having
the pattern be read-only gets around your thread-safety "gotcha"
comment in the docs.

4) Python's search() and match() methods have a "start position" and
an "end position" argument, while your match_full() has only a "start
position" argument. Is there a reason for this? Could it be
implemented?

5) I didn't fully investigate, but Java and Python have a concept of
"search vs. match" with slightly different semantics. Is this semantic
distinction easily expressible in your API?

http://docs.python.org/lib/re-objects.html

6) GRegex requires that PCRE be built with UTF-8 support, which some
existing installations aren't. For reference, Gnumeric and Goffice get
around this by including a copy of PCRE in their distribution and
statically link it in. How do you ensure that GRegex finds a version
of PCRE compiled with UTF-8 support?

Thanks,
Dom

On 10/24/06, Marco Barisione <[EMAIL PROTECTED]> wrote:
> As discussed some times ago [1] I propose to add a PCRE wrapper to GLib.
> Bug #50075 [2] contains a patch that adds it as a separate libgregex.
> The documentation of the new API is at [3] (yes, there are some
> unresolved problems with gtk-doc).
>
> Owen Taylor would prefer to have GRegex directly in the main GLib
> library:
> (17:38:55) owen: is the latest plan for gregex really a separate
> library?
> (17:39:45) mclasen: owen: you would prefer it folded in ?
> (17:40:16) owen: mclasen: I think it makes tons more sense folded in. A
> regular expression facility is most useful if you can just use it when
> you need it
> (17:40:36) owen: mclasen: And on the desktop, having it folded in is
> purely a performance win
> (17:41:36) owen: if there is an embedded problem (how big is it
> anyways?) then a --without-regex configure option would be better
> (17:43:19) mclasen: owen: you are probably right
>
> What are your ideas?
>
>
> I would like to add to the documentation a simple and short tutorial on
> regular expressions and GRegex API. Does someone know something good
> (and with a compatible license) to copy?
>
>
> [1]
> http://mail.gnome.org/archives/gtk-devel-list/2006-July/msg00099.html
>
> [2] http://bugzilla.gnome.org/show_bug.cgi?id=50075
>
> [3] http://www.barisione.org/gregex/
>
>
> --
> Marco Barisione
> http://www.barisione.org/
>
> ___
> gtk-devel-list mailing list
> gtk-devel-list@gnome.org
> http://mail.gnome.org/mailman/listinfo/gtk-devel-list
>

-- 
Counting bodies like sheep to the rhythm of the war drums.
___
gtk-devel-list mailing list
gtk-devel-list@gnome.org
http://mail.gnome.org/mailman/listinfo/gtk-devel-list

GRegex

2006-10-24 Thread Marco Barisione

As discussed some times ago [1] I propose to add a PCRE wrapper to GLib.
Bug #50075 [2] contains a patch that adds it as a separate libgregex.
The documentation of the new API is at [3] (yes, there are some
unresolved problems with gtk-doc).

Owen Taylor would prefer to have GRegex directly in the main GLib
library:
(17:38:55) owen: is the latest plan for gregex really a separate
library?
(17:39:45) mclasen: owen: you would prefer it folded in ?
(17:40:16) owen: mclasen: I think it makes tons more sense folded in. A
regular expression facility is most useful if you can just use it when
you need it
(17:40:36) owen: mclasen: And on the desktop, having it folded in is
purely a performance win
(17:41:36) owen: if there is an embedded problem (how big is it
anyways?) then a --without-regex configure option would be better
(17:43:19) mclasen: owen: you are probably right

What are your ideas?


I would like to add to the documentation a simple and short tutorial on
regular expressions and GRegex API. Does someone know something good
(and with a compatible license) to copy?


[1]
http://mail.gnome.org/archives/gtk-devel-list/2006-July/msg00099.html

[2] http://bugzilla.gnome.org/show_bug.cgi?id=50075

[3] http://www.barisione.org/gregex/


-- 
Marco Barisione
http://www.barisione.org/

___
gtk-devel-list mailing list
gtk-devel-list@gnome.org
http://mail.gnome.org/mailman/listinfo/gtk-devel-list

60 matches

Mail list logo