Hi Giuseppe, I'll start by saying tl;dr. But I didn't stop because of your e-mail, I'm actually referring to the API.
I started looking at it and it seems too cluttered. Specially this early in the process. It's hard to review something that is trying to be everything or maybe it's just exposing too many things. I would like to challenge you to do the opposite: give us the minimum API that can offer the most important features. To get there start from short self-contained code examples showing the features in use (I'm interested in seeing those) -- API design is about how it gets used, not so much about the number of features. For instance, in Perl, regular expressions are essentially a string, a couple of operators and some magic variables for the captures (and lots of magic everywhere...). (Now, I'm not saying we should do regular expressions in Qt as they are done in Perl) The API, as it stands seems too hard-wired to the implementation or feature set of the engine giving it little opportunity for evolving. Another issue is that it is hard for me to see if (or where) the API itself is imposing performance penalties on the implementation. Finally, note that it is usually ok to add API as time goes by, but our BC promises mean we have to maintain most of what we add for a long time. Cheers, João ________________________________________ From: development-bounces+joao.abecasis=nokia....@qt-project.org [development-bounces+joao.abecasis=nokia....@qt-project.org] on behalf of ext Giuseppe D'Angelo [dange...@gmail.com] Sent: 15 December 2011 17:43 To: development@qt-project.org Subject: [Development] QRegularExpression -- first round of API review Hi everybody. Sorry for the length of thjis message, but doing API reviews by mail is hard, and I needed to explain many decisions here and there (and, of course, the API itself). :-( Attached to this mail (and also here: <http://pastebin.com/KzsGFXJC> -- if you don't want to download and open a file) you can find a stripped down version of the QRegularExpression API I'm working on. It's deliberately stripped down because at this stage I'd like to have feedbacks about the API and not about the code itself. We're still free of changing all parts of it, so, even if you dislike the tiniest bit of something, please go ahead and say it. And it's deliberaltely without comments, so the reader's exercise now is reading it and finding out if it's understandable at a first glance. Once you're done, I'll start commenting the non-obvious points. General As discussed, two implicitly shared, copy on write, reentrant value classes: the regexp itself (pattern + options) and the result of a match against a subject string (fixes the T7 in <http://developer.qt.nokia.com/wiki/Regexp_engine_in_Qt5> ). The regexp also stores the compiled form of itself as an hidden optimization -- most const methods actually do modify the shared instance. The backend is PCRE so the feature set is quite big (fixing T1-T4). I have already planned to expand the enums to support even more options. QRegularExpression PatternOptions The PatternOptions flag contains the pattern options, i.e. modify the characters that the pattern should match. * CaseInsensitiveOption is the /i flag (case Insensitive Match). In QRegExp it was set by a separate method (setCaseSensitivity), holding a Qt::CaseSensitivity value. Here IMHO it just clutters the API to have different getters and setters for the options. * InvertedGreedinessOption is the /U flag (not present in Perl) inverts the greediness of the quantifiers: they became not greedy by default, and greedy if followed by ?. See also the T3 in the list. * DotMatchesEverythingOption is the /s flag (Single line match): the dot (.) metacharacter matches everything including newlines. * MultilineMatchOption is the /m flag (Multi line match): the caret (^) and the dollar ($) metacharacters are allowed to match at (and just before) any newline inside the string (as well as the ordinary begin and end of the string) * ExtendedPatternSyntaxOption is the /x flag (eXtended pattern syntax): whitespace data inside the regexp is ignored, and an unescaped # outside a character class starts a comment that goes till the next newline (inside the pattern!). It's probably not very useful in C++, where one can use the default rules for string literals, but it may be handy if one stores complicated regexps in an external file. * UseUnicodePropertiesOption enables the use Unicode properties for character classes such as \w, \s, \d, etc. (and obviously their counterparts \W, \S, etc.), instead of making them just match the standard ASCII ranges * AllowDuplicatedNamesOption allows duplicated named capturing patterns (that is (?<NAME> ... ) patterns) * DontCaptureOption disables the use of unnamed capturing parenthesis -- they all behave as if they all were (?:...) parenthesis. Named ones continue to work. * MatchFirstlineOnlyOption restricts an unanchored pattern to start to match only before the first newline (although the match can continue over it). * DollarMatchesOnlyAtEndOption changes the meaning of the dollar ($) metacharacter to match only at the very end of the string, while usually it matches at the end AND at a newline right before the end CompileHints The CompileHints are hints to the regexp engine itself. Basically they control the optimization of the compiled regexp (the PCRE API is pcre_study). * StudyPatternCompileHint tells PCRE engine to study the pattern, which basically means performing very simple optimization tasks (f.i., the engine records the minimum size of a string to get a successful match, and if the subject string is shorter the match aborts immediately) * JitCompileHint implies StudyPatternCompileHint, and enables the JIT compilation of the pattern. Of course this has a cost, and there's a tradeoff to consider between spending time compilating the regexp instead of just matching against a string. MatchOptions The MatchOptions control the match itself. * AnchoredMatch forces the match to start at the first matching point inside the subject string, even if the regexp doesn't have carets or other metacharacters that normally force such a match. In QRegExp this is what CaretAtOffset did. * NonEmptyMatch tells the engine that an empty match is not considered a success (obviously, even if the regexp allowed one); therefore, the engine continue searching for a non-empty match. * NonEmptyAtStartMatch tells the engine that an empty match at the beginning of the subject string is not acceptable as a match, but an empty match occuring elsewhere is accepted. Error reporting isValid isValid() compiles the pattern and returns true if it was successfully compiled, false otherwise (f.i. the pattern contains a syntax error). A default constructed QRegularExpression is NOT valid. patternErrorOffset patternErrorOffset() returns the offset (in UTF-16 codepoints) inside the pattern string where a syntax error was found, or -1 if this is not applicable. errorString errorString() returns a translated, textual description of the regexp error. What should it return in case of no error? "No error"? An empty string? Quoting escape It implements perl's quotemeta -- _all_ characters outside [a-zA-Z0-9_] get escaped by a backslash. The Matching Functions Here we go with the hardcore parts. :-) All the matching functions return a QRegularExpressionMatch that can be used to know if there was a full/partial match, to extract named subpatterns, and to repeat the match on the subject string emulating the /g option (see below for more info). All of them take a string to search (the _subject_ string), plus other options. match / exactMatch They basically do the same thing: look for a match inside the subject. match() succeeds if there's a substring of the subject string that matches against the pattern; exactMatch() requires that the whole subject string matches (in practice, it surrounds the pattern with \A and \z). A difference I'd like to introduce w.r.t. QRegExp is that a failed match with both of these functions does NOT return (in the QRegularExpressionMatch object) the offsets of the partial match, if any. If one is looking for partial matches then partialMatch should be used. lastMatch What lastMatch does is to return the last match that you would get by starting at the beginning of the string and matching forward (as in /g). See also the discussion below. This is really tricky to do correctly -- PCRE doesn't support it directly. Also, obviously, one wants to avoid to do the /g loop and take the last result. Perhaps a trick could be rewriting the pattern as "(?s:.*?)pattern"). As of now, lastMatch doesn't accept any option -- QRegExp accepted an offset from the end of the string, I guess I could simply add it (plus a starting offset and the usual matching options?). partialMatch partialMatch is used whenever we're looking for a partial match inside the subject string. A partial match happens when, while we're successfully matching a pattern against a subject, the end of the subject string is reached. In an ordinary context usually this is a failed match, but there are use cases in which we'd like instead to be notified about the partial match found. Doing a partial match is quite inefficient compared to an ordinary match because most of the optimizations cannot be employed (f.i. the aforementioned check on the string length >= the minimum required by the pattern, or backward matching, etc.). Note that of course we can get a complete match while attempting a partial match. The use cases for this are basically: * validators: if we have, say, a QRegExpValidator on a QLineEdit, then while the user is typing, a partial match denotes an "Intermediate" status of the validator -- the string may become fully Valid when the user finishes to enter it. * incremental matching (see T6): in order to scan a big text, we would like to provide it in chunks to the regexp engine. The obvious problem is "what if the text that matches the pattern is across 2+ chunks" -- that's where you could use partial matching. A partial match becomes a way of saying "feed me with more input". ... plus others that I'm obviously forgetting right now. TRIVIA: if I try to match the string "abca" against the pattern "(abc)+", what should be the result of partialMatch? 1) No match 2) Partial match found 3) Full match found 4) I deserve to be punched in the face because I ask tricky questions The answer is (tadah!) the fourth. In fact, both partial and full matches are good answers to that question. It all depends if we prefer to have a full match over a partial match (f.i.: the validator) or not (f.i.: incremental matching). In fact, note that if we are doing incremental matching an answer of "full match" would be wrong in case after "abca" the text continues with "bc" -- the + quantifier is greedy, thus it should have matched the whole "abcabc". This whole behaviour is controlled by the preferCompleteMatch parameter. Setting it to true enables the PCRE SOFT partial matching, which means that it tries to find a complete match, and if it fails, it reports a partial match (if any). Setting it to false enables the PCRE HARD partial matching, which means that the matches bails out as soon as we hit the end of the subject string. See also man 3 pcrepartial for more details about the two matching modes. (I'm not even sure it should be a boolean parameter -- perhaps I should change it into an enum? If so, any suggestion for the names?) QRegularExpressionMatch This class contains the result of a match (plus methods to implement /g, see below). Information about the match hasMatch() and hasPartialMatch() carry the obvious information. Error reporting isValid() returns false in case the QRegularExpressionMatch was default-constructed, or there was a matching error (for instance, backtracking limit reached). Unfortunately, PCRE provides no builtin strings to translate matching errors to strings. Should we build ours and provide an errorString() method as well? Capture group information This also includes information about the whole match, which is "implicitly" inside the capture group 0. capturedTextsCount Returns the number of capturing groups that matched. But actually I need another name for this, because I would like to return the highest index of a capturing group that captured something (so that one can extract all the substrings). Note that not all capturing groups are required to match something. "(a)(b)?(c)" matches "ac" with 2 groups (#1 and #3, group #2 doesn't match), and the highest index is therefore 3. cap, capRef, capturedTexts, capturedTextsForName cap and capRef return the QString (resp. QStringRef) that was captured by the i-th group (int overload). For the QString overload, they return the QString/QStringRef that was captured by the leftmost named group with that name (in the order they appear in the pattern) that matched something. This is coherent with Perl's %+ hash (cf. perldoc perlre) -- other RE engines (Python) don't support duplicated names inside the pattern at all. We support them, provided that the right pattern option is used. capturedTexts simply returns all the captured texts. capturedTextsForName returns all the captured texts by the capturing groups with a given name. Should I add QList<QStringRef> variants for these two last methods as well? Although it's discouraged to distinguish between null and empty QStrings, here they DO make a difference -- a null QString means that a certain group did not capture at all, while an empty QString that it captured an empty string. I think the semantic is clear and does not lead to ambiguities. pos, matchedLength, endPos They return the start position, length, and end position (in UTF-16 codepoints) of each capturing group inside the subject string -- same semantic than cap() apply. I'm missing the methods for returning ALL the starting positions/lengths/end positions for all the groups for a given name (or all groups). Any suggestion for such methods? Iterators Implementing /g (Global match) correctly (aka: as Perl does) is tricky because of patterns that may potentially match an empty string (that is, match 0 characters). For instance, "a*b*|c" inside "ccaabbd" should return, under /g: 1) match at position 0, length 0 2) match at position 0, length 1 (c) 3) match at position 1, length 0 4) match at position 1, length 1 (c) 5) match at position 2, length 4 (aabb) 6) match at position 6, length 0 7) match at position 7, length 0 Note that the QRegExp "loop" solution in the docs fails miserably at this (matching "a*" inside "b" causes an endless loop due to matchingLength() = 0). The *right* algorithm is: * Find the first match normally * If the last match didn't match an empty string, do a normal match from the current offset inside the subject. * If the last match matched an empty string, then attempt an ANCHORED, NON EMPTY AT START match at the current offset inside the subject. If it succeeds, then that's the next match. Otherwise, advance by 1 character and perform another ordinary match. Although doable with our API, this is TOO tricky to be correctly performed by an user. That's why I proposed to have methods to perform this. Note: these methods iterate on the SAME subject string -- an "high level" incremental global matcher is not offered by this API. nextMatch(), operator++ nextMatch() advances the match of "this" in the subject string following the algorithm above. operator++ does the same. The intended API usage is QRegularExpression rx(...); QRegularExpressionMatch m = rx.match(str); while (m.hasMatch()) { ... m.nextMatch(); } or for (auto m = rx.match(str); m.hasMatch(); ++m) { ... } I am just a bit skeptical about operator++ because it leads to wanting operator-- (and previousMatch) for backwards iteration. previousMatch, operator-- These do backwards iteration in the /g loop, but I'm very skeptical about implementing it -- it's too harsh to do it correctly (again, unless one does the whole /g loop forward from the start every time, or caches the result). There's a valid use case, though -- backward search in an editor, for instance. Opinions? Maybe an editor could simply save the offsets of a forward /g search and use those? The End! Well. If you survived, thanks for reading and sharing your feedbacks :-) _______________________________________________ Development mailing list Development@qt-project.org http://lists.qt-project.org/mailman/listinfo/development