On Sat, 17 Jun 2017, Fernando Cabral wrote: > Still beating my head against the wall due to my lack of knowledge about > the PCRE methods and properties... Because of this, I have progressed not > only very slowly but also -- I fell -- in a very inelegant way. So perhaps > you guys who are more acquainted with PCRE might be able to hint me on a > better solution. > > I want to search a long string that can contain a sentence, a paragraph or > even a full text. I wanna find and isolate every word it contains. A word > is defined as any sequence of alphabetic characters followed by a > non-alphatetic character. >
The Mathematician in me can't resist to point this out: you hopefully wanted to define "word in a string" as "a *longest* sequence of alphabetic characters followed by a non-alphabetic character (or the end of the string)". Using your definition above, the words in "abc:" would be "c", "bc" and "abc", whereas you probably only wanted "abc" (the longest of those). > The sample code bellow does work, but I don't feel it is as elegant and as > fast as it could and should be. Especially the way I am traversing the > string from the beginning to the end. It looks awkward and slow. There must > be a more efficient way, like working only with offsets and lengths instead > of copying the string again and again. > You think worse of String.Mid() than it deserves, IMHO. Gambas strings are triples of a pointer to some data, a start index and a length, and the built-in string functions take care not to copy a string when it's not necessary. The plain Mid$() function (dealing with ASCII strings only) is implemented as a constant-time operation which simply takes your input string and adjusts the start index and length to give you the requested portion of the string. The string doesn't even have to be read, much less copied, to do this. Now, the String.Mid() function is somewhat more complicated, because UTF-8 strings have variable-width characters, which makes it difficult to map byte indices to character positions. To implement String.Mid(), your string has to be read, but, again, not copied. Extracting a part of a string is a non-destructive operation in Gambas and no copying takes place. (Concatenating strings, on the other hand, will copy.) So, there is some reading overhead (if you need UTF-8 strings), but it's smaller than you probably thought. > Dim Alphabetics as string "abc...zyzABC...ZYZ" > Dim re as RegExp > Dim matches as String [] > Dim RawText as String > > re.Compile("([" & Alphabetics & "]+?)([^" & Alphabetics & "]+)", > RegExp.utf8) > RawText = "abc12345def ghi jklm mno p1" > > Do While RawText > re.Exec(RawText) > matches.add(re[1].text) > RawText = String.Mid(RawText, String.Len(re.text) + 1) > Loop > > For i = 0 To matches.Count - 1 > Print matches[i] > Next > > > Above code correctly finds "abc, def, ghi, jlkm, mno, p". But the tricks I > have used are cumbersome (like advancing with string.mid() and resorting to > re[1].text and re.text. > Well, I think you can't use PCRE alone to solve your problem, if you want to capture a variable number of words in your submatches. I did a bit of reading and from what I gather [1][2] capturing group numbers are assigned based on the verbatim regular expression, i.e. the number of submatches you can receive is limited by the number of "(...)" constructs in your expression; and the (otherwise very nifty) recursion operator (?R) does not give you an unlimited number of capturing groups, sadly. Anyway, I think by changing your regular expression, you can let PCRE take care of the string advancement, like so: 1 #!/usr/bin/gbs3 2 3 Use "gb.pcre" 4 5 Public Sub Main() 6 Dim r As New RegExp 7 Dim s As string 8 9 r.Compile("([[:alpha:]]+)[[:^alpha:]]+(.*$)", RegExp.UTF8) 10 s = "abc12345def ghi jklm mno p1" 11 Print "Subject:";; s 12 Do 13 r.Exec(s) 14 If r.Offset = -1 Then Break 15 Print " ->";; r[1].Text 16 s = r[2].Text 17 Loop While s 18 End Output: Subject: abc12345def ghi jklm mno p1 -> abc -> def -> ghi -> jklm -> mno -> p But, I think, this is less efficient than using String.Mid(). The trailing group (.*$) _may_ make the PCRE library read the entire subject every time. And I believe gb.pcre will copy your submatch string when returning it. If you care deeply about this, you'll have to trace the code in gb.pcre and main/gbx (the interpreter) to see what copies strings and what doesn't. Regards, Tobi [1] http://www.regular-expressions.info/recursecapture.html (Capturing Groups Inside Recursion or Subroutine Calls) [2] http://www.rexegg.com/regex-recursion.html (Groups Contents and Numbering in Recursive Expressions) -- "There's an old saying: Don't change anything... ever!" -- Mr. Monk ------------------------------------------------------------------------------ Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot _______________________________________________ Gambas-user mailing list Gambas-user@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/gambas-user