I think I would do something like: Dim ii As Integer Dim sStr As String = "abc defg hijkl" Dim sWords As String[]
sWords = Split(sStr, " ") For ii = 0 To 2 Print sWords[ii] Next Jussi On Sun, Jun 18, 2017 at 2:57 AM, Fernando Cabral < fernandojosecab...@gmail.com> wrote: > Tobi > > One more thing about the way I wish it could work (I remember having done > this in C perhaps 30 years ago). The pseudo-code bellow is pretty > schematic, but I think it will clarify the issue. > > Let p and l be arrays of integers and s be the string "abc defg hijkl" > > So, after traversing the string we would have the following result: > p[0] = offset of "a" (0) > l[0] = length of "abc" (3) > p[1] = offset of "d" (4) > l[1] = lenght of "defg" (4) > p[2] = offset of "h" (9) > l[2] = lenght of "hijkl" (5). > > After this, each word could be retrieved in the following manner: > > for i = 0 to 2 > print mid(s, p[i], l[i]) > next > > I think this would be the most efficient way to do it. But I can't find how > to do it in Gambas using Regex. > > Regards > > - fernando > > > 2017-06-17 18:06 GMT-03:00 Tobias Boege <tabo...@gmail.com>: > > > On Sat, 17 Jun 2017, Fernando Cabral wrote: > > > Still beating my head against the wall due to my lack of knowledge > about > > > the PCRE methods and properties... Because of this, I have progressed > not > > > only very slowly but also -- I fell -- in a very inelegant way. So > > perhaps > > > you guys who are more acquainted with PCRE might be able to hint me on > a > > > better solution. > > > > > > I want to search a long string that can contain a sentence, a paragraph > > or > > > even a full text. I wanna find and isolate every word it contains. A > word > > > is defined as any sequence of alphabetic characters followed by a > > > non-alphatetic character. > > > > > > > The Mathematician in me can't resist to point this out: you hopefully > > wanted > > to define "word in a string" as "a *longest* sequence of alphabetic > > characters > > followed by a non-alphabetic character (or the end of the string)". Using > > your > > definition above, the words in "abc:" would be "c", "bc" and "abc", > whereas > > you probably only wanted "abc" (the longest of those). > > > > > The sample code bellow does work, but I don't feel it is as elegant and > > as > > > fast as it could and should be. Especially the way I am traversing the > > > string from the beginning to the end. It looks awkward and slow. There > > must > > > be a more efficient way, like working only with offsets and lengths > > instead > > > of copying the string again and again. > > > > > > > You think worse of String.Mid() than it deserves, IMHO. Gambas strings > > are triples of a pointer to some data, a start index and a length, and > > the built-in string functions take care not to copy a string when it's > > not necessary. The plain Mid$() function (dealing with ASCII strings > only) > > is implemented as a constant-time operation which simply takes your input > > string and adjusts the start index and length to give you the requested > > portion of the string. The string doesn't even have to be read, much less > > copied, to do this. > > > > Now, the String.Mid() function is somewhat more complicated, because > > UTF-8 strings have variable-width characters, which makes it difficult > > to map byte indices to character positions. To implement String.Mid(), > > your string has to be read, but, again, not copied. > > > > Extracting a part of a string is a non-destructive operation in Gambas > > and no copying takes place. (Concatenating strings, on the other hand, > > will copy.) So, there is some reading overhead (if you need UTF-8 > strings), > > but it's smaller than you probably thought. > > > > > Dim Alphabetics as string "abc...zyzABC...ZYZ" > > > Dim re as RegExp > > > Dim matches as String [] > > > Dim RawText as String > > > > > > re.Compile("([" & Alphabetics & "]+?)([^" & Alphabetics & "]+)", > > > RegExp.utf8) > > > RawText = "abc12345def ghi jklm mno p1" > > > > > > Do While RawText > > > re.Exec(RawText) > > > matches.add(re[1].text) > > > RawText = String.Mid(RawText, String.Len(re.text) + 1) > > > Loop > > > > > > For i = 0 To matches.Count - 1 > > > Print matches[i] > > > Next > > > > > > > > > Above code correctly finds "abc, def, ghi, jlkm, mno, p". But the > tricks > > I > > > have used are cumbersome (like advancing with string.mid() and > resorting > > to > > > re[1].text and re.text. > > > > > > > Well, I think you can't use PCRE alone to solve your problem, if you want > > to capture a variable number of words in your submatches. I did a bit of > > reading and from what I gather [1][2] capturing group numbers are > assigned > > based on the verbatim regular expression, i.e. the number of submatches > > you can receive is limited by the number of "(...)" constructs in your > > expression; and the (otherwise very nifty) recursion operator (?R) does > > not give you an unlimited number of capturing groups, sadly. > > > > Anyway, I think by changing your regular expression, you can let PCRE > take > > care of the string advancement, like so: > > > > 1 #!/usr/bin/gbs3 > > 2 > > 3 Use "gb.pcre" > > 4 > > 5 Public Sub Main() > > 6 Dim r As New RegExp > > 7 Dim s As string > > 8 > > 9 r.Compile("([[:alpha:]]+)[[:^alpha:]]+(.*$)", RegExp.UTF8) > > 10 s = "abc12345def ghi jklm mno p1" > > 11 Print "Subject:";; s > > 12 Do > > 13 r.Exec(s) > > 14 If r.Offset = -1 Then Break > > 15 Print " ->";; r[1].Text > > 16 s = r[2].Text > > 17 Loop While s > > 18 End > > > > Output: > > > > Subject: abc12345def ghi jklm mno p1 > > -> abc > > -> def > > -> ghi > > -> jklm > > -> mno > > -> p > > > > But, I think, this is less efficient than using String.Mid(). The > trailing > > group (.*$) _may_ make the PCRE library read the entire subject every > time. > > And I believe gb.pcre will copy your submatch string when returning it. > > If you care deeply about this, you'll have to trace the code in gb.pcre > > and main/gbx (the interpreter) to see what copies strings and what > doesn't. > > > > Regards, > > Tobi > > > > [1] http://www.regular-expressions.info/recursecapture.html (Capturing > > Groups Inside Recursion or Subroutine Calls) > > [2] http://www.rexegg.com/regex-recursion.html (Groups Contents and > > Numbering in Recursive Expressions) > > > > -- > > "There's an old saying: Don't change anything... ever!" -- Mr. Monk > > > > ------------------------------------------------------------ > > ------------------ > > Check out the vibrant tech community on one of the world's most > > engaging tech sites, Slashdot.org! http://sdm.link/slashdot > > _______________________________________________ > > Gambas-user mailing list > > Gambas-user@lists.sourceforge.net > > https://lists.sourceforge.net/lists/listinfo/gambas-user > > > > > > -- > Fernando Cabral > Blogue: http://fernandocabral.org > Twitter: http://twitter.com/fjcabral > e-mail: fernandojosecab...@gmail.com > Facebook: f...@fcabral.com.br > Telegram: +55 (37) 99988-8868 > Wickr ID: fernandocabral > WhatsApp: +55 (37) 99988-8868 > Skype: fernandojosecabral > Telefone fixo: +55 (37) 3521-2183 > Telefone celular: +55 (37) 99988-8868 > > Enquanto houver no mundo uma só pessoa sem casa ou sem alimentos, > nenhum político ou cientista poderá se gabar de nada. > ------------------------------------------------------------ > ------------------ > Check out the vibrant tech community on one of the world's most > engaging tech sites, Slashdot.org! http://sdm.link/slashdot > _______________________________________________ > Gambas-user mailing list > Gambas-user@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/gambas-user > ------------------------------------------------------------------------------ Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot _______________________________________________ Gambas-user mailing list Gambas-user@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/gambas-user