Dear Bill,
Indeed, there are other cases as well - as documented. Various Regex sites give the warning to avoid the legacy syntax "[[:<:]]", so this is the alternative syntax: strsplit(split="\\b(?=\\w)", "One, two; three!", perl=TRUE) # "O" "n" "e" ", " "t" "w" "o" "; " "t" "h" "r" "e" "e" "!" gsub("\\b(?=\\w)", "#", "One, two; three!", perl=TRUE) # "#One, #two; #three!" Sincerely, Leonard On 5/5/2023 6:19 PM, Bill Dunlap wrote: > https://eu01.z.antigena.com/l/BgIBOxsm88PwDTBiTTrQ784MFk2oGZVOA3RMHiarAZuyoEemKrcnpfJeD8X0FgxRDG33qHZho~NriRCbhv9_Ffr3EOfqn2vpaNUAlCDjQ8nOyVUgPM2iGnHi-qpN54kl1YVO_gHimn0m2ZJ68ntGtysras~0mRMDuAgwbTXsQcQ~ > > (from 2016, still labelled 'UNCONFIRMED") contains some other examples > of strsplit misbehaving when using 0-length perl look-behinds. E.g., > > > strsplit(split="[[:<:]]", "One, two; three!", perl=TRUE)[[1]] > [1] "O" "n" "e" ", " "t" "w" "o" "; " "t" "h" "r" "e" "e" "!" > > gsub(pattern="[[:<:]]", "#", "One, two; three!", perl=TRUE) > [1] "#One, #two; #three!" > > The bug report includes the comment > It may be possible that strsplit is not using the startoffset argument > to pcre_exec > > pcre/pcre/doc/html/pcreapi.html > A non-zero starting offset is useful when searching for another match > in the same subject by calling pcre_exec() again after a previous > success. Setting startoffset differs from just passing over a > shortened string and setting PCRE_NOTBOL in the case of a pattern that > begins with any kind of lookbehind. > > or it could be something else. > > > On Fri, May 5, 2023 at 3:25 AM Ivan Krylov <krylov.r...@gmail.com> wrote: > > On Thu, 4 May 2023 23:59:33 +0300 > Leonard Mada via R-help <r-help@r-project.org> wrote: > > > strsplit("a bc,def, adef ,,gh", " |(?=,)|(?<=,)(?![ ])", perl=T) > > # "a" "bc" "," "def" "," "" "adef" "," "," "gh" > > > > strsplit("a bc,def, adef ,,gh", " |(?<! )(?=,)|(?<=,)(?![ ])", > perl=T) > > # "a" "bc" "," "def" "," "" "adef" "," "," "gh" > > > > strsplit("a bc,def, adef ,,gh", " |(?<! )(?=,)|(?<=,)(?=[^ ])", > > perl=T) > > # "a" "bc" "," "def" "," "" "adef" "," "," "gh" > > > > > > Is this correct? > > Perl seems to return the results you expect: > > $ perl -E ' > say("$_:\n ", join " ", map qq["$_"], split $_, q[a bc,def, adef > ,,gh]) > for ( > qr[ |(?=,)|(?<=,)(?![ ])], > qr[ |(?<! )(?=,)|(?<=,)(?![ ])], > qr[ |(?<! )(?=,)|(?<=,)(?=[^ ])] > )' > (?^u: |(?=,)|(?<=,)(?![ ])): > "a" "bc" "," "def" "," "adef" "," "," "gh" > (?^u: |(?<! )(?=,)|(?<=,)(?![ ])): > "a" "bc" "," "def" "," "adef" "," "," "gh" > (?^u: |(?<! )(?=,)|(?<=,)(?=[^ ])): > "a" "bc" "," "def" "," "adef" "," "," "gh" > > The same thing happens when I ask R to replace the separators instead > of splitting by them: > > sapply(setNames(nm = c( > " |(?=,)|(?<=,)(?![ ])", > " |(?<! )(?=,)|(?<=,)(?![ ])", > " |(?<! )(?=,)|(?<=,)(?=[^ ])") > ), gsub, '[]', "a bc,def, adef ,,gh", perl = TRUE) > # |(?=,)|(?<=,)(?![ ]) |(?<! > )(?=,)|(?<=,)(?![ ]) > # "a[]bc[],[]def[],[]adef[],[],[]gh" > "a[]bc[],[]def[],[]adef[],[],[]gh" > # |(?<! )(?=,)|(?<=,)(?=[^ ]) > # "a[]bc[],[]def[],[]adef[],[],[]gh" > > I think that something strange happens when the delimeter pattern > matches more than once in the same place: > > gsub( > '(?=<--)|(?<=-->)', '[]', 'split here --><-- split here', > perl = TRUE > ) > # [1] "split here -->[]<-- split here" > > (Both Perl's split() and s///g agree with R's gsub() here, although I > would have accepted "split here -->[][]<-- split here" too.) > > On the other hand, the following doesn't look right: > > strsplit( > 'split here --><-- split here', '(?=<--)|(?<=-->)', > perl = TRUE > ) > # [[1]] > # [1] "split here -->" "<" "-- split here" > > The "<" is definitely not followed by "<--", and the rightmost "--" is > definitely not preceded by "-->". > > Perhaps strsplit() incorrectly advances the match position after one > match? > > -- > Best regards, > Ivan > > ______________________________________________ > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > > https://eu01.z.antigena.com/l/WZma5cGVT7M3Pi1uuAoPo_edV2O7qj81C7uavPIJ3LEMXNUs9d2H6DCGBB12hJA-6tmSLDAJFSwSMeHfx9~UdkUSOMRYZx7tgL1P4G1w4VXdaEBqiHCYYXMGh59CijZYZiIc53dOO~~YTK7T17MIVg-A4Mj5av2VVOt4XNt > > PLEASE do read the posting guide > > https://eu01.z.antigena.com/l/boS91wizs77ZHW7jjYQJGhwKWDd7jhs-Bz84RKSuLO6Wr42WQEw~jCTfuUJGa_hsJ~G48rDp4Yd3YqBk~W12~24~eoBAwV8FTFmlNLCyjnyym8S-Ebcq0yz2IaH5TEYHyBIe7Z52GHo7s2sQIpyl93Js_4_UaWCcc2uXHZs1 > > and provide commented, minimal, self-contained, reproducible code. > [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.