https://bugs.r-project.org/show_bug.cgi?id=16745 (from 2016, still labelled 'UNCONFIRMED") contains some other examples of strsplit misbehaving when using 0-length perl look-behinds. E.g.,
> strsplit(split="[[:<:]]", "One, two; three!", perl=TRUE)[[1]] [1] "O" "n" "e" ", " "t" "w" "o" "; " "t" "h" "r" "e" "e" "!" > gsub(pattern="[[:<:]]", "#", "One, two; three!", perl=TRUE) [1] "#One, #two; #three!" The bug report includes the comment It may be possible that strsplit is not using the startoffset argument to pcre_exec pcre/pcre/doc/html/pcreapi.html A non-zero starting offset is useful when searching for another match in the same subject by calling pcre_exec() again after a previous success. Setting startoffset differs from just passing over a shortened string and setting PCRE_NOTBOL in the case of a pattern that begins with any kind of lookbehind. or it could be something else. On Fri, May 5, 2023 at 3:25 AM Ivan Krylov <krylov.r...@gmail.com> wrote: > On Thu, 4 May 2023 23:59:33 +0300 > Leonard Mada via R-help <r-help@r-project.org> wrote: > > > strsplit("a bc,def, adef ,,gh", " |(?=,)|(?<=,)(?![ ])", perl=T) > > # "a" "bc" "," "def" "," "" "adef" "," "," "gh" > > > > strsplit("a bc,def, adef ,,gh", " |(?<! )(?=,)|(?<=,)(?![ ])", perl=T) > > # "a" "bc" "," "def" "," "" "adef" "," "," "gh" > > > > strsplit("a bc,def, adef ,,gh", " |(?<! )(?=,)|(?<=,)(?=[^ ])", > > perl=T) > > # "a" "bc" "," "def" "," "" "adef" "," "," "gh" > > > > > > Is this correct? > > Perl seems to return the results you expect: > > $ perl -E ' > say("$_:\n ", join " ", map qq["$_"], split $_, q[a bc,def, adef ,,gh]) > for ( > qr[ |(?=,)|(?<=,)(?![ ])], > qr[ |(?<! )(?=,)|(?<=,)(?![ ])], > qr[ |(?<! )(?=,)|(?<=,)(?=[^ ])] > )' > (?^u: |(?=,)|(?<=,)(?![ ])): > "a" "bc" "," "def" "," "adef" "," "," "gh" > (?^u: |(?<! )(?=,)|(?<=,)(?![ ])): > "a" "bc" "," "def" "," "adef" "," "," "gh" > (?^u: |(?<! )(?=,)|(?<=,)(?=[^ ])): > "a" "bc" "," "def" "," "adef" "," "," "gh" > > The same thing happens when I ask R to replace the separators instead > of splitting by them: > > sapply(setNames(nm = c( > " |(?=,)|(?<=,)(?![ ])", > " |(?<! )(?=,)|(?<=,)(?![ ])", > " |(?<! )(?=,)|(?<=,)(?=[^ ])") > ), gsub, '[]', "a bc,def, adef ,,gh", perl = TRUE) > # |(?=,)|(?<=,)(?![ ]) |(?<! )(?=,)|(?<=,)(?![ ]) > # "a[]bc[],[]def[],[]adef[],[],[]gh" "a[]bc[],[]def[],[]adef[],[],[]gh" > # |(?<! )(?=,)|(?<=,)(?=[^ ]) > # "a[]bc[],[]def[],[]adef[],[],[]gh" > > I think that something strange happens when the delimeter pattern > matches more than once in the same place: > > gsub( > '(?=<--)|(?<=-->)', '[]', 'split here --><-- split here', > perl = TRUE > ) > # [1] "split here -->[]<-- split here" > > (Both Perl's split() and s///g agree with R's gsub() here, although I > would have accepted "split here -->[][]<-- split here" too.) > > On the other hand, the following doesn't look right: > > strsplit( > 'split here --><-- split here', '(?=<--)|(?<=-->)', > perl = TRUE > ) > # [[1]] > # [1] "split here -->" "<" "-- split here" > > The "<" is definitely not followed by "<--", and the rightmost "--" is > definitely not preceded by "-->". > > Perhaps strsplit() incorrectly advances the match position after one > match? > > -- > Best regards, > Ivan > > ______________________________________________ > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.