Re: [Rd] Feature request: non-dropping regmatches/strextract
That sounds great! Thank you for your consideration. Best, CG __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Feature request: non-dropping regmatches/strextract
After some discussion within R core, we decided that a "nomatch" argument on regmatches() may be a good initial step. We might add a new function later that combines the regexpr() and regmatches() steps. The gregexpr() and regexec() inputs are both lists so it's not clear whether a "nomatch" value would be relevant (the elements are empty) in those cases. On Mon, Sep 2, 2019 at 11:38 AM Cyclic Group Z_1 wrote: > > I think that's a good reason for not including this in regmatches; you're > right, its name is somewhat suggestive of yielding matches. Also, that sounds > like a great design for strcapture with an atomic prototype. > > Best, > CG -- Michael Lawrence Scientist, Bioinformatics and Computational Biology Genentech, A Member of the Roche Group Office +1 (650) 225-7760 micha...@gene.com Join Genentech on LinkedIn | Twitter | Facebook | Instagram | YouTube __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Feature request: non-dropping regmatches/strextract
I think that's a good reason for not including this in regmatches; you're right, its name is somewhat suggestive of yielding matches. Also, that sounds like a great design for strcapture with an atomic prototype. Best, CG __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Feature request: non-dropping regmatches/strextract
Just started thinking about this. The name of regmatches() suggests that it will only extract the matches but not return anything for the non-matches. We might need another function that returns a value for non-matches. Perhaps the value should be the empty string for non-matches and NA for matches to NA. The rationale is that we delegate to regexpr() (at least conceptually), and it returns a "matching region" which would be empty when there is no match. We could allow strcapture() to accept an atomic vector as a prototype, which would do what you want for regexec() (NA on no match, empty string on empty capture). Then we could call the regexpr()-based function strextract(). What do you think? Michael On Thu, Aug 29, 2019 at 3:27 PM Cyclic Group Z_1 wrote: > > Thank you! I greatly appreciate your consideration, though of course it is up > to you. I think many people switch to stringr/stringi simply because > functions in those packages have some consistent design choices, for example, > they do not drop empty/missing matches, which facilitates array-based > programming. For example, in the cases where one needs to make a new column > in a data.frame (data.table, tibble, etc.) of regex extractions. Or in any > other case where there needs to be an element-wise correspondence between > input and output. I think insertion of NA_character_ to prevent dropping > indices seems like the natural choice for an array language (which, I think, > motivated the creation of stringr/stringi). While those are great packages > and this behavior can be easily replicated with simple wrappers, string > operations are normally easy to accomplish in base languages, so this seems > like something that would be appropriate to have in base. For example, MATLAB > and Pandas regex both all ow non-dropping empty matches (though of course I acknowledge Pandas is not a base language). > > Best, > CG -- Michael Lawrence Scientist, Bioinformatics and Computational Biology Genentech, A Member of the Roche Group Office +1 (650) 225-7760 micha...@gene.com Join Genentech on LinkedIn | Twitter | Facebook | Instagram | YouTube __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Feature request: non-dropping regmatches/strextract
I'd be happy to entertain patches or at least more specific suggestions to improve strextract() and strcapture(). I hadn't exported strextract(), because I wasn't quite sure how it should behave. This feedback should be helpful. Thanks, Michael On Thu, Aug 29, 2019 at 2:20 PM Cyclic Group Z_1 via R-devel wrote: > > Thank you, I am aware that there are packages that can accomplish this. I > mentioned stringr::str_extract as a function that does not drop empty > matches. I think that the behavior of regmatches(..., regexpr(...)) in base R > should permit an option to prevent dropping of empty matches both for sake of > consistency with the rest of the language (missing data does not yield a > dropped index in other sorts of R functions, and an empty match conceptually > corresponds with missing data) and facility of use in data.frames. The > behavior of regmatches(..., gregexpr(...)) is not objectionable to me, as > lists do not drop indices when they contain character(0) vectors. > Alternatively, perhaps this should be reflected in the (currently > non-exported) strextract. > > Best, > CG > > __ > R-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel -- Michael Lawrence Scientist, Bioinformatics and Computational Biology Genentech, A Member of the Roche Group Office +1 (650) 225-7760 micha...@gene.com Join Genentech on LinkedIn | Twitter | Facebook | Instagram | YouTube __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Feature request: non-dropping regmatches/strextract
Using strcapture seems like a great workaround for use cases of this kind, at least in base R. I agree as well that filling with NA for regmatches(..., gregexpr(...)) makes less sense, given the outputs are lists and thus are retained in the list. Also, I suppose in the meantime the stringr package can be used when non-dropping vector outputs are desired. However, I do think that non-dropping regex string extraction/matching in vector outputs from regmatches(..., regexpr(...)) or strextract would be a great (optional) design feature to have in base R for sake of consistency with the rest of the language (missing values, denoted by NA, are generally not dropped from vectors elsewhere and seem to agree conceptually with empty matches) and would help R to reach greater feature parity with MATLAB and Pandas in this respect (granted, Pandas is not technically a language on its own). Although I have written personal wrappers and used stringr to accomplish the non-dropping behavior in the past, I have nevertheless found the behavior of base R string operations mildly astonishing (in the sense of POLA) and think others may have as well. As the stringr documentation puts it, "they lag behind the string operations in other programming languages, so that some things that are easy to do in languages like Ruby or Python are rather hard to do in R." Since consistent, robust string operations are often a standard base feature of other data science and scientific programming languages, I think this minor change would be a great improvement to the language and hopefully help promote adoption of R, especially given the surge in text-based data analysis in recent years. Alternatively, although I generally don't use the Tidyverse packages very often, stringr seems like a great candidate for inclusion in base or recommended R if the R Core team and the package developer see it fitting (just a suggestion and probably a long shot). However, I will try not to belabor this point further. In any case, thank you! Best,CG CG [[alternative HTML version deleted]] __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Feature request: non-dropping regmatches/strextract
Using a non-capturing group, "(?:...)" instead of "(...)", simplifies my example a bit > x <- c("Groucho ", "", "Harpo") > strcapture("([[:alpha:]]+)?(?: *<([[:alpha:]. ]+@[[:alpha:]. ]+)>)?", x, proto=data.frame(Name=character(), Address=character(), stringsAsFactors=FALSE)) Name Address 1 Groucho grou...@marx.com 2 ch...@marx.com 3 Harpo Bill Dunlap TIBCO Software wdunlap tibco.com On Thu, Aug 15, 2019 at 1:04 PM William Dunlap wrote: > I don't care much for regmatches and haven't tried strextract, but I think > replacing the character(0) by NA_character_ is almost always inappropriate > if the match information comes from gregexpr. > > I think strcapture() does a pretty good job of what I think you are trying > to do. Perhaps adding an argument to map no match to NA instead of "" > would give you just what you wanted. > > > x <- c("Groucho ", "", "Harpo") > > d <- strcapture("([[:alpha:]]+)?( *<([[:alpha:]. ]+@[[:alpha:]. ]+)>)?", > x, proto=data.frame(Name=character(), Junk=character(), > Address=character(), stringsAsFactors=FALSE)) > > d[c("Name", "Address")] > Name Address > 1 Groucho grou...@marx.com > 2 ch...@marx.com > 3 Harpo > > str(.Last.value) > 'data.frame': 3 obs. of 2 variables: > $ Name : chr "Groucho" "" "Harpo" > $ Address: chr "grou...@marx.com" "ch...@marx.com" "" > Bill Dunlap > TIBCO Software > wdunlap tibco.com > > > On Thu, Aug 15, 2019 at 11:31 AM Cyclic Group Z_1 < > cyclicgroup...@yahoo.com> wrote: > >> I do think keeping the default behavior is desirable for backwards >> compatibility; my suggestion is not to change default behavior but to add >> an optional argument that allows a different behavior. Although this can be >> implemented in a user-defined function, retaining empty matches facilitates >> programmatic use, and seems to be something that should be available in >> base R. It is available, for example, in MATLAB, a comparable array >> language. >> >> Alternatively, perhaps a nomatch (or maybe emptymatch) argument in the >> spirit of `[.data.table`? That is, an argument nomatch where nomatch = NULL >> (the default) results in drops for vector outputs and character(0) for list >> outputs and nomatch = NA results in insertion of NA_character_, and nomatch >> = '' results in insertion of empty string. >> >> I can submit proposed patch code if others think this is a good idea. >> >> What are your thoughts on the proposed alteration to (currently >> nonexported) strextract? I assume (maybe wrongly) that the plan is to >> eventually export that function. >> >> Thank you, >> CG >> > [[alternative HTML version deleted]] __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Feature request: non-dropping regmatches/strextract
I don't care much for regmatches and haven't tried strextract, but I think replacing the character(0) by NA_character_ is almost always inappropriate if the match information comes from gregexpr. I think strcapture() does a pretty good job of what I think you are trying to do. Perhaps adding an argument to map no match to NA instead of "" would give you just what you wanted. > x <- c("Groucho ", "", "Harpo") > d <- strcapture("([[:alpha:]]+)?( *<([[:alpha:]. ]+@[[:alpha:]. ]+)>)?", x, proto=data.frame(Name=character(), Junk=character(), Address=character(), stringsAsFactors=FALSE)) > d[c("Name", "Address")] Name Address 1 Groucho grou...@marx.com 2 ch...@marx.com 3 Harpo > str(.Last.value) 'data.frame': 3 obs. of 2 variables: $ Name : chr "Groucho" "" "Harpo" $ Address: chr "grou...@marx.com" "ch...@marx.com" "" Bill Dunlap TIBCO Software wdunlap tibco.com On Thu, Aug 15, 2019 at 11:31 AM Cyclic Group Z_1 wrote: > I do think keeping the default behavior is desirable for backwards > compatibility; my suggestion is not to change default behavior but to add > an optional argument that allows a different behavior. Although this can be > implemented in a user-defined function, retaining empty matches facilitates > programmatic use, and seems to be something that should be available in > base R. It is available, for example, in MATLAB, a comparable array > language. > > Alternatively, perhaps a nomatch (or maybe emptymatch) argument in the > spirit of `[.data.table`? That is, an argument nomatch where nomatch = NULL > (the default) results in drops for vector outputs and character(0) for list > outputs and nomatch = NA results in insertion of NA_character_, and nomatch > = '' results in insertion of empty string. > > I can submit proposed patch code if others think this is a good idea. > > What are your thoughts on the proposed alteration to (currently > nonexported) strextract? I assume (maybe wrongly) that the plan is to > eventually export that function. > > Thank you, > CG > [[alternative HTML version deleted]] __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Feature request: non-dropping regmatches/strextract
I do think keeping the default behavior is desirable for backwards compatibility; my suggestion is not to change default behavior but to add an optional argument that allows a different behavior. Although this can be implemented in a user-defined function, retaining empty matches facilitates programmatic use, and seems to be something that should be available in base R. It is available, for example, in MATLAB, a comparable array language. Alternatively, perhaps a nomatch (or maybe emptymatch) argument in the spirit of `[.data.table`? That is, an argument nomatch where nomatch = NULL (the default) results in drops for vector outputs and character(0) for list outputs and nomatch = NA results in insertion of NA_character_, and nomatch = '' results in insertion of empty string. I can submit proposed patch code if others think this is a good idea. What are your thoughts on the proposed alteration to (currently nonexported) strextract? I assume (maybe wrongly) that the plan is to eventually export that function. Thank you, CG __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Feature request: non-dropping regmatches/strextract
Changing the default behavior of regmatches would break its use with gregexpr, where the number of matches per input element faries, so a zero-length character vector makes more sense than NA_character_. > x <- c("John Doe", "e e cummings", "Juan de la Madrid") > m <- gregexpr("[A-Z]", x) > regmatches(x,m) [[1]] [1] "J" "D" [[2]] character(0) [[3]] [1] "J" "M" > vapply(.Last.value, function(x)paste(paste0(x, "."),collapse=""), "") [1] "J.D." ".""J.M." (We don't want e e cummings initials mapped to "NA.") Bill Dunlap TIBCO Software wdunlap tibco.com On Thu, Aug 15, 2019 at 12:15 AM Cyclic Group Z_1 via R-devel < r-devel@r-project.org> wrote: > A very common use case for regmatches is to extract regex matches into a > new column in a data.frame (or data.table, etc.) or otherwise use the > extracted strings alongside the input. However, the default behavior is to > drop empty matches, which results in mismatches in column length if > reassignment is done without subsetting. > > For consistency with other R functions and compatibility with this use > case, it would be nice if regmatches did not automatically drop empty > matches and would instead insert an NA_character_ value (similar to > stringr::str_extract). This alternative regmatches could be implemented > through an optional drop argument, a new function, or mentioned in the > documentation (a la resample in ?sample). > > Alternatively, at the moment, there is a non-exported function strextract > in utils which is very similar to stringr::str_extract. It would be great > if this function, once exported, were to include a drop argument to prevent > dropping positions with no matches. > > An example solution (last option): > > strextract <- function(pattern, x, perl = FALSE, useBytes = FALSE, drop = > T) { > m <- regexec(pattern, x, perl=perl, useBytes=useBytes) > result <- regmatches(x, m) > > if(isTRUE(drop)){ > unlist(result) > } else if(isFALSE(drop)) { > unlist({result[lengths(result)==0] <- NA_character_; result}) > } else { > stop("Invalid argument for `drop`") > } > } > > Based on Ricardo Saporta's response to How to prevent regmatches drop non > matches? > > --CG > > __ > R-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel > [[alternative HTML version deleted]] __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] Feature request: non-dropping regmatches/strextract
A very common use case for regmatches is to extract regex matches into a new column in a data.frame (or data.table, etc.) or otherwise use the extracted strings alongside the input. However, the default behavior is to drop empty matches, which results in mismatches in column length if reassignment is done without subsetting. For consistency with other R functions and compatibility with this use case, it would be nice if regmatches did not automatically drop empty matches and would instead insert an NA_character_ value (similar to stringr::str_extract). This alternative regmatches could be implemented through an optional drop argument, a new function, or mentioned in the documentation (a la resample in ?sample). Alternatively, at the moment, there is a non-exported function strextract in utils which is very similar to stringr::str_extract. It would be great if this function, once exported, were to include a drop argument to prevent dropping positions with no matches. An example solution (last option): strextract <- function(pattern, x, perl = FALSE, useBytes = FALSE, drop = T) { m <- regexec(pattern, x, perl=perl, useBytes=useBytes) result <- regmatches(x, m) if(isTRUE(drop)){ unlist(result) } else if(isFALSE(drop)) { unlist({result[lengths(result)==0] <- NA_character_; result}) } else { stop("Invalid argument for `drop`") } } Based on Ricardo Saporta's response to How to prevent regmatches drop non matches? --CG __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel