Re: [Rd] Feature request: non-dropping regmatches/strextract

2019-09-02 Thread Cyclic Group Z_1 via R-devel
That sounds great! Thank you for your consideration.

Best,
CG

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Feature request: non-dropping regmatches/strextract

2019-09-02 Thread Michael Lawrence via R-devel
After some discussion within R core, we decided that a "nomatch"
argument on regmatches() may be a good initial step. We might add a
new function later that combines the regexpr() and regmatches() steps.
The gregexpr() and regexec() inputs are both lists so it's not clear
whether a "nomatch" value would be relevant (the elements are empty)
in those cases.

On Mon, Sep 2, 2019 at 11:38 AM Cyclic Group Z_1
 wrote:
>
> I think that's a good reason for not including this in regmatches; you're 
> right, its name is somewhat suggestive of yielding matches. Also, that sounds 
> like a great design for strcapture with an atomic prototype.
>
> Best,
> CG



-- 
Michael Lawrence
Scientist, Bioinformatics and Computational Biology
Genentech, A Member of the Roche Group
Office +1 (650) 225-7760
micha...@gene.com

Join Genentech on LinkedIn | Twitter | Facebook | Instagram | YouTube

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Feature request: non-dropping regmatches/strextract

2019-09-02 Thread Cyclic Group Z_1 via R-devel
I think that's a good reason for not including this in regmatches; you're 
right, its name is somewhat suggestive of yielding matches. Also, that sounds 
like a great design for strcapture with an atomic prototype.

Best,
CG

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Feature request: non-dropping regmatches/strextract

2019-08-29 Thread Michael Lawrence via R-devel
Just started thinking about this. The name of regmatches() suggests
that it will only extract the matches but not return anything for the
non-matches. We might need another function that returns a value for
non-matches. Perhaps the value should be the empty string for
non-matches and NA for matches to NA. The rationale is that we
delegate to regexpr() (at least conceptually), and it returns a
"matching region" which would be empty when there is no match. We
could allow strcapture() to accept an atomic vector as a prototype,
which would do what you want for regexec() (NA on no match, empty
string on empty capture). Then we could call the regexpr()-based
function strextract().

What do you think?

Michael

On Thu, Aug 29, 2019 at 3:27 PM Cyclic Group Z_1
 wrote:
>
> Thank you! I greatly appreciate your consideration, though of course it is up 
> to you. I think many people switch to stringr/stringi simply because 
> functions in those packages have some consistent design choices, for example, 
> they do not drop empty/missing matches, which facilitates array-based 
> programming. For example, in the cases where one needs to make a new column 
> in a data.frame (data.table, tibble, etc.) of regex extractions. Or in any 
> other case where there needs to be an element-wise correspondence between 
> input and output. I think insertion of NA_character_ to prevent dropping 
> indices seems like the natural choice for an array language (which, I think, 
> motivated the creation of stringr/stringi). While those are great packages 
> and this behavior can be easily replicated with simple wrappers, string 
> operations are normally easy to accomplish in base languages, so this seems 
> like something that would be appropriate to have in base. For example, MATLAB 
> and Pandas regex both all
 ow non-dropping empty matches (though of course I acknowledge Pandas is not a 
base language).
>
> Best,
> CG



-- 
Michael Lawrence
Scientist, Bioinformatics and Computational Biology
Genentech, A Member of the Roche Group
Office +1 (650) 225-7760
micha...@gene.com

Join Genentech on LinkedIn | Twitter | Facebook | Instagram | YouTube

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Feature request: non-dropping regmatches/strextract

2019-08-29 Thread Michael Lawrence via R-devel
I'd be happy to entertain patches or at least more specific
suggestions to improve strextract() and strcapture(). I hadn't
exported strextract(), because I wasn't quite sure how it should
behave. This feedback should be helpful.

Thanks,
Michael

On Thu, Aug 29, 2019 at 2:20 PM Cyclic Group Z_1 via R-devel
 wrote:
>
> Thank you, I am aware that there are packages that can accomplish this. I 
> mentioned stringr::str_extract as a function that does not drop empty 
> matches. I think that the behavior of regmatches(..., regexpr(...)) in base R 
> should permit an option to prevent dropping of empty matches both for sake of 
> consistency with the rest of the language (missing data does not yield a 
> dropped index in other sorts of R functions, and an empty match conceptually 
> corresponds with missing data) and facility of use in data.frames. The 
> behavior of regmatches(..., gregexpr(...)) is not objectionable to me, as 
> lists do not drop indices when they contain character(0) vectors. 
> Alternatively, perhaps this should be reflected in the (currently 
> non-exported) strextract.
>
> Best,
> CG
>
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel



-- 
Michael Lawrence
Scientist, Bioinformatics and Computational Biology
Genentech, A Member of the Roche Group
Office +1 (650) 225-7760
micha...@gene.com

Join Genentech on LinkedIn | Twitter | Facebook | Instagram | YouTube

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Feature request: non-dropping regmatches/strextract

2019-08-16 Thread Cyclic Group Z_1 via R-devel
Using strcapture seems like a great workaround for use cases of this kind, at 
least in base R. I agree as well that filling with NA for regmatches(..., 
gregexpr(...)) makes less sense, given the outputs are lists and thus are 
retained in the list.  Also, I suppose in the meantime the stringr package can 
be used when non-dropping vector outputs are desired.

However, I do think that non-dropping regex string extraction/matching in 
vector outputs from regmatches(..., regexpr(...)) or strextract would be a 
great (optional) design feature to have in base R for sake of consistency with 
the rest of the language (missing values, denoted by NA, are generally not 
dropped from vectors elsewhere and seem to agree conceptually with empty 
matches) and would help R to reach greater feature parity with MATLAB and 
Pandas in this respect (granted, Pandas is not technically a language on its 
own).

Although I have written personal wrappers and used stringr to accomplish the 
non-dropping behavior in the past, I have nevertheless found the behavior of 
base R string operations mildly astonishing (in the sense of POLA) and think 
others may have as well. As the stringr documentation puts it, "they lag behind 
the string operations in other programming languages, so that some things that 
are easy to do in languages like Ruby or Python are rather hard to do in R." 
Since consistent, robust string operations are often a standard base feature of 
other data science and scientific programming languages, I think this minor 
change would be a great improvement to the language and hopefully help promote 
adoption of R, especially given the surge in text-based data analysis in recent 
years.

Alternatively, although I generally don't use the Tidyverse packages very 
often, stringr seems like a great candidate for inclusion in base or 
recommended R if the R Core team and the package developer see it fitting (just 
a suggestion and probably a long shot). 

However, I will try not to belabor this point further. In any case, thank you!

Best,CG
CG
[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Feature request: non-dropping regmatches/strextract

2019-08-15 Thread William Dunlap via R-devel
Using a non-capturing group, "(?:...)" instead of "(...)", simplifies my
example a bit

> x <- c("Groucho ", "", "Harpo")
> strcapture("([[:alpha:]]+)?(?: *<([[:alpha:]. ]+@[[:alpha:]. ]+)>)?", x,
proto=data.frame(Name=character(), Address=character(),
stringsAsFactors=FALSE))
 Name  Address
1 Groucho grou...@marx.com
2   ch...@marx.com
3   Harpo

Bill Dunlap
TIBCO Software
wdunlap tibco.com


On Thu, Aug 15, 2019 at 1:04 PM William Dunlap  wrote:

> I don't care much for regmatches and haven't tried strextract, but I think
> replacing the character(0) by NA_character_ is almost always inappropriate
> if the match information comes from gregexpr.
>
> I think strcapture() does a pretty good job of what I think you are trying
> to do.  Perhaps adding an argument to map no match to NA instead of ""
> would give you just what you wanted.
>
> > x <- c("Groucho ", "", "Harpo")
> > d <- strcapture("([[:alpha:]]+)?( *<([[:alpha:]. ]+@[[:alpha:]. ]+)>)?",
> x, proto=data.frame(Name=character(), Junk=character(),
> Address=character(), stringsAsFactors=FALSE))
> > d[c("Name", "Address")]
>  Name  Address
> 1 Groucho grou...@marx.com
> 2   ch...@marx.com
> 3   Harpo
> > str(.Last.value)
> 'data.frame':   3 obs. of  2 variables:
>  $ Name   : chr  "Groucho" "" "Harpo"
>  $ Address: chr  "grou...@marx.com" "ch...@marx.com" ""
> Bill Dunlap
> TIBCO Software
> wdunlap tibco.com
>
>
> On Thu, Aug 15, 2019 at 11:31 AM Cyclic Group Z_1 <
> cyclicgroup...@yahoo.com> wrote:
>
>> I do think keeping the default behavior is desirable for backwards
>> compatibility; my suggestion is not to change default behavior but to add
>> an optional argument that allows a different behavior. Although this can be
>> implemented in a user-defined function, retaining empty matches facilitates
>> programmatic use, and seems to be something that should be available in
>> base R. It is available, for example, in MATLAB, a comparable array
>> language.
>>
>> Alternatively, perhaps a nomatch (or maybe emptymatch) argument in the
>> spirit of `[.data.table`? That is, an argument nomatch where nomatch = NULL
>> (the default) results in drops for vector outputs and character(0) for list
>> outputs and nomatch = NA results in insertion of NA_character_, and nomatch
>> = '' results in insertion of empty string.
>>
>> I can submit proposed patch code if others think this is a good idea.
>>
>> What are your thoughts on the proposed alteration to (currently
>> nonexported) strextract? I assume (maybe wrongly) that the plan is to
>> eventually export that function.
>>
>> Thank you,
>> CG
>>
>

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Feature request: non-dropping regmatches/strextract

2019-08-15 Thread William Dunlap via R-devel
I don't care much for regmatches and haven't tried strextract, but I think
replacing the character(0) by NA_character_ is almost always inappropriate
if the match information comes from gregexpr.

I think strcapture() does a pretty good job of what I think you are trying
to do.  Perhaps adding an argument to map no match to NA instead of ""
would give you just what you wanted.

> x <- c("Groucho ", "", "Harpo")
> d <- strcapture("([[:alpha:]]+)?( *<([[:alpha:]. ]+@[[:alpha:]. ]+)>)?",
x, proto=data.frame(Name=character(), Junk=character(),
Address=character(), stringsAsFactors=FALSE))
> d[c("Name", "Address")]
 Name  Address
1 Groucho grou...@marx.com
2   ch...@marx.com
3   Harpo
> str(.Last.value)
'data.frame':   3 obs. of  2 variables:
 $ Name   : chr  "Groucho" "" "Harpo"
 $ Address: chr  "grou...@marx.com" "ch...@marx.com" ""
Bill Dunlap
TIBCO Software
wdunlap tibco.com


On Thu, Aug 15, 2019 at 11:31 AM Cyclic Group Z_1 
wrote:

> I do think keeping the default behavior is desirable for backwards
> compatibility; my suggestion is not to change default behavior but to add
> an optional argument that allows a different behavior. Although this can be
> implemented in a user-defined function, retaining empty matches facilitates
> programmatic use, and seems to be something that should be available in
> base R. It is available, for example, in MATLAB, a comparable array
> language.
>
> Alternatively, perhaps a nomatch (or maybe emptymatch) argument in the
> spirit of `[.data.table`? That is, an argument nomatch where nomatch = NULL
> (the default) results in drops for vector outputs and character(0) for list
> outputs and nomatch = NA results in insertion of NA_character_, and nomatch
> = '' results in insertion of empty string.
>
> I can submit proposed patch code if others think this is a good idea.
>
> What are your thoughts on the proposed alteration to (currently
> nonexported) strextract? I assume (maybe wrongly) that the plan is to
> eventually export that function.
>
> Thank you,
> CG
>

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Feature request: non-dropping regmatches/strextract

2019-08-15 Thread Cyclic Group Z_1 via R-devel
I do think keeping the default behavior is desirable for backwards 
compatibility; my suggestion is not to change default behavior but to add an 
optional argument that allows a different behavior. Although this can be 
implemented in a user-defined function, retaining empty matches facilitates 
programmatic use, and seems to be something that should be available in base R. 
It is available, for example, in MATLAB, a comparable array language.

Alternatively, perhaps a nomatch (or maybe emptymatch) argument in the spirit 
of `[.data.table`? That is, an argument nomatch where nomatch = NULL (the 
default) results in drops for vector outputs and character(0) for list outputs 
and nomatch = NA results in insertion of NA_character_, and nomatch = '' 
results in insertion of empty string.

I can submit proposed patch code if others think this is a good idea.

What are your thoughts on the proposed alteration to (currently nonexported) 
strextract? I assume (maybe wrongly) that the plan is to eventually export that 
function.

Thank you,
CG

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Feature request: non-dropping regmatches/strextract

2019-08-15 Thread William Dunlap via R-devel
Changing the default behavior of regmatches would break its use with
gregexpr, where
the number of matches per input element faries, so a zero-length character
vector
makes more sense than NA_character_.

> x <- c("John Doe", "e e cummings", "Juan de la Madrid")
> m <- gregexpr("[A-Z]", x)
> regmatches(x,m)
[[1]]
[1] "J" "D"

[[2]]
character(0)

[[3]]
[1] "J" "M"

> vapply(.Last.value, function(x)paste(paste0(x, "."),collapse=""), "")
[1] "J.D." ".""J.M."

(We don't want e e cummings initials mapped to "NA.")

Bill Dunlap
TIBCO Software
wdunlap tibco.com


On Thu, Aug 15, 2019 at 12:15 AM Cyclic Group Z_1 via R-devel <
r-devel@r-project.org> wrote:

> A very common use case for regmatches is to extract regex matches into a
> new column in a data.frame (or data.table, etc.) or otherwise use the
> extracted strings alongside the input. However, the default behavior is to
> drop empty matches, which results in mismatches in column length if
> reassignment is done without subsetting.
>
> For consistency with other R functions and compatibility with this use
> case, it would be nice if regmatches did not automatically drop empty
> matches and would instead insert an NA_character_ value (similar to
> stringr::str_extract). This alternative regmatches could be implemented
> through an optional drop argument, a new function, or mentioned in the
> documentation (a la resample in ?sample).
>
> Alternatively, at the moment, there is a non-exported function strextract
> in utils which is very similar to stringr::str_extract. It would be great
> if this function, once exported, were to include a drop argument to prevent
> dropping positions with no matches.
>
> An example solution (last option):
>
> strextract <- function(pattern, x, perl = FALSE, useBytes = FALSE, drop =
> T) {
>  m <- regexec(pattern, x, perl=perl, useBytes=useBytes)
>  result <- regmatches(x, m)
>
>  if(isTRUE(drop)){
>  unlist(result)
>  } else if(isFALSE(drop)) {
>  unlist({result[lengths(result)==0] <- NA_character_; result})
>  } else {
>  stop("Invalid argument for `drop`")
>  }
> }
>
> Based on Ricardo Saporta's response to How to prevent regmatches drop non
> matches?
>
> --CG
>
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] Feature request: non-dropping regmatches/strextract

2019-08-15 Thread Cyclic Group Z_1 via R-devel
A very common use case for regmatches is to extract regex matches into a new 
column in a data.frame (or data.table, etc.) or otherwise use the extracted 
strings alongside the input. However, the default behavior is to drop empty 
matches, which results in mismatches in column length if reassignment is done 
without subsetting.

For consistency with other R functions and compatibility with this use case, it 
would be nice if regmatches did not automatically drop empty matches and would 
instead insert an NA_character_ value (similar to stringr::str_extract). This 
alternative regmatches could be implemented through an optional drop argument, 
a new function, or mentioned in the documentation (a la resample in ?sample). 

Alternatively, at the moment, there is a non-exported function strextract in 
utils which is very similar to stringr::str_extract. It would be great if this 
function, once exported, were to include a drop argument to prevent dropping 
positions with no matches. 

An example solution (last option):

strextract <- function(pattern, x, perl = FALSE, useBytes = FALSE, drop = T) {
 m <- regexec(pattern, x, perl=perl, useBytes=useBytes)
 result <- regmatches(x, m)
 
 if(isTRUE(drop)){
 unlist(result)
 } else if(isFALSE(drop)) {
 unlist({result[lengths(result)==0] <- NA_character_; result})
 } else {
 stop("Invalid argument for `drop`")
 }
}

Based on Ricardo Saporta's response to How to prevent regmatches drop non 
matches?

--CG

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel