I'm new to R and very excited about its possibilities. But I'm struggling with some very simple things, probably because I haven't found the correct documentation. Here's a simple example which illustrates several of my problems.
Suppose I want to have a regexp match against a string, and return all the matching substrings in a vector of strings. regexp <- "[ab]+" strlist <- c( "abc", "dbabddadd", "aaa" ) matches <- gregexpr(regexp,strlist) With this input, I'd want to return list( list("ab"), list("ab", "a"), list("aaa") ). Now the matches object prints out as [[1]] [1] 1 attr(,"match.length") [1] 2 [[2]] [1] 2 7 attr(,"match.length") [1] 3 1 [[3]] [1] 1 attr(,"match.length") [1] 3 which, if I'm interpreting this correctly, means that it is a list (not a vector, because vectors can only have atomic elements) of three elements, each of which is a vector of integers (the matching positions) with an attribute match.length (the length of the corresponding match), which is in turn a vector of integers. Question: is there a more compact standard print format for this? It's a bit disconcerting that printing the 2x2 list list(list(1,2),list(3,4)) takes 16 lines while the corresponding 2x2 array takes 2 lines! (I guess that arrays are "more native"). Now, matches[[1]], the first element of matches, describes the matches in the first string. To extract those strings, I can write substr( strlist[[1]], matches[[1]], attr(matches[[1]],"match.length")+matches[[1]]-1 ) which correctly gives "ab". Question: This looks awfully clumsy; is there some more idiomatic way to do this, in particular to refer to the match.length attribute without using a quoted string or the attr function? attributes(matches[[1]])$match.length and attributes(matches[[1]])[[1]] work, but seem even clumsier. Question: R uses names like xxx.yyy in many places. Is this just a convention to represent spaces (the way most languages use "_"), or is there some semantics attached to "."? Question: Is it good practice in R to treat a string as a vector of characters so that R's powerful vector operations can be used on it? How would I do that? Now suppose I want to list *all* the matches in matches[[2]]. I try: substr( strlist[[2]], matches[[2]], attr(matches[[2]],"match.length")+matches[[2]]-1 ) but only get the first one, so it seems that the recycling rule for vectors doesn't apply here (same thing with [2] instead of [[2]]). Where does recycling apply and not apply? Question: Is there some operator (using promises?) to make strlist[[2]] into a (lazy) infinite vector/list? Now suppose I want to list *all* the matches in all the strings. How would I do that? The naive way, substr(strlist,matches, ...) doesn't work, partly because the attr operator doesn't distribute over lists (I see why it can't, but...). Thanks in advance for your patience with these very elementary questions, -s Stavros Macrakis, Cambridge, MA ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.