Re: [R] Regular expressions and 2 dots

2019-06-28 Thread Rui Barradas

Hello,

Please always cc the list.

To know more about the regular expressions used by r read

help("regex")

The one I used is not very complicated.

\\.  match a dot; it is a meta-character so it needs to be escaped.
{2,} repeated at least 2 times, at most an undetermined number of times.
.*   any character (.) repeated zero or more times (*).
$end of string.

All together now.

"\\.{2,}.*$"  matches at least 2 dots followed by anything until
  the end of the string.

The replacement for this is the empty string "", so what is matched is 
removed.



Hope this helps,

Rui Barradas



Às 09:33 de 28/06/19, ptit_b...@yahoo.fr escreveu:

Just a small message to thank you for your fast and working solution.
I now have to understand it ...
Nice day.

_
Sent from http://r.789695.n4.nabble.com



__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Regular expressions and 2 dots

2019-06-28 Thread Rui Barradas

Hello,

Try

s <- c( "colone..xx.","coltwo.ft..rr.","colthree.gh..az.","colfour.DG..lm.")

sub("\\.{2,}.*$", "", s)
#[1] "colone"  "coltwo.ft"   "colthree.gh" "colfour.DG"

Às 09:00 de 28/06/19, lionel sicot via R-help escreveu:

c( "colone..xx.","coltwo.ft..rr.","colthree.gh..az.","colfour.DG..lm.")


__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Regular expressions and 2 dots

2019-06-28 Thread lionel sicot via R-help
Hello,
I have files from an equipment with column names including dots.I would like to 
simplify these names but all my attempts with sub and regular expressions are 
unsuccessful.
I havec( 
"colone..xx.","coltwo.ft..rr.","colthree.gh..az.","colfour.DG..lm.")and I would 
like to have c( "colone","coltwo.ft","colthree.gh","colfour.DG")

that means to delete everything after the two dots but to keep strings after 
intermediate dot like in colfour.DG
Thanks in advance for your help (a working solution would be good, 
explainations with the working solution would be ideal).Nice day,Ptit Bleu.
[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Regular expressions, genbank

2014-02-06 Thread arun
HI,
May be this helps:
lines1 <- readLines(textConnection('text to be ignored...
 CDS 687..3158
 /gene="AXL2"
 /note="plasma membrane glycoprotein"

other text to be ignored...

 CDS complement(3300..4037)
 /gene="REV7"

other text to be ignored...

 CDS <4500..4550
 /gene="REV7"

other text to be ignored...

 CDS complement(join(30708..31700,31931..31984))
 /gene="REV7"'))

lines2 <- lines1[grep("CDS",lines1)]
 lines3 <- lines2[!grepl("[<>]",lines2)]
indx <- grepl("complement",lines3)*1
mapply(`c`,indx,strapply(lines3,"([0-9]+)",as.numeric))
#[[1]]
#[1]    0  687 3158
#
#[[2]]
#[1]    1 3300 4037
#
#[[3]]
#[1] 1 30708 31700 31931 31984


If you want to have "," as sep:
 
lapply(mapply(`c`,indx,strapply(lines3,"([0-9]+)",as.numeric)),paste,collapse=",
 ")
A.K.





For
 sure, maybe I could provide a more realistic sample of what I have 
rather than the vector. Here is a chunk of the text I'll be processing:
text to be ignored... CDS 687..3158 /gene="AXL2" /note="plasma 
membrane glycoprotein"
other text to be ignored...
CDS complement(3300..4037) /gene="REV7"
other text to be ignored...
CDS <4500..4550 /gene="REV7"
other text to be ignored...
CDS complement(join(30708..31700,31931..31984)) /gene="REV7"
and so on ...
processing this text, I want the following output (let's say) in a list called 
output with as many elements as there are valid "CDS" (i.e. CDS without "<" or 
">"), where the first component of each element of the list is a 0/1 number 
that tells if what followed "CDS" included the word "complement" or not. Here 
is what I would like to get for the above text:
output:
[[1]] 0, 687, 3158
[[2]] 1, 3300, 4037
[[3]] 1, 30708, 31700, 31931, 31984

Thanks again for the help!





Thank you very much for the response! This is a major improvement on 
what I was getting! I need to read and understand what is done as I need to 
modify it a little bit. The exact requirement for me is to not only 
recognize the numbers that follow "CDS" but also be able to 
differentiate between the 4 accepted cases: 
"CDS             3300..4037" 
or 
"CDS             complement(3300..4037)" 
or 
"CDS             join(21467..26641,27577..28890)" 
or 
"CDS             complement(join(30708..31700,31931..31984))" 

I need to do different things for each for example, when "join" 
follows the gap, I need to join the ranges (e.g. in this case have two 
intervals [21467 26641] U [27577 28890]) in one set. Many thanks though 
for getting me going! 


On Thursday, February 6, 2014 2:20 PM, arun  wrote:
You could also try:
library(gsubfn)


strapply(gsub("\\d+<|>\\d+","",vec1),"([0-9]+)",as.numeric,simplify=c)

A.K.



On Thursday, February 6, 2014 1:55 PM, arun  wrote:
Hi,
One way would be: 


vec1 <- c("CDS 3300..4037",  "CDS 
complement(3300..4037)", "CDS 3300<..4037", "CDS 
join(21467..26641,27577..28890)",  "CDS 
complement(join(30708..31700,31931..31984))",  "CDS 3300<..>4037")
library(stringr)
as.numeric(unlist(strsplit(str_trim(gsub("\\D+"," 
",gsub("\\d+<|>\\d+","",vec1)))," ")))
# [1]  3300  4037  3300  4037  4037 21467 26641 27577 28890 30708 31700 31931
#[13] 31984
A.K.


Hi, 

I have been using R for the past 1.5 years and usually have 
found topics to be relatively easy to learn on your own, but I am 
finding the learning curve with the regular expressions to be a little 
steep especially since I haven't found any good tutorials. While I 
intend to spend more time systematically learning proper ways of making 
regular expressions, I have a project that is coming due and can't wait 
for that so I was hoping to get some direct help. 
I need to extract all the numbers in lines with following formats: 

"CDS             3300..4037" 
or 
"CDS             complement(3300..4037)" 
or 
"CDS             join(21467..26641,27577..28890)" 
or 
"CDS             complement(join(30708..31700,31931..31984))" 

but not if any of the numbers are preceded by "<" or followed by ">" 
Many thanks in advance!

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Regular expressions, genbank

2014-02-06 Thread arun
You could also try:
library(gsubfn)


strapply(gsub("\\d+<|>\\d+","",vec1),"([0-9]+)",as.numeric,simplify=c)

A.K.


On Thursday, February 6, 2014 1:55 PM, arun  wrote:
Hi,
One way would be: 


vec1 <- c("CDS 3300..4037",  "CDS 
complement(3300..4037)", "CDS 3300<..4037", "CDS 
join(21467..26641,27577..28890)",  "CDS 
complement(join(30708..31700,31931..31984))",  "CDS 3300<..>4037")
library(stringr)
as.numeric(unlist(strsplit(str_trim(gsub("\\D+"," 
",gsub("\\d+<|>\\d+","",vec1)))," ")))
# [1]  3300  4037  3300  4037  4037 21467 26641 27577 28890 30708 31700 31931
#[13] 31984
A.K.


Hi, 

I have been using R for the past 1.5 years and usually have 
found topics to be relatively easy to learn on your own, but I am 
finding the learning curve with the regular expressions to be a little 
steep especially since I haven't found any good tutorials. While I 
intend to spend more time systematically learning proper ways of making 
regular expressions, I have a project that is coming due and can't wait 
for that so I was hoping to get some direct help. 
I need to extract all the numbers in lines with following formats: 

"CDS             3300..4037" 
or 
"CDS             complement(3300..4037)" 
or 
"CDS             join(21467..26641,27577..28890)" 
or 
"CDS             complement(join(30708..31700,31931..31984))" 

but not if any of the numbers are preceded by "<" or followed by ">" 
Many thanks in advance!

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Regular expressions, genbank

2014-02-06 Thread arun
Hi,
One way would be: 


vec1 <- c("CDS 3300..4037",  "CDS 
complement(3300..4037)", "CDS 3300<..4037", "CDS 
join(21467..26641,27577..28890)",  "CDS 
complement(join(30708..31700,31931..31984))",  "CDS 3300<..>4037")
library(stringr)
as.numeric(unlist(strsplit(str_trim(gsub("\\D+"," 
",gsub("\\d+<|>\\d+","",vec1)))," ")))
# [1]  3300  4037  3300  4037  4037 21467 26641 27577 28890 30708 31700 31931
#[13] 31984
A.K.


Hi, 

I have been using R for the past 1.5 years and usually have 
found topics to be relatively easy to learn on your own, but I am 
finding the learning curve with the regular expressions to be a little 
steep especially since I haven't found any good tutorials. While I 
intend to spend more time systematically learning proper ways of making 
regular expressions, I have a project that is coming due and can't wait 
for that so I was hoping to get some direct help. 
I need to extract all the numbers in lines with following formats: 

"CDS             3300..4037" 
or 
"CDS             complement(3300..4037)" 
or 
"CDS             join(21467..26641,27577..28890)" 
or 
"CDS             complement(join(30708..31700,31931..31984))" 

but not if any of the numbers are preceded by "<" or followed by ">" 
Many thanks in advance!

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Regular expressions on filenames

2014-01-15 Thread Wojtek Poppe
Try  sub("\\.[^.]+$", "",  basename(FILELIST))

Thanks,
Wojtek


On Wed, Jan 15, 2014 at 4:37 PM, Fisher Dennis  wrote:

> R 3.0.2
> OS X
>
> Colleagues
>
> I am writing code to read a large number of files in a particular folder.
>  In some situations, there may be two versions of the file with different
> extensions, e.g.:
> FILE.csv
> FILE.xls
> I extracted the portion before the extension with:
> sub("\\..*$", "", basename(FILELIST))
> then used
> duplicated
> to find duplicates.  All was well until I encountered files named:
> FILE.XXX.csv
> FILE.YYY.xls
>
> My regular expression extracted only the “FILE” portion of the text and
> claimed that the filenames (without the extensions) matched.  Can someone
> provide me with the appropriate regular expression to deal with this?
>  Thanks.
>
> Dennis
>
>
> Dennis Fisher MD
> P < (The "P Less Than" Company)
> Phone: 1-866-PLessThan (1-866-753-7784)
> Fax: 1-866-PLessThan (1-866-753-7784)
> www.PLessThan.com
>
> __
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Regular expressions on filenames

2014-01-15 Thread David Winsemius

On Jan 15, 2014, at 4:37 PM, Fisher Dennis wrote:

> R 3.0.2
> OS X
> 
> Colleagues
> 
> I am writing code to read a large number of files in a particular folder.  In 
> some situations, there may be two versions of the file with different 
> extensions, e.g.:
>   FILE.csv
>   FILE.xls
> I extracted the portion before the extension with:
>   sub("\\..*$", "", basename(FILELIST))
> then used 
>   duplicated
> to find duplicates.  All was well until I encountered files named:
>   FILE.XXX.csv
>   FILE.YYY.xls
> 
> My regular expression extracted only the “FILE” portion of the text and 
> claimed that the filenames (without the extensions) matched.  Can someone 
> provide me with the appropriate regular expression to deal with this?  Thanks.

Why not:

sub("\\..{3}$", "", basename(FILELIST))

See ?regex

-- 

David Winsemius
Alameda, CA, USA

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Regular expressions on filenames

2014-01-15 Thread Jeff Newmiller
You want to match a period and anything that follows to the end of the string, 
as long as what follows has no period in it.
"\\.[^.]*$"
---
Jeff NewmillerThe .   .  Go Live...
DCN:Basics: ##.#.   ##.#.  Live Go...
  Live:   OO#.. Dead: OO#..  Playing
Research Engineer (Solar/BatteriesO.O#.   #.O#.  with
/Software/Embedded Controllers)   .OO#.   .OO#.  rocks...1k
--- 
Sent from my phone. Please excuse my brevity.

Fisher Dennis  wrote:
>R 3.0.2
>OS X
>
>Colleagues
>
>I am writing code to read a large number of files in a particular
>folder.  In some situations, there may be two versions of the file with
>different extensions, e.g.:
>   FILE.csv
>   FILE.xls
>I extracted the portion before the extension with:
>   sub("\\..*$", "", basename(FILELIST))
>then used 
>   duplicated
>to find duplicates.  All was well until I encountered files named:
>   FILE.XXX.csv
>   FILE.YYY.xls
>
>My regular expression extracted only the “FILE” portion of the text and
>claimed that the filenames (without the extensions) matched.  Can
>someone provide me with the appropriate regular expression to deal with
>this?  Thanks.
>
>Dennis
>
>
>Dennis Fisher MD
>P < (The "P Less Than" Company)
>Phone: 1-866-PLessThan (1-866-753-7784)
>Fax: 1-866-PLessThan (1-866-753-7784)
>www.PLessThan.com
>
>__
>R-help@r-project.org mailing list
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide
>http://www.R-project.org/posting-guide.html
>and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Regular expressions on filenames

2014-01-15 Thread arun
Hi,
Try:
 FILELIST <- list.files()
FILELIST 
#[1] "FILE.csv" "FILE.XXX.csv" "FILE.YYY.xls"

  sub("(.*)\\..*$", "\\1", basename(FILELIST))
#[1] "FILE" "FILE.XXX" "FILE.YYY"


A.K.


On Wednesday, January 15, 2014 7:35 PM, Fisher Dennis  
wrote:
R 3.0.2
OS X

Colleagues

I am writing code to read a large number of files in a particular folder.  In 
some situations, there may be two versions of the file with different 
extensions, e.g.:
    FILE.csv
    FILE.xls
I extracted the portion before the extension with:
    sub("\\..*$", "", basename(FILELIST))
then used 
    duplicated
to find duplicates.  All was well until I encountered files named:
    FILE.XXX.csv
    FILE.YYY.xls

My regular expression extracted only the “FILE” portion of the text and claimed 
that the filenames (without the extensions) matched.  Can someone provide me 
with the appropriate regular expression to deal with this?  Thanks.

Dennis


Dennis Fisher MD
P < (The "P Less Than" Company)
Phone: 1-866-PLessThan (1-866-753-7784)
Fax: 1-866-PLessThan (1-866-753-7784)
www.PLessThan.com

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Regular expressions on filenames

2014-01-15 Thread jim holtman
try this:

> x <- c(  "FILE.XXX.csv"
+ , "FILE.YYY.xls")
> sub("\\.[^.]*$", "", x)
[1] "FILE.XXX" "FILE.YYY"
>

the '[^.]*' says to match anything BUT a period.

Jim Holtman
Data Munger Guru

What is the problem that you are trying to solve?
Tell me what you want to do, not how you want to do it.


On Wed, Jan 15, 2014 at 7:37 PM, Fisher Dennis  wrote:
> R 3.0.2
> OS X
>
> Colleagues
>
> I am writing code to read a large number of files in a particular folder.  In 
> some situations, there may be two versions of the file with different 
> extensions, e.g.:
> FILE.csv
> FILE.xls
> I extracted the portion before the extension with:
> sub("\\..*$", "", basename(FILELIST))
> then used
> duplicated
> to find duplicates.  All was well until I encountered files named:
> FILE.XXX.csv
> FILE.YYY.xls
>
> My regular expression extracted only the “FILE” portion of the text and 
> claimed that the filenames (without the extensions) matched.  Can someone 
> provide me with the appropriate regular expression to deal with this?  Thanks.
>
> Dennis
>
>
> Dennis Fisher MD
> P < (The "P Less Than" Company)
> Phone: 1-866-PLessThan (1-866-753-7784)
> Fax: 1-866-PLessThan (1-866-753-7784)
> www.PLessThan.com
>
> __
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Regular expressions on filenames

2014-01-15 Thread Fisher Dennis
R 3.0.2
OS X

Colleagues

I am writing code to read a large number of files in a particular folder.  In 
some situations, there may be two versions of the file with different 
extensions, e.g.:
FILE.csv
FILE.xls
I extracted the portion before the extension with:
sub("\\..*$", "", basename(FILELIST))
then used 
duplicated
to find duplicates.  All was well until I encountered files named:
FILE.XXX.csv
FILE.YYY.xls

My regular expression extracted only the “FILE” portion of the text and claimed 
that the filenames (without the extensions) matched.  Can someone provide me 
with the appropriate regular expression to deal with this?  Thanks.

Dennis


Dennis Fisher MD
P < (The "P Less Than" Company)
Phone: 1-866-PLessThan (1-866-753-7784)
Fax: 1-866-PLessThan (1-866-753-7784)
www.PLessThan.com

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] R Regular Expressions - Metacharacters

2013-02-05 Thread David Winsemius

On Feb 5, 2013, at 9:49 AM, Seth Dickey wrote:

> I thought that I can use metacharacters such as \w to match word characters
> with one backslash.  But for some reason, I need to include two backslashes.
> 
>> grepl(pattern='\w', x="what")
> Error: '\w' is an unrecognized escape in character string starting "\w"
> 
>> grepl(pattern='\\w', x="what")
> [1] TRUE
> 
> I can't find the reason for this on the help pages.  Does anyone know why?

The help page for ?regex says near the top ...

"Any metacharacter with special meaning may be quoted by preceding it with a 
backslash. The metacharacters in EREs are . \ | ( ) [ { ^ $ * + ?, but note 
that whether these have a special meaning depends on the context."
> 
> Thanks!
> 
>   [[alternative HTML version deleted]]
> 
> __
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

David Winsemius
Alameda, CA, USA

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] R Regular Expressions - Metacharacters

2013-02-05 Thread Duncan Murdoch

On 05/02/2013 12:49 PM, Seth Dickey wrote:

I thought that I can use metacharacters such as \w to match word characters
with one backslash.  But for some reason, I need to include two backslashes.

> grepl(pattern='\w', x="what")
Error: '\w' is an unrecognized escape in character string starting "\w"

> grepl(pattern='\\w', x="what")
[1] TRUE

I can't find the reason for this on the help pages.  Does anyone know why?


grepl wants a string containing a single backslash.  R uses the 
backslash as an escape character, so you need to double it in your 
source, so the string ends up containing just one.


"\w" is interpreted by R as an escaped w, which doesn't make sense.

"\\w" is interpreted by R as a backslash followed by a w, and then the 
\w is interpreted by grepl the way you want.


Duncan Murdoch

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] R Regular Expressions - Metacharacters

2013-02-05 Thread Seth Dickey
I thought that I can use metacharacters such as \w to match word characters
with one backslash.  But for some reason, I need to include two backslashes.

> grepl(pattern='\w', x="what")
Error: '\w' is an unrecognized escape in character string starting "\w"

> grepl(pattern='\\w', x="what")
[1] TRUE

I can't find the reason for this on the help pages.  Does anyone know why?

Thanks!

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Regular expressions: stuck again...

2012-08-24 Thread Noia Raindrops
Hello,

try this:

x <- c("SELECT [public_tblFiche].[Fichenr], [public_tblArtnr].[Artnr]", "SELECT 
public_tblFiche.Fichenr, public_tblArtnr.Artnr")

# > The square backets [ and ] should removed
x <- gsub("[][]", "", x)

# > and xxx_xxx.xxx should become \"xxx\".\"xxx\"\".\"xxx\"
x <- gsub("([[:alpha:]]+)_([[:alpha:]]+)\\.([[:alpha:]]+)", 
"\"\\1\".\"\\2\".\"\\3\"", x)


-- 
Noia Raindrops
noia.raindr...@gmail.com

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Regular expressions: stuck again...

2012-08-23 Thread Bart Joosen
Hi,

I'm currently reworking a report, originating from a MS Access database, but
should be implemented in R.
Now I'm facing the task to convert a lot of queries to postgreSQL.

What I want to do is make a function which takes the MS Access query as an
argument and returns the pgSQL version.
So:
SELECT [public_tblFiche].[Fichenr], [public_tblArtnr].[Artnr] FROM
[public_tblFiche], [public_tblArtnr] WHERE [public_tblFiche].[Artnr_ID] =
[public_tblArtnr].[Artnr_ID];
or 
SELECT public_tblFiche.Fichenr, public_tblArtnr.Artnr FROM public_tblFiche,
public_tblArtnr WHERE public_tblFiche.Artnr_ID = public_tblArtnr.Artnr_ID;

Should become: 
SELECT \"public\".\"tblFiche\".\"Fichenr\",
\"public\".\"tblArtnr\".\"Artnr\" FROM \"public\".\"tblFiche\",
\"public\".\"tblArtnr\" WHERE \"public\".\"tblFiche\".\"Artnr_ID\" =
\"public\".\"tblArtnr\".\"Artnr_ID\";

concrete:
The square backets  [ and ] should removed
and xxx_xxx.xxx  should become \"xxx\".\"xxx\"\".\"xxx\"


When only queries with square brackets, I used 
gsub('[', '\"', x, fixed=TRUE), 
gsub(']', '\"', x, fixed=TRUE), 
gsub('_', '\"', x, fixed=TRUE), 

But to do the trick with regular expressions, I cant get a grip on this

Anyone who can give me some help?


Thanks

Bart










--
View this message in context: 
http://r.789695.n4.nabble.com/Regular-expressions-stuck-again-tp4641155.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Regular Expressions in grep - Solution and function to determine significant figures of a number

2012-08-23 Thread Dr. Holger van Lishaut
Am 22.08.2012, 21:46 Uhr, schrieb Dr. Holger van Lishaut  
:



SignifStellen<-function(x){
 strx=as.character(x)
 nchar(regmatches(strx, regexpr("[1-9][0-9]*\\.[0-9]*[1-9]",strx)))-1
}

returns the significant figures of a number. Perhaps this can help  
someone.


Sorry, to work, it must read:

SignifStellen<-function(x){
  strx=as.character(x)
  intFront <- nchar(regmatches(strx, regexpr("[1-9][0-9]*\\.", strx)))
  intEnd <- nchar(regmatches(strx, regexpr("\\.[0-9]*[1-9]", strx)))
  intFront+intEnd-2
}

Best regards
H. van Lishaut

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Regular Expressions in grep - Solution and function to determine significant figures of a number

2012-08-22 Thread Bert Gunter
...

On Wed, Aug 22, 2012 at 12:46 PM, Dr. Holger van Lishaut
 wrote:
> Dear all,
>
> regmatches works.
>
> And, since this has been asked here before:
>
> SignifStellen<-function(x){
> strx=as.character(x)
> nchar(regmatches(strx, regexpr("[1-9][0-9]*\\.[0-9]*[1-9]",strx)))-1
> }
>
> returns the significant figures of a number. Perhaps this can help someone.

except that ?signif already does this, no?

-- Bert

>
> Thanks & best regards
> H. van Lishaut



-- 

Bert Gunter
Genentech Nonclinical Biostatistics

Internal Contact Info:
Phone: 467-7374
Website:
http://pharmadevelopment.roche.com/index/pdb/pdb-functional-groups/pdb-biostatistics/pdb-ncb-home.htm

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Regular Expressions in grep - Solution and function to determine significant figures of a number

2012-08-22 Thread Dr. Holger van Lishaut

Dear all,

regmatches works.

And, since this has been asked here before:

SignifStellen<-function(x){
strx=as.character(x)
nchar(regmatches(strx, regexpr("[1-9][0-9]*\\.[0-9]*[1-9]",strx)))-1
}

returns the significant figures of a number. Perhaps this can help someone.

Thanks & best regards
H. van Lishaut

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Regular Expressions in grep

2012-08-21 Thread arun
HI,
Try this:
gsub("^-\\d(\\d{4}.).*","\\1",a)
#[1] "1020."
gsub("^.*(.\\d{5}).","\\1",a)
#[1] ".90920"
A.K.



- Original Message -
From: Dr. Holger van Lishaut 
To: "r-help@r-project.org" 
Cc: 
Sent: Tuesday, August 21, 2012 3:24 PM
Subject: [R]  Regular Expressions in grep

Dear r-help members,

I have a number in the form of a string, say:

a<-"-01020.909200"

I'd like to extract "1020." as well as ".9092"

Front<-grep(pattern="[1-9]+[0-9]*\\.", value=TRUE, x=a, fixed=FALSE)
End<-grep(pattern="\\.[0-9]*[1-9]+", value=TRUE, x=a, fixed=FALSE)

However, both strings give "-01020.909200", exactly a.
Could you please point me to what is wrong?

Thanks and best regards
H. van Lishaut

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Regular Expressions in grep

2012-08-21 Thread R. Michael Weylandt
You're misreading the docs: from grep,

   value: if ‘FALSE’, a vector containing the (‘integer’) indices of
  the matches determined by ‘grep’ is returned, and if ‘TRUE’,
  a vector containing the matching elements themselves is
  returned.

Since there's a match somewhere in a[1], all of a[1] is returned (it
is a matching element), not just the matching bit: grep(x, value =
TRUE) is something like x[grepl(x)] to my mind.

I think you want ?regexpr or possibly just substitute out the
non-match with gsub.

Cheers,
Michael

On Tue, Aug 21, 2012 at 2:24 PM, Dr. Holger van Lishaut
 wrote:
> Dear r-help members,
>
> I have a number in the form of a string, say:
>
> a<-"-01020.909200"
>
> I'd like to extract "1020." as well as ".9092"
>
> Front<-grep(pattern="[1-9]+[0-9]*\\.", value=TRUE, x=a, fixed=FALSE)
> End<-grep(pattern="\\.[0-9]*[1-9]+", value=TRUE, x=a, fixed=FALSE)
>
> However, both strings give "-01020.909200", exactly a.
> Could you please point me to what is wrong?
>
> Thanks and best regards
> H. van Lishaut
>
> __
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Regular Expressions in grep

2012-08-21 Thread Noia Raindrops
'grep' does not change strings. Use 'gsub' or 'regmatches':

# gsub
Front <- gsub("^.*?([1-9][0-9]*\\.).*?$", "\\1", a)
End <- gsub("^.*?(\\.[0-9]*[1-9]).*?$", "\\1", a)
# regexpr and regmatches (R >= 2.14.0)
Front <- regmatches(a, regexpr("[1-9][0-9]*\\.", a))
End <- regmatches(a, regexpr("\\.[0-9]*[1-9]", a))

Front
## [1] "1020."
End
## [1] ".9092"

-- 
Noia Raindrops
noia.raindr...@gmail.com

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Regular Expressions in grep

2012-08-21 Thread Bert Gunter
grep() returns the matches. You want regexpr() and regmatches()

-- Bert

On Tue, Aug 21, 2012 at 12:24 PM, Dr. Holger van Lishaut
 wrote:
> Dear r-help members,
>
> I have a number in the form of a string, say:
>
> a<-"-01020.909200"
>
> I'd like to extract "1020." as well as ".9092"
>
> Front<-grep(pattern="[1-9]+[0-9]*\\.", value=TRUE, x=a, fixed=FALSE)
> End<-grep(pattern="\\.[0-9]*[1-9]+", value=TRUE, x=a, fixed=FALSE)
>
> However, both strings give "-01020.909200", exactly a.
> Could you please point me to what is wrong?
>
> Thanks and best regards
> H. van Lishaut
>
> __
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



-- 

Bert Gunter
Genentech Nonclinical Biostatistics

Internal Contact Info:
Phone: 467-7374
Website:
http://pharmadevelopment.roche.com/index/pdb/pdb-functional-groups/pdb-biostatistics/pdb-ncb-home.htm

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Regular Expressions in grep

2012-08-21 Thread Dr. Holger van Lishaut

Dear r-help members,

I have a number in the form of a string, say:

a<-"-01020.909200"

I'd like to extract "1020." as well as ".9092"

Front<-grep(pattern="[1-9]+[0-9]*\\.", value=TRUE, x=a, fixed=FALSE)
End<-grep(pattern="\\.[0-9]*[1-9]+", value=TRUE, x=a, fixed=FALSE)

However, both strings give "-01020.909200", exactly a.
Could you please point me to what is wrong?

Thanks and best regards
H. van Lishaut

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Regular Expressions + Matrices

2012-08-10 Thread Fred G
Thanks Bill! Works great! Thanks again guys!

On Fri, Aug 10, 2012 at 2:43 PM, William Dunlap  wrote:

> If you think about this as a runs problem you can get a loopless solution
> that I think is easier to read (once the requisite functions are defined).
>
> First define the function to canonicalize the name
>nickname <- function(x) sub(" .*", "", x)
> then define some handy runs functions
>   isFirstInRun <- function(x) c(TRUE, x[-1] != x[-length(x)])
>   isJustBefore <- function(x) c(x[-1], FALSE) # x should be logical
> then use those functions on your dataset
>   > nearDup <- !isFirstInRun(nickname(d$NAME)) & isFirstInRun(d$YEAR)
>   > d[ nearDup | isJustBefore(nearDup), ]
> ID NAME YEAR  SOURCE
>   1  1New York Mets 1900ESPN
>   2  2 New York Yankees 1920 Cooperstown
> See how it works with triplicates as well
>   > dd <- rbind(d, data.frame(ID=6:8,
>   NAME=c("Chicago Blacksox", "Chicago Cubs",
> "Chicago Whitesox"),
>   YEAR=1701:1703, SOURCE=rep("made up", 3)))
>   > nearDup <- !isFirstInRun(nickname(dd$NAME)) & isFirstInRun(dd$YEAR)
>   > dd[ nearDup | isJustBefore(nearDup), ]
> ID NAME YEAR  SOURCE
>   1  1New York Mets 1900ESPN
>   2  2 New York Yankees 1920 Cooperstown
>   6  6 Chicago Blacksox 1701 made up
>   7  7 Chicago Cubs 1702 made up
>   8  8 Chicago Whitesox 1703 made up
>
> Bill Dunlap
> Spotfire, TIBCO Software
> wdunlap tibco.com
>
>
> > -Original Message-----
> > From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org]
> On Behalf
> > Of Rui Barradas
> > Sent: Friday, August 10, 2012 11:18 AM
> > To: Fred G
> > Cc: r-help
> > Subject: Re: [R] Regular Expressions + Matrices
> >
> > Hello,
> >
> > Try the following.
> >
> >
> > d <- read.table(textConnection("
> > ID NAME  YEAR SOURCE
> > 1  'New York Mets'   1900  ESPN
> > 2  'New York Yankees'  1920 Cooperstown
> > 3  'Boston Redsox'   1918  ESPN
> > 4  'Washington Nationals'  2010 ESPN
> > 5  'Detroit Tigers'  1990  ESPN
> > "), header=TRUE)
> >
> > d$NAME <- as.character(d$NAME)
> >
> > fun <- function(i, x){
> >  if(x[i, "ID"] != x[i + 1, "ID"]){
> >  s <- unlist(strsplit(x[i, "NAME"], "[[:space:]]"))[1]
> >  if(grepl(s, x[i + 1, "NAME"])) return(TRUE)
> >  }
> >  FALSE
> > }
> >
> > inx <- sapply(seq_len(nrow(d) - 1), fun, d)
> > inx <- c(inx, FALSE) | c(FALSE, inx)
> > d[inx, ]
> >
> > Hope this helps,
> >
> > Rui Barradas
> > Em 10-08-2012 18:41, Fred G escreveu:
> > > Hi all,
> > >
> > > My code looks like the following:
> > > inname = read.csv("ID_error_checker.csv", as.is=TRUE)
> > > outname = read.csv("output.csv", as.is=TRUE)
> > >
> > > #My algorithm is the following:
> > > #for line in inname
> > > #if first string up to whitespace in row in inname$name = first string
> up
> > > to whitespace in row + 1 in inname$name
> > > #AND ID in inname$ID for the top row NOT EQUAL ID in inname$ID for the
> row
> > > below it
> > > #copy these two lines to a new file
> > >
> > > In other words, if the name (up to the first whitespace) in the first
> row
> > > equals the name in the second row (etc for whole file) and the ID in
> the
> > > first row does not equal the ID in the second row, copy both of these
> rows
> > > in full to a new file.  Only caveat is that I want a regular
> expression not
> > > to take the full names, but just the first string up to the first
> > > whitespace in the inname$name column (ie if row1 has a name of: New
> York
> > > Mets and row2 has a name of New York Yankees, I would want both of
> these
> > > rows to be copied in full since "New" is the same in both...)
> > >
> > > Here is some example data:
> > > ID NAME  YEAR SOURCE NOTES
> > > 1  New York Mets   1900  ESPN
> > > 2  New York Yankees  1920 Cooperstown
> > > 3  Boston Redsox   1918  ESPN
> > > 4  Washington Nationals

Re: [R] Regular Expressions + Matrices

2012-08-10 Thread William Dunlap
If you think about this as a runs problem you can get a loopless solution
that I think is easier to read (once the requisite functions are defined).

First define the function to canonicalize the name
   nickname <- function(x) sub(" .*", "", x)
then define some handy runs functions
  isFirstInRun <- function(x) c(TRUE, x[-1] != x[-length(x)])
  isJustBefore <- function(x) c(x[-1], FALSE) # x should be logical
then use those functions on your dataset
  > nearDup <- !isFirstInRun(nickname(d$NAME)) & isFirstInRun(d$YEAR)
  > d[ nearDup | isJustBefore(nearDup), ]
ID NAME YEAR  SOURCE
  1  1New York Mets 1900ESPN
  2  2 New York Yankees 1920 Cooperstown
See how it works with triplicates as well
  > dd <- rbind(d, data.frame(ID=6:8,
  NAME=c("Chicago Blacksox", "Chicago Cubs", "Chicago 
Whitesox"),
  YEAR=1701:1703, SOURCE=rep("made up", 3)))
  > nearDup <- !isFirstInRun(nickname(dd$NAME)) & isFirstInRun(dd$YEAR)
  > dd[ nearDup | isJustBefore(nearDup), ]
ID NAME YEAR  SOURCE
  1  1New York Mets 1900ESPN
  2  2 New York Yankees 1920 Cooperstown
  6  6 Chicago Blacksox 1701 made up
  7  7 Chicago Cubs 1702 made up
  8  8 Chicago Whitesox 1703 made up

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com


> -Original Message-
> From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On 
> Behalf
> Of Rui Barradas
> Sent: Friday, August 10, 2012 11:18 AM
> To: Fred G
> Cc: r-help
> Subject: Re: [R] Regular Expressions + Matrices
> 
> Hello,
> 
> Try the following.
> 
> 
> d <- read.table(textConnection("
> ID NAME  YEAR SOURCE
> 1  'New York Mets'   1900  ESPN
> 2  'New York Yankees'  1920 Cooperstown
> 3  'Boston Redsox'   1918  ESPN
> 4  'Washington Nationals'  2010 ESPN
> 5  'Detroit Tigers'  1990  ESPN
> "), header=TRUE)
> 
> d$NAME <- as.character(d$NAME)
> 
> fun <- function(i, x){
>  if(x[i, "ID"] != x[i + 1, "ID"]){
>  s <- unlist(strsplit(x[i, "NAME"], "[[:space:]]"))[1]
>  if(grepl(s, x[i + 1, "NAME"])) return(TRUE)
>  }
>  FALSE
> }
> 
> inx <- sapply(seq_len(nrow(d) - 1), fun, d)
> inx <- c(inx, FALSE) | c(FALSE, inx)
> d[inx, ]
> 
> Hope this helps,
> 
> Rui Barradas
> Em 10-08-2012 18:41, Fred G escreveu:
> > Hi all,
> >
> > My code looks like the following:
> > inname = read.csv("ID_error_checker.csv", as.is=TRUE)
> > outname = read.csv("output.csv", as.is=TRUE)
> >
> > #My algorithm is the following:
> > #for line in inname
> > #if first string up to whitespace in row in inname$name = first string up
> > to whitespace in row + 1 in inname$name
> > #AND ID in inname$ID for the top row NOT EQUAL ID in inname$ID for the row
> > below it
> > #copy these two lines to a new file
> >
> > In other words, if the name (up to the first whitespace) in the first row
> > equals the name in the second row (etc for whole file) and the ID in the
> > first row does not equal the ID in the second row, copy both of these rows
> > in full to a new file.  Only caveat is that I want a regular expression not
> > to take the full names, but just the first string up to the first
> > whitespace in the inname$name column (ie if row1 has a name of: New York
> > Mets and row2 has a name of New York Yankees, I would want both of these
> > rows to be copied in full since "New" is the same in both...)
> >
> > Here is some example data:
> > ID NAME  YEAR SOURCE NOTES
> > 1  New York Mets   1900  ESPN
> > 2  New York Yankees  1920 Cooperstown
> > 3  Boston Redsox   1918  ESPN
> > 4  Washington Nationals  2010 ESPN
> > 5  Detroit Tigers  1990  ESPN
> >
> > The desired output would be:
> > ID   NAMEYEAR SOURCE
> > 1New York Mets1900   ESPN
> > 2New York Yankees   1920   Cooperstown
> >
> > Thanks so much!
> >
> > [[alternative HTML version deleted]]
> >
> > __
> > R-help@r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> 
> __
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Regular Expressions + Matrices

2012-08-10 Thread Fred G
Thanks so much, and thanks for the clarification. "New York" ---> "New"
should not match "Other New" because "New" is not the first.

Thanks so much, testing it on my data now.

On Fri, Aug 10, 2012 at 2:35 PM, Rui Barradas  wrote:

> Hello,
>
> My code doesn't predict a point you've made clear in this post. Inline.
> Em 10-08-2012 19:05, Fred G escreveu:
>
>  Thanks Arun. The only issue is that I need the code to be very
>> generalizable, such that the grep() really has to be if the first string
>> up
>> to the whitespace in a row (ie "New", "Boston", "Washington", "Detroit
>> below) is the same as the first string up to the whitespace in the row
>> directly below it
>>
>
> Does this mean that "New York" ---> "New" in one row shouldn't match
> "Other New" in the next row because "New" is not the first string up to the
> whitespace? If this is the case, modify my earlier code to
>
>
>
> fun <- function(i, x){
> if(x[i, "ID"] != x[i + 1, "ID"]){
> s1 <- unlist(strsplit(x[i, "NAME"], "[[:space:]]"))[1] # keep
> first string
> s2 <- unlist(strsplit(x[i + 1, "NAME"], "[[:space:]]"))[1]  # keep
> first string
> if(grepl(s1, s2)) return(TRUE)
> }
> FALSE
> }
>
> If it isn't the case, do nothing.
>
> Rui Barradas
>
>
>  , AND the ID's are different, then copy.  The actual file
>> has thousands of different IDs and names...
>>
>> On Fri, Aug 10, 2012 at 2:01 PM, arun  wrote:
>>
>>
>>> Hi,
>>>
>>> Try this:
>>> dat1<-read.table(text="
>>> ID,NAME,YEAR,SOURCE
>>> 1,New York Mets,1900,ESPN
>>> 2,New York Yankees,1920,Cooperstown
>>> 3,Boston Redsox,1918,ESPN
>>> 4,Washington Nationals,2010,ESPN
>>> 5,Detroit Tigers,1990,ESPN
>>> ",sep=",",header=TRUE,**stringsAsFactors=FALSE)
>>>
>>>   index<-grep("New York.*",dat1$NAME)
>>> dat1[index,]
>>> #  ID NAME YEAR  SOURCE
>>> #1  1New York Mets 1900ESPN
>>> #2  2 New York Yankees 1920 Cooperstown
>>>
>>> A.K.
>>>
>>>
>>>
>>> - Original Message -
>>> From: Fred G 
>>> To: r-help@r-project.org
>>> Cc:
>>> Sent: Friday, August 10, 2012 1:41 PM
>>> Subject: [R] Regular Expressions + Matrices
>>>
>>> Hi all,
>>>
>>> My code looks like the following:
>>> inname = read.csv("ID_error_checker.**csv", as.is=TRUE)
>>> outname = read.csv("output.csv", as.is=TRUE)
>>>
>>> #My algorithm is the following:
>>> #for line in inname
>>> #if first string up to whitespace in row in inname$name = first string up
>>> to whitespace in row + 1 in inname$name
>>> #AND ID in inname$ID for the top row NOT EQUAL ID in inname$ID for the
>>> row
>>> below it
>>> #copy these two lines to a new file
>>>
>>> In other words, if the name (up to the first whitespace) in the first row
>>> equals the name in the second row (etc for whole file) and the ID in the
>>> first row does not equal the ID in the second row, copy both of these
>>> rows
>>> in full to a new file.  Only caveat is that I want a regular expression
>>> not
>>> to take the full names, but just the first string up to the first
>>> whitespace in the inname$name column (ie if row1 has a name of: New York
>>> Mets and row2 has a name of New York Yankees, I would want both of these
>>> rows to be copied in full since "New" is the same in both...)
>>>
>>> Here is some example data:
>>> ID NAME  YEAR SOURCE NOTES
>>> 1  New York Mets   1900  ESPN
>>> 2  New York Yankees  1920 Cooperstown
>>> 3  Boston Redsox   1918  ESPN
>>> 4  Washington Nationals  2010 ESPN
>>> 5  Detroit Tigers  1990  ESPN
>>>
>>> The desired output would be:
>>> ID   NAMEYEAR SOURCE
>>> 1New York Mets1900   ESPN
>>> 2New York Yankees   1920   Cooperstown
>>>
>>> Thanks so much!
>>>
>>>  [[alternative HTML version deleted]]
>>>

Re: [R] Regular Expressions + Matrices

2012-08-10 Thread Rui Barradas

Hello,

My code doesn't predict a point you've made clear in this post. Inline.
Em 10-08-2012 19:05, Fred G escreveu:

Thanks Arun. The only issue is that I need the code to be very
generalizable, such that the grep() really has to be if the first string up
to the whitespace in a row (ie "New", "Boston", "Washington", "Detroit
below) is the same as the first string up to the whitespace in the row
directly below it


Does this mean that "New York" ---> "New" in one row shouldn't match 
"Other New" in the next row because "New" is not the first string up to 
the whitespace? If this is the case, modify my earlier code to



fun <- function(i, x){
if(x[i, "ID"] != x[i + 1, "ID"]){
s1 <- unlist(strsplit(x[i, "NAME"], "[[:space:]]"))[1] # 
keep first string
s2 <- unlist(strsplit(x[i + 1, "NAME"], "[[:space:]]"))[1]  # 
keep first string

if(grepl(s1, s2)) return(TRUE)
}
FALSE
}

If it isn't the case, do nothing.

Rui Barradas


, AND the ID's are different, then copy.  The actual file
has thousands of different IDs and names...

On Fri, Aug 10, 2012 at 2:01 PM, arun  wrote:



Hi,

Try this:
dat1<-read.table(text="
ID,NAME,YEAR,SOURCE
1,New York Mets,1900,ESPN
2,New York Yankees,1920,Cooperstown
3,Boston Redsox,1918,ESPN
4,Washington Nationals,2010,ESPN
5,Detroit Tigers,1990,ESPN
",sep=",",header=TRUE,stringsAsFactors=FALSE)

  index<-grep("New York.*",dat1$NAME)
dat1[index,]
#  ID NAME YEAR  SOURCE
#1  1New York Mets 1900ESPN
#2  2 New York Yankees 1920 Cooperstown

A.K.



- Original Message -
From: Fred G 
To: r-help@r-project.org
Cc:
Sent: Friday, August 10, 2012 1:41 PM
Subject: [R] Regular Expressions + Matrices

Hi all,

My code looks like the following:
inname = read.csv("ID_error_checker.csv", as.is=TRUE)
outname = read.csv("output.csv", as.is=TRUE)

#My algorithm is the following:
#for line in inname
#if first string up to whitespace in row in inname$name = first string up
to whitespace in row + 1 in inname$name
#AND ID in inname$ID for the top row NOT EQUAL ID in inname$ID for the row
below it
#copy these two lines to a new file

In other words, if the name (up to the first whitespace) in the first row
equals the name in the second row (etc for whole file) and the ID in the
first row does not equal the ID in the second row, copy both of these rows
in full to a new file.  Only caveat is that I want a regular expression not
to take the full names, but just the first string up to the first
whitespace in the inname$name column (ie if row1 has a name of: New York
Mets and row2 has a name of New York Yankees, I would want both of these
rows to be copied in full since "New" is the same in both...)

Here is some example data:
ID NAME  YEAR SOURCE NOTES
1  New York Mets   1900  ESPN
2  New York Yankees  1920 Cooperstown
3  Boston Redsox   1918  ESPN
4  Washington Nationals  2010 ESPN
5  Detroit Tigers  1990  ESPN

The desired output would be:
ID   NAMEYEAR SOURCE
1New York Mets1900   ESPN
2New York Yankees   1920   Cooperstown

Thanks so much!

 [[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Regular Expressions + Matrices

2012-08-10 Thread arun


Hi,

Try this:
dat1<-read.table(text="
ID,    NAME,    YEAR,    SOURCE
1,    New York Mets,    1900,    ESPN
2,    New York Yankees,    1920,    Cooperstown
3,    Boston Redsox,    1918,    ESPN
4,    Washington Nationals,    2010,    ESPN
5,    Detroit Tigers,    1990,    ESPN
",sep=",",header=TRUE,stringsAsFactors=FALSE)

 index<-grep("New York.*",dat1$NAME)
dat1[index,]
#  ID NAME YEAR  SOURCE
#1  1    New York Mets 1900    ESPN
#2  2 New York Yankees 1920 Cooperstown

A.K.



- Original Message -
From: Fred G 
To: r-help@r-project.org
Cc: 
Sent: Friday, August 10, 2012 1:41 PM
Subject: [R] Regular Expressions + Matrices

Hi all,

My code looks like the following:
inname = read.csv("ID_error_checker.csv", as.is=TRUE)
outname = read.csv("output.csv", as.is=TRUE)

#My algorithm is the following:
#for line in inname
#if first string up to whitespace in row in inname$name = first string up
to whitespace in row + 1 in inname$name
#AND ID in inname$ID for the top row NOT EQUAL ID in inname$ID for the row
below it
#copy these two lines to a new file

In other words, if the name (up to the first whitespace) in the first row
equals the name in the second row (etc for whole file) and the ID in the
first row does not equal the ID in the second row, copy both of these rows
in full to a new file.  Only caveat is that I want a regular expression not
to take the full names, but just the first string up to the first
whitespace in the inname$name column (ie if row1 has a name of: New York
Mets and row2 has a name of New York Yankees, I would want both of these
rows to be copied in full since "New" is the same in both...)

Here is some example data:
ID NAME                          YEAR     SOURCE     NOTES
1  New York Mets               1900      ESPN
2  New York Yankees          1920     Cooperstown
3  Boston Redsox               1918      ESPN
4  Washington Nationals      2010     ESPN
5  Detroit Tigers                  1990      ESPN

The desired output would be:
ID   NAME                    YEAR SOURCE
1    New York Mets        1900   ESPN
2    New York Yankees   1920   Cooperstown

Thanks so much!

    [[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Regular Expressions + Matrices

2012-08-10 Thread Rui Barradas

Hello,

Try the following.


d <- read.table(textConnection("
ID NAME  YEAR SOURCE
1  'New York Mets'   1900  ESPN
2  'New York Yankees'  1920 Cooperstown
3  'Boston Redsox'   1918  ESPN
4  'Washington Nationals'  2010 ESPN
5  'Detroit Tigers'  1990  ESPN
"), header=TRUE)

d$NAME <- as.character(d$NAME)

fun <- function(i, x){
if(x[i, "ID"] != x[i + 1, "ID"]){
s <- unlist(strsplit(x[i, "NAME"], "[[:space:]]"))[1]
if(grepl(s, x[i + 1, "NAME"])) return(TRUE)
}
FALSE
}

inx <- sapply(seq_len(nrow(d) - 1), fun, d)
inx <- c(inx, FALSE) | c(FALSE, inx)
d[inx, ]

Hope this helps,

Rui Barradas
Em 10-08-2012 18:41, Fred G escreveu:

Hi all,

My code looks like the following:
inname = read.csv("ID_error_checker.csv", as.is=TRUE)
outname = read.csv("output.csv", as.is=TRUE)

#My algorithm is the following:
#for line in inname
#if first string up to whitespace in row in inname$name = first string up
to whitespace in row + 1 in inname$name
#AND ID in inname$ID for the top row NOT EQUAL ID in inname$ID for the row
below it
#copy these two lines to a new file

In other words, if the name (up to the first whitespace) in the first row
equals the name in the second row (etc for whole file) and the ID in the
first row does not equal the ID in the second row, copy both of these rows
in full to a new file.  Only caveat is that I want a regular expression not
to take the full names, but just the first string up to the first
whitespace in the inname$name column (ie if row1 has a name of: New York
Mets and row2 has a name of New York Yankees, I would want both of these
rows to be copied in full since "New" is the same in both...)

Here is some example data:
ID NAME  YEAR SOURCE NOTES
1  New York Mets   1900  ESPN
2  New York Yankees  1920 Cooperstown
3  Boston Redsox   1918  ESPN
4  Washington Nationals  2010 ESPN
5  Detroit Tigers  1990  ESPN

The desired output would be:
ID   NAMEYEAR SOURCE
1New York Mets1900   ESPN
2New York Yankees   1920   Cooperstown

Thanks so much!

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Regular Expressions + Matrices

2012-08-10 Thread Fred G
Thanks Arun. The only issue is that I need the code to be very
generalizable, such that the grep() really has to be if the first string up
to the whitespace in a row (ie "New", "Boston", "Washington", "Detroit
below) is the same as the first string up to the whitespace in the row
directly below it, AND the ID's are different, then copy.  The actual file
has thousands of different IDs and names...

On Fri, Aug 10, 2012 at 2:01 PM, arun  wrote:

>
>
> Hi,
>
> Try this:
> dat1<-read.table(text="
> ID,NAME,YEAR,SOURCE
> 1,New York Mets,1900,ESPN
> 2,New York Yankees,1920,Cooperstown
> 3,Boston Redsox,1918,ESPN
> 4,Washington Nationals,2010,ESPN
> 5,Detroit Tigers,1990,ESPN
> ",sep=",",header=TRUE,stringsAsFactors=FALSE)
>
>  index<-grep("New York.*",dat1$NAME)
> dat1[index,]
> #  ID NAME YEAR  SOURCE
> #1  1New York Mets 1900ESPN
> #2  2 New York Yankees 1920 Cooperstown
>
> A.K.
>
>
>
> - Original Message -
> From: Fred G 
> To: r-help@r-project.org
> Cc:
> Sent: Friday, August 10, 2012 1:41 PM
> Subject: [R] Regular Expressions + Matrices
>
> Hi all,
>
> My code looks like the following:
> inname = read.csv("ID_error_checker.csv", as.is=TRUE)
> outname = read.csv("output.csv", as.is=TRUE)
>
> #My algorithm is the following:
> #for line in inname
> #if first string up to whitespace in row in inname$name = first string up
> to whitespace in row + 1 in inname$name
> #AND ID in inname$ID for the top row NOT EQUAL ID in inname$ID for the row
> below it
> #copy these two lines to a new file
>
> In other words, if the name (up to the first whitespace) in the first row
> equals the name in the second row (etc for whole file) and the ID in the
> first row does not equal the ID in the second row, copy both of these rows
> in full to a new file.  Only caveat is that I want a regular expression not
> to take the full names, but just the first string up to the first
> whitespace in the inname$name column (ie if row1 has a name of: New York
> Mets and row2 has a name of New York Yankees, I would want both of these
> rows to be copied in full since "New" is the same in both...)
>
> Here is some example data:
> ID NAME  YEAR SOURCE NOTES
> 1  New York Mets   1900  ESPN
> 2  New York Yankees  1920 Cooperstown
> 3  Boston Redsox   1918  ESPN
> 4  Washington Nationals  2010 ESPN
> 5  Detroit Tigers  1990  ESPN
>
> The desired output would be:
> ID   NAMEYEAR SOURCE
> 1New York Mets1900   ESPN
> 2New York Yankees   1920   Cooperstown
>
> Thanks so much!
>
> [[alternative HTML version deleted]]
>
> __
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
>

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Regular Expressions + Matrices

2012-08-10 Thread Fred G
Hi all,

My code looks like the following:
inname = read.csv("ID_error_checker.csv", as.is=TRUE)
outname = read.csv("output.csv", as.is=TRUE)

#My algorithm is the following:
#for line in inname
#if first string up to whitespace in row in inname$name = first string up
to whitespace in row + 1 in inname$name
#AND ID in inname$ID for the top row NOT EQUAL ID in inname$ID for the row
below it
#copy these two lines to a new file

In other words, if the name (up to the first whitespace) in the first row
equals the name in the second row (etc for whole file) and the ID in the
first row does not equal the ID in the second row, copy both of these rows
in full to a new file.  Only caveat is that I want a regular expression not
to take the full names, but just the first string up to the first
whitespace in the inname$name column (ie if row1 has a name of: New York
Mets and row2 has a name of New York Yankees, I would want both of these
rows to be copied in full since "New" is the same in both...)

Here is some example data:
ID NAME  YEAR SOURCE NOTES
1  New York Mets   1900  ESPN
2  New York Yankees  1920 Cooperstown
3  Boston Redsox   1918  ESPN
4  Washington Nationals  2010 ESPN
5  Detroit Tigers  1990  ESPN

The desired output would be:
ID   NAMEYEAR SOURCE
1New York Mets1900   ESPN
2New York Yankees   1920   Cooperstown

Thanks so much!

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] regular expressions in R

2011-12-21 Thread jim holtman
To be correct for the regular expression, it should be:

dir(pattern = "\\.(txt|doc)$")

The form

dir(pattern="*.txt")

will match 'txt' appearing anywhere in the name; this looks like the
argument you would have used to "Sys.glob" which is a UNIX style file
name match and not a regular expression.  "." matches any character
unless you escape it to mean a 'period'.

On Wed, Dec 21, 2011 at 11:11 AM, R. Michael Weylandt
 wrote:
> Do you wish to include .docx files as well or just .doc?
>
> Michael
>
> On Wed, Dec 21, 2011 at 10:04 AM, Alaios  wrote:
>> Dear all
>> I would like to ask from dir function in R (?dir)
>> to give me only the files that end with .txt or .doc.
>>
>> The dir functions supports the use of patterns (is not that regular 
>> expressions) for doing that.
>>
>>   print(dir(i,full.names=TRUE,pattern=.))
>>
>> Could you please help me compose such a pattern?
>>
>> B.R
>> Alex
>>        [[alternative HTML version deleted]]
>>
>>
>> __
>> R-help@r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
> __
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



-- 
Jim Holtman
Data Munger Guru

What is the problem that you are trying to solve?
Tell me what you want to do, not how you want to do it.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] regular expressions in R

2011-12-21 Thread R. Michael Weylandt
Do you wish to include .docx files as well or just .doc?

Michael

On Wed, Dec 21, 2011 at 10:04 AM, Alaios  wrote:
> Dear all
> I would like to ask from dir function in R (?dir)
> to give me only the files that end with .txt or .doc.
>
> The dir functions supports the use of patterns (is not that regular 
> expressions) for doing that.
>
>   print(dir(i,full.names=TRUE,pattern=.))
>
> Could you please help me compose such a pattern?
>
> B.R
> Alex
>        [[alternative HTML version deleted]]
>
>
> __
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] regular expressions in R

2011-12-21 Thread Sarah Goslee
>From the help for dir:

 File naming conventions are platform dependent.  The pattern
 matching works with the case of file names as returned by the OS

On my linux system, this works:

> dir(pattern="*.txt")
[1] "a.txt" "b.txt"
>
> dir(pattern="*.doc")
[1] "c.doc"
>
> dir(pattern="*.doc|*.txt")
[1] "a.txt" "b.txt" "c.doc"

You don't tell us your OS, so I have no idea whether it will work for you.

Sarah


On Wed, Dec 21, 2011 at 11:04 AM, Alaios  wrote:
> Dear all
> I would like to ask from dir function in R (?dir)
> to give me only the files that end with .txt or .doc.
>
> The dir functions supports the use of patterns (is not that regular 
> expressions) for doing that.
>
>   print(dir(i,full.names=TRUE,pattern=.))
>
> Could you please help me compose such a pattern?
>
> B.R
> Alex
>
>



-- 
Sarah Goslee
http://www.functionaldiversity.org

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] regular expressions in R

2011-12-21 Thread Alaios
Dear all
I would like to ask from dir function in R (?dir)
to give me only the files that end with .txt or .doc.

The dir functions supports the use of patterns (is not that regular 
expressions) for doing that.

  print(dir(i,full.names=TRUE,pattern=.))

Could you please help me compose such a pattern?

B.R
Alex
[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Regular expressions in R

2011-11-16 Thread Michael Griffiths
Thanks to everyone who contributed to my questions. As ever, I am extremely
grateful to all those on the R-list who make it what it is.

Regards

Mike Griffiths

On Tue, Nov 15, 2011 at 5:47 PM, Joshua Wiley wrote:

> Hi Michael,
>
> Your strings were long so I made a bit smaller example.  Sarah made
> one good point, you want to be using gsub() not sub(), but when I use
> your code, I do not think it even works precisely for one instance.
> Try this on for size, you were 99% there:
>
> ## simplified cases
> form1 <- c('product + action * mean + CTA + help + mean * product')
> form2 <- c('product+action*mean+CTA+help+mean*product')
>
> ## what I believe your desired output is
> 'product + CTA + help'
> 'product+CTA+help'
>
> gsub("\\s\\+\\s[[:alnum:]]*\\s\\*\\s[[:alnum:]]*", "", form1)
> gsub("\\+[[:alnum:]]*\\*[[:alnum:]]*", "", form2)
>
> ## your code (using gsub() instead of sub())
> gsub("\\+*\\s*[[:alnum:]]*\\s*\\*.[[:alnum:]]", "", form1)
>
>
>  Running on r57586 Windows x64 
> > gsub("\\s\\+\\s[[:alnum:]]*\\s\\*\\s[[:alnum:]]*", "", form1)
> [1] "product + CTA + help"
> > gsub("\\+[[:alnum:]]*\\*[[:alnum:]]*", "", form2)
> [1] "product+CTA+help"
> >
> > ## your code (using gsub() instead of sub())
> > gsub("\\+*\\s*[[:alnum:]]*\\s*\\*.[[:alnum:]]", "", form1)
> [1] "product ean + CTA + help roduct"
>
> Hope this helps,
>
> Josh
>
> On Tue, Nov 15, 2011 at 9:18 AM, Michael Griffiths
>  wrote:
> > Good afternoon list,
> >
> > I have the following character strings; one with spaces between the maths
> > operators and variable names, and one without said spaces.
> >
> > form<-c('~ Sentence + LEGAL + Intro + Intro / Intro1 + Intro * LEGAL +
> > benefit + benefit / benefit1 + product + action * mean + CTA + help +
> mean
> > * product')
> >
> form<-c('~Sentence+LEGAL+Intro+Intro/Intro1+Intro*LEGAL+benefit+benefit/benefit1+product+action*mean+CTA+help+mean*product')
> >
> > I would like to remove the following target strings, either:
> >
> > 1. '+ Intro * LEGAL' which is  '+ space name space * space name'
> > 2. '+Intro*LEGAL' which is  '+ nospace name nospace * nospace name'
> >
> > Having delved into a variety of sites (e.g.
> > http://www.zytrax.com/tech/web/regex.htm#search) investigating regular
> > expressions I now have a basic grasp, but I am having difficulties
> removing
> > ALL of the instances or 1. or 2.
> >
> > The code below removes just a SINGLE instance of the target string, but I
> > was expecting it to remove all instances as I have \\*.[[allnum]]. I did
> > try \\*.[[allnum]]*, but this did not work.
> >
> > form<-sub("\\+*\\s*[[:alnum:]]*\\s*\\*.[[:alnum:]]", "", form)
> >
> > I am obviously still not understanding something. If the list could offer
> > some guidance I would be most grateful.
> >
> > Regards
> >
> > Mike Griffiths
> >
> >
> >
> > --
> >
> > *Michael Griffiths, Ph.D
> > *Statistician
> >
> > *Upstream Systems*
> >
> > 8th Floor
> > Portland House
> > Bressenden Place
> > SW1E 5BH
> >
> > <
> http://www.google.com/url?q=http%3A%2F%2Fwww.upstreamsystems.com%2F&sa=D&sntz=1&usg=AFrqEzfKYfaAalqvahwrpywpJDL9DxUmWw
> >
> >
> > Tel   +44 (0) 20 7869 5147
> > Fax  +44 207 290 1321
> > Mob +44 789 4944 145
> >
> > www.upstreamsystems.com<
> http://www.google.com/url?q=http%3A%2F%2Fwww.upstreamsystems.com%2F&sa=D&sntz=1&usg=AFrqEzfKYfaAalqvahwrpywpJDL9DxUmWw
> >
> >
> > *griffi...@upstreamsystems.com *
> >
> > 
> >
> >[[alternative HTML version deleted]]
> >
> > __
> > R-help@r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> >
>
>
>
> --
> Joshua Wiley
> Ph.D. Student, Health Psychology
> Programmer Analyst II, ATS Statistical Consulting Group
> University of California, Los Angeles
> https://joshuawiley.com/
>



-- 

*Michael Griffiths, Ph.D
*Statistician

*Upstream Systems*

8th Floor
Portland House
Bressenden Place
SW1E 5BH



Tel   +44 (0) 20 7869 5147
Fax  +44 207 290 1321
Mob +44 789 4944 145

www.upstreamsystems.com

*griffi...@upstreamsystems.com *



[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Regular expressions in R

2011-11-15 Thread Joshua Wiley
Hi Michael,

Your strings were long so I made a bit smaller example.  Sarah made
one good point, you want to be using gsub() not sub(), but when I use
your code, I do not think it even works precisely for one instance.
Try this on for size, you were 99% there:

## simplified cases
form1 <- c('product + action * mean + CTA + help + mean * product')
form2 <- c('product+action*mean+CTA+help+mean*product')

## what I believe your desired output is
'product + CTA + help'
'product+CTA+help'

gsub("\\s\\+\\s[[:alnum:]]*\\s\\*\\s[[:alnum:]]*", "", form1)
gsub("\\+[[:alnum:]]*\\*[[:alnum:]]*", "", form2)

## your code (using gsub() instead of sub())
gsub("\\+*\\s*[[:alnum:]]*\\s*\\*.[[:alnum:]]", "", form1)


 Running on r57586 Windows x64 
> gsub("\\s\\+\\s[[:alnum:]]*\\s\\*\\s[[:alnum:]]*", "", form1)
[1] "product + CTA + help"
> gsub("\\+[[:alnum:]]*\\*[[:alnum:]]*", "", form2)
[1] "product+CTA+help"
>
> ## your code (using gsub() instead of sub())
> gsub("\\+*\\s*[[:alnum:]]*\\s*\\*.[[:alnum:]]", "", form1)
[1] "product ean + CTA + help roduct"

Hope this helps,

Josh

On Tue, Nov 15, 2011 at 9:18 AM, Michael Griffiths
 wrote:
> Good afternoon list,
>
> I have the following character strings; one with spaces between the maths
> operators and variable names, and one without said spaces.
>
> form<-c('~ Sentence + LEGAL + Intro + Intro / Intro1 + Intro * LEGAL +
> benefit + benefit / benefit1 + product + action * mean + CTA + help + mean
> * product')
> form<-c('~Sentence+LEGAL+Intro+Intro/Intro1+Intro*LEGAL+benefit+benefit/benefit1+product+action*mean+CTA+help+mean*product')
>
> I would like to remove the following target strings, either:
>
> 1. '+ Intro * LEGAL' which is  '+ space name space * space name'
> 2. '+Intro*LEGAL' which is  '+ nospace name nospace * nospace name'
>
> Having delved into a variety of sites (e.g.
> http://www.zytrax.com/tech/web/regex.htm#search) investigating regular
> expressions I now have a basic grasp, but I am having difficulties removing
> ALL of the instances or 1. or 2.
>
> The code below removes just a SINGLE instance of the target string, but I
> was expecting it to remove all instances as I have \\*.[[allnum]]. I did
> try \\*.[[allnum]]*, but this did not work.
>
> form<-sub("\\+*\\s*[[:alnum:]]*\\s*\\*.[[:alnum:]]", "", form)
>
> I am obviously still not understanding something. If the list could offer
> some guidance I would be most grateful.
>
> Regards
>
> Mike Griffiths
>
>
>
> --
>
> *Michael Griffiths, Ph.D
> *Statistician
>
> *Upstream Systems*
>
> 8th Floor
> Portland House
> Bressenden Place
> SW1E 5BH
>
> 
>
> Tel   +44 (0) 20 7869 5147
> Fax  +44 207 290 1321
> Mob +44 789 4944 145
>
> www.upstreamsystems.com
>
> *griffi...@upstreamsystems.com *
>
> 
>
>        [[alternative HTML version deleted]]
>
> __
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



-- 
Joshua Wiley
Ph.D. Student, Health Psychology
Programmer Analyst II, ATS Statistical Consulting Group
University of California, Los Angeles
https://joshuawiley.com/

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Regular expressions in R

2011-11-15 Thread Sarah Goslee
Hi Michael,

You need to take another look at the examples you were given, and at
the help for ?sub():

 The two ‘*sub’ functions differ only in that ‘sub’ replaces only
 the first occurrence of a ‘pattern’ whereas ‘gsub’ replaces all
 occurrences.  If ‘replacement’ contains backreferences which are
 not defined in ‘pattern’ the result is undefined (but most often
 the backreference is taken to be ‘""’).

Sarah

On Tue, Nov 15, 2011 at 12:18 PM, Michael Griffiths
 wrote:
> Good afternoon list,
>
> I have the following character strings; one with spaces between the maths
> operators and variable names, and one without said spaces.
>
> form<-c('~ Sentence + LEGAL + Intro + Intro / Intro1 + Intro * LEGAL +
> benefit + benefit / benefit1 + product + action * mean + CTA + help + mean
> * product')
> form<-c('~Sentence+LEGAL+Intro+Intro/Intro1+Intro*LEGAL+benefit+benefit/benefit1+product+action*mean+CTA+help+mean*product')
>
> I would like to remove the following target strings, either:
>
> 1. '+ Intro * LEGAL' which is  '+ space name space * space name'
> 2. '+Intro*LEGAL' which is  '+ nospace name nospace * nospace name'
>
> Having delved into a variety of sites (e.g.
> http://www.zytrax.com/tech/web/regex.htm#search) investigating regular
> expressions I now have a basic grasp, but I am having difficulties removing
> ALL of the instances or 1. or 2.
>
> The code below removes just a SINGLE instance of the target string, but I
> was expecting it to remove all instances as I have \\*.[[allnum]]. I did
> try \\*.[[allnum]]*, but this did not work.
>
> form<-sub("\\+*\\s*[[:alnum:]]*\\s*\\*.[[:alnum:]]", "", form)
>
> I am obviously still not understanding something. If the list could offer
> some guidance I would be most grateful.
>
> Regards
>
> Mike Griffiths
>
>
>
-- 
Sarah Goslee
http://www.functionaldiversity.org

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Regular expressions in R

2011-11-15 Thread Michael Griffiths
Good afternoon list,

I have the following character strings; one with spaces between the maths
operators and variable names, and one without said spaces.

form<-c('~ Sentence + LEGAL + Intro + Intro / Intro1 + Intro * LEGAL +
benefit + benefit / benefit1 + product + action * mean + CTA + help + mean
* product')
form<-c('~Sentence+LEGAL+Intro+Intro/Intro1+Intro*LEGAL+benefit+benefit/benefit1+product+action*mean+CTA+help+mean*product')

I would like to remove the following target strings, either:

1. '+ Intro * LEGAL' which is  '+ space name space * space name'
2. '+Intro*LEGAL' which is  '+ nospace name nospace * nospace name'

Having delved into a variety of sites (e.g.
http://www.zytrax.com/tech/web/regex.htm#search) investigating regular
expressions I now have a basic grasp, but I am having difficulties removing
ALL of the instances or 1. or 2.

The code below removes just a SINGLE instance of the target string, but I
was expecting it to remove all instances as I have \\*.[[allnum]]. I did
try \\*.[[allnum]]*, but this did not work.

form<-sub("\\+*\\s*[[:alnum:]]*\\s*\\*.[[:alnum:]]", "", form)

I am obviously still not understanding something. If the list could offer
some guidance I would be most grateful.

Regards

Mike Griffiths



-- 

*Michael Griffiths, Ph.D
*Statistician

*Upstream Systems*

8th Floor
Portland House
Bressenden Place
SW1E 5BH



Tel   +44 (0) 20 7869 5147
Fax  +44 207 290 1321
Mob +44 789 4944 145

www.upstreamsystems.com

*griffi...@upstreamsystems.com *



[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Regular Expressions for "Large" Data Set

2011-06-07 Thread Marc Schwartz
On Jun 7, 2011, at 3:55 PM, Abraham Mathew wrote:

> I'm running R 2.13 on Ubuntu 10.10
> 
> I have a data set which is comprised of character strings.
> 
> site = readLines('http://www.census.gov/tiger/tms/gazetteer/zips.txt')
> 
> dat <- c("01, 35004, AL, ACMAR, 86.51557, 33.584132, 6055, 0.001499")
> dat
> 
> I want to loop through the data and construct a data frame with the zip
> code,
> state abbreviation, and city name in seperate columns. Given the size of
> this
> data set, I was wondering if there was an efficient way to get the desired
> results.
> 
> Thanks
> Abraham


Since the original text file is a CSV file (without a header), just use:

> system.time(DF <- 
> read.csv("http://www.census.gov/tiger/tms/gazetteer/zips.txt";, header = 
> FALSE))
   user  system elapsed 
  0.385   0.033   1.832 


> str(DF)
'data.frame':   29470 obs. of  8 variables:
 $ V1: int  1 1 1 1 1 1 1 1 1 1 ...
 $ V2: int  35004 35005 35006 35007 35010 35014 35016 35019 35020 35023 ...
 $ V3: Factor w/ 51 levels "AK","AL","AR",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ V4: Factor w/ 16698 levels "02821","04465",..: 150 168 180 7710 10434 348 
547 812 1250 7044 ...
 $ V5: num  86.5 87 87.2 86.8 86 ...
 $ V6: num  33.6 33.6 33.4 33.2 32.9 ...
 $ V7: int  6055 10616 3205 14218 19942 3062 13650 1781 40549 39677 ...
 $ V8: num  0.001499 0.002627 0.000793 0.003519 0.004935 ...


> head(DF)
  V1V2 V3 V4   V5   V6V7   V8
1  1 35004 AL  ACMAR 86.51557 33.58413  6055 0.001499
2  1 35005 AL ADAMSVILLE 86.95973 33.58844 10616 0.002627
3  1 35006 AL  ADGER 87.16746 33.43428  3205 0.000793
4  1 35007 AL   KEYSTONE 86.81286 33.23687 14218 0.003519
5  1 35010 AL   NEW SITE 85.95109 32.94145 19942 0.004935
6  1 35014 AL ALPINE 86.20893 33.33116  3062 0.000758


HTH,

Marc Schwartz

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Regular Expressions for "Large" Data Set

2011-06-07 Thread Abraham Mathew
I'm running R 2.13 on Ubuntu 10.10

I have a data set which is comprised of character strings.

site = readLines('http://www.census.gov/tiger/tms/gazetteer/zips.txt')

dat <- c("01, 35004, AL, ACMAR, 86.51557, 33.584132, 6055, 0.001499")
dat

I want to loop through the data and construct a data frame with the zip
code,
state abbreviation, and city name in seperate columns. Given the size of
this
data set, I was wondering if there was an efficient way to get the desired
results.

Thanks
Abraham


WebRep
Overall rating

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Regular Expressions in Column Headings

2011-03-09 Thread Gabor Grothendieck
On Wed, Mar 9, 2011 at 8:52 AM, Matthew DeAngelis  wrote:
> Hi all,
>
> I am hoping that someone can help me with a problem I am having with column
> headings.  I have read a table into R using read.table: the rows are
> documents, and the columns are counts of regular expression matches (so that
> the column heading is the given regular expression).  My problem is that
> read.table seems to be trying to interpret the regular expressions, or has
> trouble with the special characters, so that the column headings are not
> coming out correctly.  For example, a column headed with: \bV\.?A\.?T\.?
> will come out as X.bV...A...T...  This would not be a problem, since the
> regular expressions are still readable, except that I have a number of other
> tables that I will need to intersect with these column headings.  In some of
> those tables, the regular expressions are data, and they are coming in
> correctly (although R seems to be doubling "\"s, which is fine so long as it
> does this consistently).
>
> I have also tried importing the column names as a vector and specifying that
> vector explicitly using col.names, but R still transforms the provided names
> as above.  Is it possible to force R to read in regular expressions
> completely literally, with no interpretation?  Alternately, can I force R to
> interpret the column headings in the same way that it interprets data (i.e.
> adding the extra slash), so that I can match on these values?
>

See the read.table check.names argument.

-- 
Statistics & Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Regular Expressions in Column Headings

2011-03-09 Thread Matthew DeAngelis
Hi all,

I am hoping that someone can help me with a problem I am having with column
headings.  I have read a table into R using read.table: the rows are
documents, and the columns are counts of regular expression matches (so that
the column heading is the given regular expression).  My problem is that
read.table seems to be trying to interpret the regular expressions, or has
trouble with the special characters, so that the column headings are not
coming out correctly.  For example, a column headed with: \bV\.?A\.?T\.?
will come out as X.bV...A...T...  This would not be a problem, since the
regular expressions are still readable, except that I have a number of other
tables that I will need to intersect with these column headings.  In some of
those tables, the regular expressions are data, and they are coming in
correctly (although R seems to be doubling "\"s, which is fine so long as it
does this consistently).

I have also tried importing the column names as a vector and specifying that
vector explicitly using col.names, but R still transforms the provided names
as above.  Is it possible to force R to read in regular expressions
completely literally, with no interpretation?  Alternately, can I force R to
interpret the column headings in the same way that it interprets data (i.e.
adding the extra slash), so that I can match on these values?


Thanks,
Matt

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Regular Expressions

2010-11-05 Thread Gabor Grothendieck
2010/11/5 Brian Diggs :
> Is there a standard, built in way to get both (all) backreferences at the
> same time with just one call to sub (or the appropriate function)? I can
> cobble something together specifically for 2 backreferences (not extensively
> tested):
>
> both_backrefs <- function(pattern, x) {
>        s <- sub(pattern, "\\1\034\\2", x)
>        matrix(unlist(strsplit(s,"\034")), ncol=2, byrow=TRUE)
> }
>
> both_backrefs(regex, x)
>
> However, putting the parts back together into a string (with a delimiter
> that hopefully won't be in the string otherwise) just to use strsplit to
> pull them apart seems inelegant (as does making multiple calls to sub()).
>  sub() (and siblings) surely already have the backreferences as strings at
> some point in the processing, but I don't see a way to return them as a
> vector or matrix, only to substitute using backreferences (sub) or return
> indicies pointing to where the matches start (regexpr) or return the whole
> string matches (grep with value=TRUE).
>

The gsubfn package has gsubfn which is like gsub except it can take a
function in place of the replacement string.  The function's arguments
are match or the back references and the function's output replaces
the match.Also it has strapply which will does the same thing
except instead of inserting the function's output it returns the
function's output.  See the home page at http://gsubfn.googlecode.com

-- 
Statistics & Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Regular Expressions

2010-11-05 Thread Brian Diggs

On 11/5/2010 12:09 AM, Prof Brian Ripley wrote:

On Thu, 4 Nov 2010, Noah Silverman wrote:


Hi,

I'm trying to figure out how to use capturing parenthesis in regular
expressions in R. (Doing this in Perl, Java, etc. is fairly trivial,
but I can't seem to find the functionality in R.)

For example, given the string: "10 Nov 13.00 (PFE1020K13)"

I want to capture the first to digits and then the month abreviation.

In perl, this would be

/^(\d\d)\s(\w\w\w)\s/

Then I have the variables $1 and $1 assigned to the capturing
parenthesis.

I've found the grep and sub commands in R, but the docs don't indicate
any way to capture things.

Any suggestions?


Read the the link to ?regexp. It *does* 'indicate the way to capture
things'.

The backreference ‘\N’, where ‘N = 1 ... 9’, matches the substring
previously matched by the Nth parenthesized subexpression of the
regular expression. (This is an extension for extended regular
expressions: POSIX defines them only for basic ones.)

and there is an example on the help page for grep():

## Double all 'a' or 'b's; "\" must be escaped, i.e., 'doubled'
gsub("([ab])", "\\1_\\1_", "abc and ABC")

In your example

x <- "10 Nov 13.00 (PFE1020K13)"
regex <- "(\\d\\d)\\s(\\w\\w\\w).*"
sub(regex, "\\1", x, perl = TRUE)
sub(regex, "\\2", x, perl = TRUE)

A better way to do this would be something like

regex <- "([[:digit:]]{2})\\s([[:alpha:]]{3}).*"

which is also a POSIX extended regexp.


Is there a standard, built in way to get both (all) backreferences at 
the same time with just one call to sub (or the appropriate function)? 
I can cobble something together specifically for 2 backreferences (not 
extensively tested):


both_backrefs <- function(pattern, x) {
s <- sub(pattern, "\\1\034\\2", x)
matrix(unlist(strsplit(s,"\034")), ncol=2, byrow=TRUE)
}

both_backrefs(regex, x)

However, putting the parts back together into a string (with a delimiter 
that hopefully won't be in the string otherwise) just to use strsplit to 
pull them apart seems inelegant (as does making multiple calls to 
sub()).  sub() (and siblings) surely already have the backreferences as 
strings at some point in the processing, but I don't see a way to return 
them as a vector or matrix, only to substitute using backreferences 
(sub) or return indicies pointing to where the matches start (regexpr) 
or return the whole string matches (grep with value=TRUE).


--
Brian S. Diggs, PhD
Senior Research Associate, Department of Surgery
Oregon Health & Science University

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Regular Expressions

2010-11-05 Thread Noah Silverman
That's perfect! 

Don't know how I missed that.

I want to start playing with some modeling of financial data and the
only format I can download is rather ugly.  So my plan is to use a
series of Regex to extract what I want.

Noticed that you are a Prof. in applied stats.  I'm at UCLA working on
an MS in stats.  My department is fairly flexible, so I'm taking several
finance courses as part of my work.  Currently debating if I want to
graduate with an MS in June, or roll everything into a PhD and be
finished in an extra 1-2 years.


Thanks!

-N

On 11/5/10 12:09 AM, Prof Brian Ripley wrote:
> On Thu, 4 Nov 2010, Noah Silverman wrote:
>
>> Hi,
>>
>> I'm trying to figure out how to use capturing parenthesis in regular
>> expressions in R.  (Doing this in Perl, Java, etc. is fairly trivial,
>> but I can't seem to find the functionality in R.)
>>
>> For example, given the string:"10 Nov 13.00 (PFE1020K13)"
>>
>> I want to capture the first to digits and then the month abreviation.
>>
>> In perl, this would be
>>
>> /^(\d\d)\s(\w\w\w)\s/
>>
>> Then I have the variables $1 and $1 assigned to the capturing
>> parenthesis.
>>
>> I've found the grep and sub commands in R, but the docs don't
>> indicate any way to capture things.
>>
>> Any suggestions?
>
> Read the the link to ?regexp.  It *does* 'indicate the way to capture
> things'.
>
>  The backreference ‘\N’, where ‘N = 1 ... 9’, matches the substring
>  previously matched by the Nth parenthesized subexpression of the
>  regular expression.  (This is an extension for extended regular
>  expressions: POSIX defines them only for basic ones.)
>
> and there is an example on the help page for grep():
>
>  ## Double all 'a' or 'b's;  "\" must be escaped, i.e., 'doubled'
>  gsub("([ab])", "\\1_\\1_", "abc and ABC")
>
> In your example
>
> x <- "10 Nov 13.00 (PFE1020K13)"
> regex <- "(\\d\\d)\\s(\\w\\w\\w).*"
> sub(regex, "\\1", x, perl = TRUE)
> sub(regex, "\\2", x, perl = TRUE)
>
> A better way to do this would be something like
>
> regex <- "([[:digit:]]{2})\\s([[:alpha:]]{3}).*"
>
> which is also a POSIX extended regexp.
>

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Regular Expressions

2010-11-05 Thread Prof Brian Ripley

On Thu, 4 Nov 2010, Noah Silverman wrote:


Hi,

I'm trying to figure out how to use capturing parenthesis in regular 
expressions in R.  (Doing this in Perl, Java, etc. is fairly trivial, but I 
can't seem to find the functionality in R.)


For example, given the string:"10 Nov 13.00 (PFE1020K13)"

I want to capture the first to digits and then the month abreviation.

In perl, this would be

/^(\d\d)\s(\w\w\w)\s/

Then I have the variables $1 and $1 assigned to the capturing parenthesis.

I've found the grep and sub commands in R, but the docs don't indicate any 
way to capture things.


Any suggestions?


Read the the link to ?regexp.  It *does* 'indicate the way to capture 
things'.


 The backreference ‘\N’, where ‘N = 1 ... 9’, matches the substring
 previously matched by the Nth parenthesized subexpression of the
 regular expression.  (This is an extension for extended regular
 expressions: POSIX defines them only for basic ones.)

and there is an example on the help page for grep():

 ## Double all 'a' or 'b's;  "\" must be escaped, i.e., 'doubled'
 gsub("([ab])", "\\1_\\1_", "abc and ABC")

In your example

x <- "10 Nov 13.00 (PFE1020K13)"
regex <- "(\\d\\d)\\s(\\w\\w\\w).*"
sub(regex, "\\1", x, perl = TRUE)
sub(regex, "\\2", x, perl = TRUE)

A better way to do this would be something like

regex <- "([[:digit:]]{2})\\s([[:alpha:]]{3}).*"

which is also a POSIX extended regexp.

--
Brian D. Ripley,  rip...@stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel:  +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UKFax:  +44 1865 272595__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Regular Expressions

2010-11-04 Thread Noah Silverman

Hi,

I'm trying to figure out how to use capturing parenthesis in regular 
expressions in R.  (Doing this in Perl, Java, etc. is fairly trivial, 
but I can't seem to find the functionality in R.)


For example, given the string:"10 Nov 13.00 (PFE1020K13)"

I want to capture the first to digits and then the month abreviation.

In perl, this would be

/^(\d\d)\s(\w\w\w)\s/

Then I have the variables $1 and $1 assigned to the capturing parenthesis.

I've found the grep and sub commands in R, but the docs don't indicate 
any way to capture things.


Any suggestions?

Thanks!

-N

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Regular expressions: offsets of groups

2010-09-30 Thread Titus von der Malsburg
Ok, we decided to have a shot at modifying gregexpr.  Let's see how it
works out.  If anybody is interested in discussing this please contact
me.  R-help doesn't seem like the right place for further discussion.
Is there a default place for discussing things like that?

Thanks everybody for your responses!

  Titus

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Regular expressions: offsets of groups

2010-09-29 Thread Titus von der Malsburg
On Wed, Sep 29, 2010 at 1:58 PM, Michael Bedward
 wrote:
> How is your C coding ? Bill ? Anyone else ?  I could have a got at
> writing some prototype code to test in the next few days, though if
> someone else with decent C skills is itching to do it please speak up.

We have a skilled C- and R-programmer who could work on it. I'll talk to him.

   Titus

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Regular expressions: offsets of groups

2010-09-29 Thread Michael Bedward
I'd definitely be a customer for it Titus. And it does seem like an
obvious hole in regex processing in R that cries out to be filled.

Um, ggregexpr isn't the sexiest of function names :)  Perhaps we can
think of something a little easier ?

How is your C coding ? Bill ? Anyone else ?  I could have a got at
writing some prototype code to test in the next few days, though if
someone else with decent C skills is itching to do it please speak up.

Michael

On 29 September 2010 20:08, Titus von der Malsburg  wrote:
> Bill, Michael,
>
> good to see I'm not the only one who sees potential for improvements
> in the regexpr domain.  Adding a subpattern argument is certainly a
> step in the right direction and would make my life much easier.
> However, in my application I need to know not only the position of one
> group but also the position of the overall match in the original
> string.  The ideal solution would provide positions and match lengths
> for the whole pattern and for all groups if desired.  Only this would
> solve all related issues.  One possibility is to have a subpattern
> argument that accepts a vector of numbers (0 refers to the whole
> pattern):
>
>  > gregexpr("a+(b+)", "abcdaabbc", subpattern=c(0,1))
>  [[1]]:
>  [[1]][[1]]:
>  [1] 1 5
>  attr(, "match.length"):
>  [1] 2 4
>  [[1]][[2]]:
>  [1] 2 7
>  attr(, "match.length"):
>  [1] 1 2
>
> A weakness of this solution is that the structure of the return values
> changes if length(subpattern)>1.  An alternative is to have a separate
> function, say ggregepxr for group gregexpr, that returns a list of
> lists as in the above example.  This function would always return
> positions and match lengths of the whole pattern (group 0) and all
> groups.  The original gregexpr could still have the subpattern
> argument but it would only accept single numbers.  This way the return
> format of gregexpr remains the same.
>
> Best,
>
>  Titus
>
>
> On Wed, Sep 29, 2010 at 2:42 AM, Michael Bedward
>  wrote:
>> Ah, that's interesting - thanks Bill. That's certainly on the right
>> track for me (Titus, you too ?) especially if the subpattern argument
>> accepted a vector of multiple group indices.
>>
>> As you say, this is straightforward in C. I'd be happy to (try to)
>> make a patch for the R sources if there was some consensus on the best
>> way to implement it, ie. as a new R function or by extending existing
>> function(s).
>

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Regular expressions: offsets of groups

2010-09-29 Thread Titus von der Malsburg
Bill, Michael,

good to see I'm not the only one who sees potential for improvements
in the regexpr domain.  Adding a subpattern argument is certainly a
step in the right direction and would make my life much easier.
However, in my application I need to know not only the position of one
group but also the position of the overall match in the original
string.  The ideal solution would provide positions and match lengths
for the whole pattern and for all groups if desired.  Only this would
solve all related issues.  One possibility is to have a subpattern
argument that accepts a vector of numbers (0 refers to the whole
pattern):

  > gregexpr("a+(b+)", "abcdaabbc", subpattern=c(0,1))
 [[1]]:
 [[1]][[1]]:
 [1] 1 5
 attr(, "match.length"):
 [1] 2 4
 [[1]][[2]]:
 [1] 2 7
 attr(, "match.length"):
 [1] 1 2

A weakness of this solution is that the structure of the return values
changes if length(subpattern)>1.  An alternative is to have a separate
function, say ggregepxr for group gregexpr, that returns a list of
lists as in the above example.  This function would always return
positions and match lengths of the whole pattern (group 0) and all
groups.  The original gregexpr could still have the subpattern
argument but it would only accept single numbers.  This way the return
format of gregexpr remains the same.

Best,

  Titus


On Wed, Sep 29, 2010 at 2:42 AM, Michael Bedward
 wrote:
> Ah, that's interesting - thanks Bill. That's certainly on the right
> track for me (Titus, you too ?) especially if the subpattern argument
> accepted a vector of multiple group indices.
>
> As you say, this is straightforward in C. I'd be happy to (try to)
> make a patch for the R sources if there was some consensus on the best
> way to implement it, ie. as a new R function or by extending existing
> function(s).

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Regular expressions: offsets of groups

2010-09-28 Thread Michael Bedward
Ah, that's interesting - thanks Bill. That's certainly on the right
track for me (Titus, you too ?) especially if the subpattern argument
accepted a vector of multiple group indices.

As you say, this is straightforward in C. I'd be happy to (try to)
make a patch for the R sources if there was some consensus on the best
way to implement it, ie. as a new R function or by extending existing
function(s).

Michael

On 29 September 2010 01:46, William Dunlap wrote:
>
> S+ has a subpattern=number argument to regexpr and
> related functions.  It means that the text matched
> by the subpattern'th parenthesized expression in the
> pattern will be considered the matched text.  E.g.,
> to find runs of b's that come immediately after a's:
>
>  > gregexpr("a+(b+)", "abcdaabbc", subpattern=1)
>  [[1]]:
>  [1] 2 7
>  attr(, "match.length"):
>  [1] 1 2
>
> or to find bc's that come after 2 or more ab's
>  > gregexpr("(ab){2,}bc", "abbcabababbcabcababbc", subpattern=1)
>
> regexpr() and strsplit() have this argument in S+ 8.1 but
> gregexpr() is not yet in a released version of S+.
>
> subpattern=0, the default, means to use the entire
> pattern.  regexpr allows subpattern=-1, which means
> to return a list with one element for each subpattern.
> I don't know if the extra complexity is worth it.
> (gregexpr does not allow subpattern=-1.)
>
> The usual C regexec() returns this information.
> Perhaps it would be handy to have it in R.
>
> Bill Dunlap
> Spotfire, TIBCO Software
> wdunlap tibco.com
>

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Regular expressions: offsets of groups

2010-09-28 Thread William Dunlap

> -Original Message-
> From: r-help-boun...@r-project.org 
> [mailto:r-help-boun...@r-project.org] On Behalf Of Michael Bedward
> Sent: Tuesday, September 28, 2010 12:46 AM
> To: Titus von der Malsburg
> Cc: r-help@r-project.org
> Subject: Re: [R] Regular expressions: offsets of groups
> 
> What Titus wants to do is akin to retrieving capturing groups from a
> Matcher object in Java. I also thought there must be an existing,
> elegant solution to this some time ago and searched for it, including
> looking at the sources (albeit with not much expertise) but came up
> blank.
> 
> I also looked at the stringr package (which is nice) but it doesn't
> quite do it either.

S+ has a subpattern=number argument to regexpr and
related functions.  It means that the text matched
by the subpattern'th parenthesized expression in the
pattern will be considered the matched text.  E.g.,
to find runs of b's that come immediately after a's:

  > gregexpr("a+(b+)", "abcdaabbc", subpattern=1)
  [[1]]:
  [1] 2 7
  attr(, "match.length"):
  [1] 1 2

or to find bc's that come after 2 or more ab's
  > gregexpr("(ab){2,}bc", "abbcabababbcabcababbc", subpattern=1)

regexpr() and strsplit() have this argument in S+ 8.1 but
gregexpr() is not yet in a released version of S+.

subpattern=0, the default, means to use the entire
pattern.  regexpr allows subpattern=-1, which means
to return a list with one element for each subpattern.
I don't know if the extra complexity is worth it.
(gregexpr does not allow subpattern=-1.)

The usual C regexec() returns this information.
Perhaps it would be handy to have it in R.

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com 

> 
> Michael
> 
> On 28 September 2010 01:48, Titus von der Malsburg 
>  wrote:
> > Dear list!
> >
> >> gregexpr("a+(b+)", "abcdaabbc")
> > [[1]]
> > [1] 1 5
> > attr(,"match.length")
> > [1] 2 4
> >
> > What I want is the offsets of the matches for the group (b+), i.e. 2
> > and 7, not the offsets of the complete matches.  Is there a way in R
> > to get that?
> >
> > I know about gsubgn and strapply, but they only give me the strings
> > matched by groups not their offsets.
> >
> > I could write something myself that first takes the above matches
> > ("ab" and "aabb") and then searches again using only the group (b+).
> > For this to work, I'd have to parse the regular expression 
> and search
> > several times (> 2, for nested groups) instead of just 
> once.  But I'm
> > sure there is a better way to do this.
> >
> > Thanks for any suggestion!
> >
> >   Titus
> >
> > __
> > R-help@r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide 
> http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> >
> 
> __
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide 
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Regular expressions: offsets of groups

2010-09-28 Thread Gabor Grothendieck
On Tue, Sep 28, 2010 at 6:52 AM, Titus von der Malsburg
 wrote:
> On Tue, Sep 28, 2010 at 9:46 AM, Michael Bedward
>  wrote:
>> What Titus wants to do is akin to retrieving capturing groups from a
>> Matcher object in Java.
>
> Precisely.  Here's the description:
>
>  http://download.oracle.com/javase/1.4.2/docs/api/java/util/regex/Matcher.html#start(int)
>
> Gabor's lookbehind trick solves some special cases but it's not the

The only limitation is that in the regular expressions supported by R
you cannot have repitition in the (<=...) portion but none of your
examples -- neither the one you gave nor the one below require that
since if the prior expression ends in X+ you can just use X.Are
you sure it does not cover all your actual situations?

If you truly do have situations where that require repetition a
gregexpr plus gsubfn will do it in one line.   Parenthesize the
portion of the regular expression you want to capture and replace
every character in it with X (or some other character that does not
otherwise occur).  Then find the positions and lengths of strings of
X.

> gregexpr("X+", gsubfn("a(b+)", ~ gsub(".", "X", x), "abcdaabbcbbb"))
[[1]]
[1] 1 5
attr(,"match.length")
[1] 1 2

-- 
Statistics & Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Regular expressions: offsets of groups

2010-09-28 Thread Titus von der Malsburg
On Tue, Sep 28, 2010 at 9:46 AM, Michael Bedward
 wrote:
> What Titus wants to do is akin to retrieving capturing groups from a
> Matcher object in Java.

Precisely.  Here's the description:

  
http://download.oracle.com/javase/1.4.2/docs/api/java/util/regex/Matcher.html#start(int)

Gabor's lookbehind trick solves some special cases but it's not the
kind of general solution I'm looking for.  Let me explain what I'm
trying to achieve here.  I'm working on a package that provides tools
for processing and analyzing eye movements (we're doing reading
research).  In most situations, eye movements consist of fixations
where the eyes are relatively stationary and saccades, quick movements
between fixations.  A common way to represent eye movements is as
strings of symbols, where each symbol corresponds to a fixation on a
particular region.  AABC means two fixations followed by a fixation on
B and then C.  When people analyze eye movements it's often necessary
to find specific events in the eye movement record like: fixations on
the word C preceded by fixations on words D-F and followed by
fixations on words A-C.  This event can be specified using this
regexpr: "[D-F]+(C)[A-C]+"  The group (in parenthesis) indicates the
substring for which I'd like to know the position in the overall
string.  Another application is the extraction of subsequences from a
sequence of fixations.  Note that in some situations people might have
to use more groups in their regexprs and that groups can be nested.
In this case the user would have to indicate for which group he/she
wants to know the offset.  I'm not an expert for regexpr engines but
I'm pretty sure the necessary information is available in the engine.

Gabor, I see you're the author of gsubfn (fantastic package!).  Do you
see a relatively simple way to expose information about group offsets
and their corresponding match lengths?  I think this could be useful
for other applications as well.  At least it seems Michael could use
it, too.  We can cook up something for ourselves but a general
solution would benefit the larger community.

   Titus

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Regular expressions: offsets of groups

2010-09-28 Thread Michael Bedward
What Titus wants to do is akin to retrieving capturing groups from a
Matcher object in Java. I also thought there must be an existing,
elegant solution to this some time ago and searched for it, including
looking at the sources (albeit with not much expertise) but came up
blank.

I also looked at the stringr package (which is nice) but it doesn't
quite do it either.

Michael

On 28 September 2010 01:48, Titus von der Malsburg  wrote:
> Dear list!
>
>> gregexpr("a+(b+)", "abcdaabbc")
> [[1]]
> [1] 1 5
> attr(,"match.length")
> [1] 2 4
>
> What I want is the offsets of the matches for the group (b+), i.e. 2
> and 7, not the offsets of the complete matches.  Is there a way in R
> to get that?
>
> I know about gsubgn and strapply, but they only give me the strings
> matched by groups not their offsets.
>
> I could write something myself that first takes the above matches
> ("ab" and "aabb") and then searches again using only the group (b+).
> For this to work, I'd have to parse the regular expression and search
> several times (> 2, for nested groups) instead of just once.  But I'm
> sure there is a better way to do this.
>
> Thanks for any suggestion!
>
>   Titus
>
> __
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Regular expressions: offsets of groups

2010-09-27 Thread Gabor Grothendieck
On Mon, Sep 27, 2010 at 1:34 PM, Titus von der Malsburg
 wrote:
> On Mon, Sep 27, 2010 at 7:29 PM, Gabor Grothendieck
>  wrote:
>> Try this zero width negative look behind expression:
>>
>>> gregexpr("(?!a+)(b+)", "abcdaabbc", perl = TRUE)
>> [[1]]
>> [1] 2 7
>> attr(,"match.length")
>> [1] 1 2
>
> Thanks Gabor, but this gives me the same result as
>
>  gregexpr("b+", "abcdaabbc", perl = TRUE)
>
> which is wrong if the string is "abcdaabbcbbb".
>

Sorry, try this:

>  gregexpr("(?<=a)b+", "abcdaabbcbbb", perl = TRUE)
[[1]]
[1] 2 7
attr(,"match.length")
[1] 1 2

Note that it does not give the same answer as:

>  gregexpr("b+", "abcdaabbcbbb", perl = TRUE)
[[1]]
[1]  2  7 10
attr(,"match.length")
[1] 1 2 3


 gregexpr("(?<=a)b+", "abcdaabbcbbb", perl = TRUE)




-- 
Statistics & Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Regular expressions: offsets of groups

2010-09-27 Thread Henrique Dallazuanna
You've tried:

gregexpr("b+", "abcdaabbc")


On Mon, Sep 27, 2010 at 12:48 PM, Titus von der Malsburg  wrote:

> Dear list!
>
> > gregexpr("a+(b+)", "abcdaabbc")
> [[1]]
> [1] 1 5
> attr(,"match.length")
> [1] 2 4
>
> What I want is the offsets of the matches for the group (b+), i.e. 2
> and 7, not the offsets of the complete matches.  Is there a way in R
> to get that?
>
> I know about gsubgn and strapply, but they only give me the strings
> matched by groups not their offsets.
>
> I could write something myself that first takes the above matches
> ("ab" and "aabb") and then searches again using only the group (b+).
> For this to work, I'd have to parse the regular expression and search
> several times (> 2, for nested groups) instead of just once.  But I'm
> sure there is a better way to do this.
>
> Thanks for any suggestion!
>
>   Titus
>
> __
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



-- 
Henrique Dallazuanna
Curitiba-Paraná-Brasil
25° 25' 40" S 49° 16' 22" O

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Regular expressions: offsets of groups

2010-09-27 Thread Henrique Dallazuanna
You could do this:

gregexpr("ab+", "abcdaabbcbb")[[1]] + 1

On Mon, Sep 27, 2010 at 2:25 PM, Titus von der Malsburg
wrote:

> On Mon, Sep 27, 2010 at 7:16 PM, Henrique Dallazuanna 
> wrote:
> > You've tried:
> >
> > gregexpr("b+", "abcdaabbc")
>
> But this would match the third occurrence of b+ in "abcdaabbcbb".  But
> in this example I'm only interested in b+ if it's preceded by a+.
>
>  Titus
>



-- 
Henrique Dallazuanna
Curitiba-Paraná-Brasil
25° 25' 40" S 49° 16' 22" O

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Regular expressions: offsets of groups

2010-09-27 Thread Titus von der Malsburg
On Mon, Sep 27, 2010 at 7:29 PM, Gabor Grothendieck
 wrote:
> Try this zero width negative look behind expression:
>
>> gregexpr("(?!a+)(b+)", "abcdaabbc", perl = TRUE)
> [[1]]
> [1] 2 7
> attr(,"match.length")
> [1] 1 2

Thanks Gabor, but this gives me the same result as

  gregexpr("b+", "abcdaabbc", perl = TRUE)

which is wrong if the string is "abcdaabbcbbb".

  Titus

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Regular expressions: offsets of groups

2010-09-27 Thread Gabor Grothendieck
On Mon, Sep 27, 2010 at 11:48 AM, Titus von der Malsburg
 wrote:
> Dear list!
>
>> gregexpr("a+(b+)", "abcdaabbc")
> [[1]]
> [1] 1 5
> attr(,"match.length")
> [1] 2 4
>
> What I want is the offsets of the matches for the group (b+), i.e. 2
> and 7, not the offsets of the complete matches.  Is there a way in R
> to get that?
>
> I know about gsubgn and strapply, but they only give me the strings
> matched by groups not their offsets.
>
> I could write something myself that first takes the above matches
> ("ab" and "aabb") and then searches again using only the group (b+).
> For this to work, I'd have to parse the regular expression and search
> several times (> 2, for nested groups) instead of just once.  But I'm
> sure there is a better way to do this.
>

Try this zero width negative look behind expression:

> gregexpr("(?!a+)(b+)", "abcdaabbc", perl = TRUE)
[[1]]
[1] 2 7
attr(,"match.length")
[1] 1 2

See ?regexp for more info.

-- 
Statistics & Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Regular expressions: offsets of groups

2010-09-27 Thread Titus von der Malsburg
On Mon, Sep 27, 2010 at 7:16 PM, Henrique Dallazuanna  wrote:
> You've tried:
>
> gregexpr("b+", "abcdaabbc")

But this would match the third occurrence of b+ in "abcdaabbcbb".  But
in this example I'm only interested in b+ if it's preceded by a+.

  Titus

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Regular expressions: offsets of groups

2010-09-27 Thread Titus von der Malsburg
Thank you Jim, but just as the solution that I discussed, your
proposal involves deconstructing the pattern and searching several
times.  I'm looking for a general and efficient solution.  Internally,
the regexpr engine has all necessary information after one pass
through the string.  What I need is an interface that exposes this
information.

  Titus

On Mon, Sep 27, 2010 at 6:43 PM, jim holtman  wrote:
> try this:
>
>> x <-  gregexpr("a+(b+)", "abcdaabbcaaacaaab")
>> justA <-  gregexpr("a+", "abcdaabbcaaacaaab")
>> # find matches in 'x' for 'justA'
>> indx <- which(justA[[1]] %in% x[[1]])
>> # now determine where 'b' starts
>> justA[[1]][indx] + attr(justA[[1]], 'match.length')[indx]
> [1]  2  7 17
>>
>
>
> On Mon, Sep 27, 2010 at 11:48 AM, Titus von der Malsburg
>  wrote:
>> Dear list!
>>
>>> gregexpr("a+(b+)", "abcdaabbc")
>> [[1]]
>> [1] 1 5
>> attr(,"match.length")
>> [1] 2 4
>>
>> What I want is the offsets of the matches for the group (b+), i.e. 2
>> and 7, not the offsets of the complete matches.  Is there a way in R
>> to get that?
>>
>> I know about gsubgn and strapply, but they only give me the strings
>> matched by groups not their offsets.
>>
>> I could write something myself that first takes the above matches
>> ("ab" and "aabb") and then searches again using only the group (b+).
>> For this to work, I'd have to parse the regular expression and search
>> several times (> 2, for nested groups) instead of just once.  But I'm
>> sure there is a better way to do this.
>>
>> Thanks for any suggestion!
>>
>>   Titus
>>
>> __
>> R-help@r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
>
>
> --
> Jim Holtman
> Cincinnati, OH
> +1 513 646 9390
>
> What is the problem that you are trying to solve?
>

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Regular expressions: offsets of groups

2010-09-27 Thread jim holtman
try this:

> x <-  gregexpr("a+(b+)", "abcdaabbcaaacaaab")
> justA <-  gregexpr("a+", "abcdaabbcaaacaaab")
> # find matches in 'x' for 'justA'
> indx <- which(justA[[1]] %in% x[[1]])
> # now determine where 'b' starts
> justA[[1]][indx] + attr(justA[[1]], 'match.length')[indx]
[1]  2  7 17
>


On Mon, Sep 27, 2010 at 11:48 AM, Titus von der Malsburg
 wrote:
> Dear list!
>
>> gregexpr("a+(b+)", "abcdaabbc")
> [[1]]
> [1] 1 5
> attr(,"match.length")
> [1] 2 4
>
> What I want is the offsets of the matches for the group (b+), i.e. 2
> and 7, not the offsets of the complete matches.  Is there a way in R
> to get that?
>
> I know about gsubgn and strapply, but they only give me the strings
> matched by groups not their offsets.
>
> I could write something myself that first takes the above matches
> ("ab" and "aabb") and then searches again using only the group (b+).
> For this to work, I'd have to parse the regular expression and search
> several times (> 2, for nested groups) instead of just once.  But I'm
> sure there is a better way to do this.
>
> Thanks for any suggestion!
>
>   Titus
>
> __
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



-- 
Jim Holtman
Cincinnati, OH
+1 513 646 9390

What is the problem that you are trying to solve?

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Regular expressions: offsets of groups

2010-09-27 Thread Titus von der Malsburg
Dear list!

> gregexpr("a+(b+)", "abcdaabbc")
[[1]]
[1] 1 5
attr(,"match.length")
[1] 2 4

What I want is the offsets of the matches for the group (b+), i.e. 2
and 7, not the offsets of the complete matches.  Is there a way in R
to get that?

I know about gsubgn and strapply, but they only give me the strings
matched by groups not their offsets.

I could write something myself that first takes the above matches
("ab" and "aabb") and then searches again using only the group (b+).
For this to work, I'd have to parse the regular expression and search
several times (> 2, for nested groups) instead of just once.  But I'm
sure there is a better way to do this.

Thanks for any suggestion!

   Titus

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] regular expressions

2009-10-26 Thread baptiste auguie
Perfect, thanks!

baptiste

2009/10/26 Gabor Grothendieck :
> Assuming only START fields match pat:
>
>> ## this one has more fields: how do I generalize the regular expression?
>> st2 = c("START text1 1 text2 2.3 text3 5", "whatever intermediate text",
> + "START text1 23.4 text2 3.1415 text3 6")
>>
>> pat <- "[[:alnum:]]+ +([0-9.]+)"
>> s <- strapply(st2, pat, c, simplify = rbind)
>>
>> pat2 <- "([[:alnum:]]+) +[0-9.]+"
>> colnames(s) <- strapply(st2[1], pat2, c, simplify = rbind)
>> s
>     text1  text2    text3
> [1,] "1"    "2.3"    "5"
> [2,] "23.4" "3.1415" "6"
>
> If there are non-START fields that do match pat then grep out the
> START fields first.
>
> On Mon, Oct 26, 2009 at 9:30 AM, baptiste auguie
>  wrote:
>> Dear list,
>>
>> I have the following text to parse (originating from readLines as some
>> lines have unequal size),
>>
>> st = c("START text1 1 text2 2.3", "whatever intermediate text", "START
>> text1 23.4 text2 3.1415")
>>
>> from which I'd like to extract the lines starting with "START", and
>> group the subsequent fields in a data.frame in this format:
>>
>>  text1  text2
>>     1    2.3
>>  23.4 3.1415
>>
>>
>> All the lines containing "START" have the same number of fields, but
>> this number may vary from file to file.
>>
>> I have managed to get this minimal example work, but I am at a loss as
>> for handling an arbitrary number of couples (text value),
>>
>> library(gsubfn)
>>
>> ( parsed =
>> strapply(st, "^START +([[:alnum:]]+) +([0-9.]+) +([[:alnum:]]+)
>> +([0-9.]+)",c, simplify=rbind,combine=c) )
>>
>> d = data.frame(parsed[ ,c(2,4)])
>> names(d) <- apply(parsed[ ,c(1,3)], 2, unique)
>> d
>>
>> ## this one has more fields: how do I generalize the regular expression?
>> st2 = c("START text1 1 text2 2.3 text3 5", "whatever intermediate
>> text", "START text1 23.4 text2 3.1415 text3 6")
>>
>> Best regards,
>>
>>
>> Baptiste
>>
>> __
>> R-help@r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] regular expressions

2009-10-26 Thread Gabor Grothendieck
Assuming only START fields match pat:

> ## this one has more fields: how do I generalize the regular expression?
> st2 = c("START text1 1 text2 2.3 text3 5", "whatever intermediate text",
+ "START text1 23.4 text2 3.1415 text3 6")
>
> pat <- "[[:alnum:]]+ +([0-9.]+)"
> s <- strapply(st2, pat, c, simplify = rbind)
>
> pat2 <- "([[:alnum:]]+) +[0-9.]+"
> colnames(s) <- strapply(st2[1], pat2, c, simplify = rbind)
> s
 text1  text2text3
[1,] "1""2.3""5"
[2,] "23.4" "3.1415" "6"

If there are non-START fields that do match pat then grep out the
START fields first.

On Mon, Oct 26, 2009 at 9:30 AM, baptiste auguie
 wrote:
> Dear list,
>
> I have the following text to parse (originating from readLines as some
> lines have unequal size),
>
> st = c("START text1 1 text2 2.3", "whatever intermediate text", "START
> text1 23.4 text2 3.1415")
>
> from which I'd like to extract the lines starting with "START", and
> group the subsequent fields in a data.frame in this format:
>
>  text1  text2
>     1    2.3
>  23.4 3.1415
>
>
> All the lines containing "START" have the same number of fields, but
> this number may vary from file to file.
>
> I have managed to get this minimal example work, but I am at a loss as
> for handling an arbitrary number of couples (text value),
>
> library(gsubfn)
>
> ( parsed =
> strapply(st, "^START +([[:alnum:]]+) +([0-9.]+) +([[:alnum:]]+)
> +([0-9.]+)",c, simplify=rbind,combine=c) )
>
> d = data.frame(parsed[ ,c(2,4)])
> names(d) <- apply(parsed[ ,c(1,3)], 2, unique)
> d
>
> ## this one has more fields: how do I generalize the regular expression?
> st2 = c("START text1 1 text2 2.3 text3 5", "whatever intermediate
> text", "START text1 23.4 text2 3.1415 text3 6")
>
> Best regards,
>
>
> Baptiste
>
> __
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] regular expressions

2009-10-26 Thread baptiste auguie
Dear list,

I have the following text to parse (originating from readLines as some
lines have unequal size),

st = c("START text1 1 text2 2.3", "whatever intermediate text", "START
text1 23.4 text2 3.1415")

from which I'd like to extract the lines starting with "START", and
group the subsequent fields in a data.frame in this format:

  text1  text2
 12.3
  23.4 3.1415


All the lines containing "START" have the same number of fields, but
this number may vary from file to file.

I have managed to get this minimal example work, but I am at a loss as
for handling an arbitrary number of couples (text value),

library(gsubfn)

( parsed =
strapply(st, "^START +([[:alnum:]]+) +([0-9.]+) +([[:alnum:]]+)
+([0-9.]+)",c, simplify=rbind,combine=c) )

d = data.frame(parsed[ ,c(2,4)])
names(d) <- apply(parsed[ ,c(1,3)], 2, unique)
d

## this one has more fields: how do I generalize the regular expression?
st2 = c("START text1 1 text2 2.3 text3 5", "whatever intermediate
text", "START text1 23.4 text2 3.1415 text3 6")

Best regards,


Baptiste

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Regular expressions: bug or misunderstanding?

2008-07-06 Thread Duncan Murdoch

On 06/07/2008 7:37 PM, Gabor Grothendieck wrote:

Look at the discussion of zero width lookahead assertions in ?regex .
Use perl = TRUE as previously indicated.


Thanks, this seems to work:

gsub( "(?

On Sun, Jul 6, 2008 at 7:29 PM, Duncan Murdoch <[EMAIL PROTECTED]> wrote:

On 06/07/2008 5:37 PM, (Ted Harding) wrote:

On 06-Jul-08 21:17:04, Duncan Murdoch wrote:

I'm trying to write a gsub() call that takes a string and escapes all the
unescaped quote marks in it.  So the string

\"

would be left unchanged, but

\\"

would be changed to

\\\"

because the double backslash doesn't act as an escape for the quote,
the first just escapes the second.  I have the usual problems of
writing regular expressions involving backslashes which make
everything I write completely unreadable, so I'm going to change
the problem for this post:  I will define E to be the escape
character, and q to be the quote; the gsub() call would leave

Eq

unchanged, but would change

EEq

to EEEq, etc.

The expression I have come up with after this change is

gsub( "((^|[^E])(EE)*)q", "\\1Eq", x)

i.e. "(start of line, or non-escape, followed by an even number of
escapes), all of which we call expression 1, followed by a quote,
is replaced by expression 1 followed by an escape and a quote".

This works sometimes, but not always:

 > gsub( "((^|[^E])(EE)*)q", "\\1Eq", "Eq")
[1] "Eq"
 > gsub( "((^|[^E])(EE)*)q", "\\1Eq", "EEq")
[1] "EEEq"
 > gsub( "((^|[^E])(EE)*)q", "\\1Eq", "qaq")
[1] "EqaEq"
 > gsub( "((^|[^E])(EE)*)q", "\\1Eq", "qq")
[1] "qEq"

Notice that in the final example, the first quote doesn't get escaped.
Why not

I think (without having done the "experimental diagnostics")
that it's because in "qq" the first q mtaches (^|[^E]) because
it matches [^E] (i.e. is a "non-escape"); since it is followed
by q, it is the second q which gets the escape. Possibly you
need to include "^q" as an additional alternative match at the
start of the line.

Thanks, that sounds right, but now I can't see how to fix it.  Is there
syntax to say:  match A only if it follows B, but don't match the B part?

Duncan Murdoch

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Regular expressions: bug or misunderstanding?

2008-07-06 Thread Gabor Grothendieck
Look at the discussion of zero width lookahead assertions in ?regex .
Use perl = TRUE as previously indicated.

On Sun, Jul 6, 2008 at 7:29 PM, Duncan Murdoch <[EMAIL PROTECTED]> wrote:
> On 06/07/2008 5:37 PM, (Ted Harding) wrote:
>>
>> On 06-Jul-08 21:17:04, Duncan Murdoch wrote:
>>>
>>> I'm trying to write a gsub() call that takes a string and escapes all the
>>> unescaped quote marks in it.  So the string
>>>
>>> \"
>>>
>>> would be left unchanged, but
>>>
>>> \\"
>>>
>>> would be changed to
>>>
>>> \\\"
>>>
>>> because the double backslash doesn't act as an escape for the quote,
>>> the first just escapes the second.  I have the usual problems of
>>> writing regular expressions involving backslashes which make
>>> everything I write completely unreadable, so I'm going to change
>>> the problem for this post:  I will define E to be the escape
>>> character, and q to be the quote; the gsub() call would leave
>>>
>>> Eq
>>>
>>> unchanged, but would change
>>>
>>> EEq
>>>
>>> to EEEq, etc.
>>>
>>> The expression I have come up with after this change is
>>>
>>> gsub( "((^|[^E])(EE)*)q", "\\1Eq", x)
>>>
>>> i.e. "(start of line, or non-escape, followed by an even number of
>>> escapes), all of which we call expression 1, followed by a quote,
>>> is replaced by expression 1 followed by an escape and a quote".
>>>
>>> This works sometimes, but not always:
>>>
>>>  > gsub( "((^|[^E])(EE)*)q", "\\1Eq", "Eq")
>>> [1] "Eq"
>>>  > gsub( "((^|[^E])(EE)*)q", "\\1Eq", "EEq")
>>> [1] "EEEq"
>>>  > gsub( "((^|[^E])(EE)*)q", "\\1Eq", "qaq")
>>> [1] "EqaEq"
>>>  > gsub( "((^|[^E])(EE)*)q", "\\1Eq", "qq")
>>> [1] "qEq"
>>>
>>> Notice that in the final example, the first quote doesn't get escaped.
>>> Why not
>>
>> I think (without having done the "experimental diagnostics")
>> that it's because in "qq" the first q mtaches (^|[^E]) because
>> it matches [^E] (i.e. is a "non-escape"); since it is followed
>> by q, it is the second q which gets the escape. Possibly you
>> need to include "^q" as an additional alternative match at the
>> start of the line.
>
> Thanks, that sounds right, but now I can't see how to fix it.  Is there
> syntax to say:  match A only if it follows B, but don't match the B part?
>
> Duncan Murdoch
>
> __
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Regular expressions: bug or misunderstanding?

2008-07-06 Thread Duncan Murdoch

On 06/07/2008 5:37 PM, (Ted Harding) wrote:

On 06-Jul-08 21:17:04, Duncan Murdoch wrote:
I'm trying to write a gsub() call that takes a string and escapes all 
the unescaped quote marks in it.  So the string


\"

would be left unchanged, but

\\"

would be changed to

\\\"

because the double backslash doesn't act as an escape for the quote,
the first just escapes the second.  I have the usual problems of
writing regular expressions involving backslashes which make
everything I write completely unreadable, so I'm going to change
the problem for this post:  I will define E to be the escape
character, and q to be the quote; the gsub() call would leave

Eq

unchanged, but would change

EEq

to EEEq, etc.

The expression I have come up with after this change is

gsub( "((^|[^E])(EE)*)q", "\\1Eq", x)

i.e. "(start of line, or non-escape, followed by an even number of 
escapes), all of which we call expression 1, followed by a quote,

is replaced by expression 1 followed by an escape and a quote".

This works sometimes, but not always:

 > gsub( "((^|[^E])(EE)*)q", "\\1Eq", "Eq")
[1] "Eq"
 > gsub( "((^|[^E])(EE)*)q", "\\1Eq", "EEq")
[1] "EEEq"
 > gsub( "((^|[^E])(EE)*)q", "\\1Eq", "qaq")
[1] "EqaEq"
 > gsub( "((^|[^E])(EE)*)q", "\\1Eq", "qq")
[1] "qEq"

Notice that in the final example, the first quote doesn't get escaped. 
Why not


I think (without having done the "experimental diagnostics")
that it's because in "qq" the first q mtaches (^|[^E]) because
it matches [^E] (i.e. is a "non-escape"); since it is followed
by q, it is the second q which gets the escape. Possibly you
need to include "^q" as an additional alternative match at the
start of the line.


Thanks, that sounds right, but now I can't see how to fix it.  Is there 
syntax to say:  match A only if it follows B, but don't match the B part?


Duncan Murdoch

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Regular expressions: bug or misunderstanding?

2008-07-06 Thread Ted Harding
On 06-Jul-08 21:17:04, Duncan Murdoch wrote:
> I'm trying to write a gsub() call that takes a string and escapes all 
> the unescaped quote marks in it.  So the string
> 
> \"
> 
> would be left unchanged, but
> 
> \\"
> 
> would be changed to
> 
> \\\"
> 
> because the double backslash doesn't act as an escape for the quote,
> the first just escapes the second.  I have the usual problems of
> writing regular expressions involving backslashes which make
> everything I write completely unreadable, so I'm going to change
> the problem for this post:  I will define E to be the escape
> character, and q to be the quote; the gsub() call would leave
> 
> Eq
> 
> unchanged, but would change
> 
> EEq
> 
> to EEEq, etc.
> 
> The expression I have come up with after this change is
> 
> gsub( "((^|[^E])(EE)*)q", "\\1Eq", x)
> 
> i.e. "(start of line, or non-escape, followed by an even number of 
> escapes), all of which we call expression 1, followed by a quote,
> is replaced by expression 1 followed by an escape and a quote".
> 
> This works sometimes, but not always:
> 
>  > gsub( "((^|[^E])(EE)*)q", "\\1Eq", "Eq")
> [1] "Eq"
>  > gsub( "((^|[^E])(EE)*)q", "\\1Eq", "EEq")
> [1] "EEEq"
>  > gsub( "((^|[^E])(EE)*)q", "\\1Eq", "qaq")
> [1] "EqaEq"
>  > gsub( "((^|[^E])(EE)*)q", "\\1Eq", "qq")
> [1] "qEq"
> 
> Notice that in the final example, the first quote doesn't get escaped. 
> Why not

I think (without having done the "experimental diagnostics")
that it's because in "qq" the first q mtaches (^|[^E]) because
it matches [^E] (i.e. is a "non-escape"); since it is followed
by q, it is the second q which gets the escape. Possibly you
need to include "^q" as an additional alternative match at the
start of the line.

Ted.


E-Mail: (Ted Harding) <[EMAIL PROTECTED]>
Fax-to-email: +44 (0)870 094 0861
Date: 06-Jul-08   Time: 22:37:10
-- XFMail --

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Regular expressions: bug or misunderstanding?

2008-07-06 Thread Gabor Grothendieck
Try adding perl = TRUE

On Sun, Jul 6, 2008 at 5:17 PM, Duncan Murdoch <[EMAIL PROTECTED]> wrote:
> I'm trying to write a gsub() call that takes a string and escapes all the
> unescaped quote marks in it.  So the string
>
> \"
>
> would be left unchanged, but
>
> \\"
>
> would be changed to
>
> \\\"
>
> because the double backslash doesn't act as an escape for the quote, the
> first just escapes the second.  I have the usual problems of writing regular
> expressions involving backslashes which make everything I write completely
> unreadable, so I'm going to change the problem for this post:  I will define
> E to be the escape character, and q to be the quote; the gsub() call would
> leave
>
> Eq
>
> unchanged, but would change
>
> EEq
>
> to EEEq, etc.
>
> The expression I have come up with after this change is
>
> gsub( "((^|[^E])(EE)*)q", "\\1Eq", x)
>
> i.e. "(start of line, or non-escape, followed by an even number of escapes),
> all of which we call expression 1, followed by a quote, is replaced by
> expression 1 followed by an escape and a quote".
>
> This works sometimes, but not always:
>
>> gsub( "((^|[^E])(EE)*)q", "\\1Eq", "Eq")
> [1] "Eq"
>> gsub( "((^|[^E])(EE)*)q", "\\1Eq", "EEq")
> [1] "EEEq"
>> gsub( "((^|[^E])(EE)*)q", "\\1Eq", "qaq")
> [1] "EqaEq"
>> gsub( "((^|[^E])(EE)*)q", "\\1Eq", "qq")
> [1] "qEq"
>
> Notice that in the final example, the first quote doesn't get escaped.  Why
> not
>
> Duncan Murdoch
>
> __
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Regular expressions: bug or misunderstanding?

2008-07-06 Thread Duncan Murdoch
I'm trying to write a gsub() call that takes a string and escapes all 
the unescaped quote marks in it.  So the string


\"

would be left unchanged, but

\\"

would be changed to

\\\"

because the double backslash doesn't act as an escape for the quote, the 
first just escapes the second.  I have the usual problems of writing 
regular expressions involving backslashes which make everything I write 
completely unreadable, so I'm going to change the problem for this 
post:  I will define E to be the escape character, and q to be the 
quote; the gsub() call would leave


Eq

unchanged, but would change

EEq

to EEEq, etc.

The expression I have come up with after this change is

gsub( "((^|[^E])(EE)*)q", "\\1Eq", x)

i.e. "(start of line, or non-escape, followed by an even number of 
escapes), all of which we call expression 1, followed by a quote, is 
replaced by expression 1 followed by an escape and a quote".


This works sometimes, but not always:

> gsub( "((^|[^E])(EE)*)q", "\\1Eq", "Eq")
[1] "Eq"
> gsub( "((^|[^E])(EE)*)q", "\\1Eq", "EEq")
[1] "EEEq"
> gsub( "((^|[^E])(EE)*)q", "\\1Eq", "qaq")
[1] "EqaEq"
> gsub( "((^|[^E])(EE)*)q", "\\1Eq", "qq")
[1] "qEq"

Notice that in the final example, the first quote doesn't get escaped.  
Why not


Duncan Murdoch

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Regular Expressions

2008-05-13 Thread Gabor Grothendieck
On Tue, May 13, 2008 at 5:02 AM, Shubha Vishwanath Karanth
<[EMAIL PROTECTED]> wrote:
> Suppose,
>
> S=c("World_is_beautiful", "one_two_three_four","My_book")
>
> I need to extract the last but one element of the strings. So, my output 
> should look like:
>
> Ans=c("is","three","My")
>
> gsub() can do this...but wondering how do I give the regular expression
>

As others have mentioned strsplit is probably easier in this case but it can
be done with a regular expression as shown below where [^_]+ matches a
any string of characters not containing _ :

> re <- "^([^_]+_)*([^_]+)_([^_]+)$"
> gsub(re, "\\2", S)
[1] "is""three" "My"

The strapply function in the gsubfn package can also be used.
out below has the same value as strsplit(S, "_"):

library(gsubfn)
out <- strapply(S, "[^_]+")
sapply(out, function(x) tail(x, 2)[1])

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Regular Expressions

2008-05-13 Thread Richard . Cotton
> S=c("World_is_beautiful", "one_two_three_four","My_book")

> I need to extract the last but one element of the strings. So, my 
> output should look like:
 
> Ans=c("is","three","My")

> gsub() can do this...but wondering how do I give the regular 
expression

sapply(strsplit(S, "_"), function(x) x[length(x)-1])

You could use regular expressions, but I think it would only be 
complicating things.

Regards,
Richie.

Mathematical Sciences Unit
HSL



ATTENTION:

This message contains privileged and confidential inform...{{dropped:20}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Regular Expressions

2008-05-13 Thread Dimitris Rizopoulos

try this:

S <- c("World_is_beautiful", "one_two_three_four","My_book")

sapply(strsplit(S, "_"), tail, n = 2)[1, ]
# or
sapply(strsplit(S, "_"), function(x) x[length(x) - 1])


I hope it helps.

Best,
Dimitris


Dimitris Rizopoulos
Biostatistical Centre
School of Public Health
Catholic University of Leuven

Address: Kapucijnenvoer 35, Leuven, Belgium
Tel: +32/(0)16/336899
Fax: +32/(0)16/337015
Web: http://med.kuleuven.be/biostat/
http://www.student.kuleuven.be/~m0390867/dimitris.htm


- Original Message - 
From: "Shubha Vishwanath Karanth" <[EMAIL PROTECTED]>

To: <[EMAIL PROTECTED]>
Sent: Tuesday, May 13, 2008 11:02 AM
Subject: [R] Regular Expressions


Hi R,



Again struck with regular expressions...



Suppose,



S=c("World_is_beautiful", "one_two_three_four","My_book")



I need to extract the last but one element of the strings. So, my 
output should look like:


Ans=c("is","three","My")



gsub() can do this...but wondering how do I give the regular 
expression








Shubha Karanth | Amba Research

Ph +91 80 3980 8031 | Mob +91 94 4886 4510

Bangalore * Colombo * London * New York * San José * Singapore * 
www.ambaresearch.com




This e-mail may contain confidential and/or privileged 
i...{{dropped:13}}









__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide 
http://www.R-project.org/posting-guide.html

and provide commented, minimal, self-contained, reproducible code.




Disclaimer: http://www.kuleuven.be/cwis/email_disclaimer.htm

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Regular Expressions

2008-05-13 Thread Shubha Vishwanath Karanth
Hi R,

 

Again struck with regular expressions...

 

Suppose,

 

S=c("World_is_beautiful", "one_two_three_four","My_book")

 

I need to extract the last but one element of the strings. So, my output should 
look like:

Ans=c("is","three","My")

 

gsub() can do this...but wondering how do I give the regular expression

 

 

 

Shubha Karanth | Amba Research

Ph +91 80 3980 8031 | Mob +91 94 4886 4510 

Bangalore * Colombo * London * New York * San José * Singapore * 
www.ambaresearch.com

 

This e-mail may contain confidential and/or privileged i...{{dropped:13}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Regular Expressions Help

2008-04-19 Thread Hans-Jörg Bibiko
On 19.04.2008, at 06:46, maud wrote:

> I am having some trouble learning regular expressions. Let me describe
> the general problem I am dealing with. Consider the following setup:
>
> Joe<- c(1,2,3)
> Bob<- c(2,4,6)
> Alice <- c(9,8,7)
>
> Matrix <- cbind(Joe, Bob, Alice)
> St <- c("Bob", "Alice", "Alice:Bob")
> [...]
> I have been reading over various post on regular expressions, but
> really haven't made any progress. As far as I can tell there aren't
> standard string functions in R. (Also, as an aside, is there a
> wildcard character in R? I'd want something so if x="Bob" a statment
> of the form x== "B*b" would evaluate true where * is the wildcard.)


I'm not really sure if I understood you correctly.
If you're looking for a way to match St against something like "B*b"  
you can make usage of normal regular expressions (like grep()).

For details begin with ?regexp and ?grep

Then you can try for instance:

grep("B.b", St)

to get
[1] 1 3

(the indices of the vector St)

or directly

St[grep("B.b", St)]

Cheers,
--Hans

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Regular Expressions Help

2008-04-19 Thread maud
I am having some trouble learning regular expressions. Let me describe
the general problem I am dealing with. Consider the following setup:

Joe<- c(1,2,3)
Bob<- c(2,4,6)
Alice <- c(9,8,7)

Matrix <- cbind(Joe, Bob, Alice)
St <- c("Bob", "Alice", "Alice:Bob")

Now I want to make a new matrix having only the column's listed in St
that were in Matrix (which I can do). However, In addition for
elements of the form "Alice:Bob" I want to make a new column in my new
matrix with the product of these columns. I am not sure if a semicolon
is a valid symbol for a column name, if not we can replace it with
something like qqq.

I have been reading over various post on regular expressions, but
really haven't made any progress. As far as I can tell there aren't
standard string functions in R. (Also, as an aside, is there a
wildcard character in R? I'd want something so if x="Bob" a statment
of the form x== "B*b" would evaluate true where * is the wildcard.)

 Any help would be appreciated!

(also, thanks again to those who helped on my last question!)

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] regular expressions

2008-03-12 Thread Christos Hatzis
Try this one:

> gsub("^(plif)([a-z]*)", "\\1ONE", words)
[1] "plifONE"   "plafboum"  "ploufbang" "plifONE"  

-Christos

> -Original Message-
> From: [EMAIL PROTECTED] 
> [mailto:[EMAIL PROTECTED] On Behalf Of GOUACHE David
> Sent: Wednesday, March 12, 2008 12:15 PM
> To: [EMAIL PROTECTED]
> Subject: [R] regular expressions
> 
> Hello all,
>  
> Still fighting with regular expressions and such, I am again stuck:
>  
> Suppose I have a vector of character chains. In this vector, 
> I wish to identify which character chains start with a given 
> pattern, and then replace everything that comes after said pattern.
>  
> Here is a quick example :
>  
> words<-c("plifboum","plafboum","ploufbang","plifplaf")
>  
> I want to end up with something like this:
>  
> "plifONE","plafboum","ploufbang","plifONE"
>  
> All I can produce so far is this :
> gsub("\\bplif.","ONE",words,perl=T)
>  
> which turns out :
> "ONEboum","plafboum","ploufbang","ONEplaf"
>  
> Thanks in advance for your help.
>  
> David Gouache
> 
> Arvalis - Institut du Végétal
> 
> Station de La Minière
> 
> 78280 Guyancourt
> 
> Tel: 01.30.12.96.22 / Port: 06.86.08.94.32
> 
>  
>  
> 
> __
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide 
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 
> 

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] regular expressions

2008-03-12 Thread GOUACHE David
Hello all,
 
Still fighting with regular expressions and such, I am again stuck:
 
Suppose I have a vector of character chains. In this vector, I wish to identify 
which character chains start with a given pattern, and then replace everything 
that comes after said pattern.
 
Here is a quick example :
 
words<-c("plifboum","plafboum","ploufbang","plifplaf")
 
I want to end up with something like this:
 
"plifONE","plafboum","ploufbang","plifONE"
 
All I can produce so far is this :
gsub("\\bplif.","ONE",words,perl=T)
 
which turns out :
"ONEboum","plafboum","ploufbang","ONEplaf"
 
Thanks in advance for your help.
 
David Gouache

Arvalis - Institut du Végétal

Station de La Minière

78280 Guyancourt

Tel: 01.30.12.96.22 / Port: 06.86.08.94.32

 
 

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.