Hi all,

I have thousands of strings like these ones:

 

"1159_1; YP_177963; PPE FAMILY PROTEIN"

"1100_13; SECRETED L-ALANINE DEHYDROGENASE ALD CAA15575"

"1141_24; gi;2894249;emb;CAA17111.1; PROBABLE ISOCITRATE DEHYDROGENASE"

 

and various others..

 

I'm interested to extract the code for the protein (in this example: YP_177963, 
CAA15575, CAA17111).

I found only one common criterion to identify the protein codes in ALL my 
strings:

I need a sequence of characters selected in this way:

 

start:

the first alphabetic capital letter followed after three characters by a digit

 

end: 

the last following digit before a non-digit character, or nothing.

 

Tricky, isn't it?

Well, I'm not an expert, and I played a lot with regular expressions and sub() 
command with no big results. Also with substring.location in Hmisc package (but 
here I don't know how to use regular expressions). 

Maybe there are other more useful functions  or maybe is just a matter to use 
regular expression in a better way...

 

Can anybody help me?

 

Thanks a lot in advance...


_________________________________________________________________
Racconta la tua estate, crea il tuo blog.

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to