Hi all,
I have thousands of strings like these ones: "1159_1; YP_177963; PPE FAMILY PROTEIN" "1100_13; SECRETED L-ALANINE DEHYDROGENASE ALD CAA15575" "1141_24; gi;2894249;emb;CAA17111.1; PROBABLE ISOCITRATE DEHYDROGENASE" and various others.. I'm interested to extract the code for the protein (in this example: YP_177963, CAA15575, CAA17111). I found only one common criterion to identify the protein codes in ALL my strings: I need a sequence of characters selected in this way: start: the first alphabetic capital letter followed after three characters by a digit end: the last following digit before a non-digit character, or nothing. Tricky, isn't it? Well, I'm not an expert, and I played a lot with regular expressions and sub() command with no big results. Also with substring.location in Hmisc package (but here I don't know how to use regular expressions). Maybe there are other more useful functions or maybe is just a matter to use regular expression in a better way... Can anybody help me? Thanks a lot in advance... _________________________________________________________________ Racconta la tua estate, crea il tuo blog. [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.