I'm was guessing that the ".1" was a part of the protein code for
third example and looking at:
<http://www.plosone.org/article/fetchSingleRepresentation.action?uri=info:doi/10.1371/journal.pone.0003840.s002
>
I see quite a few protein codes of that form. I am a complete
ignoramus with regex strings but I am guessing that the OP will need a
"." added to the termination pattern. Experimentation shows that
simply adding a period after the "9" works for this example:
pat <- ".*(\\b[A-Z]..[0-9.]+).*"
--
David
On Sep 16, 2009, at 10:15 AM, jim holtman wrote:
This should do it for you:
pat <- ".*(\\b[A-Z]..[0-9]+).*"
grep(pat, x)
[1] 1 3 5
sub(pat, '\\1', x)
[1] "YP_177963" "" "CAA15575" "" "CAA17111"
On Wed, Sep 16, 2009 at 9:53 AM, Giulio Di Giovanni
<perimessagg...@hotmail.com> wrote:
Hi all,
I have thousands of strings like these ones:
"1159_1; YP_177963; PPE FAMILY PROTEIN"
"1100_13; SECRETED L-ALANINE DEHYDROGENASE ALD CAA15575"
"1141_24; gi;2894249;emb;CAA17111.1; PROBABLE ISOCITRATE
DEHYDROGENASE"
and various others..
I'm interested to extract the code for the protein (in this
example: YP_177963, CAA15575, CAA17111).
I found only one common criterion to identify the protein codes in
ALL my strings:
I need a sequence of characters selected in this way:
start:
the first alphabetic capital letter followed after three characters
by a digit
end:
the last following digit before a non-digit character, or nothing.
Tricky, isn't it?
Well, I'm not an expert, and I played a lot with regular
expressions and sub() command with no big results. Also with
substring.location in Hmisc package (but here I don't know how to
use regular expressions).
Maybe there are other more useful functions or maybe is just a
matter to use regular expression in a better way...
Can anybody help me?
Thanks a lot in advance...
_________________________________________________________________
Racconta la tua estate, crea il tuo blog.
[[alternative HTML version deleted]]
______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
--
Jim Holtman
Cincinnati, OH
+1 513 646 9390
What is the problem that you are trying to solve?
______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
David Winsemius, MD
Heritage Laboratories
West Hartford, CT
______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.