[R] Getting a list of unique gene names from a list with semi-colons

2012-01-06 Thread Kurinji Pandiyan
Hello,

I have one column in my dataframe that has gene names of interest.
Unfortunately, due to the fact that some probes lie between two genes or
two transcripts of a gene, it looks something like this -

  FAM81A  LOC283050;LOC283050;LOC283050;ZMIZ1  PINK1;PINK1  MRPL12;MRPL12
C1orf114  MMS19;UBTD1
I would like to know how to get a list with all the names with no
semi-colons and removing the replicates. I would like the end result to
look like -

FAM81A
LOC283050
ZMIZI
PINK1
MRPL12
C1orf114
MMS19
UBTD1

Thanks a lot for your help!
Kurinji

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Getting a list of unique gene names from a list with semi-colons

2012-01-06 Thread R. Michael Weylandt michael.weyla...@gmail.com
I think you can do this with something like this (untested):

unique(unlist(strsplit(XXX, ,)))

Michael

On Jan 6, 2012, at 8:05 PM, Kurinji Pandiyan kurinji.pandi...@gmail.com wrote:

 Hello,
 
 I have one column in my dataframe that has gene names of interest.
 Unfortunately, due to the fact that some probes lie between two genes or
 two transcripts of a gene, it looks something like this -
 
  FAM81A  LOC283050;LOC283050;LOC283050;ZMIZ1  PINK1;PINK1  MRPL12;MRPL12
 C1orf114  MMS19;UBTD1
 I would like to know how to get a list with all the names with no
 semi-colons and removing the replicates. I would like the end result to
 look like -
 
 FAM81A
 LOC283050
 ZMIZI
 PINK1
 MRPL12
 C1orf114
 MMS19
 UBTD1
 
 Thanks a lot for your help!
 Kurinji
 
[[alternative HTML version deleted]]
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Getting a list of unique gene names from a list with semi-colons

2012-01-06 Thread Gabor Grothendieck
On Fri, Jan 6, 2012 at 9:05 PM, Kurinji Pandiyan
kurinji.pandi...@gmail.com wrote:
 Hello,

 I have one column in my dataframe that has gene names of interest.
 Unfortunately, due to the fact that some probes lie between two genes or
 two transcripts of a gene, it looks something like this -

  FAM81A  LOC283050;LOC283050;LOC283050;ZMIZ1  PINK1;PINK1  MRPL12;MRPL12
 C1orf114  MMS19;UBTD1
 I would like to know how to get a list with all the names with no
 semi-colons and removing the replicates. I would like the end result to
 look like -

 FAM81A
 LOC283050
 ZMIZI
 PINK1
 MRPL12
 C1orf114
 MMS19
 UBTD1

 Thanks a lot for your help!
 Kurinji


This uses strapply in gsubfn:

x - FAM81A  LOC283050;LOC283050;LOC283050;ZMIZ1  PINK1;PINK1
library(gsubfn)
unique(strapply(x, \\w+, c)[[1]])

If x is very long then there is a high speed version of strapply
specialized to using c called strapplyc in the development version of
gsubfn. For example, see this example of extracting 275,000 words from
a novel:

https://groups.google.com/group/corpling-with-r/msg/b85f7ff917cccb5d?dmode=sourceoutput=gplainnoredirectpli=1





-- 
Statistics  Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Getting a list of unique gene names from a list with semi-colons

2012-01-06 Thread R. Michael Weylandt michael.weyla...@gmail.com
Sorry. - that should be a semi-colon below. 

Michael Weylandt

On Jan 6, 2012, at 8:17 PM, R. Michael Weylandt michael.weyla...@gmail.com 
michael.weyla...@gmail.com wrote:

 I think you can do this with something like this (untested):
 
 unique(unlist(strsplit(XXX, ,)))
 
 Michael
 
 On Jan 6, 2012, at 8:05 PM, Kurinji Pandiyan kurinji.pandi...@gmail.com 
 wrote:
 
 Hello,
 
 I have one column in my dataframe that has gene names of interest.
 Unfortunately, due to the fact that some probes lie between two genes or
 two transcripts of a gene, it looks something like this -
 
 FAM81A  LOC283050;LOC283050;LOC283050;ZMIZ1  PINK1;PINK1  MRPL12;MRPL12
 C1orf114  MMS19;UBTD1
 I would like to know how to get a list with all the names with no
 semi-colons and removing the replicates. I would like the end result to
 look like -
 
 FAM81A
 LOC283050
 ZMIZI
 PINK1
 MRPL12
 C1orf114
 MMS19
 UBTD1
 
 Thanks a lot for your help!
 Kurinji
 
   [[alternative HTML version deleted]]
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.