[R] Getting a list of unique gene names from a list with semi-colons
Hello, I have one column in my dataframe that has gene names of interest. Unfortunately, due to the fact that some probes lie between two genes or two transcripts of a gene, it looks something like this - FAM81A LOC283050;LOC283050;LOC283050;ZMIZ1 PINK1;PINK1 MRPL12;MRPL12 C1orf114 MMS19;UBTD1 I would like to know how to get a list with all the names with no semi-colons and removing the replicates. I would like the end result to look like - FAM81A LOC283050 ZMIZI PINK1 MRPL12 C1orf114 MMS19 UBTD1 Thanks a lot for your help! Kurinji [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Getting a list of unique gene names from a list with semi-colons
I think you can do this with something like this (untested): unique(unlist(strsplit(XXX, ,))) Michael On Jan 6, 2012, at 8:05 PM, Kurinji Pandiyan kurinji.pandi...@gmail.com wrote: Hello, I have one column in my dataframe that has gene names of interest. Unfortunately, due to the fact that some probes lie between two genes or two transcripts of a gene, it looks something like this - FAM81A LOC283050;LOC283050;LOC283050;ZMIZ1 PINK1;PINK1 MRPL12;MRPL12 C1orf114 MMS19;UBTD1 I would like to know how to get a list with all the names with no semi-colons and removing the replicates. I would like the end result to look like - FAM81A LOC283050 ZMIZI PINK1 MRPL12 C1orf114 MMS19 UBTD1 Thanks a lot for your help! Kurinji [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Getting a list of unique gene names from a list with semi-colons
On Fri, Jan 6, 2012 at 9:05 PM, Kurinji Pandiyan kurinji.pandi...@gmail.com wrote: Hello, I have one column in my dataframe that has gene names of interest. Unfortunately, due to the fact that some probes lie between two genes or two transcripts of a gene, it looks something like this - FAM81A LOC283050;LOC283050;LOC283050;ZMIZ1 PINK1;PINK1 MRPL12;MRPL12 C1orf114 MMS19;UBTD1 I would like to know how to get a list with all the names with no semi-colons and removing the replicates. I would like the end result to look like - FAM81A LOC283050 ZMIZI PINK1 MRPL12 C1orf114 MMS19 UBTD1 Thanks a lot for your help! Kurinji This uses strapply in gsubfn: x - FAM81A LOC283050;LOC283050;LOC283050;ZMIZ1 PINK1;PINK1 library(gsubfn) unique(strapply(x, \\w+, c)[[1]]) If x is very long then there is a high speed version of strapply specialized to using c called strapplyc in the development version of gsubfn. For example, see this example of extracting 275,000 words from a novel: https://groups.google.com/group/corpling-with-r/msg/b85f7ff917cccb5d?dmode=sourceoutput=gplainnoredirectpli=1 -- Statistics Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Getting a list of unique gene names from a list with semi-colons
Sorry. - that should be a semi-colon below. Michael Weylandt On Jan 6, 2012, at 8:17 PM, R. Michael Weylandt michael.weyla...@gmail.com michael.weyla...@gmail.com wrote: I think you can do this with something like this (untested): unique(unlist(strsplit(XXX, ,))) Michael On Jan 6, 2012, at 8:05 PM, Kurinji Pandiyan kurinji.pandi...@gmail.com wrote: Hello, I have one column in my dataframe that has gene names of interest. Unfortunately, due to the fact that some probes lie between two genes or two transcripts of a gene, it looks something like this - FAM81A LOC283050;LOC283050;LOC283050;ZMIZ1 PINK1;PINK1 MRPL12;MRPL12 C1orf114 MMS19;UBTD1 I would like to know how to get a list with all the names with no semi-colons and removing the replicates. I would like the end result to look like - FAM81A LOC283050 ZMIZI PINK1 MRPL12 C1orf114 MMS19 UBTD1 Thanks a lot for your help! Kurinji [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.