Try this: > x <- read.table(textConnection("snp ensembl_gene_id + rs8032583 + rs1071600 ENSG00000101605 + rs13406898 ENSG00000167165 + rs7030479 ENSG00000107249 + rs1244414 ENSG00000165629 + rs1005636 ENSG00000230681 + rs927913 ENSG00000151655;ENSG00000227546 + rs4832680 + rs4435168 ENSG00000229164;ENSG00000225227;ENSG00000211817 + rs7035549 + rs12707538 ENSG00000186472"), header=TRUE, fill=TRUE) > closeAllConnections() > x.new <- do.call(rbind, apply(x, 1, function(.row){ + .ids <- unlist(strsplit(.row[2], ';')) + # check for no data in second column; substitute a blank + if (length(.ids) == 0) return(cbind(.row[1], "")) + else return(cbind(.row[1], .ids)) + })) > x.new .ids snp "rs8032583" "" snp "rs1071600" "ENSG00000101605" snp "rs13406898" "ENSG00000167165" snp "rs7030479" "ENSG00000107249" snp "rs1244414" "ENSG00000165629" snp "rs1005636" "ENSG00000230681" ensembl_gene_id1 "rs927913" "ENSG00000151655" ensembl_gene_id2 "rs927913" "ENSG00000227546" snp "rs4832680" "" ensembl_gene_id1 "rs4435168" "ENSG00000229164" ensembl_gene_id2 "rs4435168" "ENSG00000225227" ensembl_gene_id3 "rs4435168" "ENSG00000211817" snp "rs7035549" "" snp "rs12707538" "ENSG00000186472" >
On Sun, Jul 26, 2009 at 3:26 PM, Felix Müller-Sarnowski<drfl...@googlemail.com> wrote: > Dear R colleagues, > > I annotated a list of single nuclotide polymorphiosms (SNP) with the > corresponding genes using biomaRt. The result is the following > data.frame (pasted from R): > > snp ensembl_gene_id > 1 rs8032583 > 2 rs1071600 ENSG00000101605 > 3 rs13406898 ENSG00000167165 > 4 rs7030479 ENSG00000107249 > 5 rs1244414 ENSG00000165629 > 6 rs1005636 ENSG00000230681 > 7 rs927913 ENSG00000151655;ENSG00000227546 > 8 rs4832680 > 9 rs4435168 ENSG00000229164;ENSG00000225227;ENSG00000211817 > 10 rs7035549 > 11 rs12707538 ENSG00000186472 > > As you can see, the SNP with the identifier rs4435168 corresponds to 3 > gene ids, rs927913 corresponds to 2 gene ids. As I'd like to perform a > join of several data.frames using the ensembl_gene_id later on, I'd > like to split columns with multiple gene identifiers into rows with > only one ensembl gene identifier each. So for the example of rs4435168 > it should look like this (faked output): > > snp ensembl_gene_id > ... > 9 rs4435168 ENSG00000229164 > 10 rs4435168 ENSG00000225227 > 11 rs4435168 ENSG00000211817 > ... > > This is just a simple example. Finally there will be a lot of other > columns, which should be replicated like the snp column. > > Does anyone know how to do this? I tried strsplit, which splits nicely > the multiple entries in column ensembl_gene_id. But how to go on? > > I'd appreciate any kind of help very much! > Best regards from Munich, > Felix > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > -- Jim Holtman Cincinnati, OH +1 513 646 9390 What is the problem that you are trying to solve? ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.