Re: [R-sig-phylo] Removing columns containing "N" in DNA alignment
Thank You, Andreas, yes, I try to manipulate an alignment. This is nice trick, although it returns empty alignment regardless threshold value used (I do have some data in the alignment:-)... Have a nice weekend, V. Dne pátek 27. října 2017 17:02:45 CEST jste napsal(a): > Hello V. > Because you speak of columns I assume you are handling an alignment, > right? If you handle an alignment all sequences have the same length and > you can do as.matrix > > Like this? > > library(magrittr) > #maximum number of n's > thresh <- 0.005 #0.5% > seq <- as.matrix(seq) > temp <- seq %>% sapply(.,grep,pattern="n") %>% unlist(.,use.names=F) %>% > table > seq[,-(names(temp)[which(temp/ncol(seq)>thresh)] %>% as.integer)] > > Greetings, > Andreas > > Am 2017-10-27 16:25, schrieb Vojtěch Zeisek: > > Hello, > > I checked ape::del.colgapsonly, ips::deleteGaps and > > ips::deleteEmptyCells. > > They delete columns containing missing values, but I need also to > > delete > > columns containing base "N" (all columns with amount of Ns over certain > > threshold). > > Actually, ips::deleteEmptyCells has option nset=c("-", "n", "?"), so it > > is suppose to remove columns/rows containing only the given characters, > > but if I > > use it and export data (ape::write.dna or ape::write.nexus.data), some > > samples consist only of N characters... > > The DNAbin object being processed was originally imported from VCF > > using vcfR (read.vcfR(file="my.vcf") and converted: vcfR2DNAbin(x=myvcf, > > consensus=TRUE, > > extract.haps=FALSE, unphased_as_NA=FALSE)). > > I checked source code of the above functions, but they seem to only > > count NAs > > and then drop respective columns. And as sequences in DNAbin are stored > > in binary format, I'm bit struggled here... :( > > Any idea how to remove columns with given portion of "N" in sequences? > > Sincerely, > > V. -- Vojtěch Zeisek https://trapa.cz/en/ Department of Botany, Faculty of Science Charles University, Prague, Czech Republic https://www.natur.cuni.cz/biology/botany/ Institute of Botany, Czech Academy of Sciences Průhonice, Czech Republic http://www.ibot.cas.cz/en/ signature.asc Description: This is a digitally signed message part. ___ R-sig-phylo mailing list - R-sig-phylo@r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-phylo Searchable archive at http://www.mail-archive.com/r-sig-phylo@r-project.org/
Re: [R-sig-phylo] Removing columns containing "N" in DNA alignment
Hi Vojtěch, Here's something you could do. First, make a copy of del.colgapsonly: toto <- del.colgapsonly Then, edit this copy (e.g., with fix(toto)), find this line: foo <- function(x) sum(x == 4) and replace 4 by 240. Save and close. Now you can use toto() in the same way than del.colgapsonly(); for instance, to get the number of N's in each site of the woodmouse data: R> toto(woodmouse, freq.only = TRUE) [1] 3 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 If you wonder where the values 4 and 240 come from: R> as.integer(as.DNAbin("-")) [1] 4 R> as.integer(as.DNAbin("N")) [1] 240 This gives a lot of possibilities to hack the function. For instance, if you want to find the sites with R/Y, first, get the integer codes: R> as.integer(as.DNAbin("R")) [1] 192 R> as.integer(as.DNAbin("Y")) [1] 48 Then, change the above line for: foo <- function(x) sum(x == 192 | x == 48) HTH Best, Emmanuel Le 27/10/2017 à 17:02, Vojtěch Zeisek a écrit : Hm, I tried a dirty hack: I exported the DNAbin object using ape::write.dna and replaced all occurrences of "n" in any sequence by "-" and imported the file back to R with ape::read.dna. Then I tried the mentioned functions. They did nothing. When I exported the file to disk, the FASTA file did not contain any "-", only "n". DO I do something wrong, or is there a bug in ape as it seems to confuse "n" and "-"? Sincerely, V. Dne pátek 27. října 2017 16:25:02 CEST jste napsal(a): Hello, I checked ape::del.colgapsonly, ips::deleteGaps and ips::deleteEmptyCells. They delete columns containing missing values, but I need also to delete columns containing base "N" (all columns with amount of Ns over certain threshold). Actually, ips::deleteEmptyCells has option nset=c("-", "n", "?"), so it is suppose to remove columns/rows containing only the given characters, but if I use it and export data (ape::write.dna or ape::write.nexus.data), some samples consist only of N characters... The DNAbin object being processed was originally imported from VCF using vcfR (read.vcfR(file="my.vcf") and converted: vcfR2DNAbin(x=myvcf, consensus=TRUE, extract.haps=FALSE, unphased_as_NA=FALSE)). I checked source code of the above functions, but they seem to only count NAs and then drop respective columns. And as sequences in DNAbin are stored in binary format, I'm bit struggled here... :( Any idea how to remove columns with given portion of "N" in sequences? Sincerely, V. ___ R-sig-phylo mailing list - R-sig-phylo@r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-phylo Searchable archive at http://www.mail-archive.com/r-sig-phylo@r-project.org/ ___ R-sig-phylo mailing list - R-sig-phylo@r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-phylo Searchable archive at http://www.mail-archive.com/r-sig-phylo@r-project.org/
Re: [R-sig-phylo] Removing columns containing "N" in DNA alignment
Hm, I tried a dirty hack: I exported the DNAbin object using ape::write.dna and replaced all occurrences of "n" in any sequence by "-" and imported the file back to R with ape::read.dna. Then I tried the mentioned functions. They did nothing. When I exported the file to disk, the FASTA file did not contain any "-", only "n". DO I do something wrong, or is there a bug in ape as it seems to confuse "n" and "-"? Sincerely, V. Dne pátek 27. října 2017 16:25:02 CEST jste napsal(a): > Hello, > I checked ape::del.colgapsonly, ips::deleteGaps and ips::deleteEmptyCells. > They delete columns containing missing values, but I need also to delete > columns containing base "N" (all columns with amount of Ns over certain > threshold). > Actually, ips::deleteEmptyCells has option nset=c("-", "n", "?"), so it is > suppose to remove columns/rows containing only the given characters, but if > I use it and export data (ape::write.dna or ape::write.nexus.data), some > samples consist only of N characters... > The DNAbin object being processed was originally imported from VCF using > vcfR (read.vcfR(file="my.vcf") and converted: vcfR2DNAbin(x=myvcf, > consensus=TRUE, extract.haps=FALSE, unphased_as_NA=FALSE)). > I checked source code of the above functions, but they seem to only count > NAs and then drop respective columns. And as sequences in DNAbin are stored > in binary format, I'm bit struggled here... :( > Any idea how to remove columns with given portion of "N" in sequences? > Sincerely, > V. -- Vojtěch Zeisek https://trapa.cz/en/ Department of Botany, Faculty of Science Charles University, Prague, Czech Republic https://www.natur.cuni.cz/biology/botany/ Institute of Botany, Czech Academy of Sciences Průhonice, Czech Republic http://www.ibot.cas.cz/en/ signature.asc Description: This is a digitally signed message part. ___ R-sig-phylo mailing list - R-sig-phylo@r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-phylo Searchable archive at http://www.mail-archive.com/r-sig-phylo@r-project.org/
Re: [R-sig-phylo] Removing columns containing "N" in DNA alignment
Hello V. Because you speak of columns I assume you are handling an alignment, right? If you handle an alignment all sequences have the same length and you can do as.matrix Like this? library(magrittr) #maximum number of n's thresh <- 0.005 #0.5% seq <- as.matrix(seq) temp <- seq %>% sapply(.,grep,pattern="n") %>% unlist(.,use.names=F) %>% table seq[,-(names(temp)[which(temp/ncol(seq)>thresh)] %>% as.integer)] Greetings, Andreas Am 2017-10-27 16:25, schrieb Vojtěch Zeisek: Hello, I checked ape::del.colgapsonly, ips::deleteGaps and ips::deleteEmptyCells. They delete columns containing missing values, but I need also to delete columns containing base "N" (all columns with amount of Ns over certain threshold). Actually, ips::deleteEmptyCells has option nset=c("-", "n", "?"), so it is suppose to remove columns/rows containing only the given characters, but if I use it and export data (ape::write.dna or ape::write.nexus.data), some samples consist only of N characters... The DNAbin object being processed was originally imported from VCF using vcfR (read.vcfR(file="my.vcf") and converted: vcfR2DNAbin(x=myvcf, consensus=TRUE, extract.haps=FALSE, unphased_as_NA=FALSE)). I checked source code of the above functions, but they seem to only count NAs and then drop respective columns. And as sequences in DNAbin are stored in binary format, I'm bit struggled here... :( Any idea how to remove columns with given portion of "N" in sequences? Sincerely, V. ___ R-sig-phylo mailing list - R-sig-phylo@r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-phylo Searchable archive at http://www.mail-archive.com/r-sig-phylo@r-project.org/ ___ R-sig-phylo mailing list - R-sig-phylo@r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-phylo Searchable archive at http://www.mail-archive.com/r-sig-phylo@r-project.org/
[R-sig-phylo] Removing columns containing "N" in DNA alignment
Hello, I checked ape::del.colgapsonly, ips::deleteGaps and ips::deleteEmptyCells. They delete columns containing missing values, but I need also to delete columns containing base "N" (all columns with amount of Ns over certain threshold). Actually, ips::deleteEmptyCells has option nset=c("-", "n", "?"), so it is suppose to remove columns/rows containing only the given characters, but if I use it and export data (ape::write.dna or ape::write.nexus.data), some samples consist only of N characters... The DNAbin object being processed was originally imported from VCF using vcfR (read.vcfR(file="my.vcf") and converted: vcfR2DNAbin(x=myvcf, consensus=TRUE, extract.haps=FALSE, unphased_as_NA=FALSE)). I checked source code of the above functions, but they seem to only count NAs and then drop respective columns. And as sequences in DNAbin are stored in binary format, I'm bit struggled here... :( Any idea how to remove columns with given portion of "N" in sequences? Sincerely, V. -- Vojtěch Zeisek https://trapa.cz/en/ Department of Botany, Faculty of Science Charles University, Prague, Czech Republic https://www.natur.cuni.cz/biology/botany/ Institute of Botany, Czech Academy of Sciences Průhonice, Czech Republic http://www.ibot.cas.cz/en/ signature.asc Description: This is a digitally signed message part. ___ R-sig-phylo mailing list - R-sig-phylo@r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-phylo Searchable archive at http://www.mail-archive.com/r-sig-phylo@r-project.org/