Re: [R-sig-phylo] Removing columns containing "N" in DNA alignment

2017-10-27 Thread Vojtěch Zeisek
Thank You,
Andreas, yes, I try to manipulate an alignment. This is nice trick, although 
it returns empty alignment regardless threshold value used (I do have some 
data in the alignment:-)...
Have a nice weekend,
V.

Dne pátek 27. října 2017 17:02:45 CEST jste napsal(a):
> Hello V.
> Because you speak of columns I assume you are handling an alignment,
> right? If you handle an alignment all sequences have the same length and
> you can do as.matrix
> 
> Like this?
> 
> library(magrittr)
> #maximum number of n's
> thresh <- 0.005  #0.5%
> seq <- as.matrix(seq)
> temp <- seq %>% sapply(.,grep,pattern="n") %>% unlist(.,use.names=F) %>%
> table
> seq[,-(names(temp)[which(temp/ncol(seq)>thresh)] %>% as.integer)]
> 
> Greetings,
> Andreas
> 
> Am 2017-10-27 16:25, schrieb Vojtěch Zeisek:
> > Hello,
> > I checked ape::del.colgapsonly, ips::deleteGaps and
> > ips::deleteEmptyCells.
> > They delete columns containing missing values, but I need also to
> > delete
> > columns containing base "N" (all columns with amount of Ns over certain
> > threshold).
> > Actually, ips::deleteEmptyCells has option nset=c("-", "n", "?"), so it
> > is suppose to remove columns/rows containing only the given characters,
> > but if I
> > use it and export data (ape::write.dna or ape::write.nexus.data), some
> > samples consist only of N characters...
> > The DNAbin object being processed was originally imported from VCF
> > using vcfR (read.vcfR(file="my.vcf") and converted: vcfR2DNAbin(x=myvcf,
> > consensus=TRUE,
> > extract.haps=FALSE, unphased_as_NA=FALSE)).
> > I checked source code of the above functions, but they seem to only
> > count NAs
> > and then drop respective columns. And as sequences in DNAbin are stored
> > in binary format, I'm bit struggled here... :(
> > Any idea how to remove columns with given portion of "N" in sequences?
> > Sincerely,
> > V.
-- 
Vojtěch Zeisek
https://trapa.cz/en/

Department of Botany, Faculty of Science
Charles University, Prague, Czech Republic
https://www.natur.cuni.cz/biology/botany/

Institute of Botany, Czech Academy of Sciences
Průhonice, Czech Republic
http://www.ibot.cas.cz/en/


signature.asc
Description: This is a digitally signed message part.
___
R-sig-phylo mailing list - R-sig-phylo@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
Searchable archive at http://www.mail-archive.com/r-sig-phylo@r-project.org/

Re: [R-sig-phylo] Removing columns containing "N" in DNA alignment

2017-10-27 Thread Emmanuel Paradis

Hi Vojtěch,

Here's something you could do. First, make a copy of del.colgapsonly:

toto <- del.colgapsonly

Then, edit this copy (e.g., with fix(toto)), find this line:

foo <- function(x) sum(x == 4)

and replace 4 by 240. Save and close. Now you can use toto() in the same 
way than del.colgapsonly(); for instance, to get the number of N's in 
each site of the woodmouse data:


R> toto(woodmouse, freq.only = TRUE)
  [1]  3  2  2  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1


If you wonder where the values 4 and 240 come from:

R> as.integer(as.DNAbin("-"))
[1] 4
R> as.integer(as.DNAbin("N"))
[1] 240

This gives a lot of possibilities to hack the function. For instance, if 
you want to find the sites with R/Y, first, get the integer codes:


R> as.integer(as.DNAbin("R"))
[1] 192
R> as.integer(as.DNAbin("Y"))
[1] 48

Then, change the above line for:

foo <- function(x) sum(x == 192 | x == 48)

HTH

Best,

Emmanuel

Le 27/10/2017 à 17:02, Vojtěch Zeisek a écrit :

Hm, I tried a dirty hack: I exported the DNAbin object using ape::write.dna
and replaced all occurrences of "n" in any sequence by "-" and imported the
file back to R with ape::read.dna. Then I tried the mentioned functions. They
did nothing. When I exported the file to disk, the FASTA file did not contain
any "-", only "n". DO I do something wrong, or is there a bug in ape as it
seems to confuse "n" and "-"?
Sincerely,
V.

Dne pátek 27. října 2017 16:25:02 CEST jste napsal(a):

Hello,
I checked ape::del.colgapsonly, ips::deleteGaps and ips::deleteEmptyCells.
They delete columns containing missing values, but I need also to delete
columns containing base "N" (all columns with amount of Ns over certain
threshold).
Actually, ips::deleteEmptyCells has option nset=c("-", "n", "?"), so it is
suppose to remove columns/rows containing only the given characters, but if
I use it and export data (ape::write.dna or ape::write.nexus.data), some
samples consist only of N characters...
The DNAbin object being processed was originally imported from VCF using
vcfR (read.vcfR(file="my.vcf") and converted: vcfR2DNAbin(x=myvcf,
consensus=TRUE, extract.haps=FALSE, unphased_as_NA=FALSE)).
I checked source code of the above functions, but they seem to only count
NAs and then drop respective columns. And as sequences in DNAbin are stored
in binary format, I'm bit struggled here... :(
Any idea how to remove columns with given portion of "N" in sequences?
Sincerely,
V.


___
R-sig-phylo mailing list - R-sig-phylo@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
Searchable archive at http://www.mail-archive.com/r-sig-phylo@r-project.org/


___
R-sig-phylo mailing list - R-sig-phylo@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
Searchable archive at http://www.mail-archive.com/r-sig-phylo@r-project.org/

Re: [R-sig-phylo] Removing columns containing "N" in DNA alignment

2017-10-27 Thread Vojtěch Zeisek
Hm, I tried a dirty hack: I exported the DNAbin object using ape::write.dna 
and replaced all occurrences of "n" in any sequence by "-" and imported the 
file back to R with ape::read.dna. Then I tried the mentioned functions. They 
did nothing. When I exported the file to disk, the FASTA file did not contain 
any "-", only "n". DO I do something wrong, or is there a bug in ape as it 
seems to confuse "n" and "-"?
Sincerely,
V.

Dne pátek 27. října 2017 16:25:02 CEST jste napsal(a):
> Hello,
> I checked ape::del.colgapsonly, ips::deleteGaps and ips::deleteEmptyCells.
> They delete columns containing missing values, but I need also to delete
> columns containing base "N" (all columns with amount of Ns over certain
> threshold).
> Actually, ips::deleteEmptyCells has option nset=c("-", "n", "?"), so it is
> suppose to remove columns/rows containing only the given characters, but if
> I use it and export data (ape::write.dna or ape::write.nexus.data), some
> samples consist only of N characters...
> The DNAbin object being processed was originally imported from VCF using
> vcfR (read.vcfR(file="my.vcf") and converted: vcfR2DNAbin(x=myvcf,
> consensus=TRUE, extract.haps=FALSE, unphased_as_NA=FALSE)).
> I checked source code of the above functions, but they seem to only count
> NAs and then drop respective columns. And as sequences in DNAbin are stored
> in binary format, I'm bit struggled here... :(
> Any idea how to remove columns with given portion of "N" in sequences?
> Sincerely,
> V.
-- 
Vojtěch Zeisek
https://trapa.cz/en/

Department of Botany, Faculty of Science
Charles University, Prague, Czech Republic
https://www.natur.cuni.cz/biology/botany/

Institute of Botany, Czech Academy of Sciences
Průhonice, Czech Republic
http://www.ibot.cas.cz/en/


signature.asc
Description: This is a digitally signed message part.
___
R-sig-phylo mailing list - R-sig-phylo@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
Searchable archive at http://www.mail-archive.com/r-sig-phylo@r-project.org/

Re: [R-sig-phylo] Removing columns containing "N" in DNA alignment

2017-10-27 Thread Andreas Kolter

Hello V.
Because you speak of columns I assume you are handling an alignment, 
right? If you handle an alignment all sequences have the same length and 
you can do as.matrix


Like this?

library(magrittr)
#maximum number of n's
thresh <- 0.005  #0.5%
seq <- as.matrix(seq)
temp <- seq %>% sapply(.,grep,pattern="n") %>% unlist(.,use.names=F) %>% 
table

seq[,-(names(temp)[which(temp/ncol(seq)>thresh)] %>% as.integer)]

Greetings,
Andreas


Am 2017-10-27 16:25, schrieb Vojtěch Zeisek:

Hello,
I checked ape::del.colgapsonly, ips::deleteGaps and 
ips::deleteEmptyCells.
They delete columns containing missing values, but I need also to 
delete

columns containing base "N" (all columns with amount of Ns over certain
threshold).
Actually, ips::deleteEmptyCells has option nset=c("-", "n", "?"), so it 
is
suppose to remove columns/rows containing only the given characters, 
but if I
use it and export data (ape::write.dna or ape::write.nexus.data), some 
samples

consist only of N characters...
The DNAbin object being processed was originally imported from VCF 
using vcfR
(read.vcfR(file="my.vcf") and converted: vcfR2DNAbin(x=myvcf, 
consensus=TRUE,

extract.haps=FALSE, unphased_as_NA=FALSE)).
I checked source code of the above functions, but they seem to only 
count NAs
and then drop respective columns. And as sequences in DNAbin are stored 
in

binary format, I'm bit struggled here... :(
Any idea how to remove columns with given portion of "N" in sequences?
Sincerely,
V.

___
R-sig-phylo mailing list - R-sig-phylo@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
Searchable archive at 
http://www.mail-archive.com/r-sig-phylo@r-project.org/


___
R-sig-phylo mailing list - R-sig-phylo@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
Searchable archive at http://www.mail-archive.com/r-sig-phylo@r-project.org/

[R-sig-phylo] Removing columns containing "N" in DNA alignment

2017-10-27 Thread Vojtěch Zeisek
Hello,
I checked ape::del.colgapsonly, ips::deleteGaps and ips::deleteEmptyCells. 
They delete columns containing missing values, but I need also to delete 
columns containing base "N" (all columns with amount of Ns over certain 
threshold).
Actually, ips::deleteEmptyCells has option nset=c("-", "n", "?"), so it is 
suppose to remove columns/rows containing only the given characters, but if I 
use it and export data (ape::write.dna or ape::write.nexus.data), some samples 
consist only of N characters...
The DNAbin object being processed was originally imported from VCF using vcfR 
(read.vcfR(file="my.vcf") and converted: vcfR2DNAbin(x=myvcf, consensus=TRUE, 
extract.haps=FALSE, unphased_as_NA=FALSE)).
I checked source code of the above functions, but they seem to only count NAs 
and then drop respective columns. And as sequences in DNAbin are stored in 
binary format, I'm bit struggled here... :(
Any idea how to remove columns with given portion of "N" in sequences?
Sincerely,
V.

-- 
Vojtěch Zeisek
https://trapa.cz/en/

Department of Botany, Faculty of Science
Charles University, Prague, Czech Republic
https://www.natur.cuni.cz/biology/botany/

Institute of Botany, Czech Academy of Sciences
Průhonice, Czech Republic
http://www.ibot.cas.cz/en/


signature.asc
Description: This is a digitally signed message part.
___
R-sig-phylo mailing list - R-sig-phylo@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
Searchable archive at http://www.mail-archive.com/r-sig-phylo@r-project.org/