Hi all,
I had a short look in the code and found some bits to speed the
read.nexus.data function up. I added Emmanuel on the list so he may
can put it into the next ape release if it does work.
Generally I agree with Johan that if speed matters fasta files are the
way to go. Nexus files are ugly to parse and contain many
inconsistencies, like parameters inside comments [].
Regards,
Klaus
On 4/9/13, Johan Nylander johan.nylan...@abc.se wrote:
Dear All,
Just to avoid confusion, the readNexus function is in the phylobase
package. And as Ben pointed out, other packages have their own functions
for reading the data part from a nexus-formatted file, see e.g., read.nex
in phyloch.
On a related note, I wrote read.nexus.data as a temporary, crude parsing
function while waiting for the phylobase project to take off (phylobase
uses NCL by Lewis Holder - _the_ nexus parser), so expect
read.nexus.data to have it's limitations.
Furthermore, if speed is the concern, it would perhaps be preferable to
first convert the Nexus data to Fasta, and then use one of the many
fast(er) parsers implemented in numerous R packages.
Cheers
Johan
On 04/07/2013 02:59 PM, Ben Bolker wrote: On 13-04-05 01:29 PM, Jessica
Sabo wrote:
Hi All,
I am wondering if there is anyway to increase the speed of the
read.nexus.data parser. Or if there is an alternative that is a
faster nexus file data parser.
THanks, Jess
I don't know if it's faster or not, but there is ?readNexus in the
'ape' package. Also see library(sos); findFn(read {nexus format})
___
R-sig-phylo mailing list - R-sig-phylo@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
Searchable archive at
http://www.mail-archive.com/r-sig-phylo@r-project.org/
___
R-sig-phylo mailing list - R-sig-phylo@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
Searchable archive at
http://www.mail-archive.com/r-sig-phylo@r-project.org/
--
Klaus Schliep
Phylogenomics Lab at the University of Vigo, Spain
read.nexus.data - function (file)
{
find.ntax - function(x) {
for (i in 1:NROW(x)) {
if (any(f - grep(\\bntax, x[i], ignore.case = TRUE))) {
ntax - as.numeric(sub((.+?)(ntax\\s*\\=\\s*)(\\d+)(.+),
\\3, x[i], perl = TRUE, ignore.case = TRUE))
break
}
}
ntax
}
find.nchar - function(x) {
for (i in 1:NROW(x)) {
if (any(f - grep(\\bnchar, x[i], ignore.case = TRUE))) {
nchar - as.numeric(sub((.+?)(nchar\\s*\\=\\s*)(\\d+)(.+),
\\3, x[i], perl = TRUE, ignore.case = TRUE))
break
}
}
nchar
}
find.matrix.line - function(x) {
for (i in 1:NROW(x)) {
if (any(f - grep(\\bmatrix\\b, x[i], ignore.case = TRUE))) {
matrix.line - as.numeric(i)
break
}
}
matrix.line
}
trim.whitespace - function(x) {
gsub(\\s+, , x)
}
trim.semicolon - function(x) {
gsub(;, , x)
}
X - scan(file = file, what = character(), sep = \n, quiet = TRUE,
comment.char = [, strip.white = TRUE)
ntax - find.ntax(X)
nchar - find.nchar(X)
matrix.line - find.matrix.line(X)
start.reading - matrix.line + 1
Obj - vector(list, ntax)
for(i in 1:ntax)Obj[[i]] = rep(NA, nchar)
i - 1
pos - 0
tot.nchar - 0
tot.ntax - 0
for (j in start.reading:NROW(X)) {
Xj - trim.semicolon(X[j])
if (Xj == ) {
break
}
if (any(jtmp - grep(\\bend\\b, X[j], perl = TRUE, ignore.case = TRUE))) {
break
}
ts - unlist(strsplit(Xj, (?=\\S)(\\s+)(?=\\S), perl = TRUE))
#browser()
if (length(ts) 2) {
stop(nexus parser does not handle spaces in sequences or taxon names (ts2))
}
if (length(ts) != 2) {
stop(nexus parser failed to read the sequences (ts!=2))
}
Seq - trim.whitespace(ts[2])
Name - trim.whitespace(ts[1])
nAME - paste(c(\\b, Name, \\b), collapse = )
if (any(l - grep(nAME, names(Obj {
tsp - strsplit(Seq, NULL)[[1]]
Obj[[l]][pos + c(1:length(tsp))] - tsp
chars.done - length(tsp)
}
else {
names(Obj)[i] - Name
tsp - strsplit(Seq, NULL)[[1]]
Obj[[i]][pos + c(1:length(tsp))] - tsp
chars.done - length(tsp)
}
tot.ntax - tot.ntax + 1
if (tot.ntax == ntax) {
i - 1
tot.ntax - 0
tot.nchar - tot.nchar + chars.done
if (tot.nchar == nchar * ntax) {
print(ntot was more than nchar*ntax)
break
}
pos - tot.nchar
}
else {
i - i + 1
}
}
if (tot.ntax != 0) {
cat(ntax:, ntax, differ from actual number of taxa in file?\n)
stop(nexus parser did not read names correctly (tot.ntax!=0))
}
for (i in 1:length(Obj)) {
if (length(Obj[[i]]) != nchar) {
cat(names(Obj[i]), has, length(Obj[[i]]), characters\n)