Hi,
Yes, that is much faster. Thank you. I will use that for now and for the
future I will hope for Herve's faster implementation of write.XStringSet().
Best,
Hans-Ulrich
Martin Morgan wrote:
On 05/05/2010 02:35 PM, Martin Morgan wrote:
On 05/05/2010 02:09 PM, Hervé Pagès wrote:
Hans-Ulrich Klein wrote:
Hi,
I have have the same problem. I want to write ~ 4Mio small (25bps)
sequences into one fasta file. write.XStringSet() is very slow. Also,
writeFASTA() is very low. Only about 1500 sequences are written per
minute.
if 'dna' is a DNAStringSet with names, and for this case where reaads
are< 80 characters, then maybe
fasta = paste(paste(">", names(dna), sep=""),
as.character(dna), sep="\n", collapse="\n")
fl = tempfile()
writeLines(fasta, fl)
or probably better
fasta = character(2 * length(dna))
fasta[c(TRUE, FALSE)] = paste(">", names(dna), sep="")
fasta[c(FALSE, TRUE)] = as.character(dna)
writeLines(fasta, fl)
Martin
Martin
OK, I guess it's time to bite the bullet as they say.
It has been on my TODO list for a long time to implement
write.XStringSet() in C so I will work on this and let you
know when it's ready.
Cheers,
H.
Are there any alternatives?
Best wishes,
Hans-Ulrich
> sessionInfo()
R version 2.11.0 RC (2010-04-19 r51778)
x86_64-pc-linux-gnu
locale:
[1] C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] ShortRead_1.6.2 Rsamtools_1.0.1 lattice_0.18-5
[4] Biostrings_2.16.0 GenomicRanges_1.0.1 IRanges_1.6.0
loaded via a namespace (and not attached):
[1] Biobase_2.8.0 grid_2.11.0 hwriter_1.2 tools_2.11.0
Steffen Neumann wrote:
Hi,
I have some major performance problems writing fasta files
with Biostrings. I have the full Arabidopsis Chr1 (30MByte) in one
DNAString,
and writing that to a file takes ages, as you see from the strace output
below: I obtain ~5 lines (80 chars each) per second. The runtime
of the system call<in brackets> is neglectible.
library(Biostrings)
chromosome<-read.DNAStringSet("Chr1_TAIR9.fasta", "fasta")
write.XStringSet(chromosome, file="/tmp/test.fasta", format="fasta")
Is there a fundamental flaw in my thinking ?
Is there an alternative to write.XStringSet() ?
This happens both on my laptop and a beefy server.
I also tried the (ancient) IRanges_1.0.16 and Biostrings_2.10.22,
and get ~11 lines per second.
Yours,
Steffen
13:06:09.949290 write(4, "TAGGAGTTGATGAAGACATCTAACGAAAATTC"..., 80) =
80<0.000137>
13:06:10.138835 write(4, "GTGCTCAGGCTTCATTGATAAGGAAAGAAACA"..., 80) =
80<0.000142>
13:06:10.328395 write(4, "AAAGCAGAAACCGACGTGAAATATTACAGAGA"..., 80) =
80<0.000133>
13:06:10.537475 write(4, "AGACTACTCGAGAATCATTGCACTGAAGAAAG"..., 80) =
80<0.000159>
13:06:10.727281 write(4, "AAGTGAAAAGAGAAAGAGAATGTGTGATGTGT"..., 80) =
80<0.000133>
13:06:10.916854 write(4, "CTTTGCTTTAAATGCAATCAGCTTCACGAGAA"..., 80) =
80<0.000136>
13:06:11.105687 write(4, "GATTCAAGCTCGTTTCGCTCGCTCCGGGTGAA"..., 80) =
80<0.000594>
sessionInfo()
R version 2.10.0 (2009-10-26)
x86_64-unknown-linux-gnu
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=C LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] Biostrings_2.14.12 IRanges_1.4.16
loaded via a namespace (and not attached):
[1] Biobase_2.6.0
--
Hans-Ulrich Klein
Department of Medical Informatics and Biomathematics
University of Münster
Domagkstrasse 9
48149 Münster, Germany
Tel.: +49 (0)251 83-58405
_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing