Seems like the version in Biostrings is slightly broken (the argument
checking is not great and it chokes on various use cases), but it
works for something like dumping a whole BSgenome, like
writeFASTA(x = Scerevisiae,
desc = paste("chr", 1:length(seqnames(Scerevisiae)), sep = ""),
file = "bsgenome_scerevisiae.fa")
It looks like the entire write.XStringSet should be re-thought a bit.
I'll look into this hopefully today (unless someone else beats me to
it).
Kasper
On Fri, Apr 16, 2010 at 9:55 AM, Kasper Daniel Hansen
<[email protected]> wrote:
> I don't know if there has been a refactoring of the code, but I while
> ago I send a patch to writeFASTA making it magnitudes faster, so you
> should perhaps try that one. The patch makes it pretty fast to dump
> entire bsgenomes into fasta files.
>
> Kasper
>
> On Fri, Apr 16, 2010 at 9:17 AM, Steffen Neumann <[email protected]>
> wrote:
>> Hi,
>>
>> I have some major performance problems writing fasta files
>> with Biostrings. I have the full Arabidopsis Chr1 (30MByte) in one DNAString,
>> and writing that to a file takes ages, as you see from the strace output
>> below: I obtain ~5 lines (80 chars each) per second. The runtime
>> of the system call <in brackets> is neglectible.
>>
>> library(Biostrings)
>> chromosome <-read.DNAStringSet("Chr1_TAIR9.fasta", "fasta")
>> write.XStringSet(chromosome, file="/tmp/test.fasta", format="fasta")
>>
>> Is there a fundamental flaw in my thinking ?
>> Is there an alternative to write.XStringSet() ?
>> This happens both on my laptop and a beefy server.
>>
>> I also tried the (ancient) IRanges_1.0.16 and Biostrings_2.10.22,
>> and get ~11 lines per second.
>>
>> Yours,
>> Steffen
>>
>> 13:06:09.949290 write(4, "TAGGAGTTGATGAAGACATCTAACGAAAATTC"..., 80) = 80
>> <0.000137>
>> 13:06:10.138835 write(4, "GTGCTCAGGCTTCATTGATAAGGAAAGAAACA"..., 80) = 80
>> <0.000142>
>> 13:06:10.328395 write(4, "AAAGCAGAAACCGACGTGAAATATTACAGAGA"..., 80) = 80
>> <0.000133>
>> 13:06:10.537475 write(4, "AGACTACTCGAGAATCATTGCACTGAAGAAAG"..., 80) = 80
>> <0.000159>
>> 13:06:10.727281 write(4, "AAGTGAAAAGAGAAAGAGAATGTGTGATGTGT"..., 80) = 80
>> <0.000133>
>> 13:06:10.916854 write(4, "CTTTGCTTTAAATGCAATCAGCTTCACGAGAA"..., 80) = 80
>> <0.000136>
>> 13:06:11.105687 write(4, "GATTCAAGCTCGTTTCGCTCGCTCCGGGTGAA"..., 80) = 80
>> <0.000594>
>>
>> sessionInfo()
>> R version 2.10.0 (2009-10-26)
>> x86_64-unknown-linux-gnu
>>
>> locale:
>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
>> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
>> [5] LC_MONETARY=C LC_MESSAGES=en_US.UTF-8
>> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
>> [9] LC_ADDRESS=C LC_TELEPHONE=C
>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>
>> attached base packages:
>> [1] stats graphics grDevices utils datasets methods base
>>
>> other attached packages:
>> [1] Biostrings_2.14.12 IRanges_1.4.16
>>
>> loaded via a namespace (and not attached):
>> [1] Biobase_2.6.0
>>
>> --
>> IPB Halle AG Massenspektrometrie & Bioinformatik
>> Dr. Steffen Neumann http://www.IPB-Halle.DE
>> Weinberg 3 http://msbi.bic-gh.de
>> 06120 Halle Tel. +49 (0) 345 5582 - 1470
>> +49 (0) 345 5582 - 0
>> sneumann(at)IPB-Halle.DE Fax. +49 (0) 345 5582 - 1409
>>
>> _______________________________________________
>> Bioc-sig-sequencing mailing list
>> [email protected]
>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>
>
_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing