Hi, I should filter a big fastq file (HiSeq ca. >30'000'000, which does not fit in memory) for low qualities e.g. to many N in the sequence.
I plan to read the file in chunks, filter the ShortReads and write them into a new file. This is no problem for fasta files, see the example: filename <- "data/DmS2DRSC_RNA_seqs_subsample.fa" eof <- FALSE append <- FALSE cycle <- 1L while(!eof){ chunk <- readFasta(filename, nrec=nrec, skip=(cycle-1)*nrec) nFilt <- nFilter(5) writeFasta(chunk[nFilt(chunk)], sprintf("%s/filtered_%s", dirname(filename), basename(filename)), append=append) append <- TRUE cycle <- cycle + 1 if(length(chunk) == 0L) eof <- TRUE } If I try to do the same for fastq file e.g. filename <- "data/test_data.fq" eof <- FALSE mode <- "w" cycle <- 1L while(!eof){ chunk <- readFastq(filename) nFilt <- nFilter(5) writeFastq(chunk[nFilt(chunk)], sprintf("%s/filtered_%s", dirname(filename), basename(filename)), mode=mode) mode <- "a" cycle <- cycle + 1 if(length(chunk) == 0L) eof <- TRUE } I facing some problems when I read the ShortReads: - read.DNAStringSet ignores the quality sequence. - readFastq can't read in chunks - FastqSampler choose a random sample of ShortRead Of course I can do a workaround e.g writing my own fastq parser, but I prefer a clean solution or to solve the problem. So my questions, is there a clean solution I'm not aware of it? If I'm right and there is this problem, what is the plan of the community to solve this problem or is this already work in progress? Can I help to solve this problem e.g. rewriting the readFastq() function. Thanks for the answer. Greetings Anita > sessionInfo() R Under development (unstable) (2011-09-20 r57033) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 [4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=C LC_NAME=C LC_ADDRESS=C [10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] ShortRead_1.11.37 latticeExtra_0.6-18 RColorBrewer_1.0-5 Rsamtools_1.5.65 [5] lattice_0.19-33 Biostrings_2.21.9 GenomicRanges_1.5.47 IRanges_1.11.26 loaded via a namespace (and not attached): [1] Biobase_2.13.9 BiocInstaller_1.1.27 BSgenome_1.21.3 grid_2.14.0 [5] hwriter_1.3 QuasR_0.1.1 RCurl_1.6-10 rtracklayer_1.13.13 [9] tools_2.14.0 XML_3.4-3 zlibbioc_0.1.7 -- Anita Lerch Friedrich Miescher Institute Maulbeerstrasse 66 WRO-1066.P22 4058 Basel Phone: +41 (0)61 697 5172 Email: anita.le...@fmi.ch _______________________________________________ Bioc-sig-sequencing mailing list Bioc-sig-sequencing@r-project.org https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing