On Tue, 2011-09-27 at 05:33 -0700, Martin Morgan wrote: > On 09/27/2011 02:39 AM, Anita Lerch wrote: > > Hi, > > I should filter a big fastq file (HiSeq ca.>30'000'000, which does not > > fit in memory) for low qualities e.g. to many N in the sequence. > > > > I plan to read the file in chunks, filter the ShortReads and write them > > into a new file. > > This is no problem for fasta files, see the example: > > > > filename<- "data/DmS2DRSC_RNA_seqs_subsample.fa" > > eof<- FALSE > > append<- FALSE > > cycle<- 1L > > while(!eof){ > > chunk<- readFasta(filename, nrec=nrec, skip=(cycle-1)*nrec) > > nFilt<- nFilter(5) > > writeFasta(chunk[nFilt(chunk)], > > sprintf("%s/filtered_%s", > > dirname(filename), basename(filename)), > > append=append) > > append<- TRUE > > cycle<- cycle + 1 > > if(length(chunk) == 0L) > > eof<- TRUE > > } > > > > If I try to do the same for fastq file e.g. > > > > filename<- "data/test_data.fq" > > eof<- FALSE > > mode<- "w" > > cycle<- 1L > > while(!eof){ > > chunk<- readFastq(filename) > > nFilt<- nFilter(5) > > writeFastq(chunk[nFilt(chunk)], > > sprintf("%s/filtered_%s", > > dirname(filename), basename(filename)), > > mode=mode) > > mode<- "a" > > cycle<- cycle + 1 > > if(length(chunk) == 0L) > > eof<- TRUE > > } > > > > I facing some problems when I read the ShortReads: > > - read.DNAStringSet ignores the quality sequence. > > - readFastq can't read in chunks > > - FastqSampler choose a random sample of ShortRead > > > > Of course I can do a workaround e.g writing my own fastq parser, but I > > prefer a clean solution or to solve the problem. > > So my questions, is there a clean solution I'm not aware of it? > > If I'm right and there is this problem, what is the plan of the > > community to solve this problem or is this already work in progress? > > Can I help to solve this problem e.g. rewriting the readFastq() > > function. > > Hi Anita -- this needs to be added to ShortRead, and I can do so later > this week. But if you'd like to help, I would like to create a reference > class > > .FastqStreamer <- setRefClass("FastqStreamer", <...>) > > a constructor > > FastqStreamer <- function(con, n=1e6..., verbose=FALSE) > > and a method > > yield,FastqStreamer-method > > so that streaming through a file looks like > > f <- FastqStreamer(con, n=1e6, verbose=FALSE) > while (length(srq <- yield(f))) { > ## work > } > > I think .FastqStreamer would share some infrastructure with > .FastqSampler, and would take the same approach as in > R/methods-Sampler.R -- treat the file at the R level as a connection > that is input with readBin() so that R's connection interface can be > used, do some R-level buffering, and translate the raw() to ShortReadQ > objects in C (like src/sampler.c:sampler_rec_parser). Probably there is > some code re-factoring and a little reference class hierarchy .FastqFile > as parent of .FastqStreamer and .FastqSampler.
Great. I think that is the right way. I know the reference classes, but I don't have any working experience with them. I start with implementing and keep you informed. Anita -- Anita Lerch Friedrich Miescher Institute Maulbeerstrasse 66 WRO-1066.P22 4058 Basel Phone: +41 (0)61 697 5172 Email: anita.le...@fmi.ch _______________________________________________ Bioc-sig-sequencing mailing list Bioc-sig-sequencing@r-project.org https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing