On 09/27/2011 02:39 AM, Anita Lerch wrote:
Hi,
I should filter a big fastq file (HiSeq ca.>30'000'000, which does not
fit in memory) for low qualities e.g. to many N in the sequence.

I plan to read the file in chunks, filter the ShortReads and write them
into a new file.
This is no problem for fasta files, see the example:

filename<- "data/DmS2DRSC_RNA_seqs_subsample.fa"
eof<- FALSE
append<- FALSE
cycle<- 1L
while(!eof){
   chunk<- readFasta(filename, nrec=nrec, skip=(cycle-1)*nrec)
   nFilt<- nFilter(5)
   writeFasta(chunk[nFilt(chunk)],
              sprintf("%s/filtered_%s",
                      dirname(filename), basename(filename)),
              append=append)
   append<- TRUE
   cycle<- cycle + 1
   if(length(chunk) == 0L)
       eof<- TRUE
}

If I try to do the same for fastq file e.g.

filename<- "data/test_data.fq"
eof<- FALSE
mode<- "w"
cycle<- 1L
while(!eof){
   chunk<- readFastq(filename)
   nFilt<- nFilter(5)
   writeFastq(chunk[nFilt(chunk)],
              sprintf("%s/filtered_%s",
                      dirname(filename), basename(filename)),
              mode=mode)
   mode<- "a"
   cycle<- cycle + 1
   if(length(chunk) == 0L)
       eof<- TRUE
   }

I facing some problems when I read the ShortReads:
- read.DNAStringSet ignores the quality sequence.
- readFastq can't read in chunks
- FastqSampler choose a random sample of ShortRead

Of course I can do a workaround e.g writing my own fastq parser, but I
prefer a clean solution or to solve the problem.
So my questions, is there a clean solution I'm not aware of it?
If I'm right and there is this problem, what is the plan of the
community to solve this problem or is this already work in progress?
Can I help to solve this problem e.g. rewriting the readFastq()
function.

Hi Anita -- this needs to be added to ShortRead, and I can do so later this week. But if you'd like to help, I would like to create a reference class

  .FastqStreamer <- setRefClass("FastqStreamer", <...>)

a constructor

  FastqStreamer <- function(con, n=1e6..., verbose=FALSE)

and a method

  yield,FastqStreamer-method

so that streaming through a file looks like

  f <- FastqStreamer(con, n=1e6, verbose=FALSE)
  while (length(srq <- yield(f))) {
       ## work
  }

I think .FastqStreamer would share some infrastructure with .FastqSampler, and would take the same approach as in R/methods-Sampler.R -- treat the file at the R level as a connection that is input with readBin() so that R's connection interface can be used, do some R-level buffering, and translate the raw() to ShortReadQ objects in C (like src/sampler.c:sampler_rec_parser). Probably there is some code re-factoring and a little reference class hierarchy .FastqFile as parent of .FastqStreamer and .FastqSampler.

Martin


Thanks for the answer.
Greetings
Anita

sessionInfo()
R Under development (unstable) (2011-09-20 r57033)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8
  [4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    
LC_MESSAGES=en_US.UTF-8
  [7] LC_PAPER=C                 LC_NAME=C                  LC_ADDRESS=C
[10] LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] ShortRead_1.11.37    latticeExtra_0.6-18  RColorBrewer_1.0-5   
Rsamtools_1.5.65
[5] lattice_0.19-33      Biostrings_2.21.9    GenomicRanges_1.5.47 
IRanges_1.11.26

loaded via a namespace (and not attached):
  [1] Biobase_2.13.9       BiocInstaller_1.1.27 BSgenome_1.21.3      grid_2.14.0
  [5] hwriter_1.3          QuasR_0.1.1          RCurl_1.6-10         
rtracklayer_1.13.13
  [9] tools_2.14.0         XML_3.4-3            zlibbioc_0.1.7



--
Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109

Location: M1-B861
Telephone: 206 667-2793

_______________________________________________
Bioc-sig-sequencing mailing list
Bioc-sig-sequencing@r-project.org
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing

Reply via email to