[Bioc-sig-seq] Read big fastq files in chunks

Anita Lerch Tue, 27 Sep 2011 02:39:43 -0700

Hi,
I should filter a big fastq file (HiSeq ca. >30'000'000, which does not
fit in memory) for low qualities e.g. to many N in the sequence.


I plan to read the file in chunks, filter the ShortReads and write them
into a new file.
This is no problem for fasta files, see the example:

filename <- "data/DmS2DRSC_RNA_seqs_subsample.fa"
eof <- FALSE
append <- FALSE
cycle <- 1L
while(!eof){
  chunk <- readFasta(filename, nrec=nrec, skip=(cycle-1)*nrec)
  nFilt <- nFilter(5)
  writeFasta(chunk[nFilt(chunk)],
             sprintf("%s/filtered_%s", 
                     dirname(filename), basename(filename)),
             append=append)
  append <- TRUE
  cycle <- cycle + 1
  if(length(chunk) == 0L)
      eof <- TRUE
}

If I try to do the same for fastq file e.g.

filename <- "data/test_data.fq"
eof <- FALSE
mode <- "w"
cycle <- 1L
while(!eof){
  chunk <- readFastq(filename)
  nFilt <- nFilter(5)
  writeFastq(chunk[nFilt(chunk)],
             sprintf("%s/filtered_%s", 
                     dirname(filename), basename(filename)),
             mode=mode)
  mode <- "a"
  cycle <- cycle + 1
  if(length(chunk) == 0L)
      eof <- TRUE
  }

I facing some problems when I read the ShortReads:
- read.DNAStringSet ignores the quality sequence. 
- readFastq can't read in chunks
- FastqSampler choose a random sample of ShortRead

Of course I can do a workaround e.g writing my own fastq parser, but I
prefer a clean solution or to solve the problem.
So my questions, is there a clean solution I'm not aware of it?
If I'm right and there is this problem, what is the plan of the
community to solve this problem or is this already work in progress?
Can I help to solve this problem e.g. rewriting the readFastq()
function.

Thanks for the answer.
Greetings
Anita

> sessionInfo()
R Under development (unstable) (2011-09-20 r57033)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8  
     
 [4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    
LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=C                 LC_NAME=C                  LC_ADDRESS=C         
     
[10] LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C  
     

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] ShortRead_1.11.37    latticeExtra_0.6-18  RColorBrewer_1.0-5   
Rsamtools_1.5.65    
[5] lattice_0.19-33      Biostrings_2.21.9    GenomicRanges_1.5.47 
IRanges_1.11.26     

loaded via a namespace (and not attached):
 [1] Biobase_2.13.9       BiocInstaller_1.1.27 BSgenome_1.21.3      grid_2.14.0 
        
 [5] hwriter_1.3          QuasR_0.1.1          RCurl_1.6-10         
rtracklayer_1.13.13 
 [9] tools_2.14.0         XML_3.4-3            zlibbioc_0.1.7    
 
-- 
Anita Lerch
Friedrich Miescher Institute
Maulbeerstrasse 66
WRO-1066.P22
4058 Basel
Phone: +41 (0)61 697 5172
Email: anita.le...@fmi.ch

_______________________________________________
Bioc-sig-sequencing mailing list
Bioc-sig-sequencing@r-project.org
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing

[Bioc-sig-seq] Read big fastq files in chunks

Reply via email to