On 09/27/2011 02:39 AM, Anita Lerch wrote:
Hi,
I should filter a big fastq file (HiSeq ca.>30'000'000, which does not
fit in memory) for low qualities e.g. to many N in the sequence.
I plan to read the file in chunks, filter the ShortReads and write them
into a new file.
This is no problem for fasta files, see the example:
filename<- "data/DmS2DRSC_RNA_seqs_subsample.fa"
eof<- FALSE
append<- FALSE
cycle<- 1L
while(!eof){
chunk<- readFasta(filename, nrec=nrec, skip=(cycle-1)*nrec)
nFilt<- nFilter(5)
writeFasta(chunk[nFilt(chunk)],
sprintf("%s/filtered_%s",
dirname(filename), basename(filename)),
append=append)
append<- TRUE
cycle<- cycle + 1
if(length(chunk) == 0L)
eof<- TRUE
}
If I try to do the same for fastq file e.g.
filename<- "data/test_data.fq"
eof<- FALSE
mode<- "w"
cycle<- 1L
while(!eof){
chunk<- readFastq(filename)
nFilt<- nFilter(5)
writeFastq(chunk[nFilt(chunk)],
sprintf("%s/filtered_%s",
dirname(filename), basename(filename)),
mode=mode)
mode<- "a"
cycle<- cycle + 1
if(length(chunk) == 0L)
eof<- TRUE
}
I facing some problems when I read the ShortReads:
- read.DNAStringSet ignores the quality sequence.
- readFastq can't read in chunks
- FastqSampler choose a random sample of ShortRead
Of course I can do a workaround e.g writing my own fastq parser, but I
prefer a clean solution or to solve the problem.
So my questions, is there a clean solution I'm not aware of it?
If I'm right and there is this problem, what is the plan of the
community to solve this problem or is this already work in progress?
Can I help to solve this problem e.g. rewriting the readFastq()
function.
Hi Anita -- this needs to be added to ShortRead, and I can do so later
this week. But if you'd like to help, I would like to create a reference
class
.FastqStreamer <- setRefClass("FastqStreamer", <...>)
a constructor
FastqStreamer <- function(con, n=1e6..., verbose=FALSE)
and a method
yield,FastqStreamer-method
so that streaming through a file looks like
f <- FastqStreamer(con, n=1e6, verbose=FALSE)
while (length(srq <- yield(f))) {
## work
}
I think .FastqStreamer would share some infrastructure with
.FastqSampler, and would take the same approach as in
R/methods-Sampler.R -- treat the file at the R level as a connection
that is input with readBin() so that R's connection interface can be
used, do some R-level buffering, and translate the raw() to ShortReadQ
objects in C (like src/sampler.c:sampler_rec_parser). Probably there is
some code re-factoring and a little reference class hierarchy .FastqFile
as parent of .FastqStreamer and .FastqSampler.
Martin
Thanks for the answer.
Greetings
Anita
sessionInfo()
R Under development (unstable) (2011-09-20 r57033)
Platform: x86_64-unknown-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8
[4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8
LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=C LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] ShortRead_1.11.37 latticeExtra_0.6-18 RColorBrewer_1.0-5
Rsamtools_1.5.65
[5] lattice_0.19-33 Biostrings_2.21.9 GenomicRanges_1.5.47
IRanges_1.11.26
loaded via a namespace (and not attached):
[1] Biobase_2.13.9 BiocInstaller_1.1.27 BSgenome_1.21.3 grid_2.14.0
[5] hwriter_1.3 QuasR_0.1.1 RCurl_1.6-10
rtracklayer_1.13.13
[9] tools_2.14.0 XML_3.4-3 zlibbioc_0.1.7
--
Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109
Location: M1-B861
Telephone: 206 667-2793
_______________________________________________
Bioc-sig-sequencing mailing list
Bioc-sig-sequencing@r-project.org
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing