Re: [Bioc-sig-seq] filtering by adapters in QA report

Harris A. Jaffee Sun, 27 Mar 2011 14:50:17 -0700

Just regarding your remarks on trimLRPatterns / vmatchPattern ...

I don't know how to approach partial adaptors, but I think non-flanking wholeadaptors can be handled essentially by trimLRPatterns. That is, afront-endcan alter your adaptor and mismatch limits for you, then calltrimLRPatterns.


Here, 3 N's are prepended to the adaptor, and Lfixed set to "subject":

> trimNonFlankingPatterns(Lpattern="TTT", subject=DNAStringSet(c("ATTTCG","AATTTC")))

  A DNAStringSet instance of length 2
    width seq
[1]     2 CG
[2]     1 C

Here, the max.Lmismatch vector must be enlarged; the simplest way isjustto replicate the last element (which is done here, 3 times in thiscase):

> trimNonFlankingPatterns(Lpattern="TTT", subject=DNAStringSet(c("ATATCG","AATTCG")), max.Lmismatch=1)

  A DNAStringSet instance of length 2
    width seq
[1]     2 CG
[2]     1 G

On Mar 25, 2011, at 11:59 PM, Marcus Davy wrote:

Hi Robert,
just to add to the discussion, it was not initially obvious to methat the
ShortRead QA report can read either from disk
or from a ShortReadQ object within R. This at least provides theflexibility
to filter a ShortReadQ object using
trimLRPatterns/vmatchPattern/narrow etc and then run a QA filteredreport to
get more meaningful plots.

I agree it would be a nice feature to be able to specify some adapter
sequences to filter in a qa() call itself, or potentially
select parts of the report of interest.
There will be cases that will test this proposed functionality,especially
around partial adapter sequence
and the number of mismatches to allow for. I recently came across a
synthetic construct (~20 bases) in an
illumina experiment which was the first half of an adapter with theaddition
of a single random DNA base at
the 5' start, so the partial adapter effectively started at cycleposition 2
of the subjects. Using Biostrings
trimLRPatterns may not identify this pattern and dynamically trimor filter
(utilizing ranges coordinates)
unless the random base is added to the start of the pattern and atleast one
mismatch is allowed,
whereas using a vmatchPattern approach to filter would work.

Marcus
On Sat, Mar 26, 2011 at 5:41 AM, Robert Gentleman<rgent...@gmail.com>wrote:
On Fri, Mar 25, 2011 at 8:59 AM, Martin Morgan<mtmor...@fhcrc.org> wrote:
On 03/24/2011 10:56 AM, Michael Lawrence wrote:
Hi Martin,
It would be nice if the ShortRead QA report could somehow filterout theadapter contamination before generating the rest of its plots,since
those
plots are pretty meaningless if there are adapters present.

Not sure how to handle this filtering in general. That is, what if
someone
then wants to see plots with only the "high quality" reads afterthe
quality
plots. It gets complicated. ShortRead has a nice filteringmechanism,
but
this is more complicated, since some QA plots come from one filter,
while
others come from a different stage.
However, under the assumption that no one would ever want toalign an
adapter, i.e., those reads will not be carried forward, the adapter
removal
could just be treated specially hard-coded. And then just expectmorecustomized solutions to leverage the internal ShortReadfunctions forgenerating each slot in the QA object, building it upincrementally, on
different subsets. Of course, to make sense, that would require a
different
report template, too.
Hi Michael -- Yes it would be nice to be able to more flexiblycontrol
how
different components of the report are generated, or at least tomake
some
smarter choices along the lines you suggest for adaptercontaminants.
It's
hard to know how to make this really general, but I have come across
other
situations where I'd like to cherry-pick which parts of the QAprocess I
want to perform. I think I need some standardization on function
signatures
for generating each report section, tighter description ofresults from
each
section (i.e., a formal class hierarchy), and then a flexiblereportcomposition. It seems like quite a big task; I wonder if thereare good
models out there to follow? arrayQualityMetrics?
  I think arrayQualityMetrics is a good starting place.  Audrey and
Wolfgang have
done a good job of modularizing the components.  But there are still
hiccups - which
suggests just how hard that is. And as you suggested, it was abig job.
I think the case Michael is bringing up might be useful to dealwith,
without
a major rewrite. There should be some sort of file that ShortReadhas
access to
(or an input parameter) that gives some more details on thesamples and on
the
processing (eg what the sample labels should be, and what theadapters etc
are).
Then this information could be used in the current paradigm.

Mostly the issue is that if you have adapter contamination then the
subsequent plots
(eg nucleotide by cycle) are not useful.  You cannot see anything in
them and then
you have to go back and strip adapters by hand, then rerun ShortRead.
I agree that
you may want more general filtering, as an abundance of any read will
affect the plots,
but I think there is agreement that one would never want to include
the adapters (you do want
counts as are produced now, but given their affect on the graphics
filtering would be
beneficial).

 best wishes
   Robert
Martin
Michael

       [[alternative HTML version deleted]]

_______________________________________________
Bioc-sig-sequencing mailing list
Bioc-sig-sequencing@r-project.org
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
--
Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109

Location: M1-B861
Telephone: 206 667-2793

_______________________________________________
Bioc-sig-sequencing mailing list
Bioc-sig-sequencing@r-project.org
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
--
Robert Gentleman
rgent...@gmail.com

_______________________________________________
Bioc-sig-sequencing mailing list
Bioc-sig-sequencing@r-project.org
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
        [[alternative HTML version deleted]]

_______________________________________________
Bioc-sig-sequencing mailing list
Bioc-sig-sequencing@r-project.org
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing


_______________________________________________
Bioc-sig-sequencing mailing list
Bioc-sig-sequencing@r-project.org
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing

Re: [Bioc-sig-seq] filtering by adapters in QA report

Reply via email to