Just regarding your remarks on trimLRPatterns / vmatchPattern ...

I don't know how to approach partial adaptors, but I think non- flanking whole adaptors can be handled essentially by trimLRPatterns. That is, a front-end can alter your adaptor and mismatch limits for you, then call trimLRPatterns.

Here, 3 N's are prepended to the adaptor, and Lfixed set to "subject":

> trimNonFlankingPatterns(Lpattern="TTT", subject=DNAStringSet(c ("ATTTCG","AATTTC")))
  A DNAStringSet instance of length 2
    width seq
[1]     2 CG
[2]     1 C

Here, the max.Lmismatch vector must be enlarged; the simplest way is just to replicate the last element (which is done here, 3 times in this case):

> trimNonFlankingPatterns(Lpattern="TTT", subject=DNAStringSet(c ("ATATCG","AATTCG")), max.Lmismatch=1)
  A DNAStringSet instance of length 2
    width seq
[1]     2 CG
[2]     1 G

On Mar 25, 2011, at 11:59 PM, Marcus Davy wrote:

Hi Robert,
just to add to the discussion, it was not initially obvious to me that the
ShortRead QA report can read either from disk
or from a ShortReadQ object within R. This at least provides the flexibility
to filter a ShortReadQ object using
trimLRPatterns/vmatchPattern/narrow etc and then run a QA filtered report to
get more meaningful plots.

I agree it would be a nice feature to be able to specify some adapter
sequences to filter in a qa() call itself, or potentially
select parts of the report of interest.
There will be cases that will test this proposed functionality, especially
around partial adapter sequence
and the number of mismatches to allow for. I recently came across a
synthetic construct (~20 bases) in an
illumina experiment which was the first half of an adapter with the addition
of a single random DNA base at
the 5' start, so the partial adapter effectively started at cycle position 2
of the subjects. Using Biostrings
trimLRPatterns may not identify this pattern and dynamically trim or filter
(utilizing ranges coordinates)
unless the random base is added to the start of the pattern and at least one
mismatch is allowed,
whereas using a vmatchPattern approach to filter would work.

Marcus


On Sat, Mar 26, 2011 at 5:41 AM, Robert Gentleman <rgent...@gmail.com>wrote:

On Fri, Mar 25, 2011 at 8:59 AM, Martin Morgan <mtmor...@fhcrc.org> wrote:
On 03/24/2011 10:56 AM, Michael Lawrence wrote:

Hi Martin,

It would be nice if the ShortRead QA report could somehow filter out the adapter contamination before generating the rest of its plots, since
those
plots are pretty meaningless if there are adapters present.

Not sure how to handle this filtering in general. That is, what if
someone
then wants to see plots with only the "high quality" reads after the
quality
plots. It gets complicated. ShortRead has a nice filtering mechanism,
but
this is more complicated, since some QA plots come from one filter,
while
others come from a different stage.

However, under the assumption that no one would ever want to align an
adapter, i.e., those reads will not be carried forward, the adapter
removal
could just be treated specially hard-coded. And then just expect more customized solutions to leverage the internal ShortRead functions for generating each slot in the QA object, building it up incrementally, on
different subsets. Of course, to make sense, that would require a
different
report template, too.

Hi Michael -- Yes it would be nice to be able to more flexibly control
how
different components of the report are generated, or at least to make
some
smarter choices along the lines you suggest for adapter contaminants.
It's
hard to know how to make this really general, but I have come across
other
situations where I'd like to cherry-pick which parts of the QA process I
want to perform. I think I need some standardization on function
signatures
for generating each report section, tighter description of results from
each
section (i.e., a formal class hierarchy), and then a flexible report composition. It seems like quite a big task; I wonder if there are good
models out there to follow? arrayQualityMetrics?

  I think arrayQualityMetrics is a good starting place.  Audrey and
Wolfgang have
done a good job of modularizing the components.  But there are still
hiccups - which
suggests just how hard that is. And as you suggested, it was a big job.

I think the case Michael is bringing up might be useful to deal with,
without
a major rewrite. There should be some sort of file that ShortRead has
access to
(or an input parameter) that gives some more details on the samples and on
the
processing (eg what the sample labels should be, and what the adapters etc
are).
Then this information could be used in the current paradigm.

Mostly the issue is that if you have adapter contamination then the
subsequent plots
(eg nucleotide by cycle) are not useful.  You cannot see anything in
them and then
you have to go back and strip adapters by hand, then rerun ShortRead.
I agree that
you may want more general filtering, as an abundance of any read will
affect the plots,
but I think there is agreement that one would never want to include
the adapters (you do want
counts as are produced now, but given their affect on the graphics
filtering would be
beneficial).

 best wishes
   Robert

Martin


Michael

       [[alternative HTML version deleted]]

_______________________________________________
Bioc-sig-sequencing mailing list
Bioc-sig-sequencing@r-project.org
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing


--
Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109

Location: M1-B861
Telephone: 206 667-2793

_______________________________________________
Bioc-sig-sequencing mailing list
Bioc-sig-sequencing@r-project.org
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing




--
Robert Gentleman
rgent...@gmail.com

_______________________________________________
Bioc-sig-sequencing mailing list
Bioc-sig-sequencing@r-project.org
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing


        [[alternative HTML version deleted]]

_______________________________________________
Bioc-sig-sequencing mailing list
Bioc-sig-sequencing@r-project.org
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing

_______________________________________________
Bioc-sig-sequencing mailing list
Bioc-sig-sequencing@r-project.org
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing

Reply via email to