The implementation of MarkDuplicates for large inputs in some cases writes
to disk.  This is to support marking duplicates for read pairs that have
ends on different chromosomes.  We try to store the minimal amount of
information on disk in that case, which is something like x/y/tile and read
group identifier (an integer on disk).  The latter had barcode, lane, and
flowcell encoded in to the read group identifier (string version).  The SAM
spec does allow the barcode/lane/flowcell identifier as the platform unit
in each read group.  Note we both mark duplicates and detect optical
duplicates (a subset) in the same tool.

I believe it is good practice to specify the platform unit in each read
group, rather than encoding such common information in the read name, but
it is definitely a trade-off we have made in the implementation of the tool.

N

On Mon, Oct 20, 2014 at 2:02 PM, Heng Li <[email protected]> wrote:

>
> On Oct 20, 2014, at 13:59, Heng Li <[email protected]> wrote:
>
> > I have written an optical duplicate remover,
>
> Sorry, I mena “I haven’t written an optical duplicate remover”. BTW, it
> would be good to make the example to the list if it is not too large.
>
> Thank you,
>
> Heng
>
> > but I would like to know your exact rules to identify two reads being
> duplicates. As I briefly skimmed OpticalDuplicateFinder.java, it relies on
> a parameter “this.opticalDuplicatePixelDistance”, which is expected. Are
> you using the same threshold?
> >
> >> we do not extract the lane # from the read name, only tile,
> x-coordinate, and y-coordinate.
> >
> > Nils, why not use lane number?
> >
> > Heng
> >
> > On Oct 20, 2014, at 13:49, Salzberg, Anna <[email protected]> wrote:
> >
> >> Dear Nils,
> >>
> >> I counted BY HAND the number of duplicates that have the same tile in
> the A.debug.L1.sam file I had already sent you (note that there’s only a
> single lane).  The number is 12 (which matches my script).  However, picard
> MarkDuplicates is reporting 25 READ_PAIR_OPTICAL_DUPLICATES, that is 50.
> >>
> >> I really don’t want to be a pest, however we find that the optical
> duplicates functionality is AWESOME, and we’d be extremely happy for it to
> work.
> >>
> >> Thank you again for your help.
> >> Anna
> >>
> >>
> >> From: Nils Homer [mailto:[email protected]]
> >> Sent: Thursday, October 16, 2014 8:41 PM
> >> To: Salzberg, Anna
> >> Cc: [email protected]
> >> Subject: Re: [Samtools-help] Reporting Bug - Optical Duplicates of
> Picard MarkDuplicates
> >>
> >> Thanks Anna for the example set.  I have observed a few things
> regarding this issue
> >>
> >> The first is that we do not extract the lane # from the read name, only
> tile, x-coordinate, and y-coordinate.  You can see this in the code here if
> you are interested:
> https://github.com/broadinstitute/picard/blob/master/src/java/picard/sam/markduplicates/util/OpticalDuplicateFinder.java#L84-L104
> >>
> >> Secondly, we also do not retrieve either the barcode information or
> library identifier in the read name, since they themselves are not embedded
> in the read name.  Both barcode and library identifier are also important
> to condition upon when searching for optical duplicates, or duplicates in
> general.
> >>
> >> This brings us to where *do* we expect to retrieve this information?
> We use the read group header lines to capture lane, barcode, library,
> flowcell (for Illumina) and other information for specific sets or groups
> of reads.  If this information is given, which I recommend that as a best
> practice it should, MarkDuplicates will behave as you expect.  I believe it
> is much more robust to annotate these metadata in the header rather than
> rely on parsing read names wholly, since read name structures do change,
> albeit infrequently.
> >>
> >> I would recommend adding read groups to your SAM header within your
> pipeline.  We use FastqToSam or IlluminaBasecallsToSam to set the read
> group appropriately depending on our inputs.  In Picard, we also have tools
> like AddOrReplaceReadGroups that can help you add read groups prior to
> marking duplicates.
> >>
> >> Nils
> >>
> ------------------------------------------------------------------------------
> >> Comprehensive Server Monitoring with Site24x7.
> >> Monitor 10 servers for $9/Month.
> >> Get alerted through email, SMS, voice calls or mobile push
> notifications.
> >> Take corrective actions from your mobile device.
> >> http://p.sf.net/sfu/Zoho_______________________________________________
> >> Samtools-help mailing list
> >> [email protected]
> >> https://lists.sourceforge.net/lists/listinfo/samtools-help
> >
> >
> >
> ------------------------------------------------------------------------------
> > Comprehensive Server Monitoring with Site24x7.
> > Monitor 10 servers for $9/Month.
> > Get alerted through email, SMS, voice calls or mobile push notifications.
> > Take corrective actions from your mobile device.
> > http://p.sf.net/sfu/Zoho
> > _______________________________________________
> > Samtools-help mailing list
> > [email protected]
> > https://lists.sourceforge.net/lists/listinfo/samtools-help
>
>
>
> ------------------------------------------------------------------------------
> Comprehensive Server Monitoring with Site24x7.
> Monitor 10 servers for $9/Month.
> Get alerted through email, SMS, voice calls or mobile push notifications.
> Take corrective actions from your mobile device.
> http://p.sf.net/sfu/Zoho
> _______________________________________________
> Samtools-help mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/samtools-help
>
------------------------------------------------------------------------------
Comprehensive Server Monitoring with Site24x7.
Monitor 10 servers for $9/Month.
Get alerted through email, SMS, voice calls or mobile push notifications.
Take corrective actions from your mobile device.
http://p.sf.net/sfu/Zoho
_______________________________________________
Samtools-help mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/samtools-help

Reply via email to