On Aug 21, 2014, at 17:02 , Thomas Sibley <[email protected]> wrote:
> _Something_ is bogging down in findOpticalDuplicates and going off the
> deep end.  I'd guess the Collection.sort call since I saw mention of
> the sort implementation guts in the crash logs, but again, I'm no
> expert on the JVM.

\o/

I've since figured out what makes the data so pathological: the read
name parsing for tile/x/y was incorrect, so the groups it was looping
over were artificially huge.  For my two input files, apparently, the
combined difference of 400k vs. 900k reads, a 60% vs. 97% duplicate
rate, and the particular "clustering" of "tile/x/y" (really flowcell
#/flowcell lane/tile) was enough to push it over the edge.

I wrote a patch (with description) which makes the problem vanish and
leaves optical duplicate detection intact:
https://github.com/MullinsLab/picard/commit/3fcb0fc

Two other small patches I wrote before finding the bug are included in
https://github.com/broadinstitute/picard/pull/45 to a) provide an option
to disable optical duplicate detection (Louis Letourneau said he's had
the need to do so in the past) and b) slightly optimize the inner
findOpticalDuplicates() loop.

Thomas

------------------------------------------------------------------------------
Slashdot TV.  
Video for Nerds.  Stuff that matters.
http://tv.slashdot.org/
_______________________________________________
Samtools-help mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/samtools-help

Reply via email to