On Aug 21, 2014, at 17:02 , Thomas Sibley <[email protected]> wrote: > _Something_ is bogging down in findOpticalDuplicates and going off the > deep end. I'd guess the Collection.sort call since I saw mention of > the sort implementation guts in the crash logs, but again, I'm no > expert on the JVM.
\o/ I've since figured out what makes the data so pathological: the read name parsing for tile/x/y was incorrect, so the groups it was looping over were artificially huge. For my two input files, apparently, the combined difference of 400k vs. 900k reads, a 60% vs. 97% duplicate rate, and the particular "clustering" of "tile/x/y" (really flowcell #/flowcell lane/tile) was enough to push it over the edge. I wrote a patch (with description) which makes the problem vanish and leaves optical duplicate detection intact: https://github.com/MullinsLab/picard/commit/3fcb0fc Two other small patches I wrote before finding the bug are included in https://github.com/broadinstitute/picard/pull/45 to a) provide an option to disable optical duplicate detection (Louis Letourneau said he's had the need to do so in the past) and b) slightly optimize the inner findOpticalDuplicates() loop. Thomas ------------------------------------------------------------------------------ Slashdot TV. Video for Nerds. Stuff that matters. http://tv.slashdot.org/ _______________________________________________ Samtools-help mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/samtools-help
