Re: [Samtools-help] Picard MarkDuplicates READ_NAME_REGEX Exceptions

Alec Wysoker Wed, 16 Jul 2014 09:09:33 -0700

Hi Daniel,

The purpose READ_NAME_REGEX is to specify a regular expression that willcapture from the read name the physical location of the cluster in thelane. This is used to distinguish optical duplicates (cases in whichthe cluster-finding software has incorrectly identified a single clusteras two clusters), from PCR duplicates. If two reads are consideredduplicates of one another, and their physical locations are close to oneanother (as defined by OPTICAL_DUPLICATE_PIXEL_DISTANCE), then they areconsidered optical duplicates, and excluded from the computation ofestimated library size.

The default value for READ_NAME_REGEX is a regular expression thatcaptures physical location from the conventional Illumina read nameformat. If you read the usage for this option, you'll see that theregex must contain 3 capture groups, which yours does not, thus theexception. It appears that your read names do not include physicallocation, so you can either leave the default value and ignore thewarning message, or you can pass READ_NAME_REGEX=null, which willsuppress this functionality entirely. That seems the most appropriatething to do since your reads don't contain physical location info.


-Alec

On 7/16/14, 11:20 AM, Daniel Burkhardt wrote:

Hi all, I'm new to this mailing list, but I could really use somehelp. I've tried searching through old posts and I havent seen thisissue yet.
I'm having difficulty with the READ_NAME_REGEX in the MarkDuplicatespackage.
My reads are named as follows:
SRR998967.sra.17480252
SRR999013.sra.1863729
SRR998967.sra.3441562
If I run MarkDuplicates without setting READ_NAME_REGEX, I get thefollow error:
WARNING2014-07-16 10:51:09AbstractDuplicateFindingAlgorithmDefaultREAD_NAME_REGEX '[a-zA-Z0-9]+:[0-9]:([0-9]+):([0-9]+):([0-9]+).*' didnot match read name 'SRR998967.sra.17480252'. You may need to specifya READ_NAME_REGEX in order to correctly identify optical duplicates.Note that this message will not be emitted again even if other readnames do not match the regex.
This makes sense to me. The default regex isn't formatted for my readnames. The code does continue to run, however, and finds some ~9000duplicates out of ~380,000 mapped reads.
So instead I set READ_NAME_REGEX='SRR[0-9]+\.sra\.[0-9]+'

And I get this exception:
$: java ~/software/picard-tools-1.115/MarkDuplicates.jarI=1.RTX7000.bt2.sorted.bam O=1.RTX7000.bt2.dedup.bamM=1.RTX7000.bt2.dedup.metrics READ_NAME_REGEX='SRR[0-9]+\.sra\.[0-9]+'
[Wed Jul 16 11:08:49 EDT 2014] picard.sam.MarkDuplicatesINPUT=[1.RTX7000.bt2.sorted.bam] OUTPUT=1.RTX7000.bt2.dedup.bamMETRICS_FILE=1.RTX7000.bt2.dedup.metricsREAD_NAME_REGEX=SRR[0-9]+\.sra\.[0-9]+PROGRAM_RECORD_ID=MarkDuplicates PROGRAM_GROUP_NAME=MarkDuplicatesREMOVE_DUPLICATES=false ASSUME_SORTED=falseMAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=8000SORTING_COLLECTION_SIZE_RATIO=0.25OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 VERBOSITY=INFO QUIET=falseVALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false[Wed Jul 16 11:08:49 EDT 2014] Executing as[email protected]<mailto:[email protected]> on Linux2.6.32-358.23.2.el6.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM1.7.0_45-b18; Picard version:1.115(30b1e546cc4dd80c918e151dbfe46b061e63f315_1402927010) JdkDeflaterINFO2014-07-16 11:08:49MarkDuplicatesStart of doWork freeMemory:1004570472; totalMemory: 1010827264; maxMemory: 14997258240INFO2014-07-16 11:08:49MarkDuplicatesReading input file andconstructing read end information.INFO2014-07-16 11:08:49MarkDuplicatesWill retain up to 59512929 datapoints before spilling to disk.[Wed Jul 16 11:08:51 EDT 2014] picard.sam.MarkDuplicates done. Elapsedtime: 0.03 minutes.
Runtime.totalMemory()=1433927680
To get help, see http://picard.sourceforge.net/index.shtml#GettingHelp
Exception in thread "main" java.lang.IndexOutOfBoundsException: No group 1
at java.util.regex.Matcher.group(Matcher.java:487)
atpicard.sam.AbstractDuplicateFindingAlgorithm.addLocationInformation(AbstractDuplicateFindingAlgorithm.java:97)
at picard.sam.MarkDuplicates.buildReadEnds(MarkDuplicates.java:504)
atpicard.sam.MarkDuplicates.buildSortedReadEndLists(MarkDuplicates.java:429)
at picard.sam.MarkDuplicates.doWork(MarkDuplicates.java:177)
atpicard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:183)
at picard.sam.MarkDuplicates.main(MarkDuplicates.java:161)
This made me think maybe there was an issue with grouping, so I setgroups for each type of character. So I setREAD_NAME_REGEX='(SRR)([0-9]+)\.(sra)\.([0-9]+)'
And get this exception:
Exception in thread "main" java.lang.NumberFormatException: For inputstring: "SRR"atjava.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.lang.Integer.parseInt(Integer.java:492)
at java.lang.Integer.parseInt(Integer.java:527)
atpicard.sam.AbstractDuplicateFindingAlgorithm.addLocationInformation(AbstractDuplicateFindingAlgorithm.java:97)
at picard.sam.MarkDuplicates.buildReadEnds(MarkDuplicates.java:504)
atpicard.sam.MarkDuplicates.buildSortedReadEndLists(MarkDuplicates.java:429)
at picard.sam.MarkDuplicates.doWork(MarkDuplicates.java:177)
atpicard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:183)
at picard.sam.MarkDuplicates.main(MarkDuplicates.java:161)
Which seems silly. Why is picard trying to interpret the SRR string asan Integer?
So finally I tried to leave the numbers outside of parens

READ_NAME_REGEX='SRR([0-9]+)\.sra\.([0-9]+)'

And again, the group exception:

Exception in thread "main" java.lang.IndexOutOfBoundsException: No group 3
at java.util.regex.Matcher.group(Matcher.java:487)
atpicard.sam.AbstractDuplicateFindingAlgorithm.addLocationInformation(AbstractDuplicateFindingAlgorithm.java:99)
at picard.sam.MarkDuplicates.buildReadEnds(MarkDuplicates.java:504)
atpicard.sam.MarkDuplicates.buildSortedReadEndLists(MarkDuplicates.java:429)
at picard.sam.MarkDuplicates.doWork(MarkDuplicates.java:177)
atpicard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:183)
at picard.sam.MarkDuplicates.main(MarkDuplicates.java:161)
Long story short, I've searched several forums, and tried putting thewhole thing in a single (). I've used quotes, no quotes, double andsingle quotes. I seem to be stuck with getting eitherIndexOutOfBoundsException: No group N or NumberFormatExceptions for aninput string that contains only chars.
Any idea what's going on? Should I just skip the REGEX matching? Itseems that picard's finding some duplicates without it.
Thanks,
Dan


------------------------------------------------------------------------------
Want fast and easy access to all the code in your enterprise? Index and
search up to 200,000 lines of code with a free copy of Black Duck
Code Sight - the same software that powers the world's largest code
search on Ohloh, the Black Duck Open Hub! Try it now.
http://p.sf.net/sfu/bds


_______________________________________________
Samtools-help mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/samtools-help

------------------------------------------------------------------------------
Want fast and easy access to all the code in your enterprise? Index and
search up to 200,000 lines of code with a free copy of Black Duck
Code Sight - the same software that powers the world's largest code
search on Ohloh, the Black Duck Open Hub! Try it now.
http://p.sf.net/sfu/bds

_______________________________________________
Samtools-help mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/samtools-help

Re: [Samtools-help] Picard MarkDuplicates READ_NAME_REGEX Exceptions

Reply via email to