[Samtools-help] Picard MarkDuplicates READ_NAME_REGEX Exceptions

Daniel Burkhardt Wed, 16 Jul 2014 08:54:12 -0700

Hi all, I'm new to this mailing list, but I could really use some help. I've 
tried searching through old posts and I havent seen this issue yet.


I'm having difficulty with the READ_NAME_REGEX in the MarkDuplicates package.

My reads are named as follows: 
SRR998967.sra.17480252
SRR999013.sra.1863729
SRR998967.sra.3441562

If I run MarkDuplicates without setting READ_NAME_REGEX, I get the follow error:

WARNING 2014-07-16 10:51:09     AbstractDuplicateFindingAlgorithm       Default 
READ_NAME_REGEX '[a-zA-Z0-9]+:[0-9]:([0-9]+):([0-9]+):([0-9]+).*' did not match 
read name 'SRR998967.sra.17480252'.  You may need to specify a READ_NAME_REGEX 
in order to correctly identify optical duplicates.  Note that this message will 
not be emitted again even if other read names do not match the regex.

This makes sense to me. The default regex isn't formatted for my read names. 
The code does continue to run, however, and finds some ~9000 duplicates out of 
~380,000 mapped reads.

So instead I set READ_NAME_REGEX='SRR[0-9]+\.sra\.[0-9]+' 

And I get this exception:

$: java ~/software/picard-tools-1.115/MarkDuplicates.jar 
I=1.RTX7000.bt2.sorted.bam O=1.RTX7000.bt2.dedup.bam 
M=1.RTX7000.bt2.dedup.metrics READ_NAME_REGEX='SRR[0-9]+\.sra\.[0-9]+'

[Wed Jul 16 11:08:49 EDT 2014] picard.sam.MarkDuplicates 
INPUT=[1.RTX7000.bt2.sorted.bam] OUTPUT=1.RTX7000.bt2.dedup.bam 
METRICS_FILE=1.RTX7000.bt2.dedup.metrics READ_NAME_REGEX=SRR[0-9]+\.sra\.[0-9]+ 
   PROGRAM_RECORD_ID=MarkDuplicates PROGRAM_GROUP_NAME=MarkDuplicates 
REMOVE_DUPLICATES=false ASSUME_SORTED=false 
MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000 
MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=8000 SORTING_COLLECTION_SIZE_RATIO=0.25 
OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 VERBOSITY=INFO QUIET=false 
VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 
CREATE_INDEX=false CREATE_MD5_FILE=false
[Wed Jul 16 11:08:49 EDT 2014] Executing as 
[email protected] on Linux 2.6.32-358.23.2.el6.x86_64 amd64; 
Java HotSpot(TM) 64-Bit Server VM 1.7.0_45-b18; Picard version: 
1.115(30b1e546cc4dd80c918e151dbfe46b061e63f315_1402927010) JdkDeflater
INFO    2014-07-16 11:08:49     MarkDuplicates  Start of doWork freeMemory: 
1004570472; totalMemory: 1010827264; maxMemory: 14997258240
INFO    2014-07-16 11:08:49     MarkDuplicates  Reading input file and 
constructing read end information.
INFO    2014-07-16 11:08:49     MarkDuplicates  Will retain up to 59512929 data 
points before spilling to disk.
[Wed Jul 16 11:08:51 EDT 2014] picard.sam.MarkDuplicates done. Elapsed time: 
0.03 minutes.
Runtime.totalMemory()=1433927680
To get help, see http://picard.sourceforge.net/index.shtml#GettingHelp
Exception in thread "main" java.lang.IndexOutOfBoundsException: No group 1
        at java.util.regex.Matcher.group(Matcher.java:487)
        at 
picard.sam.AbstractDuplicateFindingAlgorithm.addLocationInformation(AbstractDuplicateFindingAlgorithm.java:97)
        at picard.sam.MarkDuplicates.buildReadEnds(MarkDuplicates.java:504)
        at 
picard.sam.MarkDuplicates.buildSortedReadEndLists(MarkDuplicates.java:429)
        at picard.sam.MarkDuplicates.doWork(MarkDuplicates.java:177)
        at 
picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:183)
        at picard.sam.MarkDuplicates.main(MarkDuplicates.java:161)

This made me think maybe there was an issue with grouping, so I set groups for 
each type of character. So I set 
READ_NAME_REGEX='(SRR)([0-9]+)\.(sra)\.([0-9]+)'

And get this exception:

Exception in thread "main" java.lang.NumberFormatException: For input string: 
"SRR"
        at 
java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
        at java.lang.Integer.parseInt(Integer.java:492)
        at java.lang.Integer.parseInt(Integer.java:527)
        at 
picard.sam.AbstractDuplicateFindingAlgorithm.addLocationInformation(AbstractDuplicateFindingAlgorithm.java:97)
        at picard.sam.MarkDuplicates.buildReadEnds(MarkDuplicates.java:504)
        at 
picard.sam.MarkDuplicates.buildSortedReadEndLists(MarkDuplicates.java:429)
        at picard.sam.MarkDuplicates.doWork(MarkDuplicates.java:177)
        at 
picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:183)
        at picard.sam.MarkDuplicates.main(MarkDuplicates.java:161)

Which seems silly. Why is picard trying to interpret the SRR string as an 
Integer?

So finally I tried to leave the numbers outside of parens

READ_NAME_REGEX='SRR([0-9]+)\.sra\.([0-9]+)'

And again, the group exception:

Exception in thread "main" java.lang.IndexOutOfBoundsException: No group 3
        at java.util.regex.Matcher.group(Matcher.java:487)
        at 
picard.sam.AbstractDuplicateFindingAlgorithm.addLocationInformation(AbstractDuplicateFindingAlgorithm.java:99)
        at picard.sam.MarkDuplicates.buildReadEnds(MarkDuplicates.java:504)
        at 
picard.sam.MarkDuplicates.buildSortedReadEndLists(MarkDuplicates.java:429)
        at picard.sam.MarkDuplicates.doWork(MarkDuplicates.java:177)
        at 
picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:183)
        at picard.sam.MarkDuplicates.main(MarkDuplicates.java:161)

Long story short, I've searched several forums, and tried putting the whole 
thing in a single (). I've used quotes, no quotes, double and single quotes. I 
seem to be stuck with getting either IndexOutOfBoundsException: No group N or 
NumberFormatExceptions for an input string that contains only chars.

Any idea what's going on? Should I just skip the REGEX matching? It seems that 
picard's finding some duplicates without it.

Thanks,
Dan

------------------------------------------------------------------------------
Want fast and easy access to all the code in your enterprise? Index and
search up to 200,000 lines of code with a free copy of Black Duck
Code Sight - the same software that powers the world's largest code
search on Ohloh, the Black Duck Open Hub! Try it now.
http://p.sf.net/sfu/bds

_______________________________________________
Samtools-help mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/samtools-help

[Samtools-help] Picard MarkDuplicates READ_NAME_REGEX Exceptions

Reply via email to