Hi all, I'm new to this mailing list, but I could really use some help. I've
tried searching through old posts and I havent seen this issue yet.
I'm having difficulty with the READ_NAME_REGEX in the MarkDuplicates package.
My reads are named as follows:
SRR998967.sra.17480252
SRR999013.sra.1863729
SRR998967.sra.3441562
If I run MarkDuplicates without setting READ_NAME_REGEX, I get the follow error:
WARNING 2014-07-16 10:51:09 AbstractDuplicateFindingAlgorithm Default
READ_NAME_REGEX '[a-zA-Z0-9]+:[0-9]:([0-9]+):([0-9]+):([0-9]+).*' did not match
read name 'SRR998967.sra.17480252'. You may need to specify a READ_NAME_REGEX
in order to correctly identify optical duplicates. Note that this message will
not be emitted again even if other read names do not match the regex.
This makes sense to me. The default regex isn't formatted for my read names.
The code does continue to run, however, and finds some ~9000 duplicates out of
~380,000 mapped reads.
So instead I set READ_NAME_REGEX='SRR[0-9]+\.sra\.[0-9]+'
And I get this exception:
$: java ~/software/picard-tools-1.115/MarkDuplicates.jar
I=1.RTX7000.bt2.sorted.bam O=1.RTX7000.bt2.dedup.bam
M=1.RTX7000.bt2.dedup.metrics READ_NAME_REGEX='SRR[0-9]+\.sra\.[0-9]+'
[Wed Jul 16 11:08:49 EDT 2014] picard.sam.MarkDuplicates
INPUT=[1.RTX7000.bt2.sorted.bam] OUTPUT=1.RTX7000.bt2.dedup.bam
METRICS_FILE=1.RTX7000.bt2.dedup.metrics READ_NAME_REGEX=SRR[0-9]+\.sra\.[0-9]+
PROGRAM_RECORD_ID=MarkDuplicates PROGRAM_GROUP_NAME=MarkDuplicates
REMOVE_DUPLICATES=false ASSUME_SORTED=false
MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000
MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=8000 SORTING_COLLECTION_SIZE_RATIO=0.25
OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 VERBOSITY=INFO QUIET=false
VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000
CREATE_INDEX=false CREATE_MD5_FILE=false
[Wed Jul 16 11:08:49 EDT 2014] Executing as
[email protected] on Linux 2.6.32-358.23.2.el6.x86_64 amd64;
Java HotSpot(TM) 64-Bit Server VM 1.7.0_45-b18; Picard version:
1.115(30b1e546cc4dd80c918e151dbfe46b061e63f315_1402927010) JdkDeflater
INFO 2014-07-16 11:08:49 MarkDuplicates Start of doWork freeMemory:
1004570472; totalMemory: 1010827264; maxMemory: 14997258240
INFO 2014-07-16 11:08:49 MarkDuplicates Reading input file and
constructing read end information.
INFO 2014-07-16 11:08:49 MarkDuplicates Will retain up to 59512929 data
points before spilling to disk.
[Wed Jul 16 11:08:51 EDT 2014] picard.sam.MarkDuplicates done. Elapsed time:
0.03 minutes.
Runtime.totalMemory()=1433927680
To get help, see http://picard.sourceforge.net/index.shtml#GettingHelp
Exception in thread "main" java.lang.IndexOutOfBoundsException: No group 1
at java.util.regex.Matcher.group(Matcher.java:487)
at
picard.sam.AbstractDuplicateFindingAlgorithm.addLocationInformation(AbstractDuplicateFindingAlgorithm.java:97)
at picard.sam.MarkDuplicates.buildReadEnds(MarkDuplicates.java:504)
at
picard.sam.MarkDuplicates.buildSortedReadEndLists(MarkDuplicates.java:429)
at picard.sam.MarkDuplicates.doWork(MarkDuplicates.java:177)
at
picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:183)
at picard.sam.MarkDuplicates.main(MarkDuplicates.java:161)
This made me think maybe there was an issue with grouping, so I set groups for
each type of character. So I set
READ_NAME_REGEX='(SRR)([0-9]+)\.(sra)\.([0-9]+)'
And get this exception:
Exception in thread "main" java.lang.NumberFormatException: For input string:
"SRR"
at
java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.lang.Integer.parseInt(Integer.java:492)
at java.lang.Integer.parseInt(Integer.java:527)
at
picard.sam.AbstractDuplicateFindingAlgorithm.addLocationInformation(AbstractDuplicateFindingAlgorithm.java:97)
at picard.sam.MarkDuplicates.buildReadEnds(MarkDuplicates.java:504)
at
picard.sam.MarkDuplicates.buildSortedReadEndLists(MarkDuplicates.java:429)
at picard.sam.MarkDuplicates.doWork(MarkDuplicates.java:177)
at
picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:183)
at picard.sam.MarkDuplicates.main(MarkDuplicates.java:161)
Which seems silly. Why is picard trying to interpret the SRR string as an
Integer?
So finally I tried to leave the numbers outside of parens
READ_NAME_REGEX='SRR([0-9]+)\.sra\.([0-9]+)'
And again, the group exception:
Exception in thread "main" java.lang.IndexOutOfBoundsException: No group 3
at java.util.regex.Matcher.group(Matcher.java:487)
at
picard.sam.AbstractDuplicateFindingAlgorithm.addLocationInformation(AbstractDuplicateFindingAlgorithm.java:99)
at picard.sam.MarkDuplicates.buildReadEnds(MarkDuplicates.java:504)
at
picard.sam.MarkDuplicates.buildSortedReadEndLists(MarkDuplicates.java:429)
at picard.sam.MarkDuplicates.doWork(MarkDuplicates.java:177)
at
picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:183)
at picard.sam.MarkDuplicates.main(MarkDuplicates.java:161)
Long story short, I've searched several forums, and tried putting the whole
thing in a single (). I've used quotes, no quotes, double and single quotes. I
seem to be stuck with getting either IndexOutOfBoundsException: No group N or
NumberFormatExceptions for an input string that contains only chars.
Any idea what's going on? Should I just skip the REGEX matching? It seems that
picard's finding some duplicates without it.
Thanks,
Dan------------------------------------------------------------------------------
Want fast and easy access to all the code in your enterprise? Index and
search up to 200,000 lines of code with a free copy of Black Duck
Code Sight - the same software that powers the world's largest code
search on Ohloh, the Black Duck Open Hub! Try it now.
http://p.sf.net/sfu/bds
_______________________________________________
Samtools-help mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/samtools-help