Hi Daniel,
The purpose READ_NAME_REGEX is to specify a regular expression that will
capture from the read name the physical location of the cluster in the
lane. This is used to distinguish optical duplicates (cases in which
the cluster-finding software has incorrectly identified a single cluster
as two clusters), from PCR duplicates. If two reads are considered
duplicates of one another, and their physical locations are close to one
another (as defined by OPTICAL_DUPLICATE_PIXEL_DISTANCE), then they are
considered optical duplicates, and excluded from the computation of
estimated library size.
The default value for READ_NAME_REGEX is a regular expression that
captures physical location from the conventional Illumina read name
format. If you read the usage for this option, you'll see that the
regex must contain 3 capture groups, which yours does not, thus the
exception. It appears that your read names do not include physical
location, so you can either leave the default value and ignore the
warning message, or you can pass READ_NAME_REGEX=null, which will
suppress this functionality entirely. That seems the most appropriate
thing to do since your reads don't contain physical location info.
-Alec
On 7/16/14, 11:20 AM, Daniel Burkhardt wrote:
Hi all, I'm new to this mailing list, but I could really use some
help. I've tried searching through old posts and I havent seen this
issue yet.
I'm having difficulty with the READ_NAME_REGEX in the MarkDuplicates
package.
My reads are named as follows:
SRR998967.sra.17480252
SRR999013.sra.1863729
SRR998967.sra.3441562
If I run MarkDuplicates without setting READ_NAME_REGEX, I get the
follow error:
WARNING2014-07-16 10:51:09AbstractDuplicateFindingAlgorithmDefault
READ_NAME_REGEX '[a-zA-Z0-9]+:[0-9]:([0-9]+):([0-9]+):([0-9]+).*' did
not match read name 'SRR998967.sra.17480252'. You may need to specify
a READ_NAME_REGEX in order to correctly identify optical duplicates.
Note that this message will not be emitted again even if other read
names do not match the regex.
This makes sense to me. The default regex isn't formatted for my read
names. The code does continue to run, however, and finds some ~9000
duplicates out of ~380,000 mapped reads.
So instead I set READ_NAME_REGEX='SRR[0-9]+\.sra\.[0-9]+'
And I get this exception:
$: java ~/software/picard-tools-1.115/MarkDuplicates.jar
I=1.RTX7000.bt2.sorted.bam O=1.RTX7000.bt2.dedup.bam
M=1.RTX7000.bt2.dedup.metrics READ_NAME_REGEX='SRR[0-9]+\.sra\.[0-9]+'
[Wed Jul 16 11:08:49 EDT 2014] picard.sam.MarkDuplicates
INPUT=[1.RTX7000.bt2.sorted.bam] OUTPUT=1.RTX7000.bt2.dedup.bam
METRICS_FILE=1.RTX7000.bt2.dedup.metrics
READ_NAME_REGEX=SRR[0-9]+\.sra\.[0-9]+
PROGRAM_RECORD_ID=MarkDuplicates PROGRAM_GROUP_NAME=MarkDuplicates
REMOVE_DUPLICATES=false ASSUME_SORTED=false
MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000
MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=8000
SORTING_COLLECTION_SIZE_RATIO=0.25
OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 VERBOSITY=INFO QUIET=false
VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5
MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false
[Wed Jul 16 11:08:49 EDT 2014] Executing as
[email protected]
<mailto:[email protected]> on Linux
2.6.32-358.23.2.el6.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM
1.7.0_45-b18; Picard version:
1.115(30b1e546cc4dd80c918e151dbfe46b061e63f315_1402927010) JdkDeflater
INFO2014-07-16 11:08:49MarkDuplicatesStart of doWork freeMemory:
1004570472; totalMemory: 1010827264; maxMemory: 14997258240
INFO2014-07-16 11:08:49MarkDuplicatesReading input file and
constructing read end information.
INFO2014-07-16 11:08:49MarkDuplicatesWill retain up to 59512929 data
points before spilling to disk.
[Wed Jul 16 11:08:51 EDT 2014] picard.sam.MarkDuplicates done. Elapsed
time: 0.03 minutes.
Runtime.totalMemory()=1433927680
To get help, see http://picard.sourceforge.net/index.shtml#GettingHelp
Exception in thread "main" java.lang.IndexOutOfBoundsException: No group 1
at java.util.regex.Matcher.group(Matcher.java:487)
at
picard.sam.AbstractDuplicateFindingAlgorithm.addLocationInformation(AbstractDuplicateFindingAlgorithm.java:97)
at picard.sam.MarkDuplicates.buildReadEnds(MarkDuplicates.java:504)
at
picard.sam.MarkDuplicates.buildSortedReadEndLists(MarkDuplicates.java:429)
at picard.sam.MarkDuplicates.doWork(MarkDuplicates.java:177)
at
picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:183)
at picard.sam.MarkDuplicates.main(MarkDuplicates.java:161)
This made me think maybe there was an issue with grouping, so I set
groups for each type of character. So I set
READ_NAME_REGEX='(SRR)([0-9]+)\.(sra)\.([0-9]+)'
And get this exception:
Exception in thread "main" java.lang.NumberFormatException: For input
string: "SRR"
at
java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.lang.Integer.parseInt(Integer.java:492)
at java.lang.Integer.parseInt(Integer.java:527)
at
picard.sam.AbstractDuplicateFindingAlgorithm.addLocationInformation(AbstractDuplicateFindingAlgorithm.java:97)
at picard.sam.MarkDuplicates.buildReadEnds(MarkDuplicates.java:504)
at
picard.sam.MarkDuplicates.buildSortedReadEndLists(MarkDuplicates.java:429)
at picard.sam.MarkDuplicates.doWork(MarkDuplicates.java:177)
at
picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:183)
at picard.sam.MarkDuplicates.main(MarkDuplicates.java:161)
Which seems silly. Why is picard trying to interpret the SRR string as
an Integer?
So finally I tried to leave the numbers outside of parens
READ_NAME_REGEX='SRR([0-9]+)\.sra\.([0-9]+)'
And again, the group exception:
Exception in thread "main" java.lang.IndexOutOfBoundsException: No group 3
at java.util.regex.Matcher.group(Matcher.java:487)
at
picard.sam.AbstractDuplicateFindingAlgorithm.addLocationInformation(AbstractDuplicateFindingAlgorithm.java:99)
at picard.sam.MarkDuplicates.buildReadEnds(MarkDuplicates.java:504)
at
picard.sam.MarkDuplicates.buildSortedReadEndLists(MarkDuplicates.java:429)
at picard.sam.MarkDuplicates.doWork(MarkDuplicates.java:177)
at
picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:183)
at picard.sam.MarkDuplicates.main(MarkDuplicates.java:161)
Long story short, I've searched several forums, and tried putting the
whole thing in a single (). I've used quotes, no quotes, double and
single quotes. I seem to be stuck with getting either
IndexOutOfBoundsException: No group N or NumberFormatExceptions for an
input string that contains only chars.
Any idea what's going on? Should I just skip the REGEX matching? It
seems that picard's finding some duplicates without it.
Thanks,
Dan
------------------------------------------------------------------------------
Want fast and easy access to all the code in your enterprise? Index and
search up to 200,000 lines of code with a free copy of Black Duck
Code Sight - the same software that powers the world's largest code
search on Ohloh, the Black Duck Open Hub! Try it now.
http://p.sf.net/sfu/bds
_______________________________________________
Samtools-help mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/samtools-help
------------------------------------------------------------------------------
Want fast and easy access to all the code in your enterprise? Index and
search up to 200,000 lines of code with a free copy of Black Duck
Code Sight - the same software that powers the world's largest code
search on Ohloh, the Black Duck Open Hub! Try it now.
http://p.sf.net/sfu/bds
_______________________________________________
Samtools-help mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/samtools-help