Hi Daniel,

The purpose READ_NAME_REGEX is to specify a regular expression that will capture from the read name the physical location of the cluster in the lane. This is used to distinguish optical duplicates (cases in which the cluster-finding software has incorrectly identified a single cluster as two clusters), from PCR duplicates. If two reads are considered duplicates of one another, and their physical locations are close to one another (as defined by OPTICAL_DUPLICATE_PIXEL_DISTANCE), then they are considered optical duplicates, and excluded from the computation of estimated library size.

The default value for READ_NAME_REGEX is a regular expression that captures physical location from the conventional Illumina read name format. If you read the usage for this option, you'll see that the regex must contain 3 capture groups, which yours does not, thus the exception. It appears that your read names do not include physical location, so you can either leave the default value and ignore the warning message, or you can pass READ_NAME_REGEX=null, which will suppress this functionality entirely. That seems the most appropriate thing to do since your reads don't contain physical location info.

-Alec

On 7/16/14, 11:20 AM, Daniel Burkhardt wrote:
Hi all, I'm new to this mailing list, but I could really use some help. I've tried searching through old posts and I havent seen this issue yet.

I'm having difficulty with the READ_NAME_REGEX in the MarkDuplicates package.

My reads are named as follows:
SRR998967.sra.17480252
SRR999013.sra.1863729
SRR998967.sra.3441562

If I run MarkDuplicates without setting READ_NAME_REGEX, I get the follow error:

WARNING2014-07-16 10:51:09AbstractDuplicateFindingAlgorithmDefault READ_NAME_REGEX '[a-zA-Z0-9]+:[0-9]:([0-9]+):([0-9]+):([0-9]+).*' did not match read name 'SRR998967.sra.17480252'. You may need to specify a READ_NAME_REGEX in order to correctly identify optical duplicates. Note that this message will not be emitted again even if other read names do not match the regex.

This makes sense to me. The default regex isn't formatted for my read names. The code does continue to run, however, and finds some ~9000 duplicates out of ~380,000 mapped reads.

So instead I set READ_NAME_REGEX='SRR[0-9]+\.sra\.[0-9]+'

And I get this exception:

$: java ~/software/picard-tools-1.115/MarkDuplicates.jar I=1.RTX7000.bt2.sorted.bam O=1.RTX7000.bt2.dedup.bam M=1.RTX7000.bt2.dedup.metrics READ_NAME_REGEX='SRR[0-9]+\.sra\.[0-9]+'

[Wed Jul 16 11:08:49 EDT 2014] picard.sam.MarkDuplicates INPUT=[1.RTX7000.bt2.sorted.bam] OUTPUT=1.RTX7000.bt2.dedup.bam METRICS_FILE=1.RTX7000.bt2.dedup.metrics READ_NAME_REGEX=SRR[0-9]+\.sra\.[0-9]+ PROGRAM_RECORD_ID=MarkDuplicates PROGRAM_GROUP_NAME=MarkDuplicates REMOVE_DUPLICATES=false ASSUME_SORTED=false MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000 MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=8000 SORTING_COLLECTION_SIZE_RATIO=0.25 OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false [Wed Jul 16 11:08:49 EDT 2014] Executing as [email protected] <mailto:[email protected]> on Linux 2.6.32-358.23.2.el6.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.7.0_45-b18; Picard version: 1.115(30b1e546cc4dd80c918e151dbfe46b061e63f315_1402927010) JdkDeflater INFO2014-07-16 11:08:49MarkDuplicatesStart of doWork freeMemory: 1004570472; totalMemory: 1010827264; maxMemory: 14997258240 INFO2014-07-16 11:08:49MarkDuplicatesReading input file and constructing read end information. INFO2014-07-16 11:08:49MarkDuplicatesWill retain up to 59512929 data points before spilling to disk. [Wed Jul 16 11:08:51 EDT 2014] picard.sam.MarkDuplicates done. Elapsed time: 0.03 minutes.
Runtime.totalMemory()=1433927680
To get help, see http://picard.sourceforge.net/index.shtml#GettingHelp
Exception in thread "main" java.lang.IndexOutOfBoundsException: No group 1
at java.util.regex.Matcher.group(Matcher.java:487)
at picard.sam.AbstractDuplicateFindingAlgorithm.addLocationInformation(AbstractDuplicateFindingAlgorithm.java:97)
at picard.sam.MarkDuplicates.buildReadEnds(MarkDuplicates.java:504)
at picard.sam.MarkDuplicates.buildSortedReadEndLists(MarkDuplicates.java:429)
at picard.sam.MarkDuplicates.doWork(MarkDuplicates.java:177)
at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:183)
at picard.sam.MarkDuplicates.main(MarkDuplicates.java:161)

This made me think maybe there was an issue with grouping, so I set groups for each type of character. So I set READ_NAME_REGEX='(SRR)([0-9]+)\.(sra)\.([0-9]+)'

And get this exception:

Exception in thread "main" java.lang.NumberFormatException: For input string: "SRR" at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.lang.Integer.parseInt(Integer.java:492)
at java.lang.Integer.parseInt(Integer.java:527)
at picard.sam.AbstractDuplicateFindingAlgorithm.addLocationInformation(AbstractDuplicateFindingAlgorithm.java:97)
at picard.sam.MarkDuplicates.buildReadEnds(MarkDuplicates.java:504)
at picard.sam.MarkDuplicates.buildSortedReadEndLists(MarkDuplicates.java:429)
at picard.sam.MarkDuplicates.doWork(MarkDuplicates.java:177)
at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:183)
at picard.sam.MarkDuplicates.main(MarkDuplicates.java:161)

Which seems silly. Why is picard trying to interpret the SRR string as an Integer?

So finally I tried to leave the numbers outside of parens

READ_NAME_REGEX='SRR([0-9]+)\.sra\.([0-9]+)'

And again, the group exception:

Exception in thread "main" java.lang.IndexOutOfBoundsException: No group 3
at java.util.regex.Matcher.group(Matcher.java:487)
at picard.sam.AbstractDuplicateFindingAlgorithm.addLocationInformation(AbstractDuplicateFindingAlgorithm.java:99)
at picard.sam.MarkDuplicates.buildReadEnds(MarkDuplicates.java:504)
at picard.sam.MarkDuplicates.buildSortedReadEndLists(MarkDuplicates.java:429)
at picard.sam.MarkDuplicates.doWork(MarkDuplicates.java:177)
at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:183)
at picard.sam.MarkDuplicates.main(MarkDuplicates.java:161)

Long story short, I've searched several forums, and tried putting the whole thing in a single (). I've used quotes, no quotes, double and single quotes. I seem to be stuck with getting either IndexOutOfBoundsException: No group N or NumberFormatExceptions for an input string that contains only chars.

Any idea what's going on? Should I just skip the REGEX matching? It seems that picard's finding some duplicates without it.

Thanks,
Dan


------------------------------------------------------------------------------
Want fast and easy access to all the code in your enterprise? Index and
search up to 200,000 lines of code with a free copy of Black Duck
Code Sight - the same software that powers the world's largest code
search on Ohloh, the Black Duck Open Hub! Try it now.
http://p.sf.net/sfu/bds


_______________________________________________
Samtools-help mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/samtools-help

------------------------------------------------------------------------------
Want fast and easy access to all the code in your enterprise? Index and
search up to 200,000 lines of code with a free copy of Black Duck
Code Sight - the same software that powers the world's largest code
search on Ohloh, the Black Duck Open Hub! Try it now.
http://p.sf.net/sfu/bds
_______________________________________________
Samtools-help mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/samtools-help

Reply via email to