Hi there,
I’ve been trying to use Picard MarkDuplicates to mark and remove duplicate
entries in some BAM files of mine. (The BAM files are sorted and indexed.)
I’ve the BAM files from 4 fungal strains, and for one of them MarkDuplicates
works just fine. Have run this multiple times — always with the same results.
That is, it always works for one fungal strain (always the same one), but
always fails for the other 3 fungal strains.
For the 3 fungal strains where it fails I seem to run out of memory. I’ve tried
fixing the problem by increasing the requested RAM (once I even asked for a
ridiculous amount of RAM), and specifying a temporary directory for Java IO
with –DJava.io.tmpdir, as recommended in your mailing list, but alas the
problem remains.
I’ve always run with the latest version of Picard. Thus, recently updated it to
your latest version, 1.119, and although the error message has changed slightly
the problem of running out of memory still remains.
My Java version:
java version "1.7.0_51"
Java(TM) SE Runtime Environment (build 1.7.0_51-b13)
Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode)
Below is the command I ran, and the error message output.
java -Xmx22g -Djava.io.tmpdir=$HOME/tempdir -jar
$HOME/bio/picard-tools-1.119/MarkDuplicates.jar \
I =/my/path/my_fungal_strain.srt.bam \
O=/my/path/my_fungal_strain.srt.dedup.bam \
METRICS_FILE=/my/path/my_fungal_strain.srt.dedup.duplicationMetrics \
CREATE_INDEX=true \
REMOVE_DUPLICATES=true \
ASSUME_SORTED=true
picard.sam.MarkDuplicates INPUT=[/my/path/my_fungal_strain.srt.bam]
OUTPUT=/my/path/my_fungal_strain.srt.dedup.bam
METRICS_FILE=/my/path/my_fungal_strain.srt.dedup.duplicationMetrics
REMOVE_DUPLICATES=true ASSUME_SORTED=true CREATE_INDEX=true
PROGRAM_RECORD_ID=MarkDuplicates PROGRAM_GROUP_NAME=MarkDuplicates
MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000
MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=8000 SORTING_COLLECTION_SIZE_RATIO=0.25
READ_NAME_REGEX=[a-zA-Z0-9]+:[0-9]:([0-9]+):([0-9]+):([0-9]+).*
OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 VERBOSITY=INFO QUIET=false
VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000
CREATE_MD5_FILE=false
Java HotSpot(TM) 64-Bit Server VM warning: You have loaded library
/path/to/my/bin/picard-tools-1.119/libIntelDeflater.so which might have
disabled stack guard. The VM will try to fix the stack guard now.
It's highly recommended that you fix the library with 'execstack -c <libfile>',
or link it with '-z noexecstack'.
[Fri Sep 05 19:52:55 EST 2014] Executing as [email protected] on Linux
2.6.32-431.11.2.el6.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM
1.7.0_51-b13; Picard version:
1.119(d44cdb51745f5e8075c826430a39d8a61f1dd832_1408991805) IntelDeflater
INFO 2014-09-05 19:52:55 MarkDuplicates Start of doWork freeMemory:
754742648; totalMemory: 759693312; maxMemory: 20997734400
INFO 2014-09-05 19:52:55 MarkDuplicates Reading input file and
constructing read end information.
INFO 2014-09-05 19:52:55 MarkDuplicates Will retain up to 83324342 data
points before spilling to disk.
INFO 2014-09-05 19:53:20 MarkDuplicates Read 1,000,000 records.
Elapsed time: 00:00:24s. Time for last 1,000,000: 24s. Last read position:
NODE_7_length_125489_cov_34.5327_ID_35439289:42,886
INFO 2014-09-05 19:53:20 MarkDuplicates Tracking 135538 as yet
unmatched pairs. 292 records in RAM.
[Fri Sep 05 19:53:32 EST 2014] picard.sam.MarkDuplicates done. Elapsed time:
0.63 minutes.
Runtime.totalMemory()=2518155264
To get help, see http://picard.sourceforge.net/index.shtml#GettingHelp
Exception in thread "main" htsjdk.samtools.SAMException:
/my/home/tempdir/user/CSPI.2431821386522683006.tmp/5438.tmpnot found
at
htsjdk.samtools.util.FileAppendStreamLRUCache$Functor.makeValue(FileAppendStreamLRUCache.java:63)
at
htsjdk.samtools.util.FileAppendStreamLRUCache$Functor.makeValue(FileAppendStreamLRUCache.java:49)
at
htsjdk.samtools.util.ResourceLimitedMap.get(ResourceLimitedMap.java:76)
at
htsjdk.samtools.CoordinateSortedPairInfoMap.getOutputStreamForSequence(CoordinateSortedPairInfoMap.java:180)
at
htsjdk.samtools.CoordinateSortedPairInfoMap.put(CoordinateSortedPairInfoMap.java:164)
at picard.sam.DiskReadEndsMap.put(DiskReadEndsMap.java:67)
at
picard.sam.MarkDuplicates.buildSortedReadEndLists(MarkDuplicates.java:449)
at picard.sam.MarkDuplicates.doWork(MarkDuplicates.java:177)
at
picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:183)
at picard.sam.MarkDuplicates.main(MarkDuplicates.java:161)
Caused by: java.io.FileNotFoundException:
/my/home/tempdir/user/CSPI.2431821386522683006.tmp/5438.tmp (Too many open
files)
at java.io.FileOutputStream.open(Native Method)
at java.io.FileOutputStream.<init>(FileOutputStream.java:221)
at
htsjdk.samtools.util.FileAppendStreamLRUCache$Functor.makeValue(FileAppendStreamLRUCache.java:60)
... 9 more
One of the problems seem to be that Picard MarkDuplicates can’t write to the
temporary directory at /my/home/tempdir/user/
I’ve set that directory so that anyone can write to it, but this doesn’t seem
to solve the problem, so the temporary file
/my/home/tempdir/user/CSPI.2431821386522683006.tmp/5438.tmp isn’t written out.
Hoping you can help me shed light on this, so that I can make your tool run to
completion using my data.
(For all 4 strains it works fine running samtools rmdup, but for consistency
I’d very much like to make Picard MarkDuplicates work for all 4 strains.)
Many thanks in advance,
Åsa
------------------------------------------------------------------------------
Want excitement?
Manually upgrade your production database.
When you want reliability, choose Perforce
Perforce version control. Predictably reliable.
http://pubads.g.doubleclick.net/gampad/clk?id=157508191&iu=/4140/ostg.clktrk
_______________________________________________
Samtools-help mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/samtools-help