On Sun, May 17, 2015 at 10:12:49AM +0100, Dan Greenfield wrote: > samtools view -C -o NA12878_S1.cram -T hg38.fa NA12878_S1.bam
It's almost certainly h37. Look at the length of MT to see if it is the 1000G mitochondria or the USCS one (14571 vs 14569 length iirc). The BAM headers really ought to have included more information, but length allows us to detect which is which. > 3. attempt to convert back to BAM: > samtools view -b -o NA12878_S1.cram.bam -T hg38.fa NA12878_S1.cram > > Results from step 3: (no errors/warnings encountered in previous steps) > > Slice ends beyond reference end. > Slice ends beyond reference end. > ERROR: md5sum reference mismatch for ref 1 pos 248900092..248956422 > CRAM: 1ca5fd5ffe82936260309c85fc9b473b > Ref : b47b43c987dbf1af96ca6d59061401c8 > Failure to decode slice > [E::hts_close] Failed to decode sequence. > samtools: error closing "NA12878_S1.cram": -1 This seems wrong though, even if it's been given the wrong reference. It ought to always be able to decode what it created! My guess is it is failing due to the "Slice ends beyond reference end" bit giving a truncated reference. I'll need to verify this more, but there are issues as to how you compute a fake reference when something goes beyond the end. I think I just pretend the reference is full of additional Ns, but perhaps I don't use the same logic for computing the MD5 sums between encoder and decoder. A nasty corner case, but likely a bug. > Here are the file sizes (note the cram file is actually bigger than the > original bam): > NA12878_S1.bam 121,691,186,161 > NA12878_S1.cram 125,386,599,695 That size alone tells me that CRAM is failing to encode sequence delta correctly. I wouldn't expect it to be larger except on tiny files (we reserve more space for the header) or where the reference is invalid. For what it's worth, the platinum genomes NA12878_S1 should come out around 67GB for CRAM v3.0 (vs 121.7GB for BAM), or 65GB if we use a bzip2 codec. This compares pretty well to 65.9GB from Deez and 64.5GB from Quip - they're all in the same ballpark. James -- James Bonfield ([email protected]) | Hora aderat briligi. Nunc et Slythia Tova | Plurima gyrabant gymbolitare vabo; A Staden Package developer: | Et Borogovorum mimzebant undique formae, https://sf.net/projects/staden/ | Momiferique omnes exgrabure Rathi. -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. ------------------------------------------------------------------------------ One dashboard for servers and applications across Physical-Virtual-Cloud Widest out-of-the-box monitoring support with 50+ applications Performance metrics, stats and reports that give you Actionable Insights Deep dive visibility with transaction tracing using APM Insight. http://ad.doubleclick.net/ddm/clk/290420510;117567292;y _______________________________________________ Samtools-help mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/samtools-help
