On 17/08/2015 15:31, James Bonfield wrote: > On Mon, Aug 17, 2015 at 12:45:50PM +0100, James Bonfield wrote: >> Have you tested that they work? I don't seem able to get this working >> right now, but maybe I'm doing something wrong (or perhaps they have >> corner cases which fail). > Here is an example built on the latest cramtools-3.0.jar: > > # .BAI index > $ java -Xmx2g -jar ~/work/cram/cramtools/cramtools-3.0.jar index > --bam-style-index -I /nfs/users/nfs_j/jkb/scratch/data/9827_2#49.java3.cram > $ java -Xmx2g -jar ~/work/cram/cramtools/cramtools-3.0.jar bam -I > /nfs/users/nfs_j/jkb/scratch/data/9827_2#49.java3.cram -R > /nfs/srpipe_references/references/Human/1000Genomes_hs37d5/all/fasta/hs37d5.fa > --skip-md5-check 1:100000000-100000000 > [no output] > > # .CRAI index > $ java -Xmx2g -jar ~/work/cram/cramtools/cramtools-3.0.jar index -I > /nfs/users/nfs_j/jkb/scratch/data/9827_2#49.java3.cram > $ java -Xmx2g -jar ~/work/cram/cramtools/cramtools-3.0.jar bam -I > /nfs/users/nfs_j/jkb/scratch/data/9827_2#49.java3.cram -R > /nfs/srpipe_references/references/Human/1000Genomes_hs37d5/all/fasta/hs37d5.fa > --skip-md5-check 1:100000000-100000000 > HS25_09827:2:2214:6841:99219#49 99 1 99999945 60 100M > = 100000329484 > CTGGGAGCCCCGCAGAACATGGTGTTTTATTCTGACACTATATATCTAGCACTTGCACTGTAAAAATGGAAGTAATTCCCATTAGGACCAGCAAAACCTG > > AAEFFGFGGFHGGJ>HGGGHGG@HHJHIGHHHHELGGGHGGHGHFGHIFFIIHHLIFGFGIJGHHGGGGEGIGJGJFIIHJGIHEGHFFHGGGHEFHHIE > X0:i:1 X1:i:0 BC:Z:NGTCTATC RG:Z:1#49 XG:i:0 AM:i:37 SM:i:37 > XM:i:0 XO:i:0 QT:Z:!1=DFFFB XT:A:U > ... > [4 lines of output] > > > # Samtools on the BAM equivalent > $ samtools view /nfs/users/nfs_j/jkb/scratch/data/9827_2#49.bam > 1:100000000-100000000 > [Also 4 lines of output] > > This is just a random local human test file, but I haven't explored further. > It also fails on whole chromosomes though: > > $ samtools view /nfs/users/nfs_j/jkb/scratch/data/9827_2#49.bam Y:|wc -l > 417134 > > # With .CRAI > $ java -Xmx2g -jar ~/work/cram/cramtools/cramtools-3.0.jar bam -I > /nfjkb/scratch/data/9827_2#49.java3.cram -R > /nfs/srpipe_references/references/Human/1000Genomes_hs37d5/all/fasta/hs37d5.fa > --skip-md5-check Y:|wc -l > 417134 > > # With .BAI > $ java -Xmx2g -jar ~/work/cram/cramtools/cramtools-3.0.jar bam -I > /nfjkb/scratch/data/9827_2#49.java3.cram -R > /nfs/srpipe_references/references/Human/1000Genomes_hs37d5/all/fasta/hs37d5.fa > --skip-md5-check Y:|wc -l > 7134 Looks like a bug in the code that skips to the first container in query. It should check all coordinates returned by index instead of just skipping the last one. Do you have the numbers from Picard's ViewSam? Right now cramtools has it's own methods for random access but I'm thinking to switch it to htsjdk completely. > > > The second issue is that I don't understand how .BAI works when we get > multiple references packed into a single slice within cram. My > understanding is that BAI on CRAM is implemented by encoding the slice > coordinates into BAI instead of read coordinates, due to the > read virtual-offset not being workable within CRAM. > > Does it produce multiple BAI fake reads, one for each reference used > within the cram slice? It does not do anything meaningful for multiref at this point. It could produce fake reads as you say, although at a cost of uncompressing records.
Vadim > > James > -- Vadim Zalunin European Bioinformatics Institute (EMBL-EBI) European Molecular Biology Laboratory Wellcome Trust Genome Campus Hinxton Cambridge CB10 1SD United Kingdom Tel: + 44 (0) 1223 494 614 Fax: + 44 (0) 1223 494 468 ------------------------------------------------------------------------------ _______________________________________________ Samtools-help mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/samtools-help
