That looks like output from bcftools, not samtools mpileup: $ samtools mpileup 2>&1 | grep output-tags -t, --output-tags LIST optional tags to output: DP,DPR,DV,DP4,INFO/DPR,SP []
Petr On Thu, 2014-10-02 at 19:35 -0400, jbr950 wrote: > Hi Petr, > Pardon me, I upgraded to the latest (htslib_1.0) and the -t option > you suggested didn't result in any output. Appears that > > > -t, --targets [^]<region> similar to -r but streams > rather than index-jumps. Exclude regions with "^" prefix > > -r, --regions <region> restrict to comma-separated > list of regions > > > > Any thoughts? > > > Jonathan > > > > > > > > > > > > On Thu, Oct 2, 2014 at 3:49 AM, Petr Danecek <[email protected]> wrote: > Hi Jonathan, > > new versions of mpileup have the -t DPR,INFO/DPR option which > outputs > depth for each observed base. > > Cheers, > petr > > > On Wed, 2014-10-01 at 12:49 -0400, jbr950 wrote: > > Hi Petr, > > Thanks for your time thus far. I'm trying to query some > bam files > > at specified locations to have a look at the allele > frequencies. The > > motivation is to 'validate' variants called in one dataset, > in a novel > > dataset using a more simple approach. Ideally, I'd get the > raw counts > > of the number of reads mapping to A,T,C,G, and/or indel (if > detected). > > > > > > I tried mpileup two ways (with -cg in bcftools view to call > variants, > > and without), and neither seems to provide me the data I > would like > > in the corresponding output: > > > > samtools mpileup -BQ0 -d10000000 -l $variantBed -uf ${ref} > $bamFile | > > $bcftools view - > no_cg.vcf > > > > samtools mpileup -BQ0 -d10000000 -l $variantBed -uf ${ref} > $bamFile | > > $bcftools view -cg - > cg.vcf > > > > > > > > Then I check the context around 2 known variants: > > > > > > Variant 1 > > grep -C 1 105213319 cg.vcf > > > > 14 105213318 . A . 117 . > > DP=29;AF1=0;AC1=0;DP4=14,15,0,0;MQ=20;FQ=-114 PL 0 > > 14 105213319 . G C 25 . > > > DP=28;VDB=0.0398;AF1=0.5;AC1=1;DP4=10,9,4,5;MQ=20;FQ=28;PV4=1,1,1,1 > > GT:PL:GQ 0/1:55,0,128:58 > > 14 105213320 . T . 114 . > > DP=28;AF1=0;AC1=0;DP4=14,14,0,0;MQ=20;FQ=-111 PL 0 > > > > > > Here, the variant was called and the DP4 field tells me the > number of > > reference and non-reference (presumably "C") alleles. I am > assuming > > that there are 0 T's and A's. > > > > > > > > > > grep -C 1 105213319 no_cg.vcf > > 14 105213318 . A X 0 . > > > DP=29;I16=14,15,0,0,979,34421,0,0,580,11600,0,0,511,11223,0,0 PL > > 0,87,198 > > 14 105213319 . G C,T,X 0 . > > > > DP=28;I16=10,9,4,5,645,23047,338,12742,380,7600,180,3600,339,7223,177,4179;VDB=0.0398 > PL 55,0,128,96,140,195,112,152,198,201 > > 14 105213320 . T X 0 . > > > DP=28;I16=14,14,0,0,924,32326,0,0,560,11200,0,0,520,11560,0,0 PL > > 0,84,195 > > > > > > Here, I see that that the non-ref alleles consist of C's ant > T's, > > although I don't see any way to determine the relative > proportion or > > absolute number of each. > > > > > > > > > > > > Trying Variant 2: > > grep -C 1 154574018 cg.vcf > > 1 154574017 . T . 283 . > > > DP=274;VDB=0.0365;AF1=0;AC1=0;DP4=146,128,0,0;MQ=20;FQ=-282 > PL > > 0 > > 1 154574018 . C . 113 . > > > > DP=273;VDB=0.0365;AF1=0;AC1=0;DP4=110,100,31,32;MQ=20;FQ=-110;PV4=0.67,1,1,1 > PL 0 > > 1 154574019 . T . 283 . > > > DP=278;VDB=0.0365;AF1=0;AC1=0;DP4=142,136,0,0;MQ=20;FQ=-282 > PL > > 0 > > > > > > This variant was not called at all. I may be willing to > "assume" that > > the 63 'non-ref' reads reported in the DP4 field correspond > to my > > variant, but it would be nice to be sure. > > > > > > > > > > > > grep -C 1 154574018 no_cg.vcf > > 1 154574017 . T X 0 . > > > > DP=274;I16=146,128,0,0,10042,376816,0,0,5463,109209,0,0,5159,117099,0,0;VDB=0.0365 > PL 0,255,255 > > 1 154574018 . C G,X 0 . > > > > DP=273;I16=110,100,31,32,7610,283160,2311,86963,4183,83609,1260,25200,3918,88780,1242,28290;VDB=0.0365 > PL 0,83,117,255,255,220 > > 1 154574019 . T X 0 . > > > > DP=278;I16=142,136,0,0,10104,376490,0,0,5543,110809,0,0,5167,116999,0,0;VDB=0.0365 > PL 0,255,255 > > > > > > > > > > Here, the C to G is reported, but I again I only have total > read depth > > via the DP field, no breakdown by nucleotide (or am I > missing it?). > > > > > > It's true I'm using Samtools 0.1.18. If upgrading to the > latest would > > give me a better option, I'm happy to do that, but after > scouring the > > available documentation, I don't see a way to do this. > > > > > > In other words, I'd like to inquire at many positions the > proportion > > of A's, T's, C's, G's and/or detected indels. Mpileup (at > least how > > I'm currently using it) gets me close but not all the way > there. > > > > > > > > > > Did it make sense? I'd appreciate any suggestions. > > > > > > > > Best regards, > > Jonathan > > > > > > > > On Tue, Sep 23, 2014 at 11:47 AM, Petr Danecek > <[email protected]> > > wrote: > > Hi Jonathan, > > > > On Tue, 2014-09-23 at 11:05 -0400, jbr950 wrote: > > > Hi Petr, > > > The workflow documentation looks great - this > effort is > > undoubtedly > > > going to help a lot of researchers. Thanks for > doing it. > > > > the credits for the workflow should not go to me, > the > > htslib.org page > > has been put together by Martin Pollard, Thomas > Keane and > > others. > > :-) > > > > > In this instance, I actually don't need statistics > and the > > raw pileups > > > are fine. Which brings me back to the original > issue. With > > the new > > > version, the vcf outputs from mpileup, using the > same .bed > > as input, > > > are still of varying lengths. Do you know of a > way to > > force mpileup > > > to report data at every position in the .bed file > for every > > run? Or > > > alternatively, do you know why some positions > report a DP=0 > > and other > > > positions are omitted altogether? > > > > These DP=0 sites are emitted at places where there > is an > > aligned read > > with an indel. The indel may not appear in the > output when the > > mpileup > > machinery decides it is not significant enough. > > > > I hope this helps, > > Petr > > > > > > > Thanks again, > > > Jonathan > > > > > > > > > > > > On Tue, Sep 23, 2014 at 9:26 AM, Petr Danecek > > <[email protected]> > > > wrote: > > > There is a slowly growing documentation on > the new > > website, > > > please check > > > the variant calling workflow > > > http://www.htslib.org/workflow/ > > > > > > You need to use `bcftools call`, not > `bcftools > > view`. > > > > > > petr > > > > > > > > > On Tue, 2014-09-23 at 09:25 -0400, jbr950 > wrote: > > > > Good catch - I was trying to use an old > version of > > bcftools > > > with the > > > > new version of samtools.. > > > > > > > > > > > > So I changed that to point to the new > version of > > bcftools, > > > and now: > > > > > > > > > > > > > > samtools-bcftools-htslib-1.0_x64-linux/bin/samtools > mpileup > > > -BQ0 > > > > -d10000000 -l $variantBed -uf ${ref} > $bamFile > > > test.bcf > > > > > > > > > > > > > > > > So far so good.. > > > > > > > > > > > > And then: > > > > cat test.bcf | > > > > samtools-bcftools-htslib-1.0_x64-linux/bin/bcftools > > > > view -cg - > > > > Error: Could not parse --min-ac g > > > > > > > > > > > > > > > > On the other hand, calling: . > > > > cat test.bcf | > > > > samtools-bcftools-htslib-1.0_x64-linux/bin/bcftools > > > > view- > > > > > > > > > > > > > > > > does create a nice output, so I'm > inclined to let > > it go with > > > the > > > > defaults. I don't see equivalent > arguments for the > > -c and -g > > > flags and > > > > I'm not particularly attached to the use > of > > Bayesian > > > inference, > > > > although it would be good to understand > what > > differences I > > > can expect > > > > between the old method (using -c) and > the new > > (sans -c)? > > > > > > > > > > > > Thanks for your time and assistance! > > > > > > > > > > > > Jonathan > > > > > > > > > > > > > > > > > > > > > > > > On Tue, Sep 23, 2014 at 2:40 AM, Petr > Danecek > > > <[email protected]> > > > > wrote: > > > > This does not look like output > from > > samtools 1.0 - > > > it outputs > > > > VCFv4.2 > > > > > > > > petr > > > > > > > > On Mon, 2014-09-22 at 14:53 > -0400, jbr950 > > wrote: > > > > > Hi Petr, > > > > > Thanks for your reply. I > grabbed > > the > > > > > > samtools-bcftools-htslib-1.0_x64-linux > > binary and > > > tried > > > > again. > > > > > > > > > > > > > > > When I'd used Samtools 0.1.18, > the issue > > I emailed > > > about was > > > > that the > > > > > number of lines in output > varied by .bam > > file, and > > > I didn't > > > > understand > > > > > why lines were being ommitted, > and why > > not in a > > > common > > > > manner. > > > > > > > > > > > > > > > Using version 1.0 and the same > command > > (I checked > > > on the > > > > older > > > > > samtools and it is producing > output), I > > get no > > > output at > > > > all. Mpileup > > > > > writes a header and then > stops. I have > > copied and > > > pasted > > > > the output. > > > > > > > > > > > > > > > > > > > > My .bed is 3 columns: chr > start stop > > > > > with no header. > > > > > > > > > > > > > > > Any advice appreciated, > thanks! > > > > > > > > > > > > > > > Jonathan > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > [mpileup] 1 samples in 1 input > files > > > > > (mpileup) Max depth is above > 1M. > > Potential memory > > > hog! > > > > > [bcf_sync] incorrect number of > fields > > (0 != 5) at > > > 0:0 > > > > > [afs] 0:0.000 > > > > > ##fileformat=VCFv4.1 > > > > > > > > > ##INFO=<ID=DP,Number=1,Type=Integer,Description="Raw > > read > > > > depth"> > > > > > > > > > ##INFO=<ID=DP4,Number=4,Type=Integer,Description="# > > > > high-quality > > > > > ref-forward bases, > ref-reverse, > > alt-forward and > > > alt-reverse > > > > bases"> > > > > > > > > > > > > > > > ##INFO=<ID=MQ,Number=1,Type=Integer,Description="Root-mean-square > > > > > mapping quality of covering > reads"> > > > > > > > > > ##INFO=<ID=FQ,Number=1,Type=Float,Description="Phred > > > > probability of > > > > > all samples being the same"> > > > > > > > > > > > > > > > ##INFO=<ID=AF1,Number=1,Type=Float,Description="Max-likelihood > > > > > estimate of the first ALT > allele > > frequency > > > (assuming HWE)"> > > > > > > > > > > > > > > > ##INFO=<ID=AC1,Number=1,Type=Float,Description="Max-likelihood > > > > > estimate of the first ALT > allele count > > (no HWE > > > assumption)"> > > > > > > > ##INFO=<ID=G3,Number=3,Type=Float,Description="ML > > > estimate > > > > of genotype > > > > > frequencies"> > > > > > > > > > > > ##INFO=<ID=HWE,Number=1,Type=Float,Description="Chi^2 based > > > > HWE test > > > > > P-value based on G3"> > > > > > > > > > > > ##INFO=<ID=CLR,Number=1,Type=Integer,Description="Log ratio > > > > of > > > > > genotype likelihoods with and > without > > the > > > constraint"> > > > > > > > > > ##INFO=<ID=UGT,Number=1,Type=String,Description="The > > most > > > > probable > > > > > unconstrained genotype > configuration in > > the trio"> > > > > > > > > > ##INFO=<ID=CGT,Number=1,Type=String,Description="The > > most > > > > probable > > > > > constrained genotype > configuration in > > the trio"> > > > > > > > > > > > ##INFO=<ID=PV4,Number=4,Type=Float,Description="P-values for > > > > strand > > > > > bias, baseQ bias, mapQ bias > and tail > > distance > > > bias"> > > > > > > > > > > > ##INFO=<ID=INDEL,Number=0,Type=Flag,Description="Indicates > > > > that the > > > > > variant is an INDEL."> > > > > > > > > > > > ##INFO=<ID=PC2,Number=2,Type=Integer,Description="Phred > > > > probability of > > > > > the nonRef allele frequency in > group1 > > samples > > > being larger > > > > (,smaller) > > > > > than in group2."> > > > > > > > > > > > ##INFO=<ID=PCHI2,Number=1,Type=Float,Description="Posterior > > > > weighted > > > > > chi^2 P-value for testing the > > association between > > > group1 and > > > > group2 > > > > > samples."> > > > > > > > > > > > ##INFO=<ID=QCHI2,Number=1,Type=Integer,Description="Phred > > > > scaled > > > > > PCHI2."> > > > > > > > ##INFO=<ID=PR,Number=1,Type=Integer,Description="# > > > > permutations > > > > > yielding a smaller PCHI2."> > > > > > > > > > > > ##INFO=<ID=VDB,Number=1,Type=Float,Description="Variant > > > > Distance > > > > > Bias"> > > > > > > > > > > > ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype"> > > > > > > > > > > > ##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype > > > > Quality"> > > > > > > > > > > > ##FORMAT=<ID=GL,Number=3,Type=Float,Description="Likelihoods > > > > for > > > > > RR,RA,AA genotypes > (R=ref,A=alt)"> > > > > > > > > > ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="# > > > > high-quality > > > > > bases"> > > > > > > > > > > > > > > > ##FORMAT=<ID=SP,Number=1,Type=Integer,Description="Phred-scaled > strand > > > > > bias P-value"> > > > > > > > > > > > ##FORMAT=<ID=PL,Number=G,Type=Integer,Description="List of > > > > > Phred-scaled genotype > likelihoods"> > > > > > #CHROM POS ID REF > ALT > > QUAL > > > FILTER INFO > > > > FORMAT > > > > > > > > > > > > > > > > > > > > On Thu, Sep 4, 2014 at 5:20 > AM, Petr > > Danecek > > > > <[email protected]> wrote: > > > > > Hi Jonathan, > > > > > > > > > > these are good > questions. Could > > you please > > > try with > > > > the latest > > > > > release, > > > > > I am happy to answer > any > > remaining issues > > > not solved > > > > by the > > > > > upgrade. > > > > > > > > > > Cheers, > > > > > Petr > > > > > > > > > > On Sun, 2014-08-17 at > 01:34 > > -0400, Jo R > > > wrote: > > > > > > Hello, > > > > > > I'm attemping to > get > > nucleotide > > > frequency > > > > information at > > > > > a number > > > > > > of positions across > a number > > of samples, > > > and am > > > > having > > > > > difficulty > > > > > > interpreting some > output. Any > > insights > > > would be > > > > > appreciated. > > > > > > > > > > > > > > > > > > I'm running the > following > > command: > > > > > > > > > > > > > > > > > > samtools mpileup > -BQ0 > > -d10000000 -l > > > > VariantBed.bed -uf > > > > > $refFile $bam > > > > > > | bcftools view -bcg > - | > > bcftools view - > > > > > > > > > > > ${sampleName}_validation.vcf > > > > > > > > > > > > > > > > > > > > > > > > I notice that this > command > > creates an > > > output file > > > > with > > > > > > an unpredictable > number of > > rows. > > > Running the > > > > command using > > > > > the same > > > > > > bed file on a set of > > different .bam > > > files creates > > > > a set of > > > > > output vcf > > > > > > files with a wide > distribution > > in > > > numbers of rows. > > > > > > > > > > > > > > > > > > I presumed that the > difference > > in row > > > numbers > > > > means that > > > > > some > > > > > > positions drop out > on > > some .bam files > > > because > > > > those samples > > > > > lacked > > > > > > coverage where other > samples > > had > > > coverage. > > > > > > > > > > > > > > > > > > If that's the case, > though, I > > don't know > > > what to > > > > make of > > > > > lines like > > > > > > the following one: > > > > > > > > > > > > > > > > > > 1 2160881 . > G > > . > > > 28.2 . > > > > > > > > DP=0;VDB=0.0003;;AC1=2;FQ=-30 PL > > > 0 > > > > > > > > > > > > > > > > > > > > > > > > here, it looks like > DP=0, but > > this > > > position still > > > > got > > > > > reported in the > > > > > > vcf output. I also > don't see > > AC1 in the > > > legend > > > > for INFO > > > > > tags in the > > > > > > samtools > specification page, > > so I don't > > > know what > > > > to make of > > > > > a value > > > > > > of 2. > > > > > > > > > > > > > > > > > > So, I am confused. > Positions > > with a > > > positive > > > > value of DP > > > > > and DP4 make > > > > > > sense to me. But > why are some > > positions > > > > completely ommitted > > > > > from the > > > > > > vcf output, and > other > > positions > > > reporting a DP=0? > > > > > > > > > > > > > > > > > > Thanks for any > advice. > > > > > > > > > > > > > > > > > > Best regards, > > > > > > Jonathan > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > ------------------------------------------------------------------------------ > > > > > > > > > > _______________________________________________ > > > > > > Samtools-help > mailing list > > > > > > > > [email protected] > > > > > > > > > > > > > > > > https://lists.sourceforge.net/lists/listinfo/samtools-help > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > The Wellcome Trust > Sanger > > Institute is > > > operated by > > > > Genome > > > > > Research > > > > > Limited, a charity > registered > > in England > > > with > > > > number 1021457 > > > > > and a > > > > > company registered in > England > > with number > > > 2742969, > > > > whose > > > > > registered > > > > > office is 215 Euston > Road, > > London, NW1 > > > 2BE. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > The Wellcome Trust Sanger > Institute is > > operated by > > > Genome > > > > Research > > > > Limited, a charity registered > in England > > with > > > number 1021457 > > > > and a > > > > company registered in England > with number > > 2742969, > > > whose > > > > registered > > > > office is 215 Euston Road, > London, NW1 > > 2BE. > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > The Wellcome Trust Sanger Institute is > operated by > > Genome > > > Research > > > Limited, a charity registered in England > with > > number 1021457 > > > and a > > > company registered in England with number > 2742969, > > whose > > > registered > > > office is 215 Euston Road, London, NW1 > 2BE. > > > > > > > > > > > > > > > > > > > -- > > The Wellcome Trust Sanger Institute is operated by > Genome > > Research > > Limited, a charity registered in England with > number 1021457 > > and a > > company registered in England with number 2742969, > whose > > registered > > office is 215 Euston Road, London, NW1 2BE. > > > > > > > > > > > -- > The Wellcome Trust Sanger Institute is operated by Genome > Research > Limited, a charity registered in England with number 1021457 > and a > company registered in England with number 2742969, whose > registered > office is 215 Euston Road, London, NW1 2BE. > > > -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. ------------------------------------------------------------------------------ Meet PCI DSS 3.0 Compliance Requirements with EventLog Analyzer Achieve PCI DSS 3.0 Compliant Status with Out-of-the-box PCI DSS Reports Are you Audit-Ready for PCI DSS 3.0 Compliance? Download White paper Comply to PCI DSS 3.0 Requirement 10 and 11.5 with EventLog Analyzer http://pubads.g.doubleclick.net/gampad/clk?id=154622311&iu=/4140/ostg.clktrk _______________________________________________ Samtools-help mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/samtools-help
