Hi Petr,
Pardon me, I upgraded to the latest (htslib_1.0) and the -t option you
suggested didn't result in any output. Appears that
-t, --targets [^]<region> similar to -r but streams rather
than index-jumps. Exclude regions with "^" prefix
-r, --regions <region> restrict to comma-separated list of
regions
Any thoughts?
Jonathan
On Thu, Oct 2, 2014 at 3:49 AM, Petr Danecek <[email protected]> wrote:
> Hi Jonathan,
>
> new versions of mpileup have the -t DPR,INFO/DPR option which outputs
> depth for each observed base.
>
> Cheers,
> petr
>
>
> On Wed, 2014-10-01 at 12:49 -0400, jbr950 wrote:
> > Hi Petr,
> > Thanks for your time thus far. I'm trying to query some bam files
> > at specified locations to have a look at the allele frequencies. The
> > motivation is to 'validate' variants called in one dataset, in a novel
> > dataset using a more simple approach. Ideally, I'd get the raw counts
> > of the number of reads mapping to A,T,C,G, and/or indel (if detected).
> >
> >
> > I tried mpileup two ways (with -cg in bcftools view to call variants,
> > and without), and neither seems to provide me the data I would like
> > in the corresponding output:
> >
> > samtools mpileup -BQ0 -d10000000 -l $variantBed -uf ${ref} $bamFile |
> > $bcftools view - > no_cg.vcf
> >
> > samtools mpileup -BQ0 -d10000000 -l $variantBed -uf ${ref} $bamFile |
> > $bcftools view -cg - > cg.vcf
> >
> >
> >
> > Then I check the context around 2 known variants:
> >
> >
> > Variant 1
> > grep -C 1 105213319 cg.vcf
> >
> > 14 105213318 . A . 117 .
> > DP=29;AF1=0;AC1=0;DP4=14,15,0,0;MQ=20;FQ=-114 PL 0
> > 14 105213319 . G C 25 .
> > DP=28;VDB=0.0398;AF1=0.5;AC1=1;DP4=10,9,4,5;MQ=20;FQ=28;PV4=1,1,1,1
> > GT:PL:GQ 0/1:55,0,128:58
> > 14 105213320 . T . 114 .
> > DP=28;AF1=0;AC1=0;DP4=14,14,0,0;MQ=20;FQ=-111 PL 0
> >
> >
> > Here, the variant was called and the DP4 field tells me the number of
> > reference and non-reference (presumably "C") alleles. I am assuming
> > that there are 0 T's and A's.
> >
> >
> >
> >
> > grep -C 1 105213319 no_cg.vcf
> > 14 105213318 . A X 0 .
> > DP=29;I16=14,15,0,0,979,34421,0,0,580,11600,0,0,511,11223,0,0 PL
> > 0,87,198
> > 14 105213319 . G C,T,X 0 .
> >
> DP=28;I16=10,9,4,5,645,23047,338,12742,380,7600,180,3600,339,7223,177,4179;VDB=0.0398
> PL 55,0,128,96,140,195,112,152,198,201
> > 14 105213320 . T X 0 .
> > DP=28;I16=14,14,0,0,924,32326,0,0,560,11200,0,0,520,11560,0,0 PL
> > 0,84,195
> >
> >
> > Here, I see that that the non-ref alleles consist of C's ant T's,
> > although I don't see any way to determine the relative proportion or
> > absolute number of each.
> >
> >
> >
> >
> >
> > Trying Variant 2:
> > grep -C 1 154574018 cg.vcf
> > 1 154574017 . T . 283 .
> > DP=274;VDB=0.0365;AF1=0;AC1=0;DP4=146,128,0,0;MQ=20;FQ=-282 PL
> > 0
> > 1 154574018 . C . 113 .
> >
> DP=273;VDB=0.0365;AF1=0;AC1=0;DP4=110,100,31,32;MQ=20;FQ=-110;PV4=0.67,1,1,1
> PL 0
> > 1 154574019 . T . 283 .
> > DP=278;VDB=0.0365;AF1=0;AC1=0;DP4=142,136,0,0;MQ=20;FQ=-282 PL
> > 0
> >
> >
> > This variant was not called at all. I may be willing to "assume" that
> > the 63 'non-ref' reads reported in the DP4 field correspond to my
> > variant, but it would be nice to be sure.
> >
> >
> >
> >
> >
> > grep -C 1 154574018 no_cg.vcf
> > 1 154574017 . T X 0 .
> >
> DP=274;I16=146,128,0,0,10042,376816,0,0,5463,109209,0,0,5159,117099,0,0;VDB=0.0365
> PL 0,255,255
> > 1 154574018 . C G,X 0 .
> >
> DP=273;I16=110,100,31,32,7610,283160,2311,86963,4183,83609,1260,25200,3918,88780,1242,28290;VDB=0.0365
> PL 0,83,117,255,255,220
> > 1 154574019 . T X 0 .
> >
> DP=278;I16=142,136,0,0,10104,376490,0,0,5543,110809,0,0,5167,116999,0,0;VDB=0.0365
> PL 0,255,255
> >
> >
> >
> >
> > Here, the C to G is reported, but I again I only have total read depth
> > via the DP field, no breakdown by nucleotide (or am I missing it?).
> >
> >
> > It's true I'm using Samtools 0.1.18. If upgrading to the latest would
> > give me a better option, I'm happy to do that, but after scouring the
> > available documentation, I don't see a way to do this.
> >
> >
> > In other words, I'd like to inquire at many positions the proportion
> > of A's, T's, C's, G's and/or detected indels. Mpileup (at least how
> > I'm currently using it) gets me close but not all the way there.
> >
> >
> >
> >
> > Did it make sense? I'd appreciate any suggestions.
> >
> >
> >
> > Best regards,
> > Jonathan
> >
> >
> >
> > On Tue, Sep 23, 2014 at 11:47 AM, Petr Danecek <[email protected]>
> > wrote:
> > Hi Jonathan,
> >
> > On Tue, 2014-09-23 at 11:05 -0400, jbr950 wrote:
> > > Hi Petr,
> > > The workflow documentation looks great - this effort is
> > undoubtedly
> > > going to help a lot of researchers. Thanks for doing it.
> >
> > the credits for the workflow should not go to me, the
> > htslib.org page
> > has been put together by Martin Pollard, Thomas Keane and
> > others.
> > :-)
> >
> > > In this instance, I actually don't need statistics and the
> > raw pileups
> > > are fine. Which brings me back to the original issue. With
> > the new
> > > version, the vcf outputs from mpileup, using the same .bed
> > as input,
> > > are still of varying lengths. Do you know of a way to
> > force mpileup
> > > to report data at every position in the .bed file for every
> > run? Or
> > > alternatively, do you know why some positions report a DP=0
> > and other
> > > positions are omitted altogether?
> >
> > These DP=0 sites are emitted at places where there is an
> > aligned read
> > with an indel. The indel may not appear in the output when the
> > mpileup
> > machinery decides it is not significant enough.
> >
> > I hope this helps,
> > Petr
> >
> >
> > > Thanks again,
> > > Jonathan
> > >
> > >
> > >
> > > On Tue, Sep 23, 2014 at 9:26 AM, Petr Danecek
> > <[email protected]>
> > > wrote:
> > > There is a slowly growing documentation on the new
> > website,
> > > please check
> > > the variant calling workflow
> > > http://www.htslib.org/workflow/
> > >
> > > You need to use `bcftools call`, not `bcftools
> > view`.
> > >
> > > petr
> > >
> > >
> > > On Tue, 2014-09-23 at 09:25 -0400, jbr950 wrote:
> > > > Good catch - I was trying to use an old version of
> > bcftools
> > > with the
> > > > new version of samtools..
> > > >
> > > >
> > > > So I changed that to point to the new version of
> > bcftools,
> > > and now:
> > > >
> > > >
> > > >
> > samtools-bcftools-htslib-1.0_x64-linux/bin/samtools mpileup
> > > -BQ0
> > > > -d10000000 -l $variantBed -uf ${ref} $bamFile >
> > test.bcf
> > > >
> > > >
> > > >
> > > > So far so good..
> > > >
> > > >
> > > > And then:
> > > > cat test.bcf |
> > > samtools-bcftools-htslib-1.0_x64-linux/bin/bcftools
> > > > view -cg -
> > > > Error: Could not parse --min-ac g
> > > >
> > > >
> > > >
> > > > On the other hand, calling: .
> > > > cat test.bcf |
> > > samtools-bcftools-htslib-1.0_x64-linux/bin/bcftools
> > > > view-
> > > >
> > > >
> > > >
> > > > does create a nice output, so I'm inclined to let
> > it go with
> > > the
> > > > defaults. I don't see equivalent arguments for the
> > -c and -g
> > > flags and
> > > > I'm not particularly attached to the use of
> > Bayesian
> > > inference,
> > > > although it would be good to understand what
> > differences I
> > > can expect
> > > > between the old method (using -c) and the new
> > (sans -c)?
> > > >
> > > >
> > > > Thanks for your time and assistance!
> > > >
> > > >
> > > > Jonathan
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > On Tue, Sep 23, 2014 at 2:40 AM, Petr Danecek
> > > <[email protected]>
> > > > wrote:
> > > > This does not look like output from
> > samtools 1.0 -
> > > it outputs
> > > > VCFv4.2
> > > >
> > > > petr
> > > >
> > > > On Mon, 2014-09-22 at 14:53 -0400, jbr950
> > wrote:
> > > > > Hi Petr,
> > > > > Thanks for your reply. I grabbed
> > the
> > > > > samtools-bcftools-htslib-1.0_x64-linux
> > binary and
> > > tried
> > > > again.
> > > > >
> > > > >
> > > > > When I'd used Samtools 0.1.18, the issue
> > I emailed
> > > about was
> > > > that the
> > > > > number of lines in output varied by .bam
> > file, and
> > > I didn't
> > > > understand
> > > > > why lines were being ommitted, and why
> > not in a
> > > common
> > > > manner.
> > > > >
> > > > >
> > > > > Using version 1.0 and the same command
> > (I checked
> > > on the
> > > > older
> > > > > samtools and it is producing output), I
> > get no
> > > output at
> > > > all. Mpileup
> > > > > writes a header and then stops. I have
> > copied and
> > > pasted
> > > > the output.
> > > > >
> > > > >
> > > > >
> > > > > My .bed is 3 columns: chr start stop
> > > > > with no header.
> > > > >
> > > > >
> > > > > Any advice appreciated, thanks!
> > > > >
> > > > >
> > > > > Jonathan
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > [mpileup] 1 samples in 1 input files
> > > > > (mpileup) Max depth is above 1M.
> > Potential memory
> > > hog!
> > > > > [bcf_sync] incorrect number of fields
> > (0 != 5) at
> > > 0:0
> > > > > [afs] 0:0.000
> > > > > ##fileformat=VCFv4.1
> > > > >
> > > ##INFO=<ID=DP,Number=1,Type=Integer,Description="Raw
> > read
> > > > depth">
> > > > >
> > > ##INFO=<ID=DP4,Number=4,Type=Integer,Description="#
> > > > high-quality
> > > > > ref-forward bases, ref-reverse,
> > alt-forward and
> > > alt-reverse
> > > > bases">
> > > > >
> > > >
> > >
> > ##INFO=<ID=MQ,Number=1,Type=Integer,Description="Root-mean-square
> > > > > mapping quality of covering reads">
> > > > >
> > > ##INFO=<ID=FQ,Number=1,Type=Float,Description="Phred
> > > > probability of
> > > > > all samples being the same">
> > > > >
> > > >
> > >
> > ##INFO=<ID=AF1,Number=1,Type=Float,Description="Max-likelihood
> > > > > estimate of the first ALT allele
> > frequency
> > > (assuming HWE)">
> > > > >
> > > >
> > >
> > ##INFO=<ID=AC1,Number=1,Type=Float,Description="Max-likelihood
> > > > > estimate of the first ALT allele count
> > (no HWE
> > > assumption)">
> > > > >
> > ##INFO=<ID=G3,Number=3,Type=Float,Description="ML
> > > estimate
> > > > of genotype
> > > > > frequencies">
> > > > >
> > >
> > ##INFO=<ID=HWE,Number=1,Type=Float,Description="Chi^2 based
> > > > HWE test
> > > > > P-value based on G3">
> > > > >
> > >
> > ##INFO=<ID=CLR,Number=1,Type=Integer,Description="Log ratio
> > > > of
> > > > > genotype likelihoods with and without
> > the
> > > constraint">
> > > > >
> > > ##INFO=<ID=UGT,Number=1,Type=String,Description="The
> > most
> > > > probable
> > > > > unconstrained genotype configuration in
> > the trio">
> > > > >
> > > ##INFO=<ID=CGT,Number=1,Type=String,Description="The
> > most
> > > > probable
> > > > > constrained genotype configuration in
> > the trio">
> > > > >
> > >
> > ##INFO=<ID=PV4,Number=4,Type=Float,Description="P-values for
> > > > strand
> > > > > bias, baseQ bias, mapQ bias and tail
> > distance
> > > bias">
> > > > >
> > >
> > ##INFO=<ID=INDEL,Number=0,Type=Flag,Description="Indicates
> > > > that the
> > > > > variant is an INDEL.">
> > > > >
> > >
> > ##INFO=<ID=PC2,Number=2,Type=Integer,Description="Phred
> > > > probability of
> > > > > the nonRef allele frequency in group1
> > samples
> > > being larger
> > > > (,smaller)
> > > > > than in group2.">
> > > > >
> > >
> > ##INFO=<ID=PCHI2,Number=1,Type=Float,Description="Posterior
> > > > weighted
> > > > > chi^2 P-value for testing the
> > association between
> > > group1 and
> > > > group2
> > > > > samples.">
> > > > >
> > >
> > ##INFO=<ID=QCHI2,Number=1,Type=Integer,Description="Phred
> > > > scaled
> > > > > PCHI2.">
> > > > >
> > ##INFO=<ID=PR,Number=1,Type=Integer,Description="#
> > > > permutations
> > > > > yielding a smaller PCHI2.">
> > > > >
> > >
> > ##INFO=<ID=VDB,Number=1,Type=Float,Description="Variant
> > > > Distance
> > > > > Bias">
> > > > >
> > >
> > ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
> > > > >
> > >
> > ##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype
> > > > Quality">
> > > > >
> > >
> > ##FORMAT=<ID=GL,Number=3,Type=Float,Description="Likelihoods
> > > > for
> > > > > RR,RA,AA genotypes (R=ref,A=alt)">
> > > > >
> > > ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="#
> > > > high-quality
> > > > > bases">
> > > > >
> > > >
> > >
> > ##FORMAT=<ID=SP,Number=1,Type=Integer,Description="Phred-scaled
> strand
> > > > > bias P-value">
> > > > >
> > >
> > ##FORMAT=<ID=PL,Number=G,Type=Integer,Description="List of
> > > > > Phred-scaled genotype likelihoods">
> > > > > #CHROM POS ID REF ALT
> > QUAL
> > > FILTER INFO
> > > > FORMAT
> > > > >
> > > > >
> > > > >
> > > > > On Thu, Sep 4, 2014 at 5:20 AM, Petr
> > Danecek
> > > > <[email protected]> wrote:
> > > > > Hi Jonathan,
> > > > >
> > > > > these are good questions. Could
> > you please
> > > try with
> > > > the latest
> > > > > release,
> > > > > I am happy to answer any
> > remaining issues
> > > not solved
> > > > by the
> > > > > upgrade.
> > > > >
> > > > > Cheers,
> > > > > Petr
> > > > >
> > > > > On Sun, 2014-08-17 at 01:34
> > -0400, Jo R
> > > wrote:
> > > > > > Hello,
> > > > > > I'm attemping to get
> > nucleotide
> > > frequency
> > > > information at
> > > > > a number
> > > > > > of positions across a number
> > of samples,
> > > and am
> > > > having
> > > > > difficulty
> > > > > > interpreting some output. Any
> > insights
> > > would be
> > > > > appreciated.
> > > > > >
> > > > > >
> > > > > > I'm running the following
> > command:
> > > > > >
> > > > > >
> > > > > > samtools mpileup -BQ0
> > -d10000000 -l
> > > > VariantBed.bed -uf
> > > > > $refFile $bam
> > > > > > | bcftools view -bcg - |
> > bcftools view -
> > > >
> > > > > > ${sampleName}_validation.vcf
> > > > > >
> > > > > >
> > > > > >
> > > > > > I notice that this command
> > creates an
> > > output file
> > > > with
> > > > > > an unpredictable number of
> > rows.
> > > Running the
> > > > command using
> > > > > the same
> > > > > > bed file on a set of
> > different .bam
> > > files creates
> > > > a set of
> > > > > output vcf
> > > > > > files with a wide distribution
> > in
> > > numbers of rows.
> > > > > >
> > > > > >
> > > > > > I presumed that the difference
> > in row
> > > numbers
> > > > means that
> > > > > some
> > > > > > positions drop out on
> > some .bam files
> > > because
> > > > those samples
> > > > > lacked
> > > > > > coverage where other samples
> > had
> > > coverage.
> > > > > >
> > > > > >
> > > > > > If that's the case, though, I
> > don't know
> > > what to
> > > > make of
> > > > > lines like
> > > > > > the following one:
> > > > > >
> > > > > >
> > > > > > 1 2160881 . G
> > .
> > > 28.2 .
> > > > > >
> > DP=0;VDB=0.0003;;AC1=2;FQ=-30 PL
> > > 0
> > > > > >
> > > > > >
> > > > > >
> > > > > > here, it looks like DP=0, but
> > this
> > > position still
> > > > got
> > > > > reported in the
> > > > > > vcf output. I also don't see
> > AC1 in the
> > > legend
> > > > for INFO
> > > > > tags in the
> > > > > > samtools specification page,
> > so I don't
> > > know what
> > > > to make of
> > > > > a value
> > > > > > of 2.
> > > > > >
> > > > > >
> > > > > > So, I am confused. Positions
> > with a
> > > positive
> > > > value of DP
> > > > > and DP4 make
> > > > > > sense to me. But why are some
> > positions
> > > > completely ommitted
> > > > > from the
> > > > > > vcf output, and other
> > positions
> > > reporting a DP=0?
> > > > > >
> > > > > >
> > > > > > Thanks for any advice.
> > > > > >
> > > > > >
> > > > > > Best regards,
> > > > > > Jonathan
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> ------------------------------------------------------------------------------
> > > > > >
> > > _______________________________________________
> > > > > > Samtools-help mailing list
> > > > > >
> > [email protected]
> > > > > >
> > > >
> > >
> > https://lists.sourceforge.net/lists/listinfo/samtools-help
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > The Wellcome Trust Sanger
> > Institute is
> > > operated by
> > > > Genome
> > > > > Research
> > > > > Limited, a charity registered
> > in England
> > > with
> > > > number 1021457
> > > > > and a
> > > > > company registered in England
> > with number
> > > 2742969,
> > > > whose
> > > > > registered
> > > > > office is 215 Euston Road,
> > London, NW1
> > > 2BE.
> > > > >
> > > > >
> > > >
> > > >
> > > >
> > > >
> > > > --
> > > > The Wellcome Trust Sanger Institute is
> > operated by
> > > Genome
> > > > Research
> > > > Limited, a charity registered in England
> > with
> > > number 1021457
> > > > and a
> > > > company registered in England with number
> > 2742969,
> > > whose
> > > > registered
> > > > office is 215 Euston Road, London, NW1
> > 2BE.
> > > >
> > > >
> > > >
> > >
> > >
> > >
> > >
> > > --
> > > The Wellcome Trust Sanger Institute is operated by
> > Genome
> > > Research
> > > Limited, a charity registered in England with
> > number 1021457
> > > and a
> > > company registered in England with number 2742969,
> > whose
> > > registered
> > > office is 215 Euston Road, London, NW1 2BE.
> > >
> > >
> > >
> >
> >
> >
> >
> > --
> > The Wellcome Trust Sanger Institute is operated by Genome
> > Research
> > Limited, a charity registered in England with number 1021457
> > and a
> > company registered in England with number 2742969, whose
> > registered
> > office is 215 Euston Road, London, NW1 2BE.
> >
> >
> >
>
>
>
>
> --
> The Wellcome Trust Sanger Institute is operated by Genome Research
> Limited, a charity registered in England with number 1021457 and a
> company registered in England with number 2742969, whose registered
> office is 215 Euston Road, London, NW1 2BE.
>
------------------------------------------------------------------------------
Meet PCI DSS 3.0 Compliance Requirements with EventLog Analyzer
Achieve PCI DSS 3.0 Compliant Status with Out-of-the-box PCI DSS Reports
Are you Audit-Ready for PCI DSS 3.0 Compliance? Download White paper
Comply to PCI DSS 3.0 Requirement 10 and 11.5 with EventLog Analyzer
http://pubads.g.doubleclick.net/gampad/clk?id=154622311&iu=/4140/ostg.clktrk
_______________________________________________
Samtools-help mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/samtools-help