Hi Danielle,
Sorry that message was sent before it was complete. Here is the complete reply:
In a nutshell, the chain scoring scheme is somewhat complicated, but ballpark
estimates of scores can be made from expected size of aligned blocks, percent
identity, gap size and gap frequency.
For most species pairs, we use this substitution matrix to score gapless
aligned blocks:
A C G T
A 91 -90 -25 -100
C -90 100 -100 -25
G -25 -100 100 -90
T -100 -25 -90 91
The matrix is included in the details page for each chain track. It gives more
precision in scoring matches/mismatches than the Smith-Waterman +1/-1. I
believe that was derived a long time ago (2001?) by Webb Miller and colleagues
from observed match and mismatch frequencies in manually checked multi-species
alignments of the HoxD region, which was sequenced long before entire genome
sequences became available. We use more stringent matrices for closely related
mammals e.g. mouse-rat or human-chimp. But as a rule of thumb, expect scores
of aligned blocks from this scheme to be a couple orders of magnitude larger
than Smith-Waterman scores because substitution scores are normalized to
max=100.
Gaps in chains are penalized with a piecewise linear function that penalizes
gap openings the most, with less harsh penalties as gaps are larger (we expect
large gaps in cross-species alignments due to insertions, rearrangements etc.).
For mammal-mammal, we use this matrix (which also appears on chain details
pages):
tableSize 11
smallSize 111
position 1 2 3 11 111 2111 12111 32111 72111 152111 252111
qGap 350 425 450 600 900 2900 22900 57900 117900 217900 317900
tGap 350 425 450 600 900 2900 22900 57900 117900 217900 317900
bothGap 750 825 850 1000 1300 3300 23300 58300 11360,000
8300 218300 318300
tableSize 11 means that there are 11 points in the piecewise function.
smallSize 111 is an implementation detail (boundary for precomputing
penalties). position corresponds to gap size boundaries. Query gap penalties
are in qGap, target gap penalties in tGap, and gaps in both target and query
(meaning there is unalignable sequence in both) are in bothGap. Given a gap
size, if it corresponds to one of the values in position, it gets the
corresponding penalty. If it is between two positions, its score is linearly
interpolated between the two positions' penalties. Then the penalty is
subtracted from the chain score.
The penalties can be divided by 100 to get a (low) estimate of how many
matching bases are required to make up for the gap penalty (beyond the number
of matching bases required to make up for mismatching bases).
Another approach to making sense of chain scores is to look at cahin score
histograms, e.g. for all chains or for chains of a particular length and gap
size extracted using chainFilter.
Jim said that the score thresholds were set to exclude most of (Robert
Baertsch's) pseudogenes. So I suppose there were several trials of trying a
chain score picked from a histogram of chains meeting some desired size and gap
content (netFilter usage: "-minSynScore... def=200,000... Default covers 27,000
bases including 9,000 aligning--a very stringent requirement."), and then
intersecting the resulting filtered nets with the pseudogene track. (Also,
code comment:b*sc
----- "Danielle Lemay" <[email protected]> wrote:
> From: "Danielle Lemay" <[email protected]>
> To: "Angie Hinrichs" <[email protected]>
> Cc: [email protected]
> Sent: Tuesday, February 9, 2010 11:28:26 AM GMT -08:00 US/Canada Pacific
> Subject: Re: [Genome] netFilter to produce syntenic regions
>
> Dear Angie,
>
> Thank you for the thorough explanation. The default settings are much
> too stringent for my purposes. Is it possible for you to explain to
> me how the two scores are calculated, TopScore and SynScore, so that a
> user can make intelligent parameter choices?
>
> Thanks,
> Danielle
>
>
> On Mon, Feb 8, 2010 at 4:25 PM, Angie Hinrichs <[email protected]>
> wrote:
> > Hi Danielle,
> >
> >> It has been suggested in the archives that "netFilter -syn" can be
> >> used to do this and elsewhere in the archives there's an
> indication
> >> that it has been "tuned" to human-mouse. What does that mean?
> Will
> >> it still work for other mammal pairs?
> >
> > Yes, that is the aim of "netFilter -syn". I will describe how it
> > works in as much detail as I can, and hopefully that will be enough
> > information for you to decide whether it will suit your purposes.
> >
> > "netFilter -syn" uses these parameters described in the usage (run
> > just "netFilter" with no arguments to see the full usage):
> >
> > -syn - do filtering based on synteny (tuned for
> human/mouse).
> > -minTopScore=N - Minimum score for top level alignments. default
> 300000
> > -minSynScore=N - Min syntenic block score (def=200,000).
> > Default covers 27,000 bases including 9,000
> > aligning--a very stringent requirement.
> > -minSynSize=N - Min syntenic block size (def=20,000). -
> > -minSynAli=N - Min syntenic alignment size(def=10,000). -
> > -maxFar=N - Max distance to allow synteny (def=200,000).
> >
> > The -syn filter has multiple stages:
> >
> > First, a net passes the -syn filter if its score is at least
> > minSynScore, its size on the reference genome is at least
> minSynSize,
> > and its alignment size (number of bases in gapless aligned blocks
> in
> > chain) is at least minSynAli.
> >
> > If those criteria are not satisfied, but the net has been
> classified
> > as "top" by the netSyntenic program, and its score is at least
> > minTopScore, it passes.
> >
> > If the net has been classified as "nonSyn" by netSyntenic, it
> fails.
> >
> > If the net's distance from the position predicted by its parent net
> > is greater than maxFar, it fails.
> >
> > Finally, if the net has not met any of the above criteria, it
> passes.
> > The source code has this comment:
> > /* For all others, assume syntenic. This keeps tandem dupes, small
> inversion,
> > * and translocations. */
> >
> > The source tree has a description of the .net format that may
> provide
> > insight into some of these parameters ("ali" is compared to
> minSynAli,
> > "qFar" is compared to maxFar):
> kent/src/hg/mouseStuff/netFormat.doc
> >
> > Source code for the chain* and net* programs is also in mouseStuff,
> > and ultimately that is where to look for answers about the detailed
> > workings of the programs. (Or, as you have done, you can ask the
> > genome list for a reading of the entrails of the code :)
> >
> > I don't know the process by which the parameters were tuned --
> perhaps
> > Jim Kent can comment.
> >
> > My hunch is that it will probably do a decent job with other pairs
> of
> > mammalian genomes; I suggest running netFilter on a downloaded .net
> > file, using netToBed to make a custom track, and then visually
> > comparing the custom track to the net track in a handful of regions
> --
> > do the retained and excluded parts of the net seem reasonable? If
> > not, is there a netFilter parameter that could be changed to get
> the
> > sensitivity / specificity that you're looking for? I guess that
> would
> > be a tuning process, by subjectively evaluating the output of a
> very
> > complex quantitative system.
> >
> >
> >> Also, is it possible to specify the permissible gap size? For
> >> example, let's say there are two level 1 nets that are separated by
> a
> >> 100bp or less and I want to join them, is that possible with the
> >> program?
> >
> > netFilter can gloss over gaps within a net, but not between
> top-level
> > nets. But hopefully there wouldn't be cases of adjacent top-level
> > nets that should have been joined -- that should have happened at
> the
> > chaining step. The command line arg that you want is -minGap=100:
> >
> > -minGap=N - restrict to those with gap size (tSize) >= minSize
> >
> > -- that should read "restrict gaps to those with size (tSize) >=
> > minGap".
> >
> > Hope that helps,
> >
> > Angie
> >
> > ----- "Danielle Lemay" <[email protected]> wrote:
> >
> >> From: "Danielle Lemay" <[email protected]>
> >> To: [email protected]
> >> Sent: Saturday, February 6, 2010 3:21:28 PM GMT -08:00 US/Canada
> Pacific
> >> Subject: [Genome] netFilter to produce syntenic regions
> >>
> >> Hi,
> >>
> >> This question is regarding the netFilter program.
> >>
> >> Given a pair of mammalian genomes (e.g. mouse and opossum), I'd
> like
> >> to produce a list of all syntenic regions. This is presumably a
> >> combination of the level 1 net with the lower level nets that are
> >> syntenic.
> >>
> >> It has been suggested in the archives that "netFilter -syn" can be
> >> used to do this and elsewhere in the archives there's an
> indication
> >> that it has been "tuned" to human-mouse. What does that mean?
> Will
> >> it still work for other mammal pairs?
> >>
> >> Also, is it possible to specify the permissible gap size? For
> >> example, let's say there are two level 1 nets that are separated by
> a
> >> 100bp or less and I want to join them, is that possible with the
> >> program?
> >>
> >> Thanks,
> >> Danielle
> >>
> >> ########################################
> >> Danielle G. Lemay, PhD
> >> Postdoctoral Scholar, Bruce German's Lab
> >> Food Science and Technology Department
> >> University of California at Davis
> >> (530) 297 7688
> >> _______________________________________________
> >> Genome maillist - [email protected]
> >> https://lists.soe.ucsc.edu/mailman/listinfo/genome
> >
>
>
>
> --
> ########################################
> Danielle G. Lemay, PhD
> Postdoctoral scholar, German Lab
> Food Science and Technology Department
> University of California at Davis
> (530) 297 7688
_______________________________________________
Genome maillist - [email protected]
https://lists.soe.ucsc.edu/mailman/listinfo/genome