[Xplor-nih] PASD for structure determination of proteins from solid state data

John Kuszewski Thu, 30 Nov 2006 20:34:36 -0500

On Nov 23, 2006, at 8:23 AM, jtn at chem.au.dk wrote:

> Dear Xplor-NIH developers,
>
> For some time ago I posted some questions regarding using the PASD/ 
> MARVIN
> facility to determine the structure of a protein from solid state  
> data. Thanks
> a lot, your answers helped a lot. I am now ready to start the  
> calculations, but
> I have some more specific questions for you regarding data formats  
> and specific
> (non-existing) options in PASD. I wondered if there is some  
> documentation out
> there on PASD, alternatively, I hope you can help me...
>
> I think PASD is a very elegant and robust method to handle  
> ambiguous data and
> false peaks and I would prefer using this method. However, for the  
> solid state
> data I work with the ambiguity is very pronounced. I.e. for your  
> case of
> cyanovirin-N you have and assignment degeneracy of >=5 in 5% of the  
> cases and
> degeneracy >= 10 in less than 2% of the cases (as judged from  
> output from
> ?initialMatch3dC.tcl?) whereas for my case with solid state data I  
> have
> degeneracy>=5 in ca. 2/3 of the cases and degeneracy>=10 in 1/3 of  
> the cases
> and also degeneracies close to 100!. this is because the data is  
> only 2D and
> with large line widths. This means altogether that the fraction of  
> inconsistent
> long-range restraints is higher than 80%.
> I would like to try PASD anyway.


How much higher than 80%?  If it's 95% bad longrange data, it'll be a  
very long shot.
One thing I've learned is that the completeness of the dataset is  
also important--
you need to have a reasonably high number of good restraints  
available for
marvin to converge.  Missing a lot of peaks in a NOESY (eg., from  
poor peak picking)
is more of a problem than picking a lot of noise.

> The type of restraints that can be derived from
> the data I use is between carbon pairs. How does that agree with  
> data format of
> the ?.shifts? and ?.PCK? files? Does the initialMatch3d.tcl script  
> for matching
> C-C chemical shifts assignment possibilities with peak positions  
> need any
> modification? I hope to have 3d spectra available later to reduce the
> degeneracy.

You could modify the initial matching script, but most of the work in  
the current
scripts is pre-filtering the peak assignments, using techniques that  
probably
wouldn't apply well to your case.  So your suggestion (below) of  
building your initial
tables directly is quite reasonable.

>
> One thing I would rather do for the moment was to do the initial- 
> matching step
> myself and proceed straightforward to the SA-steps. I have developed a
> purpose-designed method to filter out unlikely assignments in each  
> of the
> restraints based on an inferential approach using prior  
> probabilities derived
> from the database of structures and information on secondary  
> structure. I plan
> to use this inferential restraint assignment (IRA) method to reduce  
> the
> degeneracies of the restraints (with the risk of neglecting some  
> information).

Your IRA sounds like a reasonable approach to your extremely  
degenerate data.
Of course, you could be opening yourself to biasing your structures,  
but it's
hard to say how big an issue this could be without hearing more details.

>
> I am trying to build the required ?.noes? and ?.shiftAssignments?  
> files. Could
> you explain to me the precise syntax of these files?
>

Sure!  There are two files that are used to express the data in each  
NOESY spectrum:
The first contains information from the shift table, and the second  
contains information from
the peak-location table.  I'll show you the specifics first, and  
explain the logic along the way.

The .shiftAssignments table contains information from your original  
chemical shift assignment table.

The shiftAssignments file contains an entry for every proton (or  
group of equivalent protons, like a methyl
group) that has an entry in the chemical shift assignment table AND  
is expected to be visible in a
particular NOESY experiment.  Note that if you have several NOESYs  
(eg., 3dc, 3dn, 4dcn, etc), there will
be a SEPARATE .shiftAssignments file for EACH of those spectra.  In  
addition, since some types of
protons can only be expected to appear along one dimension of a  
spectrum (eg., in a 3d 15N NOESY,
Halphas can't appear on the nitrogen-bonded dimension), each entry in  
the shiftAssignments file
is flagged as either belonging to the "from" or "to" dimension.  If a  
particular proton can appear on
either dimension (eg., HNs in a 3dN spectrum), TWO SEPARATE  
shiftAssignments are created, one
marked as from, and the other marked as to.

Here's the specific syntax for the .shiftAssignments file:

Lines that begin with an exclamation point are comments, just as in  
the classic xplor.  The standard
analysis routines produce lots and lots of comments for your  
amusement and edification.

Entries in the .shiftAssignments file begin with the word  
"shiftAssignment", followed by a name, which
must be a unique string.

Various data about each shiftAssignment are then specified using  
flags.  They can appear in any order.

-protonSelection (sel)                  Selects the proton(s) this 
shiftAssignment  
refers to.  Usually, these selections are a
                                                        single proton or a 
group of equivalent protons.

-heavyatomSelection (sel)               Selects the heavy atom (usually 
directly  
bonded to the proton selection).  Used only if
                                                        this shiftAssignment is 
on an axis that represents a through- 
bond correlation (eg., in the typical setup,
                                                        the "from" dimension in 
a 3dN spectrum represents the amides,  
and the corresponding shiftAssignments
                                                        will have both proton 
and heavyatom selections defined).

-protonShift <real>                     The chemical shift, in ppm, of the 
atoms  
defined in the -protonSelection flag
-heavyatomShift <real>          The chemical shift, in ppm, of the atoms  
defined in the -heavyatomSelection flag

One, but not both, of the following flags must be set:

-to                                                     Indicates that this 
shiftAssignment is to be associated  
with the "to" dimension of the spectrum
-from                                           Indicates that this 
shiftAssignment is to be associated  
with the "from" dimension of the spectrum

-toFromPartner <string>         The name of another shiftAssignment in this  
table that represents the same atoms on the other dimension.
                                                        ShiftAssignments that 
can only appear on one dimension have no  
partners, and this flag is not used.

-note <string>                          Just a way to attach arbitrary text to 
each  
entry.  You can continue beyond one line by adding another -note  
flag.                   

Each entry ends with the word,  "end".  So a  
typical .shiftAssignments file might look like:


!
! This is a comment.  Hi  Mom!
!

shiftAssignment abc1
    -protonSelection (name ha and resid 10)
    -protonShift 4.1
    -to
    -note from file foo, entry number 1234
end

shiftAssignment abc2
    -protonSelection (name hn and resid 14)
    -heavyatomSelection (name n and resid 14)
    -protonShift 7.8
    -heavyatomShift 121.2
    -from
    -toFromPartner abc3
end

shiftAssignment abc3
    -protonSelection (name hn and resid 14)
    -protonShift 7.8
    -to
    -toFromPartner abc2
end


The peak-location data are  contained in a file ending in ".noes" (or  
".peaks"--I've changed
terminology in my current work).

Entries in the .noes file begin with the word "restraint".  After  
"restraint" comes a name, which must be a unique
string.  I usually generate it from a prefix that indicates the type  
of NOESY, followed by the peak's
ID number that most peak table formats provide.  You can do whatever  
you like, as long as each
name is unique.

After the name, a whole bunch of details about the peak are defined  
in any order you like, using
various flags.  Here they are:

-bounds <real> <real> distance bounds for this peak, in A.  Arbitrary  
order--the larger number is taken to be the upper bound.

-intensity <real>                             Records the NOESY  
peak's intensity (or volume) in arbitrary units

-fromProtonShift <real>                Flags to record the NOESY  
peak's positon (in ppm) along each spectral dimension.  If you're  
working with a 2D NOESY, only the first two would be defined.
-toProtonShift <real>
-fromHeavyatomShift <real>
-toHeavyatomShift <real>

-note <string>  Just a way to attach arbitrary text to each  
restraint.  You can continue beyond one line by adding another -note  
flag.

If you have a possible assignment for this peak, you define it here  
with the following statement and some more flags:

assign <string>  <from SA name> <to SA name> Defines a new peak  
assignment for the current peak.  Also needs a unique name, which is  
usually derived from the peak's name.
                                                                                
        Also requires the names of two entries in the  
shiftAssignments file, defining this possible assignment of this peak  
to be from a given
                                                                                
        proton to another given proton.

-upBoundCorrection <real>       Defines a distance (in A) to be added to  
the peak's upper bound when evaluating this peakAssignment.  Intended  
for cases where you have a peak assignment
                                                        that contains a methyl 
group

-lowBoundCorrection <real>      Similar to the -upBoundCorrection, but  
only there for the sake of completeness--I can't think of a reason to  
use it.

-likelihood <real>                              Defines this peak assignment's 
previous  
likelihood, which is a number between 0 and 1.  Usually calculated  
from the results of the previous pass's structure
                                                        calculations.

-note <string>                                  Just a way to attach arbitrary 
text to each peak  
assignment.  You can continue beyond one line by adding another -note  
flag.


-good           This flag is only really used in evaluating performance  
against a protein of known structure.  It defines this peak  
assignment as being consistent with the known
                        structure, and allows the analysis routines to say more 
about  
marvin's performance.  It's obviously optional.

Each assign ends with the word, "end", and each restraint ends with  
an "end" too.  So a typical noes file looks like

!
! Another comment
!

restraint 3dn1234
    -fromProtonShift 7.78
    -fromHeavyatomShift 121.21
    -toProtonShift 4.08
    -intensity 10500000
    -bounds 3.0 1.8
    -note from file bar, peak ID 1234
    assign 3dn1234_1 abc2 abc1
       -note intraresidue
       -good
       -likelihood 0.9
    end
    assign 3dn1234_2 abc2 abc4
       -note long range
       -likelihood 0.02
    end
end

restraint 3dn1235
    -fromProtonShift 7.77
    -fromHeavyatomShift 121.22
    -toProtonShift 4.51
    -intensity 633223
    -bounds 5.0 1.8
    -note unassigned
end


The reason I have a separate shiftAssignments table defined for each  
NOESY spectrum is to allow me
to make calculations that analyze each proton as a whole, looking,  
for example, at its NOE completeness,
or figuring out if I need to correct the value of its chemical shift,  
based on the locations of intraresidue
peaks.

>
> All restraints for my case is between carbons ? what does ?proton? and
> ?heavyAtom? mean in my context?

Yes, my terminology is awkward for your data.  In your case, since  
your distances are observed between
pairs of carbons, you'd create shiftAssignments with carbon atoms  
selected in the -protonSelection flags,
and their chemical shift in the -protonShift flags.

If you had higher-dimensional data, akin to a 3d NOESY, you could  
then define some shiftAssignments
as having -heavyatomSelections as well.

>
> Another thing that I would like to do was to use the output assignment
> likelihoods from IRA as input prior assignment likelihoods in PASD.  
> I.e use
> these derived estimates in place of lambda_p(i,j) eq. (6) in your  
> paper in
> JACS2004 and then include all of the possible assignments in the  
> restraint (i.e
> the very high degeneracy). This means that I would like to use a  
> different
> annealing protocol that starts with w0=1. One way to do that could  
> be to start
> straightaway with the pass2 step. Is it possible to use user- 
> provided values
> for the prior assignment likelihoods lambda_p(i,j)'s? I would very  
> much like to
> try it!
>

You can set the initial likelihoods for each peak assignment with the  
-likelihood flag, as I showed in
the sample .noes file above.  As long as the values are between 0 and  
1, the code will work.
Note that if all of the peak assignments for a particular peak have  
likelihood 0, that implies that
you (or marvin) think this peak is probably noise or something.  If  
all the peak assignments for a
peak are 1, that implies that you think they're all reasonable.   
Having a peak with one peak assignment
that has a high likelihood and the others low means that you think  
this peak is good (because it has an
assignment with high likelihood), and that the correct assignment is  
the one with the high likelihood.

> I also have a question regarding the calculation of the overall  
> assignment
> likelihood in PASD (eq. 5 in the JACS2004 paper). If one would use  
> Bayes
> theorem to calculate the posterior probability that a given  
> assignment in a
> restraint is true then one would calculate the product of the prior  
> probability
> (prior likelihood in your paper) and the likelihood (instantaneous  
> likelihood
> in your paper). In PASD you use the sum of the two probabilities  
> rather than
> the product, why did you prefer to use the sum?. when say w0=0.5  
> the  overall
> assignment likelihood would be almost equal for assignments with  
> priors with
> values 0.0001, 0.001 and 0.1. I was just curious if you had any  
> considerations
> of that kind... if use the summing not to get trapped in structures  
> biased by
> the priors...

Precisely.  I tried to avoid implying Bayesian analysis in the  
paper.  I wanted to be able to weight activation / inactivation
decisions to be based on the previous likelihoods only at high  
temperature (where the instantaneous
likelihoods were meaningless, because the coordinates aren't terribly  
accurate), and then change over
smoothly into ignoring the previous likelihoods at the end of the  
annealing, when the coordinates
should be more accurate.

> By the way, I think it is possible to develop IRA to also
> assign resonances in combination with PASD...

I'm quite interested in hearing more about your approach!  I've been  
doing a lot
of peak assignment filtering work lately, but mostly stuff that's  
similar to the filtering
used by CANDID.  Bringing more statistical analysis to bear might  
work well,
as long as we can be sure that we're not biasing the structures too  
badly.

Good luck, and let me know if you run across any problems in creating  
your tables!

--JK
>
>

[Xplor-nih] PASD for structure determination of proteins from solid state data

Reply via email to