Re: [ccp4bb] Off topic: 'Difficult' Datasets for Processing Practice

James Holton Fri, 21 Sep 2018 15:52:00 -0700

For teaching purposes I have found that controlled pairs of data setsare most instructive. You are right that an easy one-button-pushprocessing run tells you nothing, but so does a bang-it-crashed-now-whatdata set. Most useful are two data sets that are identical in everyrespect but one, and that one thing is the point you are trying to getacross. It's hard to collect such perfectly paired data sets, so Iended up just simulating them. I deliberately chose a high-symmetryspace group to keep the download size small. You can download them fromhere:


http://bl831.als.lbl.gov/~jamesh/workshop/

These five datasets represent the four biggest problems I see users havewhen trying to solve structures: 1) poor anomalous signal, 2) overlapsfrom a bad crystal orientation, 3) hidden radiation damage to sites, and4) ice rings. The 5th "goodsignal" dataset is the positive control.

The web page contains everything from images to processed MTZ files,maps and the "right answer" in pdb and mtz format. A slightly more"realistic" version with a bigger download size is here:


http://bl831.als.lbl.gov/~jamesh/workshop2/

This is the one I used for my "weak anomalous challenge" a few yearsback. The teaching advantage is that you can use the image-mixer scriptto modulate the severity of problems like ice rings and anomaloussignal. If you make a competition of it, people tend to get moreinterested.

When it comes to beam centers, it is not all that hard to take a dataset with a "correct" beam center and just edit the headers. How you dothis depends on the file format, but I have some instructions forediting images in general here:


http://bl831.als.lbl.gov/~jamesh/bin_stuff/

In general, you can usually separate the header from the data with theunix command "head" or "dd", edit the header with your favorite texteditor, and then put the two parts back together with "cat". As forwhich beam center is "correct", it is important to tell your studentsthat that depends on which software you are using. I wrote all thisdown in the last paragraph on page 7 of this doc:


https://submit.biorxiv.org/submission/pdf?msid=BIORXIV/2018/394965

This doc also describes another simulated data set that demonstrates thechallenges of combining lots of short wedges together. May or may notbe too advanced a topic for your students? Or maybe not. As you canguess I'm experimenting with biorxiv. So far, no comments.


Good luck with your class!

-James Holton
MAD Scientist


On 9/19/2018 5:15 PM, Whitley, Matthew J wrote:

Dear colleagues,

For teaching purposes, I am looking for a small number (< 5) of
macromolecular diffraction datasets (raw images) that might be
considered 'difficult' for a beginning crystallography student to
process.  By 'difficult' I generally mean not able to be processed
automatically by a common processing package (XDS, Mosflm, DIALS, etc)
using default settings, i.e., no black box "click and done" processing.
The datasets I am looking for would have some stumbling block such as
incorrect experimental parameters recorded in the image headers,
multiple lattices that cause indexing to fail, datasets for which
determining the correct space group is tricky, datasets for experiments
in which the crystal slipped or moved in the beam, or anything else you
can think of.  The idea is for these beginning students to examine
several datasets that highlight various phenomena that can lead one
astray during processing.

A good candidate dataset would also ideally comprise a modest number of
images so as to keep integration time to a minimum.  Factors that are
mostly irrelevant for my purpose: resolution (as long as better than
~3.5 Å), source (home vs synchrotron), presence/absence of anomalous
scattering,  presence/absence of ligands, monomeric vs oligomeric
structures, etc.  Also, to be clear, I am not looking for datasets that
have so many pathologies that they would require many long hours of work
for an expert to process correctly.

I have checked public repositories such as proteindiffraction.org and
SBGrid databank, but all of the datasets I acquired from these sources
process satisfactorily with little effort, and in any event I know of no
way to search for 'challenging' datasets.  (I also wonder whether
anybody is in the habit of depositing, shall we say, less-than-pristine
images to public repositories?)

If you know of such a dataset that is already publicly available, or if
you have such a dataset that you are willing to share for solely
educational purposes, I would appreciate hearing from you, either on- or
off-list.

Thank you in advance for your suggestions.

Matthew


########################################################################

To unsubscribe from the CCP4BB list, click the following link:
https://www.jiscmail.ac.uk/cgi-bin/webadmin?SUBED1=CCP4BB&A=1

Re: [ccp4bb] Off topic: 'Difficult' Datasets for Processing Practice

Reply via email to