Dear James,
This is technically ingenious stuff.

Perhaps it could be applied to help the 'full archive challenge' ie containing 
many data sets that will never lead to publication/ database deposition?

However for the latter,publication/deposition, subset you would surely not 
'tamper' with the raw measurements? 

The 'grey area' between the two clearcut cases  ie where eventually 
publication/deposition MAY result then becomes the challenge as to whether to 
compress or not? (I would still prefer no tampering.)

Greetings,
John

Prof John R Helliwell DSc 




On 24 Oct 2011, at 22:56, James Holton <jmhol...@lbl.gov> wrote:

> The Pilatus is fast, but or decades now we have had detectors that can read 
> out in ~1s.  This means that you can collect a typical ~100 image dataset in 
> a few minutes (if flux is not limiting).  Since there are ~150 beamlines 
> currently operating around the world and they are open about 200 days/year, 
> we should be collecting ~20,000,000 datasets each year.
> 
> We're not.
> 
> The PDB only gets about 8000 depositions per year, which means either we 
> throw away 99.96% of our images, or we don't actually collect images anywhere 
> near the ultimate capacity of the equipment we have.  In my estimation, both 
> of these play about equal roles, with ~50-fold attrition between ultimate 
> data collection capacity and actual collected data, and another ~50 fold 
> attrition between collected data sets and published structures.
> 
> Personally, I think this means that the time it takes to collect the final 
> dataset is not rate-limiting in a "typical" structural biology project/paper. 
>  This does not mean that the dataset is of little value.  Quite the opposite! 
>  About 3000x more time and energy is expended preparing for the final dataset 
> than is spent collecting it, and these efforts require experimental feedback. 
>  The trick is figuring out how best to compress the "data used to solve a 
> structure" for archival storage.  Do the "previous data sets" count?  Or 
> should the compression be "lossy" about such historical details?  Does the 
> stuff between the spots matter?  After all, h,k,l,F,sigF is really just a 
> form of data compression.  In fact, there is no such thing as "raw" data.  
> Even "raw" diffraction images are a simplification of the signals that came 
> out of the detector electronics.  But we round-off and average over a lot of 
> things to remove "noise".  Largely because "noise" is difficult to compress.  
> The question of how much compression is too much compression depends on which 
> information (aka noise) you think could be important in the future.
> 
> When it comes to fine-sliced data, such as that from Pilatus, the main reason 
> why it doesn't compress very well is not because of the spots, but the 
> background.  It occupies thousands of times more pixels than the spots.  Yes, 
> there is diffuse scattering information in the background pixels, but this 
> kind of data is MUCH smoother than the spot data (by definition), and 
> therefore is optimally stored in larger pixels.  Last year, I messed around a 
> bit with applying different compression protocols to the spots and the 
> background, and found that ~30 fold compression can be easily achieved if you 
> apply h264 to the background and store the "spots" with lossless png 
> compression:
> 
> http://bl831.als.lbl.gov/~jamesh/lossy_compression/
> 
> I think these results "speak" to the relative information content of the 
> spots and the pixels between them.  Perhaps at least the "online version" of 
> archived images could be in some sort of lossy-background format?  With the 
> "real images" in some sort of slower storage (like a room full of tapes that 
> are available upon request)?  Would 30-fold compression make the storage of 
> image data tractable enough for some entity like the PDB to be able to afford 
> it?
> 
> 
> I go to a lot of methods meetings, and it pains me to see the most brilliant 
> minds in the field starved for "interesting" data sets.  The problem is that 
> it is very easy to get people to send you data that is so bad that it can't 
> be solved by any software imaginable (I've got piles of that!).  As a 
> developer, what you really need is a "right answer" so you can come up with 
> better metrics for how close you are to it.  Ironically, bad, unsolvable data 
> that is connected to a right answer (aka a PDB ID) is very difficult to 
> obtain.  The explanations usually involve protestations about being in the 
> middle of writing up the paper, the student graduated and we don't understand 
> how he/she labeled the tapes, or the RAID crashed and we lost it all, etc. 
> etc.  Then again, just finding someone who has a data set with the kind of 
> problem you are interested in is a lot of work!  So is figuring out which 
> problem affects the most people, and is therefore "interesting".
> 
> Is this not exactly the kind of thing that publicly-accessible centralized 
> scientific databases are created to address?
> 
> -James Holton
> MAD Scientist
> 
> On 10/16/2011 11:38 AM, Frank von Delft wrote:
>> On the deposition of raw data:
>> 
>> I recommend to the committee that before it convenes again, every member 
>> should go collect some data on a beamline with a Pilatus detector [feel free 
>> to join us at Diamond].  Because by the probable time any recommendations 
>> actually emerge, most beamlines will have one of those (or similar), we'll 
>> be generating more data than the LHC, and users will be happy just to have 
>> it integrated, never mind worry about its fate.
>> 
>> That's not an endorsement, btw, just an observation/prediction.
>> 
>> phx.
>> 
>> 
>> 
>> 
>> On 14/10/2011 23:56, Thomas C. Terwilliger wrote:
>>> For those who have strong opinions on what data should be deposited...
>>> 
>>> The IUCR is just starting a serious discussion of this subject. Two
>>> committees, the "Data Deposition Working Group", led by John Helliwell,
>>> and the Commission on Biological Macromolecules (chaired by Xiao-Dong Su)
>>> are working on this.
>>> 
>>> Two key issues are (1) feasibility and importance of deposition of raw
>>> images and (2) deposition of sufficient information to fully reproduce the
>>> crystallographic analysis.
>>> 
>>> I am on both committees and would be happy to hear your ideas (off-list).
>>> I am sure the other members of the committees would welcome your thoughts
>>> as well.
>>> 
>>> -Tom T
>>> 
>>> Tom Terwilliger
>>> terwilli...@lanl.gov
>>> 
>>> 
>>>>> This is a follow up (or a digression) to James comparing test set to
>>>>> missing reflections.  I also heard this issue mentioned before but was
>>>>> always too lazy to actually pursue it.
>>>>> 
>>>>> So.
>>>>> 
>>>>> The role of the test set is to prevent overfitting.  Let's say I have
>>>>> the final model and I monitored the Rfree every step of the way and can
>>>>> conclude that there is no overfitting.  Should I do the final refinement
>>>>> against complete dataset?
>>>>> 
>>>>> IMCO, I absolutely should.  The test set reflections contain
>>>>> information, and the "final" model is actually biased towards the
>>>>> working set.  Refining using all the data can only improve the accuracy
>>>>> of the model, if only slightly.
>>>>> 
>>>>> The second question is practical.  Let's say I want to deposit the
>>>>> results of the refinement against the full dataset as my final model.
>>>>> Should I not report the Rfree and instead insert a remark explaining the
>>>>> situation?  If I report the Rfree prior to the test set removal, it is
>>>>> certain that every validation tool will report a mismatch.  It does not
>>>>> seem that the PDB has a mechanism to deal with this.
>>>>> 
>>>>> Cheers,
>>>>> 
>>>>> Ed.
>>>>> 
>>>>> 
>>>>> 
>>>>> -- 
>>>>> Oh, suddenly throwing a giraffe into a volcano to make water is crazy?
>>>>>                                               Julian, King of Lemurs
>>>>> 

Reply via email to