On 19/05/2007 9:17 PM, James Stroud wrote: > John Machin wrote: >> The approach that I've adopted is to test the values in a column for >> all types, and choose the non-text type that has the highest success >> rate (provided the rate is greater than some threshold e.g. 90%, >> otherwise it's text). >> >> For large files, taking a 1/N sample can save a lot of time with >> little chance of misdiagnosis. > > > Why stop there? You could lower the minimum 1/N by straightforward > application of Bayesian statistics, using results from previous tables > as priors. >
The example I gave related to one file out of several files prepared at the same time by the same organisation from the same application by the same personnel using the same query tool for a yearly process which has been going on for several years. All files for a year should be in the same format, and the format should not change year by year, and the format should match the agreed specifications ... but this doesn't happen. Against that background, please explain to me how I can use "results from previous tables as priors". Cheers, John -- http://mail.python.org/mailman/listinfo/python-list