Paul Dupuis wrote:

> There are many published algorithms for doing this and we have a past
> contractor of ours take a "best practice" algorithm and create a LCS
> "guessEncoding function. This replaced a previous guessEncoding
> function we had that from Richard Gaskin, which while quite good, did
> not cover as many test cases and the newer more robust one.

The algo I wrote for you a decade ago was an amalgam of best efforts culled throughout this community at the time. It even included a variant, refined in our testing, of statistical analysis of certain patterns identified by Peter Haworth for files without explicit declaration.

At the time, running the algo through the test collection of some ~200 widely varying sample documents, some of which even mixed different encodings, we compared our results with those from Apple's TextEdit and found that our algo correctly identified encoding at least 15% more often than TextEdit.

Once we bested Apple on that by an appreciable margin, all of us on the team reviewed the results and determined that we were clearly looking at a case of diminishing returns in terms of cost-to-further-refine vs actual percentage of documents in use requiring such refinement.

I would be interested to learn more about the details of the subsequent refinements over the decade since, but also the ROI proposition for today:

Given that another ten years has passed with modern encoding, and that older encodings like CP1252 (premiered in Windows 1.0 and popularized in Windows 95) are rarely seen in modern usage (as of March 2020 Wikipedia notes only 0.4% of web pages using that encoding), what percentage of documents your customers need to work with will benefit from further investment in refining that algo?

--
 Richard Gaskin
 Fourth World Systems
 Software Design and Development for the Desktop, Mobile, and the Web
 ____________________________________________________________________
 ambassa...@fourthworld.com                http://www.FourthWorld.com


_______________________________________________
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

Reply via email to