Re: Guessing the encoding of a test file...

Paul Dupuis via use-livecode Sat, 21 Mar 2020 07:13:09 -0700

Nope.

The reason I refer to the routine as "guessEncoding" is that Iabsolutely know that it is a "guess" based on the presence of nulls andother bytes for UTF files and by statistical sampling for variouscharacters for MacRoman vs CP1252. We also offer a optional way for theuser to pick the encoding IF THEY KNOW IT (or I suppose they can keepguessing until they get it right),

I'll say it again, I was looking to see if ANY one else had implementeda guessEncoding routine and was willing to share of license forcomparison to my own in hopes of either concluding mine is the best itcan be OR learning something someone else is doing that improves it alittle bit.

So far the only person who has read my post and replied with what I waslooking for was Peter - and although the routine was written in Rebolrather than LiveCode, he kindly provided a link to information about it.


On 3/21/2020 4:20 AM, Quentin Long via use-livecode wrote:

I strongly suspect that the desired goal, to have a nice, robust algorithm 
which automagically identifies the encoding of *ABSOLUTELY ANY* text document 
with zero need for human involvement, simply isn't possible. Because text 
encoding is intrinsically arbitrary—see also: the many variations on extended 
(8-bit) ASCII, the various mutually-incompatible versions of EBCDIC, etc ad 
nauseam.
Seems to me, therefore, that in the general case, human involvement is an 
*unavoidable necessity* in determining which encoding an arbitrary text 
document uses. So the goal of any encoding-ID algorithm should *not* be the 
impossible task of determining that encoding *without* human involvement. 
Rather, the goal should be to *minimize* that human involvement, make that 
human involvement as *simple and painless* as practically feasible. So, here 
goes with some semi-random rambling…  Pretty sure the best, most nearly 
bulletproof way to ID a document's text-encoding involves applying that 
encoding to the bits of the document, and showing the resulting 
character-sequence to a human. If there's more than one possibility for the 
document's encoding, apply all of the possible encodings, and show a human all 
of the resulting character-sequences. I'm thinking that a good way to do this 
might be to put up N different text fields in a window, with all of the text 
fields controlled by one scrollbar, and the human clicks on all of the fields 
whose content looks good to them. Or maybe the human clicks on all the fields 
whose output looks *bad* to them? Whichever way works; as long as there *is* 
some human judgement in there somewhere.
Can we assume that once a particular document's text-encoding has been identified, that 
*all* documents which came from the same source as that document use that particular 
encoding? If so, that might simplify the continuing workflow; tell the software 
"This document came from Source X", and the software then uses whichever 
text-encoding it associates with that source. Even if there's more than one such 
text-encoding in play, that's at least easier to work with than having to sort thru an 
arbitrarily large number of text-encodings.
Is it possible to tell the software "hey, no character in $ThisSetOfChars will ever 
appear in this document"? If so, the software should be able to rule out any 
encoding which ends up putting one of the Forbidden Chars into the decoded 
character-sequence.

Given human error, it may be that the human's input ends up ruling out *any 
possible* text-encoding. Prolly a good idea to use something akin to fuzzy 
logic rather than strict Boolean operations.

_______________________________________________
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode




_______________________________________________
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

Re: Guessing the encoding of a test file...

Reply via email to