Re: Detecting UTF-8 Encoded Files
Hi Ken, I do not see any problem (and wouldn't if there were ;-) but Mark Waddingham once helped me out with a working function exactly for determining how a VCARD is encoded! Here it is including Marks (very helpful)comments: # vCards are stored as a text file, however, the text encoding used varies depending on the program that exported them. # We use the following heuristic to detect encoding: # 1) If there is the byte order mark 0xFEFF then we assume UTF-16BE # 2) If there is the byte order mark 0xFFFE then we assume UTF-16LE # 3) If the first byte is 0x00 then we assume UTF-16BE (compatibility with Tiger Address Book) # 4) Otherwise we assume UTF-8 function vcf_convert3format tBinaryVCard # First load the vCard as binary data - at this stage we don't know the text encoding of the file and loading # as text would cause inappropriate line ending conversion. # This variable will hold the vCard encoded in MacRoman (the default text encoding Revolution uses on Mac OS X) local tNativeVCard # We now do our checks to detect text encoding switch case charToNum(char 1 of tBinaryVCard) = 0 put "UTF16BE" into tTextEncoding break case charToNum(char 1 of tBinaryVCard) = 0xFE and charToNum(char 2 of tBinaryVCard) = 0xFF delete char 1 to 2 of tBinaryVCard put "UTF16BE" into tTextEncoding break case charToNum(char 1 of tBinaryVCard) = 0xFF and charToNum(char 2 of tBinaryVCard) = 0xFE delete char 1 to 2 of tBinaryVCard put "UTF16LE" into tTextEncoding break default put "UTF8" into tTextEncoding break end switch if tTextEncoding begins with "UTF16" then # Work out the processors byte order local tHostByteOrder if the processor is "x86" then put "LE" into tHostByteOrder else put "BE" into tHostByteOrder end if # If the byte orders don't match, switch the order of pairs of bytes if char -2 to -1 of tTextEncoding <> tHostByteOrder then put swapbytes(tBinaryVCard) into tBinaryVCard end if # Decode the UTF-16 to native put uniDecode(tBinaryVCard) into tNativeVCard else # Use the standard uniDecode/uniEncode pair to decode the UTF-8 encoding put uniDecode(uniEncode(tBinaryVCard, "UTF8")) into tNativeVCard end if # We now need to normalize line endings to make sure all lines terminate in 'return' (numToChar(10)). put tNativeVCard into tTextVCard # First replace Windows CR-LF style endings replace numToChar(13) & numToChar(10) with return in tTextVCard # Now replace Mac OS CR style endings replace numToChar(13) with return in tTextVCard return mac2win(tTextVCard) end vcf_convert3format *** Here is my function "mac2win" that we use in our crossplatform project werhe we store EVERYTHING in ISO format! function mac2win was if the platform = "MacOS" then return mactoiso(was) else return was end if end mac2win Hope that helps! Best Klaus -- Klaus Major http://www.major-k.de kl...@major.on-rev.com ___ use-revolution mailing list use-revolution@lists.runrev.com Please visit this url to subscribe, unsubscribe and manage your subscription preferences: http://lists.runrev.com/mailman/listinfo/use-revolution
Detecting UTF-8 Encoded Files
I recently had a need to be able to detect whether a vCard was UTF-8 encoded or not so that I could run the proper decoding on it... after a healthy web search, I found an article on Instructables for how to walk through the text of a file and be able to determine this: http://www.instructables.com/id/SYGL47RFDYPTCVC/ I wrote a function based on it and so far it's worked for me, but if anyone sees any problems with it, let me know: function GetFileData answer file "Select a file:" put it into tFile if tFile is not "" then if isUTF8Encoded(tFile) then put url ("file:" & tFile) into tData return unidecode(uniencode(tData,"utf8")) else return tdata end if end if end GetFileData function isUTF8Encoded pPath put url ("file:" & pPath) into tData -- Look for patterns of: -- "110x, 10yy" (2 bytes) -- "1110, 10yy, 10zz" (3 bytes) -- "0xxx,10yy, 10zz, 10ww" (4 bytes) put "" into tMatchHolder repeat for each char tChar in tData put format("%08d",baseConvert(charToNum(tChar),10,2)) into tVal if tMatchHolder = "" then switch case (char 1 to 3 of tVal = "110") put "20" into tMatchHolder break case (char 1 to 4 of tVal = "1110") put "30" into tMatchHolder break case (char 1 to 5 of tVal = "0") put "40" into tMatchHolder break default next repeat end switch else if (char 1 to 2 of tVal = "10") then if char 2 of tMatchHolder = (char 1 of tMatchHolder - 2) then return "true" else add 1 to char 2 of tMatchHolder end if else put "" into tMatchHolder next repeat end if end if end repeat return "false" end isUTF8Encoded HTH, Ken Ray Sons of Thunder Software, Inc. Email: k...@sonsothunder.com Web Site: http://www.sonsothunder.com/ ___ use-revolution mailing list use-revolution@lists.runrev.com Please visit this url to subscribe, unsubscribe and manage your subscription preferences: http://lists.runrev.com/mailman/listinfo/use-revolution