subject:"Detecting UTF\-8 Encoded Files"

Re: Detecting UTF-8 Encoded Files

2009-08-07 Thread Klaus Major


Hi Ken,

I do not see any problem (and wouldn't if there were ;-)
but Mark Waddingham once helped me out with a working function exactly  
for determining

how a VCARD is encoded!

Here it is including Marks (very helpful)comments:

# vCards are stored as a text file, however, the text encoding used  
varies depending on the program that exported them.

# We use the following heuristic to detect encoding:
# 1) If there is the byte order mark 0xFEFF then we assume UTF-16BE
# 2) If there is the byte order mark 0xFFFE then we assume UTF-16LE
# 3) If the first byte is 0x00 then we assume UTF-16BE (compatibility  
with Tiger Address Book)

# 4) Otherwise we assume UTF-8
function vcf_convert3format tBinaryVCard
  # First load the vCard as binary data - at this stage we don't know  
the text encoding of the file and loading

  # as text would cause inappropriate line ending conversion.
  # This variable will hold the vCard encoded in MacRoman (the  
default text encoding Revolution uses on Mac OS X)

  local tNativeVCard

  # We now do our checks to detect text encoding
  switch
  case charToNum(char 1 of tBinaryVCard) = 0
put "UTF16BE" into tTextEncoding
break
  case charToNum(char 1 of tBinaryVCard) = 0xFE and charToNum(char 2  
of tBinaryVCard) = 0xFF

delete char 1 to 2 of tBinaryVCard
put "UTF16BE" into tTextEncoding
break
  case charToNum(char 1 of tBinaryVCard) = 0xFF and charToNum(char 2  
of tBinaryVCard) = 0xFE

delete char 1 to 2 of tBinaryVCard
put "UTF16LE" into tTextEncoding
break
  default
put "UTF8" into tTextEncoding
break
  end switch

  if tTextEncoding begins with "UTF16" then
# Work out the processors byte order
local tHostByteOrder
if the processor is "x86" then
  put "LE" into tHostByteOrder
else
  put "BE" into tHostByteOrder
end if

# If the byte orders don't match, switch the order of pairs of  
bytes

if char -2 to -1 of tTextEncoding <> tHostByteOrder then
  put swapbytes(tBinaryVCard) into tBinaryVCard
end if

# Decode the UTF-16 to native
put uniDecode(tBinaryVCard) into tNativeVCard
  else
# Use the standard uniDecode/uniEncode pair to decode the UTF-8  
encoding

put uniDecode(uniEncode(tBinaryVCard, "UTF8")) into tNativeVCard
  end if

  # We now need to normalize line endings to make sure all lines  
terminate in 'return' (numToChar(10)).

  put tNativeVCard into tTextVCard

  # First replace Windows CR-LF style endings
  replace numToChar(13) & numToChar(10) with return in tTextVCard

  # Now replace Mac OS CR style endings
  replace numToChar(13) with return in tTextVCard
  return mac2win(tTextVCard)
end vcf_convert3format

***
Here is my function "mac2win" that we use in our crossplatform project  
werhe we store EVERYTHING in ISO format!

function mac2win was
  if the platform = "MacOS" then
return mactoiso(was)
  else
return was
  end if
end mac2win

Hope that helps!


Best

Klaus

--
Klaus Major
http://www.major-k.de
kl...@major.on-rev.com

___
use-revolution mailing list
use-revolution@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-revolution

Detecting UTF-8 Encoded Files

2009-08-07 Thread Ken Ray

I recently had a need to be able to detect whether a vCard was UTF-8 encoded
or not so that I could run the proper decoding on it... after a healthy web
search, I found an article on Instructables for how to walk through the text
of a file and be able to determine this:

http://www.instructables.com/id/SYGL47RFDYPTCVC/

I wrote a function based on it and so far it's worked for me, but if anyone
sees any problems with it, let me know:


function GetFileData
answer file "Select a file:"
put it into tFile
if tFile is not "" then
if isUTF8Encoded(tFile) then
put url ("file:" & tFile) into tData
return unidecode(uniencode(tData,"utf8"))
else
return tdata
end if
end if
end GetFileData

function isUTF8Encoded pPath
  put url ("file:" & pPath) into tData
  
  -- Look for patterns of:
  -- "110x, 10yy" (2 bytes)
  -- "1110, 10yy, 10zz" (3 bytes)
  -- "0xxx,10yy, 10zz, 10ww" (4 bytes)
  put "" into tMatchHolder
  repeat for each char tChar in tData
put format("%08d",baseConvert(charToNum(tChar),10,2)) into tVal
if tMatchHolder = "" then
  switch
  case (char 1 to 3 of tVal = "110")
put "20" into tMatchHolder
break
  case (char 1 to 4 of tVal = "1110")
put "30" into tMatchHolder
break
  case (char 1 to 5 of tVal = "0")
put "40" into tMatchHolder
break
  default
next repeat
  end switch
else
  if (char 1 to 2 of tVal = "10") then
if char 2 of tMatchHolder = (char 1 of tMatchHolder - 2) then
  return "true"
else
  add 1 to char 2 of tMatchHolder
end if
  else
put "" into tMatchHolder
next repeat
  end if
end if
  end repeat
  return "false"
end isUTF8Encoded

HTH,

Ken Ray
Sons of Thunder Software, Inc.
Email: k...@sonsothunder.com
Web Site: http://www.sonsothunder.com/


___
use-revolution mailing list
use-revolution@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-revolution

Re: Detecting UTF-8 Encoded Files

Detecting UTF-8 Encoded Files

2 matches

Site Navigation

Mail list logo

Footer information