I ran into a situation, today (dealing with files), where most of the
files were utf-8 encoded but some represented the latin-1 "code plane"
with 8 bit characters.

To cope with this issue, I coded up a mechanism to test whether the
file contained only valid utf-8 sequences, and used {{ ": 10 u: y }}
for the files which failed this test.

In other words:

cclass=: (i.9) (48+i.9)} 256#9
cstates=: 0 10#:10* ".;._2{{)n
  0    7.3  2    3    4    5    6    7.3  7.3  7.1 NB. 0: start char sequence
  0    7.3  2    3    4    5    6    7.3  7.3  7.1 NB. 1: finish char
sequence, start next
  7.3  1    7.3  7.3  7.3  7.3  7.3  7.3  7.3  7.3 NB. 2: need one
more character
  7.3  2    7.3  7.3  7.3  7.3  7.3  7.3  7.3  7.3 NB. 3: need two
more characters
  7.3  3    7.3  7.3  7.3  7.3  7.3  7.3  7.3  7.3 NB. 4: need three
more characters
  7.3  4    7.3  7.3  7.3  7.3  7.3  7.3  7.3  7.3 NB. 5: need four
more characters
  7.3  5    7.3  7.3  7.3  7.3  7.3  7.3  7.3  7.3 NB. 6: need five
more characters
  7.3  7.3  7.3  7.3  7.3  7.3  7.3  7.3  7.3  7.2 NB. 7: end
}}

utf8lenb=: <:2#.>1 #each~1+i.8
utf8ok=: {{
  try.
    (1;cstates;cclass) ;: '.',~'012345678_'{~ utf8lenb I. 3 u: y
    1
  catch.
    0
  end.
}}

NB. most content is utf-8 -- assume non-utf-8 sequences are ascii+latin-1
latin2utf8=: {{
  if.utf8ok y do. y else. ":10 u: y end.
}}

I don't know if this approach would be useful to anyone else here,
but... just in case...

FYI,

-- 
Raul
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Reply via email to