I ran into a situation, today (dealing with files), where most of the
files were utf-8 encoded but some represented the latin-1 "code plane"
with 8 bit characters.
To cope with this issue, I coded up a mechanism to test whether the
file contained only valid utf-8 sequences, and used {{ ": 10 u: y }}
for the files which failed this test.
In other words:
cclass=: (i.9) (48+i.9)} 256#9
cstates=: 0 10#:10* ".;._2{{)n
0 7.3 2 3 4 5 6 7.3 7.3 7.1 NB. 0: start char sequence
0 7.3 2 3 4 5 6 7.3 7.3 7.1 NB. 1: finish char
sequence, start next
7.3 1 7.3 7.3 7.3 7.3 7.3 7.3 7.3 7.3 NB. 2: need one
more character
7.3 2 7.3 7.3 7.3 7.3 7.3 7.3 7.3 7.3 NB. 3: need two
more characters
7.3 3 7.3 7.3 7.3 7.3 7.3 7.3 7.3 7.3 NB. 4: need three
more characters
7.3 4 7.3 7.3 7.3 7.3 7.3 7.3 7.3 7.3 NB. 5: need four
more characters
7.3 5 7.3 7.3 7.3 7.3 7.3 7.3 7.3 7.3 NB. 6: need five
more characters
7.3 7.3 7.3 7.3 7.3 7.3 7.3 7.3 7.3 7.2 NB. 7: end
}}
utf8lenb=: <:2#.>1 #each~1+i.8
utf8ok=: {{
try.
(1;cstates;cclass) ;: '.',~'012345678_'{~ utf8lenb I. 3 u: y
1
catch.
0
end.
}}
NB. most content is utf-8 -- assume non-utf-8 sequences are ascii+latin-1
latin2utf8=: {{
if.utf8ok y do. y else. ":10 u: y end.
}}
I don't know if this approach would be useful to anyone else here,
but... just in case...
FYI,
--
Raul
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm