Based in information received it seems that we're going to have to start setting and acting upon the "character_code" value in pspp sys files, if we want them to remain compatible with those from SPSS in an internationalised environment.
Currently, we always set this value to 2 on writing, and ignore it on reading. However, apparently this causes problems reading utf8 encoded files on SPSS. Conceivably, it could also mean that PSPP wont properly read certain SPSS generated files. Although there is another part of the file pertaining to character encoding, (record 7 subtype 20) from what I can make out, that affects only the encoding of the data records, and not the headers (labels etc.). The character_code is currently documented as: Character code. 1 indicates EBCDIC, 2 indicates 7-bit ASCII, 3 indicates 8-bit ASCII, 4 indicates DEC Kanji. Windows code page numbers are also valid. The problem is, that we will need a mapping between this integer and the strings which are recognised by iconv. According to wikipedia, no such mapping that is universally accepted exists - every vendor has their own one! Evidence suggests however that SPSS uses Microsoft's mapping, even when running on non-Microsoft platforms. So the best source of information seems to be http://msdn.microsoft.com/en-us/library/dd317756(VS.85).aspx However, as you will see, this table has only 153 entries whilst "iconv -l" on my machine generates 1153 encoding names. So the question remains what do we do with the 1000 character sets not in Microsoft's table? Many of the iconv names I suspect are synonyms and we can make educated guesses as to their meaning. Similarly, a lot of the iconv names are of the form CP%d which suggests a mapping to the codepage. However there are still gaps. Moreover, there are a lot of SPSS data files which I have seen which have this "character_code" set to 2, yet contain data which are clearly not 7 bit ascii. Has anyone got any sensible suggestions on how to implement the two functions: int get_codepage_from_encoding_name (const char*); and const char *get_encoding_from_codepage (int); ??? -- PGP Public key ID: 1024D/2DE827B3 fingerprint = 8797 A26D 0854 2EAB 0285 A290 8A67 719C 2DE8 27B3 See http://pgp.mit.edu or any PGP keyserver for public key.
signature.asc
Description: Digital signature
_______________________________________________ pspp-dev mailing list [email protected] http://lists.gnu.org/mailman/listinfo/pspp-dev
