Re: [R] read.spss and umlaut
Hello Am Donnerstag, 3. August 2006 15.34 schrieb Thomas Lumley: On Thu, 3 Aug 2006, Thomas Kuster wrote: Hello Am Mittwoch, 2. August 2006 17.11 schrieb Thomas Lumley: ... You haven't shown anything that indicates that the C code stopped reading. More likely R just stops displaying when it gets to an illegal byte sequence. You could use nchar() to count the bytes in the string to find out. If I change the translatable characters (overwrite the 0 between :#@'= and ~000 with ÄÖÜäöü). I can read in the file an every ÄÖÜäöü ist a withspace: daten - read.spss(projets_umlaut.por) levels(daten$PROJETX) [1] Bg Stammzellenforschung [2] Bb ber eine neue Finanzordnung [3] Bb Neugestaltung des Finanzausgleichs [4] nderrung Bg EOG Mutterschafturlaub [5] EV Postdienste f r alle [6] Bb ber B rgerrechtserwerb 3. Generation [7] Bb ber erleichterte Einb rung 2. Generation [8] Bg Steuerpaket . . . levels(daten$PROJETX)[208] [1] EV Gleiche Rechte f r Mann und Frau Gegenvorschlag charToRaw(levels(daten$PROJETX)[208]) [1] 45 56 20 47 6c 65 69 63 68 65 20 52 65 63 68 74 65 20 66 20 72 20 4d 61 6e [26] 6e 20 75 6e 64 20 46 72 61 75 20 47 65 67 65 6e 76 6f 72 73 63 68 6c 61 67 without change the table I get: daten - read.spss(projets.por) charToRaw(levels(daten$PROJETX)[208]) [1] 45 56 20 47 6c 65 69 63 68 65 20 52 65 63 68 74 65 20 66 The SPSS file is from: http://voxit.sidos.ch/update.asp?lang=d - Download der kumulierten Dateien Version 2.0 You must accept this: The Standardized Post-Vote Surveys: http://voxit.sidos.ch/agreement.asp?lang=emenu=0 Thomas __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] read.spss and umlaut
On Fri, 4 Aug 2006, Thomas Kuster wrote: If I change the translatable characters (overwrite the 0 between :#@'= and ~000 with ÄÖÜäöü). I can read in the file an every ÄÖÜäöü ist a withspace: Ok, that's the problem then. The file format says that the umlauts are unreadable and R is believing the file format. I will look at adding an option to specify an encoding and ignore the translation table, but not very urgently. -thomas Thomas Lumley Assoc. Professor, Biostatistics [EMAIL PROTECTED] University of Washington, Seattle__ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] read.spss and umlaut
Hello Am Mittwoch, 2. August 2006 17.11 schrieb Thomas Lumley: This sounds like a conflict between encodings -- eg if R is assuming UTF-8 and the file is encoding in Latin-1 then the sequence U+00FC : LATIN SMALL LETTER U WITH DIAERESIS U+0072 : LATIN SMALL LETTER R is coded as FC72 in the file, which is an illegal byte sequence in UTF-8. Hex: 74 65 20 66 fc 72 20 61 6c 6c 65 53 45 2f 31 36 Text: t e f ü r a l l e S E / 1 6 The underlying C code (being written in the US quite a long time ago) doesn't know about encodings, and I don't know what the rules are in SPSS for valid characters (I suspect that in these old portable file formats it probably just reads and writes bytes, leaving it up to the OS to interpret them. But why stopp the C code reading? Is / not the endmark of the string? What is the problem, if I chance that in the source? You could try running R in a non-UTF-8 locale to see if it helps. I think my local is non-UTF-8 (de_CH, isolatin). How can I check that, and set an other temporary? A dirty hack like this: sed s/ä/ae/g | sed s/ö/oe/g | sed s/ü/ue/g | sed s/Ä/Ae/g | sed s/Ö/Oe/g | sed s/Ü/Ue/g didn't work (file 'projets_non_umlaut.por' is not in any supported SPSS format). Thomas If anyone has definitive information about how SPSS represents strings and decides on valid characters that might be useful too. -thomas library(foreign) spssdaten - read.spss(projets.por) attr(spssdaten$PROJETX, value.labels)[1:20] Bg Stammzellenforschung Bb 863 862 Bb Neugestaltung des Finanzausgleichs 861 854 EV Postdienste f Bb 853 852 Bb Bg Steuerpaket 851 843 Bb Anhebung der Mehrwertsteuer s 11. AHV-Revision 842 841 Volkinitiative Lebenslange Verwahrung 833 832 Gegenentwurf zur Avanti EV Lehrstellen-Initiative 831 824 EV Moratorium Plus EV Strom ohne Atom 823 822 EV Ja zu fairen Mieten EV Gleiche Rechte f 821 815 EV GesundheitsinitiativeEV Sonntags-Initiative 814 813 The SPSS-File is okay: system(cat projets.por |grep Postdienste) echtserwerb 3. GenerationSD/N/EV Postdienste für alleSE/16/Änderrung Bg EOG Mut How can I read the SPSS-File with the Umlaut? Bye Thomas Kuster R: 2.1.0 (2005-04-18) OS: Debian Linux, 2.6.10-isgee-neptun-1 __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Thomas Lumley Assoc. Professor, Biostatistics [EMAIL PROTECTED] University of Washington, Seattle __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] read.spss and umlaut
On Thu, 3 Aug 2006, Thomas Kuster wrote: Hello Am Mittwoch, 2. August 2006 17.11 schrieb Thomas Lumley: This sounds like a conflict between encodings -- eg if R is assuming UTF-8 and the file is encoding in Latin-1 then the sequence U+00FC : LATIN SMALL LETTER U WITH DIAERESIS U+0072 : LATIN SMALL LETTER R is coded as FC72 in the file, which is an illegal byte sequence in UTF-8. Hex: 74 65 20 66 fc 72 20 61 6c 6c 65 53 45 2f 31 36 Text: t e f ? r a l l e S E / 1 6 Ok, so that looks like Latin-1 encoding in the file The underlying C code (being written in the US quite a long time ago) doesn't know about encodings, and I don't know what the rules are in SPSS for valid characters (I suspect that in these old portable file formats it probably just reads and writes bytes, leaving it up to the OS to interpret them. But why stopp the C code reading? Is / not the endmark of the string? What is the problem, if I chance that in the source? You haven't shown anything that indicates that the C code stopped reading. More likely R just stops displaying when it gets to an illegal byte sequence. You could use nchar() to count the bytes in the string to find out. You could try running R in a non-UTF-8 locale to see if it helps. I think my local is non-UTF-8 (de_CH, isolatin). How can I check that, and set an other temporary? You can use charToRaw() to see what R thinks the byte sequence is for a word with a u-umlaut. Sys.setlocale() will let you change the locale, but your locale does look non-UTF-8. This is all guesswork since we can't see the file. -thomas__ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] read.spss and umlaut
I have gone and looked at the code for reading SPSS portable files, and the file format appears to specify that you cannot read many legal characters. Part of the header information in the file format is a 256-byte translation table apparently designed for translating between character representations. It can mark characters as untranslateable, and the code for reading character strings replaces untranslateable characters with NULs. In the example file in the foreign package the only translatable characters are the ASCII alphanumeric characters and .(+0[]!$*);^-/|,%_?`:#@'=~{}\ So, it looks as though your SPSS portable file may be marking character code FC as untranslatable. This is easy to check -- look at the start of the file and find the sequence ABCDEF..., which is in the middle of the translation table. See if u-umlaut is in the table. It might even work to modify the translation table to allow the accented characters. It looks as though SPSS .sav files don't have this limitation. -thomas Thomas Lumley Assoc. Professor, Biostatistics [EMAIL PROTECTED] University of Washington, Seattle __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] read.spss and umlaut
Hello When I read a SPSS *.por file with read.spss everything after a umlaut is missing: library(foreign) spssdaten - read.spss(projets.por) attr(spssdaten$PROJETX, value.labels)[1:20] Bg Stammzellenforschung Bb 863 862 Bb Neugestaltung des Finanzausgleichs 861 854 EV Postdienste f Bb 853 852 Bb Bg Steuerpaket 851 843 Bb Anhebung der Mehrwertsteuer s 11. AHV-Revision 842 841 Volkinitiative Lebenslange Verwahrung 833 832 Gegenentwurf zur Avanti EV Lehrstellen-Initiative 831 824 EV Moratorium PlusEV Strom ohne Atom 823 822 EV Ja zu fairen Mieten EV Gleiche Rechte f 821 815 EV GesundheitsinitiativeEV Sonntags-Initiative 814 813 The SPSS-File is okay: system(cat projets.por |grep Postdienste) echtserwerb 3. GenerationSD/N/EV Postdienste für alleSE/16/Änderrung Bg EOG Mut How can I read the SPSS-File with the Umlaut? Bye Thomas Kuster R: 2.1.0 (2005-04-18) OS: Debian Sarge (Version 2.6.10-isgee-neptun-1) __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] read.spss and umlaut
Hello When I read a SPSS *.por file with read.spss everything after a umlaut is missing: library(foreign) spssdaten - read.spss(projets.por) attr(spssdaten$PROJETX, value.labels)[1:20] Bg Stammzellenforschung Bb 863 862 Bb Neugestaltung des Finanzausgleichs 861 854 EV Postdienste f Bb 853 852 Bb Bg Steuerpaket 851 843 Bb Anhebung der Mehrwertsteuer s 11. AHV-Revision 842 841 Volkinitiative Lebenslange Verwahrung 833 832 Gegenentwurf zur Avanti EV Lehrstellen-Initiative 831 824 EV Moratorium PlusEV Strom ohne Atom 823 822 EV Ja zu fairen Mieten EV Gleiche Rechte f 821 815 EV GesundheitsinitiativeEV Sonntags-Initiative 814 813 The SPSS-File is okay: system(cat projets.por |grep Postdienste) echtserwerb 3. GenerationSD/N/EV Postdienste für alleSE/16/Änderrung Bg EOG Mut How can I read the SPSS-File with the Umlaut? Bye Thomas Kuster R: 2.1.0 (2005-04-18) OS: Debian Linux, 2.6.10-isgee-neptun-1 __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] read.spss and umlaut
On Wed, 2 Aug 2006, Thomas Kuster wrote: Hello When I read a SPSS *.por file with read.spss everything after a umlaut is missing: This sounds like a conflict between encodings -- eg if R is assuming UTF-8 and the file is encoding in Latin-1 then the sequence U+00FC : LATIN SMALL LETTER U WITH DIAERESIS U+0072 : LATIN SMALL LETTER R is coded as FC72 in the file, which is an illegal byte sequence in UTF-8. The underlying C code (being written in the US quite a long time ago) doesn't know about encodings, and I don't know what the rules are in SPSS for valid characters (I suspect that in these old portable file formats it probably just reads and writes bytes, leaving it up to the OS to interpret them. You could try running R in a non-UTF-8 locale to see if it helps. If anyone has definitive information about how SPSS represents strings and decides on valid characters that might be useful too. -thomas library(foreign) spssdaten - read.spss(projets.por) attr(spssdaten$PROJETX, value.labels)[1:20] Bg Stammzellenforschung Bb 863 862 Bb Neugestaltung des Finanzausgleichs 861 854 EV Postdienste f Bb 853 852 Bb Bg Steuerpaket 851 843 Bb Anhebung der Mehrwertsteuer s 11. AHV-Revision 842 841 Volkinitiative Lebenslange Verwahrung 833 832 Gegenentwurf zur Avanti EV Lehrstellen-Initiative 831 824 EV Moratorium PlusEV Strom ohne Atom 823 822 EV Ja zu fairen Mieten EV Gleiche Rechte f 821 815 EV GesundheitsinitiativeEV Sonntags-Initiative 814 813 The SPSS-File is okay: system(cat projets.por |grep Postdienste) echtserwerb 3. GenerationSD/N/EV Postdienste für alleSE/16/Änderrung Bg EOG Mut How can I read the SPSS-File with the Umlaut? Bye Thomas Kuster R: 2.1.0 (2005-04-18) OS: Debian Linux, 2.6.10-isgee-neptun-1 __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Thomas Lumley Assoc. Professor, Biostatistics [EMAIL PROTECTED] University of Washington, Seattle__ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.