Re: [R] read.spss and umlaut

2006-08-04 Thread Thomas Kuster
Hello

Am Donnerstag, 3. August 2006 15.34 schrieb Thomas Lumley:
 On Thu, 3 Aug 2006, Thomas Kuster wrote:
  Hello
 
  Am Mittwoch, 2. August 2006 17.11 schrieb Thomas Lumley:
...
 You haven't shown anything that indicates that the C code stopped reading.
 More likely R just stops displaying when it gets to an illegal byte
 sequence.  You could use nchar() to count the bytes in the string to find
 out.

If I change the translatable characters (overwrite the 0 between :#@'= and 
~000 with ÄÖÜäöü). I can read in the file an every ÄÖÜäöü ist a withspace:
 daten - read.spss(projets_umlaut.por)
 levels(daten$PROJETX)
  [1] Bg Stammzellenforschung
  [2] Bb   ber eine neue Finanzordnung
  [3] Bb Neugestaltung des Finanzausgleichs
  [4]  nderrung Bg  EOG Mutterschafturlaub
  [5] EV Postdienste f r alle
  [6] Bb  ber B rgerrechtserwerb 3. Generation
  [7] Bb  ber erleichterte Einb rung 2. Generation
  [8] Bg Steuerpaket
   .
   .
   .
 levels(daten$PROJETX)[208]
[1] EV Gleiche Rechte f r Mann und Frau Gegenvorschlag
 charToRaw(levels(daten$PROJETX)[208])
 [1] 45 56 20 47 6c 65 69 63 68 65 20 52 65 63 68 74 65 20 66 20 72 20 4d 61 
6e
[26] 6e 20 75 6e 64 20 46 72 61 75 20 47 65 67 65 6e 76 6f 72 73 63 68 6c 61 
67

without change the table I get:
 daten - read.spss(projets.por)
 charToRaw(levels(daten$PROJETX)[208])
 [1] 45 56 20 47 6c 65 69 63 68 65 20 52 65 63 68 74 65 20 66

The SPSS file is from:
http://voxit.sidos.ch/update.asp?lang=d
- Download der kumulierten Dateien Version 2.0

You must accept this:
The Standardized Post-Vote Surveys:
http://voxit.sidos.ch/agreement.asp?lang=emenu=0

Thomas

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] read.spss and umlaut

2006-08-04 Thread Thomas Lumley

On Fri, 4 Aug 2006, Thomas Kuster wrote:


If I change the translatable characters (overwrite the 0 between :#@'= and
~000 with ÄÖÜäöü). I can read in the file an every ÄÖÜäöü ist a withspace:


Ok, that's the problem then.  The file format says that the umlauts are 
unreadable and R is believing the file format.


I will look at adding an option to specify an encoding and ignore the 
translation table, but not very urgently.


-thomas

Thomas Lumley   Assoc. Professor, Biostatistics
[EMAIL PROTECTED]   University of Washington, Seattle__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] read.spss and umlaut

2006-08-03 Thread Thomas Kuster
Hello

Am Mittwoch, 2. August 2006 17.11 schrieb Thomas Lumley:
 This sounds like a conflict between encodings -- eg if R is assuming UTF-8
 and the file is encoding in Latin-1 then the sequence
 U+00FC : LATIN SMALL LETTER U WITH DIAERESIS
 U+0072 : LATIN SMALL LETTER R
 is coded as FC72 in the file, which is an illegal byte sequence in UTF-8.

Hex:  74 65 20 66 fc 72 20 61 6c 6c 65 53 45 2f 31 36
Text:  t  e f  ü  r a  l  l  e  S  E  /  1  6

 The underlying C code (being written in the US quite a long time ago)
 doesn't know about encodings, and I don't know what the rules are in SPSS
 for valid characters (I suspect that in these old portable file formats it
 probably just reads and writes bytes, leaving it up to the OS to interpret
 them.

But why stopp the C code reading? Is / not the endmark of the string? What 
is the problem, if I chance that in the source?

 You could try running R in a non-UTF-8 locale to see if it helps.

I think my local is non-UTF-8 (de_CH, isolatin). How can I check that, and set 
an other temporary?

A dirty hack like this:
sed s/ä/ae/g | sed s/ö/oe/g | sed s/ü/ue/g | sed s/Ä/Ae/g | sed s/Ö/Oe/g | sed 
s/Ü/Ue/g
didn't work (file 'projets_non_umlaut.por' is not in any supported SPSS 
format).

Thomas

 If anyone has definitive information about how SPSS represents strings and
 decides on valid characters that might be useful too.

   -thomas

  library(foreign)
  spssdaten - read.spss(projets.por)
  attr(spssdaten$PROJETX, value.labels)[1:20]
 
   Bg Stammzellenforschung  Bb
   863  
  862 Bb Neugestaltung des Finanzausgleichs
   861  
  854 EV Postdienste f   Bb 853
852 Bb Bg Steuerpaket 851  
  843 Bb Anhebung der Mehrwertsteuer s 
  11. AHV-Revision 842  
  841 Volkinitiative Lebenslange Verwahrung
   833  
  832 Gegenentwurf zur Avanti EV Lehrstellen-Initiative 831
824 EV Moratorium Plus   
  EV Strom ohne Atom 823   822 EV Ja zu
  fairen Mieten   EV Gleiche Rechte f 821  
  815 EV GesundheitsinitiativeEV
  Sonntags-Initiative 814   813
 
  The SPSS-File is okay:
  system(cat projets.por |grep Postdienste)
 
  echtserwerb 3. GenerationSD/N/EV Postdienste für alleSE/16/Änderrung Bg 
  EOG Mut
 
  How can I read the SPSS-File with the Umlaut?
 
  Bye
  Thomas Kuster
 
  R: 2.1.0 (2005-04-18)
  OS: Debian Linux, 2.6.10-isgee-neptun-1
 
  __
  R-help@stat.math.ethz.ch mailing list
  https://stat.ethz.ch/mailman/listinfo/r-help
  PLEASE do read the posting guide
  http://www.R-project.org/posting-guide.html and provide commented,
  minimal, self-contained, reproducible code.

 Thomas Lumley Assoc. Professor, Biostatistics
 [EMAIL PROTECTED] University of Washington, Seattle

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] read.spss and umlaut

2006-08-03 Thread Thomas Lumley

On Thu, 3 Aug 2006, Thomas Kuster wrote:


Hello

Am Mittwoch, 2. August 2006 17.11 schrieb Thomas Lumley:

This sounds like a conflict between encodings -- eg if R is assuming UTF-8
and the file is encoding in Latin-1 then the sequence
U+00FC : LATIN SMALL LETTER U WITH DIAERESIS
U+0072 : LATIN SMALL LETTER R
is coded as FC72 in the file, which is an illegal byte sequence in UTF-8.


Hex:  74 65 20 66 fc 72 20 61 6c 6c 65 53 45 2f 31 36
Text:  t  e f  ?  r a  l  l  e  S  E  /  1  6


Ok, so that looks like Latin-1 encoding in the file


The underlying C code (being written in the US quite a long time ago)
doesn't know about encodings, and I don't know what the rules are in SPSS
for valid characters (I suspect that in these old portable file formats it
probably just reads and writes bytes, leaving it up to the OS to interpret
them.


But why stopp the C code reading? Is / not the endmark of the string? What
is the problem, if I chance that in the source?


You haven't shown anything that indicates that the C code stopped reading. 
More likely R just stops displaying when it gets to an illegal byte 
sequence.  You could use nchar() to count the bytes in the string to find 
out.



You could try running R in a non-UTF-8 locale to see if it helps.


I think my local is non-UTF-8 (de_CH, isolatin). How can I check that, and set
an other temporary?


You can use charToRaw() to see what R thinks the byte sequence is for a 
word with a u-umlaut.


Sys.setlocale() will let you change the locale, but your locale does look 
non-UTF-8.


This is all guesswork since we can't see the file.

-thomas__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] read.spss and umlaut

2006-08-03 Thread Thomas Lumley

I have gone and looked at the code for reading SPSS portable files, and 
the file format appears to specify that you cannot read many legal 
characters.

Part of the header information in the file format is a 256-byte 
translation table apparently designed for translating between character 
representations.  It can mark characters as untranslateable, and the 
code for reading character strings replaces untranslateable characters 
with NULs.

In the example file in the foreign package the only translatable 
characters are the ASCII alphanumeric characters and 
.(+0[]!$*);^-/|,%_?`:#@'=~{}\

So, it looks as though your SPSS portable file may be marking character 
code FC as untranslatable.  This is easy to check -- look at the start of 
the file and find the sequence ABCDEF..., which is in the middle of the 
translation table. See if u-umlaut is in the table.  It might even work to 
modify the translation table to allow the accented characters.

It looks as though SPSS .sav files don't have this limitation.

-thomas

Thomas Lumley   Assoc. Professor, Biostatistics
[EMAIL PROTECTED]   University of Washington, Seattle

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] read.spss and umlaut

2006-08-02 Thread Thomas Kuster
Hello

When I read a SPSS *.por file with read.spss everything after a umlaut is 
missing:

 library(foreign)
 spssdaten - read.spss(projets.por)
 attr(spssdaten$PROJETX, value.labels)[1:20]
  Bg Stammzellenforschung  Bb
  863   862
Bb Neugestaltung des Finanzausgleichs
  861   854
 EV Postdienste f   Bb
  853   852
  Bb Bg Steuerpaket
  851   843
 Bb Anhebung der Mehrwertsteuer s  11. AHV-Revision
  842   841
Volkinitiative Lebenslange Verwahrung
  833   832
  Gegenentwurf zur Avanti EV Lehrstellen-Initiative
  831   824
   EV Moratorium PlusEV Strom ohne Atom
  823   822
   EV Ja zu fairen Mieten   EV Gleiche Rechte f
  821   815
 EV GesundheitsinitiativeEV Sonntags-Initiative
  814   813

The SPSS-File is okay:
 system(cat projets.por |grep Postdienste)
echtserwerb 3. GenerationSD/N/EV Postdienste für alleSE/16/Änderrung Bg  EOG 
Mut

How can I read the SPSS-File with the Umlaut?

Bye
Thomas Kuster

R: 2.1.0 (2005-04-18)
OS: Debian Sarge (Version 2.6.10-isgee-neptun-1)

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] read.spss and umlaut

2006-08-02 Thread Thomas Kuster
Hello

When I read a SPSS *.por file with read.spss everything after a umlaut is 
missing:

 library(foreign)
 spssdaten - read.spss(projets.por)
 attr(spssdaten$PROJETX, value.labels)[1:20]
  Bg Stammzellenforschung  Bb
  863   862
Bb Neugestaltung des Finanzausgleichs
  861   854
 EV Postdienste f   Bb
  853   852
  Bb Bg Steuerpaket
  851   843
 Bb Anhebung der Mehrwertsteuer s  11. AHV-Revision
  842   841
Volkinitiative Lebenslange Verwahrung
  833   832
  Gegenentwurf zur Avanti EV Lehrstellen-Initiative
  831   824
   EV Moratorium PlusEV Strom ohne Atom
  823   822
   EV Ja zu fairen Mieten   EV Gleiche Rechte f
  821   815
 EV GesundheitsinitiativeEV Sonntags-Initiative
  814   813

The SPSS-File is okay:
 system(cat projets.por |grep Postdienste)
echtserwerb 3. GenerationSD/N/EV Postdienste für alleSE/16/Änderrung Bg  EOG 
Mut

How can I read the SPSS-File with the Umlaut?

Bye
Thomas Kuster

R: 2.1.0 (2005-04-18)
OS: Debian Linux, 2.6.10-isgee-neptun-1

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] read.spss and umlaut

2006-08-02 Thread Thomas Lumley

On Wed, 2 Aug 2006, Thomas Kuster wrote:


Hello

When I read a SPSS *.por file with read.spss everything after a umlaut is
missing:


This sounds like a conflict between encodings -- eg if R is assuming UTF-8 
and the file is encoding in Latin-1 then the sequence

U+00FC : LATIN SMALL LETTER U WITH DIAERESIS
U+0072 : LATIN SMALL LETTER R
is coded as FC72 in the file, which is an illegal byte sequence in UTF-8.

The underlying C code (being written in the US quite a long time ago) 
doesn't know about encodings, and I don't know what the rules are in SPSS 
for valid characters (I suspect that in these old portable file formats it 
probably just reads and writes bytes, leaving it up to the OS to interpret 
them.


You could try running R in a non-UTF-8 locale to see if it helps.

If anyone has definitive information about how SPSS represents strings and 
decides on valid characters that might be useful too.


-thomas


library(foreign)
spssdaten - read.spss(projets.por)
attr(spssdaten$PROJETX, value.labels)[1:20]

 Bg Stammzellenforschung  Bb
 863   862
Bb Neugestaltung des Finanzausgleichs
 861   854
EV Postdienste f   Bb
 853   852
 Bb Bg Steuerpaket
 851   843
Bb Anhebung der Mehrwertsteuer s  11. AHV-Revision
 842   841
Volkinitiative Lebenslange Verwahrung
 833   832
 Gegenentwurf zur Avanti EV Lehrstellen-Initiative
 831   824
  EV Moratorium PlusEV Strom ohne Atom
 823   822
  EV Ja zu fairen Mieten   EV Gleiche Rechte f
 821   815
EV GesundheitsinitiativeEV Sonntags-Initiative
 814   813

The SPSS-File is okay:

system(cat projets.por |grep Postdienste)

echtserwerb 3. GenerationSD/N/EV Postdienste für alleSE/16/Änderrung Bg  EOG
Mut

How can I read the SPSS-File with the Umlaut?

Bye
Thomas Kuster

R: 2.1.0 (2005-04-18)
OS: Debian Linux, 2.6.10-isgee-neptun-1

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



Thomas Lumley   Assoc. Professor, Biostatistics
[EMAIL PROTECTED]   University of Washington, Seattle__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.