[R] reading and frequency analysis of Spanish text

2009-08-05 Thread Michael Friendly
For an historical  paper I'm working on, I have some Spanish plaintext, 
presently in the form of a Word .doc

file,
http://euclid.psych.yorku.ca/SCS/Gallery/images/Private/Langren/Verdadera-spanish-stripped.doc

and also some ciphered text from the same original source.  The ultimate 
goal is to use some
frequency analysis of letters and word lengths in  the plaintext to help 
decode the ciphered text.


For now, I'm stuck on how to read the Spanish plaintext into R as a text 
string, given that it is in a Word .doc file
using some form of latin1 encoding.  From Word, I can Save As .. plain 
text (.txt), but I'm worried about losing
character encoding information and I don't see anything in the list of 
Other encodings presented that seems
helpful. 


A naive attempt to read the .doc file directly gives:

 langren.sp.file - 
http://euclid.psych.yorku.ca/SCS/Gallery/images/Private/Langren/Verdadera-spanish-stripped.doc;


 langren.txt - scan(langren.sp.file, encoding=latin1)
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, 
na.strings,  :

 scan() expected 'a real', got 'ÐÏࡱá'


Can someone help?

--
Michael Friendly Email: friendly AT yorku DOT ca 
Professor, Psychology Dept.

York University  Voice: 416 736-5115 x66249 Fax: 416 736-5814
4700 Keele Streethttp://www.math.yorku.ca/SCS/friendly.html
Toronto, ONT  M3J 1P3 CANADA

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] reading and frequency analysis of Spanish text

2009-08-05 Thread David Winsemius
When I open that link in OpenOffice.org Writer and then save in Text  
encoded format with Unicode encoding, the diacriticals (is that the  
correct font-ish term?)  seem to remain intact wehn re-opended. When I  
read that file in, not with scan() but with readLines(), here is what  
I get for the second string:


langren.txt - readLines(/Users/davidwinsemius/Downloads/Verdadera- 
spanish-stripped-1.txt, encoding=UTF-8)

 langren.txt[2]

 [2] MIGUEL FLORENCIO VAN LANGREN Matemático y cosmógrafo de su  
Majestad presenta las siguientes consideraciones de la Longitud por  
Mar y Tierra; y dice que su Padre y Abuelo fueron astrónomos y  
geógrafos, y en particular su padre asistió a las observaciones  
celestes realizadas por el famoso astrónomo Ticho Brahe, de quien  
recibió sus primeras observaciones, como consta por las obras del  
dicho Ticho. Así mismo su padre sirvió a su majestad como cosmógrafo  
en Flandes. Y el dicho VAN LANGREN, a imitación de sus antepasados, ha  
ejercitado en esas artes y descubierto cosas que no se sabían sobre la  
verdadera longitud por mar y tierra, apoyándose más en lo esencial que  
en lo especulativo. Y habiéndolo propuesto a la infanta Isabel, muy  
aficionada a dichas artes, ella le recomendó al rey por una carta en  
1629 (página 9 de este documento), para que le encargase corregir la  
geografía. Su majestad lo aprobó por una real cédula, debido a los  
enormes errores que muestran las distancias calculadas por eminentes  
astrónomos y geógrafos entre Toledo y Roma, tal como se muestra en  
esta línea, por la cual se pueden conjeturar los errores entre lugares  
más distantes.


Mind you this was on a Mac so the usual cross-platform caveats apply:

 sessionInfo()
R version 2.9.1 Patched (2009-07-04 r48897)
x86_64-apple-darwin9.7.0

locale:
en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] splines   stats graphics  grDevices utils datasets   
methods   base


other attached packages:
[1] lattice_0.17-25 MASS_7.2-46 plotrix_2.6-4   plyr_0.1.9   
Design_2.1-2survival_2.35-4

[7] Hmisc_3.5-2

loaded via a namespace (and not attached):
[1] cluster_1.12.0 grid_2.9.1 tools_2.9.1

--
DW


On Aug 5, 2009, at 2:19 PM, Michael Friendly wrote:

For an historical  paper I'm working on, I have some Spanish  
plaintext, presently in the form of a Word .doc

file,
http://euclid.psych.yorku.ca/SCS/Gallery/images/Private/Langren/Verdadera-spanish-stripped.doc

and also some ciphered text from the same original source.  The  
ultimate goal is to use some
frequency analysis of letters and word lengths in  the plaintext to  
help decode the ciphered text.


For now, I'm stuck on how to read the Spanish plaintext into R as a  
text string, given that it is in a Word .doc file
using some form of latin1 encoding.  From Word, I can Save As ..  
plain text (.txt), but I'm worried about losing
character encoding information and I don't see anything in the list  
of Other encodings presented that seems

helpful.
A naive attempt to read the .doc file directly gives:

 langren.sp.file - http://euclid.psych.yorku.ca/SCS/Gallery/images/Private/Langren/Verdadera-spanish-stripped.doc 



 langren.txt - scan(langren.sp.file, encoding=latin1)
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines,  
na.strings,  :

scan() expected 'a real', got 'ÐÏࡱá'


Can someone help?

--
Michael Friendly Email: friendly AT yorku DOT ca Professor,  
Psychology Dept.

York University  Voice: 416 736-5115 x66249 Fax: 416 736-5814
4700 Keele Streethttp://www.math.yorku.ca/SCS/friendly.html
Toronto, ONT  M3J 1P3 CANADA



David Winsemius, MD
Heritage Laboratories
West Hartford, CT

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] reading and frequency analysis of Spanish text

2009-08-05 Thread Sam Thomas
I used the readDOC function in tm.  

After storing the document locally on a Windows pc...

langren.sp.path - C:\\text\\ #store file by itself in this directory

langren.corpus - (Corpus(DirSource(langren.sp.path), readerControl = 
list(reader
= 
readDOC(AntiwordOptions = -t), language = spa, load = TRUE)))

(langren.sp.file - langren.corpus[[1]])[1:10]


I think the default encoding for antiword is latin1, but antiword -m option can 
handle other mappings.  

Sam Thomas

-Original Message-
From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On 
Behalf Of Michael Friendly
Sent: Wednesday, August 05, 2009 2:19 PM
To: R-Help
Subject: [R] reading and frequency analysis of Spanish text

For an historical  paper I'm working on, I have some Spanish plaintext, 
presently in the form of a Word .doc
file,
http://euclid.psych.yorku.ca/SCS/Gallery/images/Private/Langren/Verdadera-spanish-stripped.doc

and also some ciphered text from the same original source.  The ultimate 
goal is to use some
frequency analysis of letters and word lengths in  the plaintext to help 
decode the ciphered text.

For now, I'm stuck on how to read the Spanish plaintext into R as a text 
string, given that it is in a Word .doc file
using some form of latin1 encoding.  From Word, I can Save As .. plain 
text (.txt), but I'm worried about losing
character encoding information and I don't see anything in the list of 
Other encodings presented that seems
helpful. 

A naive attempt to read the .doc file directly gives:

  langren.sp.file - 
http://euclid.psych.yorku.ca/SCS/Gallery/images/Private/Langren/Verdadera-spanish-stripped.doc;
 
  langren.txt - scan(langren.sp.file, encoding=latin1)
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, 
na.strings,  :
  scan() expected 'a real', got 'ÐÏࡱá'
 

Can someone help?

-- 
Michael Friendly Email: friendly AT yorku DOT ca 
Professor, Psychology Dept.
York University  Voice: 416 736-5115 x66249 Fax: 416 736-5814
4700 Keele Streethttp://www.math.yorku.ca/SCS/friendly.html
Toronto, ONT  M3J 1P3 CANADA

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.