I have a txt file (attached) that defines equivalents among characters in latin1 (or iso-8859-1), numeric &#xxx; codes, HTML entities and latex equivalents. A portion of the file is shown inline below, but may not be rendered well in this email.

I'd like to read this into R to use as a character translation table, but am stuck on two things: - The 5 fields in the file are column-aligned and are separated by 2+ white space characters.
In perl this is trivial to read and parse via something like
        @entries = split("\n", $charTable);
        foreach (@entries) {
                ($desc, $char, $code, $html, $tex) = split(/\s\s+/);
        }
AFAIK, the only function for reading such data is utils::read.fwf, but I have to specify the field widths. I don't know of any function that allows even a simple regrex like this as a sep= argument.

- The TeX field contains many backslashed codes that need to be escaped in R. Is it necessarty to manually edit the file to change '\pounds' --> '\\pounds', '\S' --> '\\S', etc. or is there something
like raw mode input that would do this where necessary?

Description                         Char
 Code      HTML        TeX
double quote                         "    " "
ampersand                            &    & &amp        \&
apostrophe                           '    ' '
less than                            <    &#060; &lt;        $<$
greater than                         >    &#062; &gt;        $>$
non-breaking space                   .    &#160; &nbsp;      ~
inverted exclamation                 ¡    &#161; &iexcl;     !'
cent sign                            ¢    &#162; &cent;
pound sterling                       £    &#163; &pound;     \pounds
general currency sign                ¤    &#164; &curren;
yen sign                             ¥    &#165; &yen;
broken vertical bar                  ¦    &#166; &brvbar;
section sign                         §    &#167; &sect;      \S
umlaut (dieresis)                    ¨    &#168; &uml;       \"{}
copyright                            ©    &#169; &copy;      \copyright
feminine ordinal                     ª    &#170; &ordf;      $^a$
left angle quote, guillemotleft      «    &#171; &laquo;     \guillemotleft
not sign                             ¬    &#172; &not;
soft hyphen                          ­    &#173; &shy;
registered trademark                 ®    &#174; &reg;       \textregistered
macron accent                        ¯    &#175; &macr;
degree sign                          °    &#176; &deg;       $^o$
plus or minus                        ±    &#177; &plusmn;    $\pm$
superscript two                      ²    &#178; &sup2;      $^2$
superscript three                    ³    &#179; &sup3;      $^3$
acute accent                         ´    &#180; &acute;     \'{}
micro sign                           µ    &#181; &micro;     $\mu$
paragraph sign                       ¶    &#182; &para;      \P
middle dot                           ·    &#183; &middot;    $\cdot$
cedilla                              ¸    &#184; &cedil;     \c{}
superscript one                      ¹    &#185; &sup1;      $^1$
masculine ordinal                    º    &#186; &ordm;      $^o$
right angle quote, guillemotright    »    &#187; &raquo;     \guillemotright
fraction one-fourth                  ¼    &#188; &frac14;    $\frac14$
fraction one-half                    ½    &#189; &frac12;    $\frac12$
fraction three-fourths               ¾    &#190; &frac34;    $\frac34$
inverted question mark               ¿    &#191; &iquest;    ?'
capital A, grave accent              À    &#192; &Agrave;    \`A
capital A, acute accent              Á    &#193; &Aacute;    \'A
capital A, circumflex accent         Â    &#194; &Acirc;     \^A
capital A, tilde                     Ã    &#195; &Atilde;    \~A
capital A, dieresis or umlaut mark   Ä    &#196; &Auml;      \"A
capital A, ring                      Å    &#197; &Aring;     \AA
capital AE diphthong (ligature)      Æ    &#198; &AElig;     \AE

--
Michael Friendly     Email: frien...@yorku.ca
Professor, Psychology Dept.
York University      Voice: 416 736-2100 x66249 Fax: 416 736-5814
4700 Keele Street    http://datavis.ca
Toronto, ONT  M3J 1P3 CANADA

Description                          Char Code      HTML        TeX
double quote                         "    &#034;    &quot; 
ampersand                            &    &#038;    &amp        \&     
apostrophe                           '    &#039;    &apos;
less than                            <    &#060;    &lt;        $<$
greater than                         >    &#062;    &gt;        $>$  
non-breaking space                   .    &#160;    &nbsp;      ~
inverted exclamation                 ¡    &#161;    &iexcl;     !'
cent sign                            ¢    &#162;    &cent;
pound sterling                       £    &#163;    &pound;     \pounds
general currency sign                ¤    &#164;    &curren;
yen sign                             ¥    &#165;    &yen;
broken vertical bar                  ¦    &#166;    &brvbar;
section sign                         §    &#167;    &sect;      \S
umlaut (dieresis)                    ¨    &#168;    &uml;       \"{}
copyright                            ©    &#169;    &copy;      \copyright
feminine ordinal                     ª    &#170;    &ordf;      $^a$
left angle quote, guillemotleft      «    &#171;    &laquo;     \guillemotleft
not sign                             ¬    &#172;    &not;
soft hyphen                          ­    &#173;    &shy;
registered trademark                 ®    &#174;    &reg;       \textregistered
macron accent                        ¯    &#175;    &macr;
degree sign                          °    &#176;    &deg;       $^o$
plus or minus                        ±    &#177;    &plusmn;    $\pm$
superscript two                      ²    &#178;    &sup2;      $^2$
superscript three                    ³    &#179;    &sup3;      $^3$
acute accent                         ´    &#180;    &acute;     \'{}
micro sign                           µ    &#181;    &micro;     $\mu$
paragraph sign                       ¶    &#182;    &para;      \P
middle dot                           ·    &#183;    &middot;    $\cdot$
cedilla                              ¸    &#184;    &cedil;     \c{}
superscript one                      ¹    &#185;    &sup1;      $^1$
masculine ordinal                    º    &#186;    &ordm;      $^o$
right angle quote, guillemotright    »    &#187;    &raquo;     \guillemotright
fraction one-fourth                  ¼    &#188;    &frac14;    $\frac14$
fraction one-half                    ½    &#189;    &frac12;    $\frac12$
fraction three-fourths               ¾    &#190;    &frac34;    $\frac34$
inverted question mark               ¿    &#191;    &iquest;    ?'
capital A, grave accent              À    &#192;    &Agrave;    \`A
capital A, acute accent              Á    &#193;    &Aacute;    \'A
capital A, circumflex accent         Â    &#194;    &Acirc;     \^A
capital A, tilde                     Ã    &#195;    &Atilde;    \~A
capital A, dieresis or umlaut mark   Ä    &#196;    &Auml;      \"A
capital A, ring                      Å    &#197;    &Aring;     \AA
capital AE diphthong (ligature)      Æ    &#198;    &AElig;     \AE
capital C, cedilla                   Ç    &#199;    &Ccedil;    \c{C}
capital E, grave accent              È    &#200;    &Egrave;    \`E
capital E, acute accent              É    &#201;    &Eacute;    \'E
capital E, circumflex accent         Ê    &#202;    &Ecirc;     \^E
capital E, dieresis or umlaut mark   Ë    &#203;    &Euml;      \"E
capital I, grave accent              Ì    &#204;    &Igrave;    \`I
capital I, acute accent              Í    &#205;    &Iacute;    \'I
capital I, circumflex accent         Î    &#206;    &Icirc;     \^I
capital I, dieresis or umlaut mark   Ï    &#207;    &Iuml;      \"I
capital Eth, Icelandic               Ð    &#208;    &ETH;
capital N, tilde                     Ñ    &#209;    &Ntilde;    \~N
capital O, grave accent              Ò    &#210;    &Ograve;    \`O
capital O, acute accent              Ó    &#211;    &Oacute;    \'O
capital O, circumflex accent         Ô    &#212;    &Ocirc;     \^O
capital O, tilde                     Õ    &#213;    &Otilde;    \~O
capital O, dieresis or umlaut mark   Ö    &#214;    &Ouml;      \"O
multiply sign                        ×    &#215;    &times;     $\times$
capital O, slash                     Ø    &#216;    &Oslash;    {\O}
capital U, grave accent              Ù    &#217;    &Ugrave;    \`U
capital U, acute accent              Ú    &#218;    &Uacute;    \'U
capital U, circumflex accent         Û    &#219;    &Ucirc;     \^U
capital U, dieresis or umlaut mark   Ü    &#220;    &Uuml;      \"A
capital Y, acute accent              Ý    &#221;    &Yacute;    \'Y
capital THORN, Icelandic             Þ    &#222;    &THORN;     \TH
small sharp s, German (sz ligature)  ß    &#223;    &szlig;     \ss
small a, grave accent                à    &#224;    &agrave;    \`a
small a, acute accent                á    &#225;    &aacute;    \'a
small a, circumflex accent           â    &#226;    &acirc;     \^a
small a, tilde                       ã    &#227;    &atilde;    \~a
small a, dieresis or umlaut mark     ä    &#228;    &auml;      \"a
small a, ring                        å    &#229;    &aring;     \aa
small ae diphthong (ligature)        æ    &#230;    &aelig;     \ae
small c, cedilla                     ç    &#231;    &ccedil;    \c{c}
small e, grave accent                è    &#232;    &egrave;    \`e
small e, acute accent                é    &#233;    &eacute;    \'e
small e, circumflex accent           ê    &#234;    &ecirc;     \^e
small e, dieresis or umlaut mark     ë    &#235;    &euml;      \"e
small i, grave accent                ì    &#236;    &igrave;    \`i
small i, acute accent                í    &#237;    &iacute;    \'i
small i, circumflex accent           î    &#238;    &icirc;     \^i
small i, dieresis or umlaut mark     ï    &#239;    &iuml;      \"i
small eth, Icelandic                 ð    &#240;    &eth;
small n, tilde                       ñ    &#241;    &ntilde;    \~n
small o, grave accent                ò    &#242;    &ograve;    \`o
small o, acute accent                ó    &#243;    &oacute;    \'o
small o, circumflex accent           ô    &#244;    &ocirc;     \^o
small o, tilde                       õ    &#245;    &otilde;    \~o
small o, dieresis or umlaut mark     ö    &#246;    &ouml;      \"o
division sign                        ÷    &#247;    &divide;    $\divide$
small o, slash                       ø    &#248;    &oslash;    {\o}
small u, grave accent                ù    &#249;    &ugrave;    \`u
small u, acute accent                ú    &#250;    &uacute;    \'u
small u, circumflex accent           û    &#251;    &ucirc;     \^u
small u, dieresis or umlaut mark     ü    &#252;    &uuml;      \"u
small y, acute accent                ý    &#253;    &yacute;    \'y
small thorn, Icelandic               þ    &#254;    &thorn;     \th
small y, dieresis or umlaut mark     ÿ    &#255;    &yuml;      \"y
______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to