UTF-8 is treated specially by readLines(), originally to allow for UTF-8 strings on Windows. See the NEWS for 2.12.0.

That is not the case for encoding = "latin1".

If you have a Latin-1 file in a UTF-8 locale, then

readLines(x, encoding = "latin1")

stores the strings in Latin-1 and marks them, and

readLines(file(x, encoding = "latin1"))

translates the strings to UTF-8 and marks them as such.

There can be advantages to the first, including speed and less storage space. Also to the second (e.g. translating once may be better if the strings are to be manipulated by character-level functions).

Prior to 2.12.0 there were differences for UTF-8 files, and even now
readLines(x, encoding="UTF-8") is more convenient (no encoding left open as your first example will).

On 05/04/2014 11:54, Milan Bouchet-Valat wrote:
Hi!

I'm wondering what's the use of the 'encoding' argument to readLines(x),
as opposed to readLines(file(x, encoding=)). The same question applies
to read.table()'s 'encoding' vs 'fileEncoding' arguments. AFAIK only the
latter is able to re-encode the read text into the internal
representation used by R (let's say when reading files in encodings
other than latin1 and UTF-8). But then what's the purpose of the former?

?readLines says:
encoding: encoding to be assumed for input strings.  It is used to mark
           character strings as known to be in Latin-1 or UTF-8: it is
           not used to re-encode the input.  To do the latter, specify
           the encoding as part of the connection ‘con’ or via
           ‘options(encoding=)’: see the example under ‘file’.

But if I have a UTF-8 text file to read, couldn't I use
readLines(file(x, encoding="UTF-8"))
instead of
readLines(x, encoding="UTF-8")

In my experience resulting character strings are marked as UTF-8 where
needed as well.

The reason I'm asking this is because I need to decide whether I should
allow users of a tm source plug-in to pass both (à la 'encoding' vs
'fileEncoding') or whether I could safely skip the first one.


Thanks for your help

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



--
Brian D. Ripley,                  rip...@stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to