Re: [R] [Rd] source(echo = TRUE) with a iso-8859-1 encoded file gives an error

2018-05-05 Thread Scott Kostyshak
On Fri, May 04, 2018 at 10:58:26PM +, Ista Zahn wrote:
> On Fri, May 4, 2018 at 4:47 PM, Scott Kostyshak  wrote:
> > I have very little knowledge about file encodings and would like to
> > learn more.
> >
> > I've read the following pages to learn more:
> >
> >   
> > https://urldefense.proofpoint.com/v2/url?u=http-3A__stat.ethz.ch_R-2Dmanual_R-2Ddevel_library_base_html_Encoding.html=DwIFaQ=pZJPUDQ3SB9JplYbifm4nt2lEVG5pWx2KikqINpWlZM=neJ42wVqpDzuvOKMBML6-HnbH0l0aXpb0ZUFWoGb-Bo=yaDPpePO4lxR7-PBircARZlFh-GVyi5sTNtjTr_JZ7U=PSqR5opjnHspAeM6Edm1ddsaY3ok1bnV-t6W4MKtVCM=
> >   
> > https://urldefense.proofpoint.com/v2/url?u=https-3A__stackoverflow.com_questions_4806823_how-2Dto-2Ddetect-2Dthe-2Dright-2Dencoding-2Dfor-2Dread-2Dcsv=DwIFaQ=pZJPUDQ3SB9JplYbifm4nt2lEVG5pWx2KikqINpWlZM=neJ42wVqpDzuvOKMBML6-HnbH0l0aXpb0ZUFWoGb-Bo=yaDPpePO4lxR7-PBircARZlFh-GVyi5sTNtjTr_JZ7U=1M6pNfwFR5uG5DkSAHPpXZKYETCiwV1wsJxpew6lThY=
> >   
> > https://urldefense.proofpoint.com/v2/url?u=https-3A__developer.r-2Dproject.org_Encodings-5Fand-5FR.html=DwIFaQ=pZJPUDQ3SB9JplYbifm4nt2lEVG5pWx2KikqINpWlZM=neJ42wVqpDzuvOKMBML6-HnbH0l0aXpb0ZUFWoGb-Bo=yaDPpePO4lxR7-PBircARZlFh-GVyi5sTNtjTr_JZ7U=hAF57aL9khHQ_2Ndars7qMO-FoqxnnmOiEDIprsllko=
> >
> > The last one, in particular, has been very helpful. I would be
> > interested in any further references that you suggest.
> >
> > I attach a file that reproduces the issue I would like to learn more
> > about. I do not know if the file encoding will be correctly preserved
> > through email, so I also provide the file (temporarily) on Dropbox here:
> >
> >   
> > https://urldefense.proofpoint.com/v2/url?u=https-3A__www.dropbox.com_s_3lbgebk7b5uaia7_encoding-5Fexport-5Fissue.R-3Fdl-3D0=DwIFaQ=pZJPUDQ3SB9JplYbifm4nt2lEVG5pWx2KikqINpWlZM=neJ42wVqpDzuvOKMBML6-HnbH0l0aXpb0ZUFWoGb-Bo=yaDPpePO4lxR7-PBircARZlFh-GVyi5sTNtjTr_JZ7U=fGtYdB-U7ktXVFeniRudE-ZmxmCP3ZUfeLOvJ0AJwqs=
> >
> > The file gives an error when using "source()" with the
> > argument echo = TRUE:
> >
> >   > source("encoding_export_issue.R", echo = TRUE)
> >   Error in nchar(dep, "c") : invalid multibyte string, element 1
> >   In addition: Warning message:
> >   In grepl("^[[:blank:]]*$", dep[1L]) :
> > input string 1 is invalid in this locale
> >
> > The problem comes from the "á" character in the .R file. The file
> > appears to be encoded as "iso-8859-1":
> >
> >   $ file --mime-encoding encoding_export_issue.R
> >   encoding_export_issue.R: iso-8859-1
> >
> > Note that for me:
> >
> >   > getOption("encoding")
> >   [1] "native.enc"
> >
> > so "native.enc" is used for the "encoding" argument of source().
> >
> > The following two calls succeed:
> >
> >   > source("encoding_export_issue.R", echo = TRUE, encoding = "unknown")
> >   > source("encoding_export_issue.R", echo = TRUE, encoding = "iso-8859-1")
> >
> > Is this file a valid "iso-8859-1" encoded file?
> 
> The one you attached is not. The one linked to in dropbox is.
> 
>  Why does source() fail
> > in the case of encoding set to "native.enc"? Is it because of the
> > settings to UTF-8 in my locale (see info on my system at the bottom of
> > this email).
> 
> Yes.
> 
> >
> > I'm guessing it would be a bad idea to put
> >
> >   options(encoding = "unknown")
> >
> > in my .Rprofile, because it is difficult to always correctly guess the
> > encoding of files?
> 
> My guess is that the issue is less about the difficulty of guessing
> the encoding, and more about the time it takes to do so. That's not
> particularly relevant for the "source" function, but the encoding
> option is used by many of the file IO functions in R and so has
> implications well beyond the behavior of "source".

Ah I did not think about this possibility. Makes sense.

> 
>  Is there a reason why setting it to "unknown" would
> > lead to more problems than leaving it set to "native.enc"?
> 
> It depends on what you are actually doing. If you are on a UTF-8
> locale and working exclusively with UTF-8 files, setting
> options(encoding = "unknown") will just slow down your file IO by
> checking for the encoding every time.

Good to know. Thank you for your response, Ista.

Scott


-- 
Scott Kostyshak
Assistant Professor of Economics
University of Florida
https://people.clas.ufl.edu/skostyshak/

> >
> > I've reproduced the above behavior on R-devel (r74677) and 3.4.3. Below
> > is my session info and locale info for my system with the 3.4.3 version:
> >
> >> sessionInfo()
> > R version 3.4.3 (2017-11-30)
> > Platform: x86_64-pc-linux-gnu (64-bit)
> > Running under: Ubuntu 16.04.3 LTS
> >
> > Matrix products: default
> > BLAS: /usr/lib/libblas/libblas.so.3.6.0
> > LAPACK: /usr/lib/lapack/liblapack.so.3.6.0
> >
> > locale:
> >  [1] LC_CTYPE=en_US.UTF-8   LC_NUMERIC=C
> >  [3] LC_TIME=en_US.UTF-8LC_COLLATE=en_US.UTF-8
> >  [5] LC_MONETARY=en_US.UTF-8LC_MESSAGES=en_US.UTF-8
> >  [7] LC_PAPER=en_US.UTF-8   LC_NAME=C
> >  [9] LC_ADDRESS=C   LC_TELEPHONE=C
> > [11] 

Re: [R] [Rd] source(echo = TRUE) with a iso-8859-1 encoded file gives an error

2018-05-04 Thread Ista Zahn
On Fri, May 4, 2018 at 4:47 PM, Scott Kostyshak  wrote:
> I have very little knowledge about file encodings and would like to
> learn more.
>
> I've read the following pages to learn more:
>
>   http://stat.ethz.ch/R-manual/R-devel/library/base/html/Encoding.html
>   
> https://stackoverflow.com/questions/4806823/how-to-detect-the-right-encoding-for-read-csv
>   https://developer.r-project.org/Encodings_and_R.html
>
> The last one, in particular, has been very helpful. I would be
> interested in any further references that you suggest.
>
> I attach a file that reproduces the issue I would like to learn more
> about. I do not know if the file encoding will be correctly preserved
> through email, so I also provide the file (temporarily) on Dropbox here:
>
>   https://www.dropbox.com/s/3lbgebk7b5uaia7/encoding_export_issue.R?dl=0
>
> The file gives an error when using "source()" with the
> argument echo = TRUE:
>
>   > source("encoding_export_issue.R", echo = TRUE)
>   Error in nchar(dep, "c") : invalid multibyte string, element 1
>   In addition: Warning message:
>   In grepl("^[[:blank:]]*$", dep[1L]) :
> input string 1 is invalid in this locale
>
> The problem comes from the "á" character in the .R file. The file
> appears to be encoded as "iso-8859-1":
>
>   $ file --mime-encoding encoding_export_issue.R
>   encoding_export_issue.R: iso-8859-1
>
> Note that for me:
>
>   > getOption("encoding")
>   [1] "native.enc"
>
> so "native.enc" is used for the "encoding" argument of source().
>
> The following two calls succeed:
>
>   > source("encoding_export_issue.R", echo = TRUE, encoding = "unknown")
>   > source("encoding_export_issue.R", echo = TRUE, encoding = "iso-8859-1")
>
> Is this file a valid "iso-8859-1" encoded file?

The one you attached is not. The one linked to in dropbox is.

 Why does source() fail
> in the case of encoding set to "native.enc"? Is it because of the
> settings to UTF-8 in my locale (see info on my system at the bottom of
> this email).

Yes.

>
> I'm guessing it would be a bad idea to put
>
>   options(encoding = "unknown")
>
> in my .Rprofile, because it is difficult to always correctly guess the
> encoding of files?

My guess is that the issue is less about the difficulty of guessing
the encoding, and more about the time it takes to do so. That's not
particularly relevant for the "source" function, but the encoding
option is used by many of the file IO functions in R and so has
implications well beyond the behavior of "source".

 Is there a reason why setting it to "unknown" would
> lead to more problems than leaving it set to "native.enc"?

It depends on what you are actually doing. If you are on a UTF-8
locale and working exclusively with UTF-8 files, setting
options(encoding = "unknown") will just slow down your file IO by
checking for the encoding every time.
>
> I've reproduced the above behavior on R-devel (r74677) and 3.4.3. Below
> is my session info and locale info for my system with the 3.4.3 version:
>
>> sessionInfo()
> R version 3.4.3 (2017-11-30)
> Platform: x86_64-pc-linux-gnu (64-bit)
> Running under: Ubuntu 16.04.3 LTS
>
> Matrix products: default
> BLAS: /usr/lib/libblas/libblas.so.3.6.0
> LAPACK: /usr/lib/lapack/liblapack.so.3.6.0
>
> locale:
>  [1] LC_CTYPE=en_US.UTF-8   LC_NUMERIC=C
>  [3] LC_TIME=en_US.UTF-8LC_COLLATE=en_US.UTF-8
>  [5] LC_MONETARY=en_US.UTF-8LC_MESSAGES=en_US.UTF-8
>  [7] LC_PAPER=en_US.UTF-8   LC_NAME=C
>  [9] LC_ADDRESS=C   LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats graphics  grDevices utils datasets  methods   base
>
> loaded via a namespace (and not attached):
> [1] compiler_3.4.3
>
>> Sys.getlocale()
> [1] 
> "LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=en_US.UTF-8;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=C"
>
> Thanks for your time,
>
> Scott
>
> P.S. Note that I had posted this question to r-devel, which was the
> incorrect choice. For archival purposes, I reference the thread here:
>
> https://www.mail-archive.com/search?l=mid=20180501185750.445oub53vcdnyyyx%40steph
>
>
> --
> Scott Kostyshak
> Assistant Professor of Economics
> University of Florida
> https://people.clas.ufl.edu/skostyshak/
>
>
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, 

[R] [Rd] source(echo = TRUE) with a iso-8859-1 encoded file gives an error

2018-05-04 Thread Scott Kostyshak
I have very little knowledge about file encodings and would like to
learn more.

I've read the following pages to learn more:

  http://stat.ethz.ch/R-manual/R-devel/library/base/html/Encoding.html
  
https://stackoverflow.com/questions/4806823/how-to-detect-the-right-encoding-for-read-csv
  https://developer.r-project.org/Encodings_and_R.html

The last one, in particular, has been very helpful. I would be
interested in any further references that you suggest.

I attach a file that reproduces the issue I would like to learn more
about. I do not know if the file encoding will be correctly preserved
through email, so I also provide the file (temporarily) on Dropbox here:

  https://www.dropbox.com/s/3lbgebk7b5uaia7/encoding_export_issue.R?dl=0

The file gives an error when using "source()" with the
argument echo = TRUE:

  > source("encoding_export_issue.R", echo = TRUE)
  Error in nchar(dep, "c") : invalid multibyte string, element 1
  In addition: Warning message:
  In grepl("^[[:blank:]]*$", dep[1L]) :
input string 1 is invalid in this locale

The problem comes from the "á" character in the .R file. The file
appears to be encoded as "iso-8859-1":

  $ file --mime-encoding encoding_export_issue.R 
  encoding_export_issue.R: iso-8859-1

Note that for me:

  > getOption("encoding")
  [1] "native.enc"

so "native.enc" is used for the "encoding" argument of source().

The following two calls succeed:

  > source("encoding_export_issue.R", echo = TRUE, encoding = "unknown")
  > source("encoding_export_issue.R", echo = TRUE, encoding = "iso-8859-1")

Is this file a valid "iso-8859-1" encoded file?  Why does source() fail
in the case of encoding set to "native.enc"? Is it because of the
settings to UTF-8 in my locale (see info on my system at the bottom of
this email).

I'm guessing it would be a bad idea to put

  options(encoding = "unknown")

in my .Rprofile, because it is difficult to always correctly guess the
encoding of files? Is there a reason why setting it to "unknown" would
lead to more problems than leaving it set to "native.enc"?

I've reproduced the above behavior on R-devel (r74677) and 3.4.3. Below
is my session info and locale info for my system with the 3.4.3 version:

> sessionInfo()
R version 3.4.3 (2017-11-30)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.3 LTS

Matrix products: default
BLAS: /usr/lib/libblas/libblas.so.3.6.0
LAPACK: /usr/lib/lapack/liblapack.so.3.6.0

locale:
 [1] LC_CTYPE=en_US.UTF-8   LC_NUMERIC=C  
 [3] LC_TIME=en_US.UTF-8LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8   LC_NAME=C 
 [9] LC_ADDRESS=C   LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C   

attached base packages:
[1] stats graphics  grDevices utils datasets  methods   base 

loaded via a namespace (and not attached):
[1] compiler_3.4.3

> Sys.getlocale()
[1] 
"LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=en_US.UTF-8;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=C"

Thanks for your time,

Scott

P.S. Note that I had posted this question to r-devel, which was the
incorrect choice. For archival purposes, I reference the thread here:

https://www.mail-archive.com/search?l=mid=20180501185750.445oub53vcdnyyyx%40steph


-- 
Scott Kostyshak
Assistant Professor of Economics
University of Florida
https://people.clas.ufl.edu/skostyshak/

# Ch?vez
quantile_type <- 4

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.