Re: [Rd] Native characterset is wrong for unicode builds for Windows

2015-02-27 Thread maill...@tlink.de

Am 27.02.2015 um 11:49 schrieb Duncan Murdoch:

On 27/02/2015 2:31 AM, maill...@tlink.de wrote:

Am 27.02.2015 um 03:13 schrieb Duncan Murdoch:

On 26/02/2015 6:34 PM, maill...@tlink.de wrote:

On 26/02/2015 3:09 PM, maill...@tlink.de wrote:

When I send some outlandish characters through enc2native (or format) in
R 3.1.2 on Ubuntu trusty it works quite well:

 ®ØΔЊת
[1] ®ØΔЊת
 enc2native(®ØΔЊת)
[1] ®ØΔЊת
 Encoding(enc2native(®ØΔЊת))
[1] UTF-8

In Windows the result is different:

 ®ØΔЊת
[1] ®ØΔЊת
 enc2native(®ØΔЊת)
[1] ®ØU+0394U+040AU+05EA
 Encoding(enc2native(®ØΔЊת))
[1] latin1

And this is wrong. The native character set of a unicode application
under Windows is *Unicode*. enc2native should do the same under Windows
as it does on Ubuntu. Also the unknown encoding should be changed to
mean the same as UTF-8 exactly as it is on Linux.

What is a unicode application, and what makes you think R is one?  R
is being told by Windows that your native encoding is latin1.  Perhaps
Windows 8 supports UTF-8 as a native encoding (I've never used it), but
previous versions of Windows didn't.

Duncan Murdoch


A unicode application is a program that uses the unicode API of Windows

R uses those functions, so I guess it is a unicode application.  But
internally it uses an 8 bit encoding (normally the native one for the
platform it is running on, which in your case is apparently latin1).


- the functions with the ending W. For such a application the system
code page (native encoding) is completely irrelevant. The system code
page is just a compatibility feature that enables Windows NT/Vista/7/8
to run applications that were developed for Windows 95 which didn't have
unicode support.

Windows 95 had UCS-2 support, which was pretty close to UTF-16.

But this line of operating systems is dead for 10 years

now. R obviously is a unicode application because it can print - or read
from the clipboard - characters like ΔЊת that are not in my system
code page which is not possible over the legacy API.

So unicode application is something you just made up.

If you use Windows development tools, they have macros to convert
generic functions to either A or W versions.  R doesn't use those.  It
calls the W functions when it has UTF-16 characters, and A functions
when it has native characters.  I would love it if R was a UTF-8
application, because it would make life so much simpler, but Windows
doesn't support that.  So R needs to do tons of conversions.  If you
don't like that, you probably need to stick with Ubuntu.

Duncan Murdoch


I am not complaining about those conversions. They work just fine
already. I am complaining about
enc2native breaking things in the windows builds. An assignment like

s - format(®ØΔЊת)

has no interaction with windows at all yet s contains garbage like
®ØU+0394U+040AU+05EA
after that. And if a native encoding of UTF-8 - as defined by enc2native
- works in Ubuntu why shouldn't it work
in Windows?

Because in Ubuntu, UTF-8 is the native encoding, and in your Windows
system, latin1 is the native encoding.

But I do agree that the format() issue is a problem.  I haven't traced
through the code, but I think the string ®ØΔЊת is read using Windows
API functions that return a UTF-16 result, then converted by R to UTF-8.
  So format() should see that it is a UTF-8 string and not convert it to
the native encoding.  There is nothing wrong with enc2native(), it's
doing what you asked for.  The problem is that format() is using it.

Duncan Murdoch


I would expect that every function that is using enc2native is broken in 
this respect because it invariably will scramble most unicode characters 
in the process and I can't think of a case where this could be wanted 
actually.
Functions that really need something other than UTF-8 are probably using 
iconv and getOption(encoding) anyway as this allows to specify the 
desired encoding much more flexible.


__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Native characterset is wrong for unicode builds for Windows

2015-02-27 Thread Duncan Murdoch
On 27/02/2015 2:31 AM, maill...@tlink.de wrote:
 Am 27.02.2015 um 03:13 schrieb Duncan Murdoch:
 On 26/02/2015 6:34 PM, maill...@tlink.de wrote:
 On 26/02/2015 3:09 PM, maill...@tlink.de wrote:
 When I send some outlandish characters through enc2native (or format) in
 R 3.1.2 on Ubuntu trusty it works quite well:

 ®ØΔЊת
 [1] ®ØΔЊת
 enc2native(®ØΔЊת)
 [1] ®ØΔЊת
 Encoding(enc2native(®ØΔЊת))
 [1] UTF-8

 In Windows the result is different:

 ®ØΔЊת
 [1] ®ØΔЊת
 enc2native(®ØΔЊת)
 [1] ®ØU+0394U+040AU+05EA
 Encoding(enc2native(®ØΔЊת))
 [1] latin1

 And this is wrong. The native character set of a unicode application
 under Windows is *Unicode*. enc2native should do the same under Windows
 as it does on Ubuntu. Also the unknown encoding should be changed to
 mean the same as UTF-8 exactly as it is on Linux.
 What is a unicode application, and what makes you think R is one?  R
 is being told by Windows that your native encoding is latin1.  Perhaps
 Windows 8 supports UTF-8 as a native encoding (I've never used it), but
 previous versions of Windows didn't.

 Duncan Murdoch

 A unicode application is a program that uses the unicode API of Windows
 R uses those functions, so I guess it is a unicode application.  But
 internally it uses an 8 bit encoding (normally the native one for the
 platform it is running on, which in your case is apparently latin1).

 - the functions with the ending W. For such a application the system
 code page (native encoding) is completely irrelevant. The system code
 page is just a compatibility feature that enables Windows NT/Vista/7/8
 to run applications that were developed for Windows 95 which didn't have
 unicode support.
 Windows 95 had UCS-2 support, which was pretty close to UTF-16.

 But this line of operating systems is dead for 10 years
 now. R obviously is a unicode application because it can print - or read
 from the clipboard - characters like ΔЊת that are not in my system
 code page which is not possible over the legacy API.
 So unicode application is something you just made up.

 If you use Windows development tools, they have macros to convert
 generic functions to either A or W versions.  R doesn't use those.  It
 calls the W functions when it has UTF-16 characters, and A functions
 when it has native characters.  I would love it if R was a UTF-8
 application, because it would make life so much simpler, but Windows
 doesn't support that.  So R needs to do tons of conversions.  If you
 don't like that, you probably need to stick with Ubuntu.

 Duncan Murdoch

 
 I am not complaining about those conversions. They work just fine 
 already. I am complaining about
 enc2native breaking things in the windows builds. An assignment like
 
 s - format(®ØΔЊת)
 
 has no interaction with windows at all yet s contains garbage like  
 ®ØU+0394U+040AU+05EA
 after that. And if a native encoding of UTF-8 - as defined by enc2native 
 - works in Ubuntu why shouldn't it work
 in Windows?

Because in Ubuntu, UTF-8 is the native encoding, and in your Windows
system, latin1 is the native encoding.

But I do agree that the format() issue is a problem.  I haven't traced
through the code, but I think the string ®ØΔЊת is read using Windows
API functions that return a UTF-16 result, then converted by R to UTF-8.
 So format() should see that it is a UTF-8 string and not convert it to
the native encoding.  There is nothing wrong with enc2native(), it's
doing what you asked for.  The problem is that format() is using it.

Duncan Murdoch

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Native characterset is wrong for unicode builds for Windows

2015-02-26 Thread maill...@tlink.de

Am 27.02.2015 um 03:13 schrieb Duncan Murdoch:

On 26/02/2015 6:34 PM, maill...@tlink.de wrote:

On 26/02/2015 3:09 PM, maill...@tlink.de wrote:

When I send some outlandish characters through enc2native (or format) in
R 3.1.2 on Ubuntu trusty it works quite well:

®ØΔЊת
[1] ®ØΔЊת
enc2native(®ØΔЊת)
[1] ®ØΔЊת
Encoding(enc2native(®ØΔЊת))
[1] UTF-8

In Windows the result is different:

®ØΔЊת
[1] ®ØΔЊת
enc2native(®ØΔЊת)
[1] ®ØU+0394U+040AU+05EA
Encoding(enc2native(®ØΔЊת))
[1] latin1

And this is wrong. The native character set of a unicode application
under Windows is *Unicode*. enc2native should do the same under Windows
as it does on Ubuntu. Also the unknown encoding should be changed to
mean the same as UTF-8 exactly as it is on Linux.

What is a unicode application, and what makes you think R is one?  R
is being told by Windows that your native encoding is latin1.  Perhaps
Windows 8 supports UTF-8 as a native encoding (I've never used it), but
previous versions of Windows didn't.

Duncan Murdoch


A unicode application is a program that uses the unicode API of Windows

R uses those functions, so I guess it is a unicode application.  But
internally it uses an 8 bit encoding (normally the native one for the
platform it is running on, which in your case is apparently latin1).


- the functions with the ending W. For such a application the system
code page (native encoding) is completely irrelevant. The system code
page is just a compatibility feature that enables Windows NT/Vista/7/8
to run applications that were developed for Windows 95 which didn't have
unicode support.

Windows 95 had UCS-2 support, which was pretty close to UTF-16.

But this line of operating systems is dead for 10 years

now. R obviously is a unicode application because it can print - or read
from the clipboard - characters like ΔЊת that are not in my system
code page which is not possible over the legacy API.

So unicode application is something you just made up.

If you use Windows development tools, they have macros to convert
generic functions to either A or W versions.  R doesn't use those.  It
calls the W functions when it has UTF-16 characters, and A functions
when it has native characters.  I would love it if R was a UTF-8
application, because it would make life so much simpler, but Windows
doesn't support that.  So R needs to do tons of conversions.  If you
don't like that, you probably need to stick with Ubuntu.

Duncan Murdoch



I am not complaining about those conversions. They work just fine 
already. I am complaining about

enc2native breaking things in the windows builds. An assignment like

s - format(®ØΔЊת)

has no interaction with windows at all yet s contains garbage like  
®ØU+0394U+040AU+05EA
after that. And if a native encoding of UTF-8 - as defined by enc2native 
- works in Ubuntu why shouldn't it work

in Windows?

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Native characterset is wrong for unicode builds for Windows

2015-02-26 Thread maill...@tlink.de
Am 26.02.2015 um 23:44 schrieb Winston Chang:
 On Thu, Feb 26, 2015 at 2:09 PM, maill...@tlink.de 
 mailto:maill...@tlink.de maill...@tlink.de 
 mailto:maill...@tlink.de wrote:


 When I send some outlandish characters through enc2native (or
 format) in R 3.1.2 on Ubuntu trusty it works quite well:

  ®ØΔЊת
 [1] ®ØΔЊת
  enc2native(®ØΔЊת)
 [1] ®ØΔЊת
  Encoding(enc2native(®ØΔЊת))
 [1] UTF-8

 In Windows the result is different:

  ®ØΔЊת
 [1] ®ØΔЊת
  enc2native(®ØΔЊת)
 [1] ®ØU+0394U+040AU+05EA
  Encoding(enc2native(®ØΔЊת))
 [1] latin1

 And this is wrong. The native character set of a unicode
 application under Windows is *Unicode*. enc2native should do the
 same under Windows as it does on Ubuntu. Also the unknown
 encoding should be changed to mean the same as UTF-8 exactly as
 it is on Linux.


 I think you're mixing up the term character set with the encoding 
 for a character set. Unicode is a character set. UTF-8 is one of many 
 encodings of Unicode.

 UTF-8 may be the native character encoding in Ubuntu, but it's not the 
 native encoding in Windows. According to this, what counts as the 
 native encoding in Windows depends on the code page:
 http://stackoverflow.com/a/4649507

 So you shouldn't expect enc2native to do the same thing on Linux and 
 Windows. If you really want UTF-8, you can use enc2utf8.

 -Winston

Maybe I'm expecting too much but I rather have R not to produce garbage 
like ®ØU+0394U+040AU+05EA and while I can use enc2utf8 to 
convert from UTF-8 to UTF-8 this does not fix the many places - like 
format - where enc2native is used and that are broken because of this.



[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Native characterset is wrong for unicode builds for Windows

2015-02-26 Thread Winston Chang
On Thu, Feb 26, 2015 at 2:09 PM, maill...@tlink.de maill...@tlink.de
wrote:


 When I send some outlandish characters through enc2native (or format) in R
 3.1.2 on Ubuntu trusty it works quite well:

  ®ØΔЊת
 [1] ®ØΔЊת
  enc2native(®ØΔЊת)
 [1] ®ØΔЊת
  Encoding(enc2native(®ØΔЊת))
 [1] UTF-8

 In Windows the result is different:

  ®ØΔЊת
 [1] ®ØΔЊת
  enc2native(®ØΔЊת)
 [1] ®ØU+0394U+040AU+05EA
  Encoding(enc2native(®ØΔЊת))
 [1] latin1

 And this is wrong. The native character set of a unicode application under
 Windows is *Unicode*. enc2native should do the same under Windows as it
 does on Ubuntu. Also the unknown encoding should be changed to mean the
 same as UTF-8 exactly as it is on Linux.


I think you're mixing up the term character set with the encoding for a
character set. Unicode is a character set. UTF-8 is one of many encodings
of Unicode.

UTF-8 may be the native character encoding in Ubuntu, but it's not the
native encoding in Windows. According to this, what counts as the native
encoding in Windows depends on the code page:
  http://stackoverflow.com/a/4649507

So you shouldn't expect enc2native to do the same thing on Linux and
Windows. If you really want UTF-8, you can use enc2utf8.

-Winston

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Native characterset is wrong for unicode builds for Windows

2015-02-26 Thread Duncan Murdoch
On 26/02/2015 3:09 PM, maill...@tlink.de wrote:
 
 When I send some outlandish characters through enc2native (or format) in 
 R 3.1.2 on Ubuntu trusty it works quite well:
 
   ®ØΔЊת
 [1] ®ØΔЊת
   enc2native(®ØΔЊת)
 [1] ®ØΔЊת
   Encoding(enc2native(®ØΔЊת))
 [1] UTF-8
 
 In Windows the result is different:
 
   ®ØΔЊת
 [1] ®ØΔЊת
   enc2native(®ØΔЊת)
 [1] ®ØU+0394U+040AU+05EA
   Encoding(enc2native(®ØΔЊת))
 [1] latin1
 
 And this is wrong. The native character set of a unicode application 
 under Windows is *Unicode*. enc2native should do the same under Windows 
 as it does on Ubuntu. Also the unknown encoding should be changed to 
 mean the same as UTF-8 exactly as it is on Linux.

What is a unicode application, and what makes you think R is one?  R
is being told by Windows that your native encoding is latin1.  Perhaps
Windows 8 supports UTF-8 as a native encoding (I've never used it), but
previous versions of Windows didn't.

Duncan Murdoch

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Native characterset is wrong for unicode builds for Windows

2015-02-26 Thread maill...@tlink.de

On 26/02/2015 3:09 PM, maill...@tlink.de wrote:

When I send some outlandish characters through enc2native (or format) in
R 3.1.2 on Ubuntu trusty it works quite well:

   ®ØΔЊת
[1] ®ØΔЊת
   enc2native(®ØΔЊת)
[1] ®ØΔЊת
   Encoding(enc2native(®ØΔЊת))
[1] UTF-8

In Windows the result is different:

   ®ØΔЊת
[1] ®ØΔЊת
   enc2native(®ØΔЊת)
[1] ®ØU+0394U+040AU+05EA
   Encoding(enc2native(®ØΔЊת))
[1] latin1

And this is wrong. The native character set of a unicode application
under Windows is *Unicode*. enc2native should do the same under Windows
as it does on Ubuntu. Also the unknown encoding should be changed to
mean the same as UTF-8 exactly as it is on Linux.

What is a unicode application, and what makes you think R is one?  R
is being told by Windows that your native encoding is latin1.  Perhaps
Windows 8 supports UTF-8 as a native encoding (I've never used it), but
previous versions of Windows didn't.

Duncan Murdoch

A unicode application is a program that uses the unicode API of Windows 
- the functions with the ending W. For such a application the system 
code page (native encoding) is completely irrelevant. The system code 
page is just a compatibility feature that enables Windows NT/Vista/7/8 
to run applications that were developed for Windows 95 which didn't have 
unicode support. But this line of operating systems is dead for 10 years 
now. R obviously is a unicode application because it can print - or read 
from the clipboard - characters like ΔЊת that are not in my system 
code page which is not possible over the legacy API.


Neither the unicode API nor the legacy API accepts UTF-8. The legacy API 
needs strings encoded according to the active code page and the unicode 
API wants UTF-16. If you have UTF-8 you need to convert it in either to 
the active code page which will loose all characters that are not 
covered by it or convert to UTF-16 and use the unicode functions. But 
this is not the problem, the Windows interface functions of R are 
working quite nicely with unicode already.


__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Native characterset is wrong for unicode builds for Windows

2015-02-26 Thread Duncan Murdoch
On 26/02/2015 6:34 PM, maill...@tlink.de wrote:
 On 26/02/2015 3:09 PM, maill...@tlink.de wrote:
 When I send some outlandish characters through enc2native (or format) in
 R 3.1.2 on Ubuntu trusty it works quite well:

®ØΔЊת
 [1] ®ØΔЊת
enc2native(®ØΔЊת)
 [1] ®ØΔЊת
Encoding(enc2native(®ØΔЊת))
 [1] UTF-8

 In Windows the result is different:

®ØΔЊת
 [1] ®ØΔЊת
enc2native(®ØΔЊת)
 [1] ®ØU+0394U+040AU+05EA
Encoding(enc2native(®ØΔЊת))
 [1] latin1

 And this is wrong. The native character set of a unicode application
 under Windows is *Unicode*. enc2native should do the same under Windows
 as it does on Ubuntu. Also the unknown encoding should be changed to
 mean the same as UTF-8 exactly as it is on Linux.
 What is a unicode application, and what makes you think R is one?  R
 is being told by Windows that your native encoding is latin1.  Perhaps
 Windows 8 supports UTF-8 as a native encoding (I've never used it), but
 previous versions of Windows didn't.

 Duncan Murdoch

 A unicode application is a program that uses the unicode API of Windows 

R uses those functions, so I guess it is a unicode application.  But
internally it uses an 8 bit encoding (normally the native one for the
platform it is running on, which in your case is apparently latin1).

 - the functions with the ending W. For such a application the system 
 code page (native encoding) is completely irrelevant. The system code 
 page is just a compatibility feature that enables Windows NT/Vista/7/8 
 to run applications that were developed for Windows 95 which didn't have 
 unicode support. 

Windows 95 had UCS-2 support, which was pretty close to UTF-16.

But this line of operating systems is dead for 10 years
 now. R obviously is a unicode application because it can print - or read 
 from the clipboard - characters like ΔЊת that are not in my system 
 code page which is not possible over the legacy API.

So unicode application is something you just made up.

If you use Windows development tools, they have macros to convert
generic functions to either A or W versions.  R doesn't use those.  It
calls the W functions when it has UTF-16 characters, and A functions
when it has native characters.  I would love it if R was a UTF-8
application, because it would make life so much simpler, but Windows
doesn't support that.  So R needs to do tons of conversions.  If you
don't like that, you probably need to stick with Ubuntu.

Duncan Murdoch

 
 Neither the unicode API nor the legacy API accepts UTF-8. The legacy API 
 needs strings encoded according to the active code page and the unicode 
 API wants UTF-16. If you have UTF-8 you need to convert it in either to 
 the active code page which will loose all characters that are not 
 covered by it or convert to UTF-16 and use the unicode functions. But 
 this is not the problem, the Windows interface functions of R are 
 working quite nicely with unicode already.


 
 __
 R-devel@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-devel


__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel