Re: ASCII or Unicode? (was best text editor for programming Python on a Mac)

2016-06-22 Thread Lawrence D’Oliveiro
On Thursday, June 23, 2016 at 2:02:18 PM UTC+12, Rustom Mody wrote:

> So remembered that there is one method -- yes clunky -- that I use most -- 
> forgot to mention -- C-x 8 RET 
> ie insert-char¹
> 
> Which takes the name (or hex) of the unicode char.

A handy tool for looking up names and codes is the unicode(1) command 
.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: ASCII or Unicode? (was best text editor for programming Python on a Mac)

2016-06-22 Thread Rustom Mody
On Tuesday, June 21, 2016 at 7:27:00 PM UTC+5:30, Rustom Mody wrote:
> >https://wiki.archlinux.org/index.php/Keyboard_configuration_i
> >n_Xorg> -- no good

You probably want this:
https://wiki.archlinux.org/index.php/X_KeyBoard_extension#Editing_the_layout

> > So Rustom, how do *you* produce, say, Hebrew or Spanish text, or your
> > favorite math symbols?
> 
> I wish I could say I have a good answer -- ATM dont
> However some ½-assed ones:
> 
> 
> Emacs:
> set-input-method (C-x RET C-\) greek
> And then typing
> abcdefghijklmnopqrstuvwxyz
> gives
> αβψδεφγηιξκλμνοπ;ρστθωςχυζ
> [yeah that ; on q is curious]
> 
> Spanish?? No idea
> But there seems to be a spanish input method that
> has these éóñá¿
> 
> Ive typed Hindi/Marathi/Tamil/Sanskrit/Gujarati and helped others with Bengali
> using devanagari-itrans/gujarati-itrans/tamil-itrans/bengali-itrans input
> methods. There are also the corresponding -inscript methods for those that
> type these fluently -- I am not one of those.
> 
> I have some 15-20 lines of elisp that makes these itrans uses easier (for me)
... etc

A couple of people wrote me off list thanking me for emacs-unicode knowhow

  

So remembered that there is one method -- yes clunky -- that I use most -- 
forgot to mention -- C-x 8 RET 
ie insert-char¹

Which takes the name (or hex) of the unicode char.
Nice thing is there is some amount of Tab-*-completion available which makes
it possible to fish around for chars after knowing/remembering part of the name

So with ↹ showing TAB²
Superscr↹
expands to
SUPERSCRIPT
One more ↹ gives
==
Click on a completion to select it.
In this buffer, type RET to select the completion near point.

Possible completions are:
SUPERSCRIPT CLOSING PARENTHESIS SUPERSCRIPT DIGIT EIGHT
SUPERSCRIPT DIGIT FIVE  SUPERSCRIPT DIGIT FOUR
SUPERSCRIPT DIGIT NINE  SUPERSCRIPT DIGIT ONE
SUPERSCRIPT DIGIT SEVEN SUPERSCRIPT DIGIT SIX
SUPERSCRIPT DIGIT THREE SUPERSCRIPT DIGIT TWO
SUPERSCRIPT DIGIT ZERO  SUPERSCRIPT EIGHT
SUPERSCRIPT EQUALS SIGN SUPERSCRIPT FIVE
SUPERSCRIPT FOURSUPERSCRIPT HYPHEN-MINUS
SUPERSCRIPT LATIN SMALL LETTER ISUPERSCRIPT LATIN SMALL LETTER N
SUPERSCRIPT LEFT PARENTHESISSUPERSCRIPT MINUS
SUPERSCRIPT NINESUPERSCRIPT ONE
SUPERSCRIPT OPENING PARENTHESIS SUPERSCRIPT PLUS SIGN
SUPERSCRIPT RIGHT PARENTHESIS   SUPERSCRIPT SEVEN
SUPERSCRIPT SIX SUPERSCRIPT THREE
SUPERSCRIPT TWO SUPERSCRIPT ZERO

Adding a d narrows to
SUPERSCRIPT DIGIT

One more ↹ narrows to
Possible completions are:
SUPERSCRIPT DIGIT EIGHT SUPERSCRIPT DIGIT FIVE  SUPERSCRIPT DIGIT FOUR
SUPERSCRIPT DIGIT NINE  SUPERSCRIPT DIGIT ONE   SUPERSCRIPT DIGIT SEVEN
SUPERSCRIPT DIGIT SIX   SUPERSCRIPT DIGIT THREE SUPERSCRIPT DIGIT TWO
SUPERSCRIPT DIGIT ZERO

* can also be used as glob for parts of the name one does not remember
So since there are zillions of chars that are some kind of ARROW
One can write Right*arrow↹
Still too many
Narrow further to Right*Double*Arrow↹ 
And we get

Possible completions are:
RIGHT DOUBLE ARROW  RIGHT DOUBLE ARROW WITH ROUNDED HEAD
RIGHT DOUBLE ARROW WITH STROKE  RIGHTWARDS DOUBLE ARROW
RIGHTWARDS DOUBLE ARROW FROM BARRIGHTWARDS DOUBLE ARROW WITH STROKE
RIGHTWARDS DOUBLE ARROW WITH VERTICAL STROKE
RIGHTWARDS DOUBLE ARROW-TAILRIGHTWARDS DOUBLE DASH ARROW 

etc
===
¹ Steven will be mighty pleased to note that it used to be called ucs-insert
For which now the help page gives:
"This function is obsolete since 24.3; use `insert-char' instead."

² Courtesy Xah Lee: http://xahlee.info/comp/unicode_computing_symbols.html
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: ASCII or Unicode? (was best text editor for programming Python on a Mac)

2016-06-22 Thread Tim Chase
On 2016-06-22 00:55, Lawrence D’Oliveiro wrote:
> On Wednesday, June 22, 2016 at 7:50:50 AM UTC+12, Tim Chase wrote:
>> I have a ~/.XCompose file that contains something like
> 
> You may find your custom XCompose is ignored by certain GUI apps.
> This is because the GUI toolkits they are using need to be told to
> pull it in (seems like XCompose is interpreted by the client side X
> toolkits, not the server side). So I put the following lines in
> my .bashrc:
> 
> export GTK_IM_MODULE=xim
> export QT_IM_MODULE=xim

Ah, I knew that I'd had issues at some point with it not working but
couldn't remember what I'd done to get it working.  This was it.
(grepping for "XCompose" in my config files didn't turn up anything)

Thanks for adding the missing element.

-tkc



-- 
https://mail.python.org/mailman/listinfo/python-list


Re: ASCII or Unicode? (was best text editor for programming Python on a Mac)

2016-06-22 Thread Lawrence D’Oliveiro
On Wednesday, June 22, 2016 at 7:50:50 AM UTC+12, Tim Chase wrote:
>
> I have a ~/.XCompose file that contains something like
> 
> include "%L"
>: ""   U1F616 # CONFOUNDED FACE
>: ""   U1F61B # FACE WITH
> STUCK-OUT TONGUE: ""   U1F61B #
> FACE WITH STUCK-OUT TONGUE
> 
> 
> The "include" pulls in the system-wide file, before adding my own
> compose maps.

You may find your custom XCompose is ignored by certain GUI apps. This is 
because the GUI toolkits they are using need to be told to pull it in (seems 
like XCompose is interpreted by the client side X toolkits, not the server 
side). So I put the following lines in my .bashrc:

export GTK_IM_MODULE=xim
export QT_IM_MODULE=xim
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: ASCII or Unicode? (was best text editor for programming Python on a Mac)

2016-06-21 Thread Marko Rauhamaa
Tim Chase :

> I have a ~/.XCompose file that contains something like

My Fedora 23 setup has

=== BEGIN /etc/X11/xinit/xinitrc-common=
[...]
userxkbmap=$HOME/.Xkbmap
[...]
if [ -r "$userxkbmap" ]; then
setxkbmap $(cat "$userxkbmap")
XKB_IN_USE=yes
fi
[...]
=== END /etc/X11/xinit/xinitrc-common===

A somewhat surprising and scary idiom! I suppose I could specify:

===BEGIN ~/.Xkbmap==
-keymap /home/marko/.keys
===END ~/.Xkbmap

Then, I suppose I need to use xkbcomp to create ~/.keys


Marko
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: ASCII or Unicode? (was best text editor for programming Python on a Mac)

2016-06-21 Thread Tim Chase
On 2016-06-21 21:56, Marko Rauhamaa wrote:
> Rustom Mody :
> 
> > Regarding xkb:
> >
> > Some good advice given to me by Yuri Khan on emacs list
> > https://lists.gnu.org/archive/html/help-gnu-emacs/2015-01/msg00332.html
> 
> Well, not quite:
> 
>* Find the XKB data directory. [Normally, this
> is /usr/share/X11/xkb.]
>* In its “keycodes” subdirectory, create a file that is unlikely
> to be overwritten by a future version of XKB (e.g. by prefixing it
> with your initials). [Let’s name it “rusi” for the sake of this
> example.]
>* In this file, paste the following:
>[...]
> 
> You can see this advice requires root access.
> 
> My coworker does assure me it can all be done with regular luser
> rights as well, but no web site seems to say how exactly.

I have a ~/.XCompose file that contains something like

include "%L"
   : ""   U1F616 # CONFOUNDED FACE
   : ""   U1F61B # FACE WITH
STUCK-OUT TONGUE: ""   U1F61B #
FACE WITH STUCK-OUT TONGUE


The "include" pulls in the system-wide file, before adding my own
compose maps.

-tkc


-- 
https://mail.python.org/mailman/listinfo/python-list


Re: ASCII or Unicode? (was best text editor for programming Python on a Mac)

2016-06-21 Thread Marko Rauhamaa
Rustom Mody :

> Regarding xkb:
>
> Some good advice given to me by Yuri Khan on emacs list
> https://lists.gnu.org/archive/html/help-gnu-emacs/2015-01/msg00332.html

Well, not quite:

   * Find the XKB data directory. [Normally, this is /usr/share/X11/xkb.]
   * In its “keycodes” subdirectory, create a file that is unlikely to be
   overwritten by a future version of XKB (e.g. by prefixing it with your
   initials). [Let’s name it “rusi” for the sake of this example.]
   * In this file, paste the following:
   [...]

You can see this advice requires root access.

My coworker does assure me it can all be done with regular luser rights
as well, but no web site seems to say how exactly.


Marko
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: ASCII or Unicode? (was best text editor for programming Python on a Mac)

2016-06-21 Thread Tim Chase
On 2016-06-21 11:35, Marko Rauhamaa wrote:
> > These are all pretty easy to remember.
> > German umlauts a" o" u" give ä ö ü  (or use uppercase)
> > Spanish eña (spelling?) and punctuations:  n~ ?? !!  -->  ñ ¿ ¡
> > French accents:  e' e` e^ c,  -->  é è ê ç
> > Money:  c= l- y- c/  -->  € £ ¥ ¢
> > Math:  =/ -: +- xx <= >=  -->  ≠ ÷ ± × ≤ ≥
> > Superscripts:  ^0 ^1 ^2 ^3  -->  ⁰ ¹ ² ³
> > Simple fractions:  12 13 ... 78  -->  ½ ⅓ ... ⅞
> > Here's a cute one:  CCCP  -->  ☭  (hammer & sickle)
> > And like your first examples:  oo mu ss  -->  ° µ ß  
> 
> Trouble is, nobody's going to guess or memorize any of that stuff.

I've been pleasantly surprised by how guessable most of them are.
Occasionally I have to dig a bit deeper, but for diacritics,
superscripts (using the "^", as well as subscripts using "_"),
fractions, and arrows (either a "-" or a "|" followed by a
character that looks like the arrow-head "<>v^") are all pretty easy
to guess when you understand the patterns.

-tkc



-- 
https://mail.python.org/mailman/listinfo/python-list


Re: ASCII or Unicode? (was best text editor for programming Python on a Mac)

2016-06-21 Thread Rustom Mody
On Tuesday, June 21, 2016 at 6:38:19 PM UTC+5:30, Marko Rauhamaa wrote:
> A coworker of mine went through the trouble of doing the xmodmap
> equivalent with setxkbmap. Thought of interviewing him about it one day.
> 
> How-to's are really hard to come by:
> 
>https://wiki.archlinux.org/index.php/Keyboard_configuration_i
>n_Xorg> -- no good
> 
>https://bbs.archlinux.org/viewtopic.php?id=172316> -- no good
> 
>http://michal.kosmulski.org/computing/articles/custom-keyboar
>d-layouts-xkb.html> -- interesting but assumes root access
> 
>https://awesome.naquadah.org/wiki/Change_keyboard_maps> -- no
>good

Regarding xkb:

Some good advice given to me by Yuri Khan on emacs list
https://lists.gnu.org/archive/html/help-gnu-emacs/2015-01/msg00332.html
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: ASCII or Unicode? (was best text editor for programming Python on a Mac)

2016-06-21 Thread Rustom Mody
On Tuesday, June 21, 2016 at 7:27:00 PM UTC+5:30, Rustom Mody wrote:
> Emacs:
:
> Math: So far Ive used tex input method -- Not satisfactory

After "Random832" pointed me to RFC1345 I checked that emacs has an
RFC1345 input method. It may be nicer than tex input method -- need to check
However like everything unicode there is no attempt to distinguish the babel
part of unicode and the universal part:
http://blog.languager.org/2015/03/whimsical-unicode.html
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: ASCII or Unicode? (was best text editor for programming Python on a Mac)

2016-06-21 Thread Rustom Mody
On Tuesday, June 21, 2016 at 6:38:19 PM UTC+5:30, Marko Rauhamaa wrote:
> Rustom Mody :
> 
> > On Tuesday, June 21, 2016 at 2:05:55 PM UTC+5:30, Marko Rauhamaa wrote:
> >> (On the other hand, I have always specified my preferred keyboard
> >> layout with .Xmodmap.)
> >
> > If this is being given as advice
> 
> I never gave it as advice.
> 
> > its bad advice xmodmap is obsolete use xkb
> 
> A coworker of mine went through the trouble of doing the xmodmap
> equivalent with setxkbmap. Thought of interviewing him about it one day.
> 
> How-to's are really hard to come by:
> 
>https://wiki.archlinux.org/index.php/Keyboard_configuration_i
>n_Xorg> -- no good
> 
>https://bbs.archlinux.org/viewtopic.php?id=172316> -- no good
> 
>http://michal.kosmulski.org/computing/articles/custom-keyboar
>d-layouts-xkb.html> -- interesting but assumes root access
> 
>https://awesome.naquadah.org/wiki/Change_keyboard_maps> -- no
>good
> 
> etc etc
> 
> > This particularly nasty bug:
> > https://bugs.launchpad.net/ubuntu/+source/linux/+bug/998310 I believe
> > I witnessed when I tried to use xmodmap
> 
> I do run into that when I place my laptop on the docker. I know to
> expect it, wait for ten or so seconds, and I'm on my way. I'm guessing
> it has to do with the X server sending the keyboard map to every X
> window on the display.
> 
> So Rustom, how do *you* produce, say, Hebrew or Spanish text, or your
> favorite math symbols?

I wish I could say I have a good answer -- ATM dont
However some ½-assed ones:


Emacs:
set-input-method (C-x RET C-\) greek
And then typing
abcdefghijklmnopqrstuvwxyz
gives
αβψδεφγηιξκλμνοπ;ρστθωςχυζ
[yeah that ; on q is curious]

Spanish?? No idea
But there seems to be a spanish input method that
has these éóñá¿

Ive typed Hindi/Marathi/Tamil/Sanskrit/Gujarati and helped others with Bengali
using devanagari-itrans/gujarati-itrans/tamil-itrans/bengali-itrans input
methods. There are also the corresponding -inscript methods for those that
type these fluently -- I am not one of those.

I have some 15-20 lines of elisp that makes these itrans uses easier (for me)

Math: So far Ive used tex input method -- Not satisfactory
Search-n-cut-paste from google is better!
My favorite goto for these are Xah Lee's pages:
Starts here: http://xahlee.info/comp/unicode_index.html

Some neat xah pages: http://xahlee.info/comp/unicode_matching_brackets.html
http://xahlee.info/comp/unicode_arrows.html
http://xahlee.info/comp/unicode_math_operators.html

Some of this is replicatable at setxkbmap level
[Note: these commands are dangerous as you can have a borked X system.
Of course temporarily
One safety catch is to keep
setxkbmap -option
in the bash history
So (assuming up-arrow still works) goofups are correctable 
]

eg Doing
$ setxkbmap -layout "us,apl(sax)" -option "grp:switch"
gives an APL keyboard on shift-rAlt chord
So abcdefghijklmnop
chorded gives
with RtAlt
⍺⊥∩⌊∊_∇∆⍳∘⎕|⊤○*?⍴⌈~↓∪⍵⊂⊃↑⊂
Along with RAlt-Shift
⊖⍎⌊⍷⍫⍒⍋⍸⍤⌻⍞⌶⍕⍥⍟¿⍴⌈⍉↓∪⌽⊃↑⊂

I guess expert APLers may find this neat -- I am not one!

So I use this emacs-mode https://github.com/lokedhs/gnu-apl-mode
when using APL (mostly teaching)

Then there is compose
For this Ive a compose key set
[With laptops and ubuntu-unity ths can get hard
1. Unity appropriates too many keys
2. Laptops have key shortage
 -- Ive just changed to CAPSLOCK to try out]

Then install uim
Then install https://github.com/rrthomas/pointless-xcompose

The whole point of that is to edit that to get it to have those chars that 
one wants accessible and not others... Ive not got round to that!
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: ASCII or Unicode? (was best text editor for programming Python on a Mac)

2016-06-21 Thread Marko Rauhamaa
Rustom Mody :

> On Tuesday, June 21, 2016 at 2:05:55 PM UTC+5:30, Marko Rauhamaa wrote:
>> (On the other hand, I have always specified my preferred keyboard
>> layout with .Xmodmap.)
>
> If this is being given as advice

I never gave it as advice.

> its bad advice xmodmap is obsolete use xkb

A coworker of mine went through the trouble of doing the xmodmap
equivalent with setxkbmap. Thought of interviewing him about it one day.

How-to's are really hard to come by:

   https://wiki.archlinux.org/index.php/Keyboard_configuration_i
   n_Xorg> -- no good

   https://bbs.archlinux.org/viewtopic.php?id=172316> -- no good

   http://michal.kosmulski.org/computing/articles/custom-keyboar
   d-layouts-xkb.html> -- interesting but assumes root access

   https://awesome.naquadah.org/wiki/Change_keyboard_maps> -- no
   good

etc etc

> This particularly nasty bug:
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/998310 I believe
> I witnessed when I tried to use xmodmap

I do run into that when I place my laptop on the docker. I know to
expect it, wait for ten or so seconds, and I'm on my way. I'm guessing
it has to do with the X server sending the keyboard map to every X
window on the display.

So Rustom, how do *you* produce, say, Hebrew or Spanish text, or your
favorite math symbols?


Marko
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: ASCII or Unicode? (was best text editor for programming Python on a Mac)

2016-06-21 Thread Rustom Mody
On Tuesday, June 21, 2016 at 2:05:55 PM UTC+5:30, Marko Rauhamaa wrote:
> Larry Hudson :
> > It sounds like you are almost, but not quite, describing the Linux
> > Compose key.
> 
> I have used Linux since the 1990's but don't know anything about "the
> Linux Compose key." 

It used to be a real (aka hardware) key:
See pics
https://en.wikipedia.org/wiki/Compose_key#Occurrence_on_keyboards


> (On the other hand, I have always specified my preferred keyboard layout with 
> .Xmodmap.)

If this is being given as advice its bad advice
xmodmap is obsolete use xkb
https://wiki.archlinux.org/index.php/X_KeyBoard_extension#xmodmap
[Does this make life easier?? Didnt say so :-) ]
This particularly nasty bug: h
ttps://bugs.launchpad.net/ubuntu/+source/linux/+bug/998310
I believe I witnessed when I tried to use xmodmap

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: ASCII or Unicode? (was best text editor for programming Python on a Mac)

2016-06-21 Thread Marko Rauhamaa
Larry Hudson :
> It sounds like you are almost, but not quite, describing the Linux
> Compose key.

I have used Linux since the 1990's but don't know anything about "the
Linux Compose key." (On the other hand, I have always specified my
preferred keyboard layout with .Xmodmap.)

> These are all pretty easy to remember.
> German umlauts a" o" u" give ä ö ü  (or use uppercase)
> Spanish eña (spelling?) and punctuations:  n~ ?? !!  -->  ñ ¿ ¡
> French accents:  e' e` e^ c,  -->  é è ê ç
> Money:  c= l- y- c/  -->  € £ ¥ ¢
> Math:  =/ -: +- xx <= >=  -->  ≠ ÷ ± × ≤ ≥
> Superscripts:  ^0 ^1 ^2 ^3  -->  ⁰ ¹ ² ³
> Simple fractions:  12 13 ... 78  -->  ½ ⅓ ... ⅞
> Here's a cute one:  CCCP  -->  ☭  (hammer & sickle)
> And like your first examples:  oo mu ss  -->  ° µ ß

Trouble is, nobody's going to guess or memorize any of that stuff. The
Chinese face analogous typing issues. They must have come up with
productive solutions since demonstrably they can type quite fast.


Marko
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: ASCII or Unicode? (was best text editor for programming Python on a Mac)

2016-06-21 Thread Larry Hudson via Python-list

On 06/19/2016 08:29 PM, Steven D'Aprano wrote:

On Mon, 20 Jun 2016 12:07 pm, Rustom Mody wrote:


[snip]

In theory most Linux apps support an X mechanism for inserting characters
that don't appear on the keyboard. Unfortunately, this gives no feedback
when you get it wrong, and discoverablity is terrible. It's taken me many
years to discover and learn the following:

WIN o WIN o gives °
WIN m WIN u gives µ
WIN s WIN s gives ß
WIN . . gives ·

(WIN is the Windows key)

Getting back to ≠ I tried:

WIN = WIN /
WIN / WIN =
WIN < WIN >
WIN ! WIN =

etc none of which do anything.

Another example of missing tooling is the lack of a good keyboard
application. Back in the 1980s, Apple Macs had a desk accessory that didn't
just simulate the keyboard, but showed what characters were available. If
you held down the Option key, the on-screen keyboard would display the
characters each key would insert. This increased discoverability and made
it practical for Hypertalk to accept non-ASCII synonyms such as

≤ for <=
≥ for >=
≠ for <>

Without better tooling and more discoverability, non-ASCII characters as
syntax are an anti-feature.



It sounds like you are almost, but not quite, describing the Linux Compose key.  To get many of 
the 'special' characters, you first press the compose key and follow it with (usually) two 
characters.  (That's ONE press of the compose key, not two like your first examples.)  And yes, 
the unequal sign is  =/


Here are some more examples (I'm not going to specify the  key here, just assume these 
examples are prefixed with it):  These are all pretty easy to remember.

German umlauts a" o" u" give ä ö ü  (or use uppercase)
Spanish eña (spelling?) and punctuations:  n~ ?? !!  -->  ñ ¿ ¡
French accents:  e' e` e^ c,  -->  é è ê ç
Money:  c= l- y- c/  -->  € £ ¥ ¢
Math:  =/ -: +- xx <= >=  -->  ≠ ÷ ± × ≤ ≥
Superscripts:  ^0 ^1 ^2 ^3  -->  ⁰ ¹ ² ³
Simple fractions:  12 13 ... 78  -->  ½ ⅓ ... ⅞
Here's a cute one:  CCCP  -->  ☭  (hammer & sickle)
And like your first examples:  oo mu ss  -->  ° µ ß
Many MANY more obscure codes as well (have to look them up, or make a copy of 
this info.)

Admittedly not much use in programming, but can be useful for other general 
text.

Now, setting the compose key...  Easy (but obscure) in Mint Linux (and I think Ubuntu is the 
same.  I don't know about other distros.):


From the menu, select Preferences->Keyboard->Layouts->Options->Position of 
Compose Key
This opens a list of checkboxes with about a dozen choices -- select whatever you want (I use 
the Menu key).


--
 -=- Larry -=-
--
https://mail.python.org/mailman/listinfo/python-list


Re: ASCII or Unicode? (was best text editor for programming Python on a Mac)

2016-06-20 Thread Rustom Mody
On Monday, June 20, 2016 at 8:30:25 PM UTC+5:30, Steven D'Aprano wrote:
> On Tue, 21 Jun 2016 12:23 am, Grant Edwards wrote:
> 
> > On 2016-06-20, Phil Boutros  wrote:
> [...]
> >> Ctrl-K, =, ! (last two steps interchangeable).  Done.  Result:  ≠
> > 
> > On any non-broken X11 system it's:  = /
> 
> Nope, doesn't work for me. I guess I've got a "broken" X11 system.
> 
> Oh, I did learn one thing, thanks to Lawrence's earlier link: the compose
> key behaves as a dead-key, not a modifier.
> 

You need to say something like
$ setxkbmap  -option compose:menu

then the windows menu key becomes the compose key
Or
$ setxkbmap  -option compose:ralt

then its the right-alt

You can check whats currently the state of xkb with
$ setxkbmap -print

And you can clean up with a bare
$ setkkb -option

else these options 'pile-up' as a  -print would show
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: ASCII or Unicode? (was best text editor for programming Python on a Mac)

2016-06-20 Thread Steven D'Aprano
On Tue, 21 Jun 2016 12:23 am, Grant Edwards wrote:

> On 2016-06-20, Phil Boutros  wrote:
[...]
>> Ctrl-K, =, ! (last two steps interchangeable).  Done.  Result:  ≠
> 
> On any non-broken X11 system it's:  = /

Nope, doesn't work for me. I guess I've got a "broken" X11 system.

Oh, I did learn one thing, thanks to Lawrence's earlier link: the compose
key behaves as a dead-key, not a modifier.



-- 
Steven

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: ASCII or Unicode? (was best text editor for programming Python on a Mac)

2016-06-20 Thread Grant Edwards
On 2016-06-20, Phil Boutros  wrote:
> Steven D'Aprano  wrote:
>>
>> Quote:
>>
>> "Why do we have to write x!=y then argue about the status of x<>y when we
>> can simply write x≠y?"
>>
>> "Simply"?
>>
>> This is how I write x≠y from scratch:
>
>
> To wrap this back full circle, here's how it's done on vim:  
>
> Ctrl-K, =, ! (last two steps interchangeable).  Done.  Result:  ≠

On any non-broken X11 system it's:  = /

I generally configure my system so that the right-hand "windows" key
is .  If I used it a lot, I'd probably configure the left
hand one to be the same.

> It's still probably a horrible idea to have it in a programming
> language, though, unless the original behaviour still also works.

Definitely.  And we should allow overbar for logical inversion.  I
never did figure out the X11 compose sequence for the XOR symbol...

-- 
Grant Edwards   grant.b.edwardsYow! Somewhere in DOWNTOWN
  at   BURBANK a prostitute is
  gmail.comOVERCOOKING a LAMB CHOP!!

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: ASCII or Unicode? (was best text editor for programming Python on a Mac)

2016-06-20 Thread Rustom Mody
On Monday, June 20, 2016 at 11:34:36 AM UTC+5:30, Random832 wrote:
> On Mon, Jun 20, 2016, at 01:03, Rustom Mody wrote:
> > > Ctrl-K, =, ! (last two steps interchangeable).  Done.  Result:  ≠
> > 
> > Are these 'shortcuts' parameterizable?
> 
> They originate from RFC 1345, with the extension that they can be
> reversed if the reverse doesn't itself exist as a RFC 1345 combination.

Thanks!
Useful reference even though old
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: ASCII or Unicode? (was best text editor for programming Python on a Mac)

2016-06-20 Thread Steven D'Aprano
On Monday 20 June 2016 17:57, Lawrence D’Oliveiro wrote:

> On Monday, June 20, 2016 at 4:31:00 PM UTC+12, Phil Boutros wrote:
>>
>> Steven D'Aprano wrote:
>>>
>>> This is how I write x≠y from scratch:
>> 
>> 
>> To wrap this back full circle, here's how it's done on vim:
>> 
>> Ctrl-K, =, ! (last two steps interchangeable).  Done.  Result:  ≠
> 
> Standard Linux sequence: compose-slash-equals (or compose-equals-slash).
> Works in every sensible editor, terminal emulator, text-input field in web
> browsers and other GUI apps. In short, everywhere.

Everywhere compose is configured the way you expect.


> 

Nice link, thank you, although missing a few things. Like how to query which 
key is the compose key, and how to specify a key other than CapsLock. But 
there's always Google, I suppose.

According to that link: "By default this function is not assigned to any key."

So... not so much "everywhere" as "by default, nowhere".


-- 
Steve

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: ASCII or Unicode? (was best text editor for programming Python on a Mac)

2016-06-20 Thread Lawrence D’Oliveiro
On Monday, June 20, 2016 at 4:31:00 PM UTC+12, Phil Boutros wrote:
>
> Steven D'Aprano wrote:
>>
>> This is how I write x≠y from scratch:
> 
> 
> To wrap this back full circle, here's how it's done on vim:  
> 
> Ctrl-K, =, ! (last two steps interchangeable).  Done.  Result:  ≠

Standard Linux sequence: compose-slash-equals (or compose-equals-slash). Works 
in every sensible editor, terminal emulator, text-input field in web browsers 
and other GUI apps. In short, everywhere.


-- 
https://mail.python.org/mailman/listinfo/python-list


Re: ASCII or Unicode? (was best text editor for programming Python on a Mac)

2016-06-20 Thread Random832
On Mon, Jun 20, 2016, at 01:03, Rustom Mody wrote:
> > Ctrl-K, =, ! (last two steps interchangeable).  Done.  Result:  ≠
> 
> Are these 'shortcuts' parameterizable?

They originate from RFC 1345, with the extension that they can be
reversed if the reverse doesn't itself exist as a RFC 1345 combination.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: ASCII or Unicode? (was best text editor for programming Python on a Mac)

2016-06-19 Thread Rustom Mody
On Monday, June 20, 2016 at 10:01:00 AM UTC+5:30, Phil Boutros wrote:
> Steven D'Aprano  wrote:
> >
> > Quote:
> >
> > "Why do we have to write x!=y then argue about the status of x<>y when we
> > can simply write x≠y?"
> >
> > "Simply"?
> >
> > This is how I write x≠y from scratch:
> 
> 
> To wrap this back full circle, here's how it's done on vim:  
> 
> Ctrl-K, =, ! (last two steps interchangeable).  Done.  Result:  ≠

Are these 'shortcuts' parameterizable?

> 
> It's still probably a horrible idea to have it in a programming
> language, though, unless the original behaviour still also works.

That goes without saying:
Gradual evolutionary changes are more likely to be lasting than violent
revolutionary ones.

Which evolution is already happening:

Fortress: 
https://umbilicus.wordpress.com/2009/10/16/fortress-parallel-by-default/
[unfortunately died along with its patron Sun microsystems]

Agda: http://mazzo.li/posts/AgdaSort.html
which is based on Haskell but cleans up
-> to →
forall to ∀
In addition to allowing arbitrary operators like ≈

Julia: http://iaindunning.com/blog/julia-unicode.html

And of course the original APL: http://aplwiki.com/FinnAplIdiomLibrary
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: ASCII or Unicode? (was best text editor for programming Python on a Mac)

2016-06-19 Thread Rustom Mody
On Monday, June 20, 2016 at 10:06:41 AM UTC+5:30, Rustom Mody wrote:

> I have greater horror-stories to describe if you like
> On my recent ubuntu upgrade my keyboard broke -- totally ie cant type 
> anything.
> Here's a detailed rundown...
> 
> Upgrade complete; reboot -- NO KEYBOARD -- Yikes
> However login works in X -- after login ... GONE
> And ttys (Ctrl-Alt-F1 etc) fine; no issue.
> 
> Searched around.. "Uninstall ibus" seems to be the advice... No go
> Some Unity issue it looks?
> Installed xfce (from tty)
> 
> Again after few days (some upgrade dont remember which) keyboard broken
> Um Now what to install? gnome?? OMG!
> 
> Created a new login... Problem gone...
> 
> Well whats the problem?? Well whatever!!
> Finally by chance discovered that the problem was probably uim
> uim is an alternative to ibus
> I had installed it to make this work:
> https://github.com/rrthomas/pointless-xcompose
> 
> which is aimed precisely at removing this pain:

Umm that comes across as an inversion and misrepresentation.
uim got UNINSTALLED in the upgrade
[I did it and forgot? it automatically happened?? Dont remember]
No uim; no ibus; no input method evidently
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: ASCII or Unicode? (was best text editor for programming Python on a Mac)

2016-06-19 Thread Rustom Mody
On Monday, June 20, 2016 at 8:59:44 AM UTC+5:30, Steven D'Aprano wrote:

> Without better tooling and more discoverability, non-ASCII characters as
> syntax are an anti-feature.

You need to decide which hat you have on
- idealist
- pragmatist

From a pragmatic pov nothing you are saying below is arguable.

The argument starts because you are taking the moral high ground against
a *title* and presumably the mnemonic (the 'a' in :ga) in the vi docs.
Ignoring that the implementation and the doc-body are aligned with current 
practices



> Getting back to ≠ I tried:
> 

I have greater horror-stories to describe if you like
On my recent ubuntu upgrade my keyboard broke -- totally ie cant type anything.
Here's a detailed rundown...

Upgrade complete; reboot -- NO KEYBOARD -- Yikes
However login works in X -- after login ... GONE
And ttys (Ctrl-Alt-F1 etc) fine; no issue.

Searched around.. "Uninstall ibus" seems to be the advice... No go
Some Unity issue it looks?
Installed xfce (from tty)

Again after few days (some upgrade dont remember which) keyboard broken
Um Now what to install? gnome?? OMG!

Created a new login... Problem gone...

Well whats the problem?? Well whatever!!
Finally by chance discovered that the problem was probably uim
uim is an alternative to ibus
I had installed it to make this work:
https://github.com/rrthomas/pointless-xcompose

which is aimed precisely at removing this pain:


> This is how I write x≠y from scratch:
> 
> - press the 'x' key on my keyboard
> - grab the mouse
> - move mouse to Start menu
> - pause and wait for "Utilities" submenu to appear
> - move mouse over "Utilities" submenu
> - pause and wait for "More Applications" submenu to appear
> - move mouse to "More Applications" submenu
> - move mouse to "Gnome charmap"
> - click
> - wait a second
> - move mouse to "Character Map" window
> - click on "Search" menu
> - click on "Find" menu item
> - release mouse
> - type "unequal" and press ENTER
> - press ENTER to dismiss the "Not Found" dialog
> - type "not equal" and press ENTER
> - press ESC to dismiss the Find dialog
> - grab the mouse
> - click the ≠ glyph
> - pause and swear when nothing happens
> - double-click the ≠ glyph
> - move the mouse to the "Copy" button
> - click "Copy"
> - visually search the task bar for my editor
> - click on the editor
> - invariably I end up accidentally moving the insertion point, 
>   so click after the 'x'
> - release the mouse
> - press Ctrl-V
> - press the 'y' key
> 
> and I am done.
> 

So yeah...
- Remedy worse than evil? Sure
- Unpractical? of course.

So also thought programmers in the 70s, when presented with possibility of 
using lowercase when everyone used FORTRAN, COBOL and PL/1 and programming meant
CODING ON CODING SHEETS LIKE THIS

BTW the maverick that offered this completely unnecessarily wasteful luxury was
called Unix

> In theory most Linux apps support an X mechanism for inserting characters
> that don't appear on the keyboard. Unfortunately, this gives no feedback
> when you get it wrong, and discoverablity is terrible. It's taken me many
> years to discover and learn the following:
> 
> WIN o WIN o gives °
> WIN m WIN u gives µ
> WIN s WIN s gives ß
> WIN . . gives ·
> 
> (WIN is the Windows key)
> 

Heres a small sample of what you get with xcompose
[compose key can be anything; in my case its set to right-alt]
COMP oo °
COMP mu µ
COMP 12 ½
COMP <> ↔
COMP => ⇒
COMP -v ↓
COMP ^^i ⁱ  Likewise n ⁿ

Nifty when it works; nicely parameterisable -- just edit ~/.XCompose
But mind your next upgrade :D
COMP -^ ↑

> > 
> > http://blog.languager.org/2014/04/unicoded-python.html
> 
> Quote:
> 
> "Why do we have to write x!=y then argue about the status of x<>y when we
> can simply write x≠y?"
> 
> "Simply"?
> 

Early adopters by definition live on the bleeding edge
So "not simple" today ⇏ "not simple" tomorrow
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: ASCII or Unicode? (was best text editor for programming Python on a Mac)

2016-06-19 Thread Phil Boutros
Steven D'Aprano  wrote:
>
> Quote:
>
> "Why do we have to write x!=y then argue about the status of x<>y when we
> can simply write x≠y?"
>
> "Simply"?
>
> This is how I write x≠y from scratch:


To wrap this back full circle, here's how it's done on vim:  

Ctrl-K, =, ! (last two steps interchangeable).  Done.  Result:  ≠

It's still probably a horrible idea to have it in a programming
language, though, unless the original behaviour still also works.


Phil
-- 
AH#61  Wolf#14  BS#89  bus#1  CCB#1  SENS  KOTC#4   ph...@philb.ca
http://philb.ca EKIII rides with me:  http://eddiekieger.com
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: ASCII or Unicode? (was best text editor for programming Python on a Mac)

2016-06-19 Thread Steven D'Aprano
On Mon, 20 Jun 2016 12:07 pm, Rustom Mody wrote:

> If python were to do more than lip service to  REALLY being a unicode age
> language why are things like this out of bounds even for discussion?
> 
> http://blog.languager.org/2014/04/unicoded-python.html

Quote:

"Why do we have to write x!=y then argue about the status of x<>y when we
can simply write x≠y?"

"Simply"?

This is how I write x≠y from scratch:

- press the 'x' key on my keyboard
- grab the mouse
- move mouse to Start menu
- pause and wait for "Utilities" submenu to appear
- move mouse over "Utilities" submenu
- pause and wait for "More Applications" submenu to appear
- move mouse to "More Applications" submenu
- move mouse to "Gnome charmap"
- click
- wait a second
- move mouse to "Character Map" window
- click on "Search" menu
- click on "Find" menu item
- release mouse
- type "unequal" and press ENTER
- press ENTER to dismiss the "Not Found" dialog
- type "not equal" and press ENTER
- press ESC to dismiss the Find dialog
- grab the mouse
- click the ≠ glyph
- pause and swear when nothing happens
- double-click the ≠ glyph
- move the mouse to the "Copy" button
- click "Copy"
- visually search the task bar for my editor
- click on the editor
- invariably I end up accidentally moving the insertion point, 
  so click after the 'x'
- release the mouse
- press Ctrl-V
- press the 'y' key

and I am done.

Now, I accept that some of those steps could probably be streamlined. Better
tooling would probably make it better, e.g. my editor could offer its own
char map, which hopefully wouldn't suck like Open Office's inbuilt "Insert
Special Character" function. It would be nice if the editor keep a cache
of "Frequently Inserted Characters", because realistically there's only a
set of about twenty or thirty that I use frequently.

A programmer's editor could even offer a per-language palette of non-ASCII
operators. Or there could be a keyboard shortcut which I probably wouldn't
remember. If I could remember a seemingly infinite number of arbitrary
keyboard commands, I'd use Emacs or Vi :-)

In theory most Linux apps support an X mechanism for inserting characters
that don't appear on the keyboard. Unfortunately, this gives no feedback
when you get it wrong, and discoverablity is terrible. It's taken me many
years to discover and learn the following:

WIN o WIN o gives °
WIN m WIN u gives µ
WIN s WIN s gives ß
WIN . . gives ·

(WIN is the Windows key)

Getting back to ≠ I tried:

WIN = WIN /
WIN / WIN =
WIN < WIN >
WIN ! WIN =

etc none of which do anything.

Another example of missing tooling is the lack of a good keyboard
application. Back in the 1980s, Apple Macs had a desk accessory that didn't
just simulate the keyboard, but showed what characters were available. If
you held down the Option key, the on-screen keyboard would display the
characters each key would insert. This increased discoverability and made
it practical for Hypertalk to accept non-ASCII synonyms such as

≤ for <=
≥ for >= 
≠ for <>

Without better tooling and more discoverability, non-ASCII characters as
syntax are an anti-feature.



-- 
Steven

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: ASCII and Unicode

2013-12-08 Thread rusi
On Saturday, December 7, 2013 9:35:34 PM UTC+5:30, giacomo boffi wrote:
 Steven D'Aprano  writes:

  Ironically, your post was not Unicode.  [...] Your post was sent
  using a legacy encoding, Windows-1252, also known as CP-1252

 i access rusi's post using a NNTP server,
 and in his post i see

 Content-Type: text/plain; charset=UTF-8

 is it possible that what you see is an artifact
 of the gateway?

Thanks for checking that!
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: ASCII and Unicode

2013-12-08 Thread Steven D'Aprano
On Sat, 07 Dec 2013 17:05:34 +0100, giacomo boffi wrote:

 Steven D'Aprano steve+comp.lang.pyt...@pearwood.info writes:
 
 Ironically, your post was not Unicode.  [...] Your post was sent using
 a legacy encoding, Windows-1252, also known as CP-1252
 
 i access rusi's post using a NNTP server, and in his post i see
 
 Content-Type: text/plain; charset=UTF-8

But *which post* are you looking at?


I have just looked at three posts from him:

Rusi's original post, where he used the ellipsis characters:

  Subject: Re: Managing Google Groups headaches
  Date: Thu, 5 Dec 2013 23:13:54 -0800 (PST)
  Content-Type: text/plain; charset=windows-1252

Then his reply to me:

  Subject: Re: ASCII and Unicode [was Re: Managing Google Groups headaches]
  Date: Fri, 6 Dec 2013 18:33:39 -0800 (PST)
  Content-Type: text/plain; charset=UTF-8

And finally, his reply to you:

  Subject: Re: ASCII and Unicode
  Date: Sun, 8 Dec 2013 08:41:10 -0800 (PST)
  Content-Type: text/plain; charset=ISO-8859-1

It seems to me that whatever client he is using to post (I believe it is 
Google Groups web interface?) varies the encoding depending on what 
characters are included in his post.


 is it possible that what you see is an artifact of the gateway?

I doubt it. Unfortunately the email mailing list archive doesn't display 
all the email headers, but for the record here is his original post as 
seen by the email mailing list:

https://mail.python.org/pipermail/python-list/2013-December/661782.html

If you view source, you'll see that Mailman (the mailing list software) 
sets the webpage encoding to US-ASCII and encodes the ellipses to #8230, 
which is a perfectly reasonable thing for a web page to do. So we can be 
confident that when Mailman saw Rusi's post, it was able to correctly 
decode the message and see ellipses.

Although I think that (probably) Google Groups is being stupid by varying 
the charset (why not just use UTF-8 always?), at least it is setting the 
charset correctly. 



-- 
Steven
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: ASCII and Unicode

2013-12-08 Thread rusi
On Sunday, December 8, 2013 10:52:34 PM UTC+5:30, Steven D'Aprano wrote:
 On Sat, 07 Dec 2013 17:05:34 +0100, giacomo boffi wrote:

  Steven D'Aprano  writes:
  Ironically, your post was not Unicode.  [...] Your post was sent using
  a legacy encoding, Windows-1252, also known as CP-1252
  i access rusi's post using a NNTP server, and in his post i see
  Content-Type: text/plain; charset=UTF-8

 But *which post* are you looking at?

 I have just looked at three posts from him:

 Rusi's original post, where he used the ellipsis characters:

   Subject: Re: Managing Google Groups headaches
   Date: Thu, 5 Dec 2013 23:13:54 -0800 (PST)
   Content-Type: text/plain; charset=windows-1252

 Then his reply to me:

   Subject: Re: ASCII and Unicode [was Re: Managing Google Groups headaches]
   Date: Fri, 6 Dec 2013 18:33:39 -0800 (PST)
   Content-Type: text/plain; charset=UTF-8

 And finally, his reply to you:

   Subject: Re: ASCII and Unicode
   Date: Sun, 8 Dec 2013 08:41:10 -0800 (PST)
   Content-Type: text/plain; charset=ISO-8859-1

 It seems to me that whatever client he is using to post (I believe it is 
 Google Groups web interface?) varies the encoding depending on what 
 characters are included in his post.

  is it possible that what you see is an artifact of the gateway?

 I doubt it. Unfortunately the email mailing list archive doesn't display 
 all the email headers, but for the record here is his original post as 
 seen by the email mailing list:

 https://mail.python.org/pipermail/python-list/2013-December/661782.html

 If you view source, you'll see that Mailman (the mailing list software) 
 sets the webpage encoding to US-ASCII and encodes the ellipses to #8230, 
 which is a perfectly reasonable thing for a web page to do. So we can be 
 confident that when Mailman saw Rusi's post, it was able to correctly 
 decode the message and see ellipses.

 Although I think that (probably) Google Groups is being stupid by varying 
 the charset (why not just use UTF-8 always?), at least it is setting the 
 charset correctly. 

I think GG is being being sweet and affectionate and delectable enough that a
 in the footer will keep it stuck at UTF-8 you think ?? :-)


-- 
https://mail.python.org/mailman/listinfo/python-list


Re: ASCII and Unicode

2013-12-08 Thread giacomo boffi
Steven D'Aprano steve+comp.lang.pyt...@pearwood.info writes:

 On Sat, 07 Dec 2013 17:05:34 +0100, giacomo boffi wrote:

 Steven D'Aprano steve+comp.lang.pyt...@pearwood.info writes:
 
 Ironically, your post was not Unicode.  [...] Your post was sent using
 a legacy encoding, Windows-1252, also known as CP-1252
 
 i access rusi's post using a NNTP server, and in his post i see
 
 Content-Type: text/plain; charset=UTF-8

 But *which post* are you looking at?

blush the wrong one.../ i.e, the one JUST BEFORE your change of
subject --- if i look at the ellipsis post, i see the same encoding
that you have mentioned

sorry for the confusion
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: ASCII and Unicode

2013-12-08 Thread rusi
On Monday, December 9, 2013 1:41:41 AM UTC+5:30, giacomo boffi wrote:
 blush the wrong one.../ i.e, the one JUST BEFORE your change of
 subject --- if i look at the ellipsis post, i see the same encoding
 that you have mentioned

 sorry for the confusion

And thank you for pointing the way to the culprit, viz. GG trying to be
too clever.

[Since you neglected to close your blush I am included in it :-) ]
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: ASCII and Unicode

2013-12-07 Thread giacomo boffi
Steven D'Aprano steve+comp.lang.pyt...@pearwood.info writes:

 Ironically, your post was not Unicode.  [...] Your post was sent
 using a legacy encoding, Windows-1252, also known as CP-1252

i access rusi's post using a NNTP server,
and in his post i see

Content-Type: text/plain; charset=UTF-8

is it possible that what you see is an artifact
of the gateway?
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: ASCII and Unicode [was Re: Managing Google Groups headaches]

2013-12-06 Thread Gene Heskett
On Friday 06 December 2013 14:30:06 Steven D'Aprano did opine:

 On Fri, 06 Dec 2013 05:03:57 -0800, rusi wrote:
  Evidently (and completely inadvertently) this exchange has just
  illustrated one of the inadmissable assumptions:
  
  unicode as a medium is universal in the same way that ASCII used to
  be
 
 Ironically, your post was not Unicode.
 
 Seriously. I am 100% serious.
 
 Your post was sent using a legacy encoding, Windows-1252, also known as
 CP-1252, which is most certainly *not* Unicode. Whatever software you
 used to send the message correctly flagged it with a charset header:
 
 Content-Type: text/plain; charset=windows-1252
 
 Alas, the software Roy Smith uses, MT-NewsWatcher, does not handle
 encodings correctly (or at all!), it screws up the encoding then sends a
 reply with no charset line at all. This is one bug that cannot be blamed
 on Google Groups -- or on Unicode.
 
  I wrote a number of ellipsis characters ie codepoint 2026 as in:
 Actually you didn't. You wrote a number of ellipsis characters, hex byte
 \x85 (decimal 133), in the CP1252 charset. That happens to be mapped to
 code point U+2026 in Unicode, but the two are as distinct as ASCII and
 EBCDIC.
 
  Somewhere between my sending and your quoting those ellipses became
  the replacement character FFFD
 
 Yes, it appears that MT-NewsWatcher is *deeply, deeply* confused about
 encodings and character sets. It doesn't just assume things are ASCII,
 but makes a half-hearted attempt to be charset-aware, but badly. I can
 only imagine that it was written back in the Dark Ages where there were
 a lot of different charsets in use but no conventions for specifying
 which charset was in use. Or perhaps the author was smoking crack while
 coding.
 
  Leaving aside whose fault this is (very likely buggy google groups),
  this mojibaking cannot happen if the assumption All text is ASCII
  were to uniformly hold.
 
 This is incorrect. People forget that ASCII has evolved since the first
 version of the standard in 1963. There have actually been five versions
 of the ASCII standard, plus one unpublished version. (And that's not
 including the things which are frequently called ASCII but aren't.)
 
 ASCII-1963 didn't even include lowercase letters. It is also missing
 some graphic characters like braces, and included at least two
 characters no longer used, the up-arrow and left-arrow. The control
 characters were also significantly different from today.
 
 ASCII-1965 was unpublished and unused. I don't know the details of what
 it changed.
 
 ASCII-1967 is a lot closer to the ASCII in use today. It made
 considerable changes to the control characters, moving, adding,
 removing, or renaming at least half a dozen control characters. It
 officially added lowercase letters, braces, and some others. It
 replaced the up-arrow character with the caret and the left-arrow with
 the underscore. It was ambiguous, allowing variations and
 substitutions, e.g.:
 
 - character 33 was permitted to be either the exclamation
   mark ! or the logical OR symbol |
 
 - consequently character 124 (vertical bar) was always
   displayed as a broken bar آ¦, which explains why even today
   many keyboards show it that way
 
 - character 35 was permitted to be either the number sign # or
   the pound sign آ£
 
 - character 94 could be either a caret ^ or a logical NOT آ¬
 
 Even the humble comma could be pressed into service as a cedilla.
 
 ASCII-1968 didn't change any characters, but allowed the use of LF on
 its own. Previously, you had to use either LF/CR or CR/LF as newline.
 
 ASCII-1977 removed the ambiguities from the 1967 standard.
 
 The most recent version is ASCII-1986 (also known as ANSI X3.4-1986).
 Unfortunately I haven't been able to find out what changes were made --
 I presume they were minor, and didn't affect the character set.
 
 So as you can see, even with actual ASCII, you can have mojibake. It's
 just not normally called that. But if you are given an arbitrary ASCII
 file of unknown age, containing code 94, how can you be sure it was
 intended as a caret rather than a logical NOT symbol? You can't.
 
 Then there are at least 30 official variations of ASCII, strictly
 speaking part of ISO-646. These 7-bit codes were commonly called ASCII
 by their users, despite the differences, e.g. replacing the dollar sign
 $ with the international currency sign آ¤, or replacing the left brace
 { with the letter s with caron إ،.
 
 One consequence of this is that the MIME type for ASCII text is called
 US ASCII, despite the redundancy, because many people expect ASCII
 alone to mean whatever national variation they are used to.
 
 But it gets worse: there are proprietary variations on ASCII which are
 commonly called ASCII but aren't, including dozens of 8-bit so-called
 extended ASCII character sets, which is where the problems *really*
 pile up. Invariably back in the 1980s and early 1990s people used to
 call these 

Re: ASCII and Unicode [was Re: Managing Google Groups headaches]

2013-12-06 Thread Roy Smith
Steven D'Aprano steve+comp.lang.python at pearwood.info writes:

 Yes, it appears that MT-NewsWatcher is *deeply, deeply* confused about 
 encodings and character sets. It doesn't just assume things are ASCII, 
 but makes a half-hearted attempt to be charset-aware, but badly. I can 
 only imagine that it was written back in the Dark Ages

Indeed.  The basic codebase probably goes back 20 years.  I'm posting this
from gmane, just so people don't think I'm a total luddite.

 When transmitting ASCII characters, the networking protocol could include 
 various start and stop bits and parity codes. A single 7-bit ASCII 
 character might be anything up to 12 bits in length on the wire.

Not to mention that some really old hardware used 1.5 stop bits!


-- 
https://mail.python.org/mailman/listinfo/python-list


Re: ASCII and Unicode [was Re: Managing Google Groups headaches]

2013-12-06 Thread Chris Angelico
On Sat, Dec 7, 2013 at 6:00 AM, Steven D'Aprano
steve+comp.lang.pyt...@pearwood.info wrote:
 - character 33 was permitted to be either the exclamation
   mark ! or the logical OR symbol |

 - consequently character 124 (vertical bar) was always
   displayed as a broken bar ¦, which explains why even today
   many keyboards show it that way

 - character 35 was permitted to be either the number sign # or
   the pound sign £

 - character 94 could be either a caret ^ or a logical NOT ¬

Yeah, good fun stuff. I first met several of these ambiguities in the
OS/2 REXX documentation, which detailed the language's operators by
specifying their byte values as well as their characters - for
instance, this quote from the docs (yeah, I still have it all here):


Note:   Depending upon your Personal System keyboard and the code page
you are using, you may not have the solid vertical bar to select. For
this reason, REXX also recognizes the use of the split vertical bar as
a logical OR symbol. Some keyboards may have both characters. If so,
they are not interchangeable; only the character that is equal to the
ASCII value of 124 works as the logical OR. This type of mismatch can
also cause the character on your screen to be different from the
character on your keyboard.

(The front material on the docs says (C) Copyright IBM Corp. 1987,
1994. All Rights Reserved.)

It says ASCII value where on this list we would be more likely to
call it byte value, and I'd prefer to say represented by rather
than equal to, but nonetheless, this is still clearly distinguishing
characters and bytes. The language spec is on characters, but
ultimately the interpreter is going to be looking at bytes, so when
there's a problem, it's byte 124 that's the one defined as logical OR.
Oh, and note the copyright date. The byte/char distinction isn't new.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: ASCII and Unicode [was Re: Managing Google Groups headaches]

2013-12-06 Thread rusi
On Saturday, December 7, 2013 12:30:18 AM UTC+5:30, Steven D'Aprano wrote:
 On Fri, 06 Dec 2013 05:03:57 -0800, rusi wrote:

  Evidently (and completely inadvertently) this exchange has just
  illustrated one of the inadmissable assumptions:
  unicode as a medium is universal in the same way that ASCII used to be

 Ironically, your post was not Unicode.

 Seriously. I am 100% serious.

 Your post was sent using a legacy encoding, Windows-1252, also known as 
 CP-1252, which is most certainly *not* Unicode. Whatever software you 
 used to send the message correctly flagged it with a charset header:

 Content-Type: text/plain; charset=windows-1252

 Alas, the software Roy Smith uses, MT-NewsWatcher, does not handle 
 encodings correctly (or at all!), it screws up the encoding then sends a 
 reply with no charset line at all. This is one bug that cannot be blamed 
 on Google Groups -- or on Unicode.

  I wrote a number of ellipsis characters ie codepoint 2026 as in:

 Actually you didn't. You wrote a number of ellipsis characters, hex byte 
 \x85 (decimal 133), in the CP1252 charset. That happens to be mapped to 
 code point U+2026 in Unicode, but the two are as distinct as ASCII and 
 EBCDIC.

  Somewhere between my sending and your quoting those ellipses became the
  replacement character FFFD

 Yes, it appears that MT-NewsWatcher is *deeply, deeply* confused about 
 encodings and character sets. It doesn't just assume things are ASCII, 
 but makes a half-hearted attempt to be charset-aware, but badly. I can 
 only imagine that it was written back in the Dark Ages where there were a 
 lot of different charsets in use but no conventions for specifying which 
 charset was in use. Or perhaps the author was smoking crack while coding.

  Leaving aside whose fault this is (very likely buggy google groups),
  this mojibaking cannot happen if the assumption All text is ASCII were
  to uniformly hold.

 This is incorrect. People forget that ASCII has evolved since the first 
 version of the standard in 1963. There have actually been five versions 
 of the ASCII standard, plus one unpublished version. (And that's not 
 including the things which are frequently called ASCII but aren't.)

 ASCII-1963 didn't even include lowercase letters. It is also missing some 
 graphic characters like braces, and included at least two characters no 
 longer used, the up-arrow and left-arrow. The control characters were 
 also significantly different from today.

 ASCII-1965 was unpublished and unused. I don't know the details of what 
 it changed.

 ASCII-1967 is a lot closer to the ASCII in use today. It made 
 considerable changes to the control characters, moving, adding, removing, 
 or renaming at least half a dozen control characters. It officially added 
 lowercase letters, braces, and some others. It replaced the up-arrow 
 character with the caret and the left-arrow with the underscore. It was 
 ambiguous, allowing variations and substitutions, e.g.:

 - character 33 was permitted to be either the exclamation 
   mark ! or the logical OR symbol |

 - consequently character 124 (vertical bar) was always 
   displayed as a broken bar ¦, which explains why even today
   many keyboards show it that way

 - character 35 was permitted to be either the number sign # or 
   the pound sign £

 - character 94 could be either a caret ^ or a logical NOT ¬

 Even the humble comma could be pressed into service as a cedilla.

 ASCII-1968 didn't change any characters, but allowed the use of LF on its 
 own. Previously, you had to use either LF/CR or CR/LF as newline.

 ASCII-1977 removed the ambiguities from the 1967 standard.

 The most recent version is ASCII-1986 (also known as ANSI X3.4-1986). 
 Unfortunately I haven't been able to find out what changes were made -- I 
 presume they were minor, and didn't affect the character set.

 So as you can see, even with actual ASCII, you can have mojibake. It's 
 just not normally called that. But if you are given an arbitrary ASCII 
 file of unknown age, containing code 94, how can you be sure it was 
 intended as a caret rather than a logical NOT symbol? You can't.

 Then there are at least 30 official variations of ASCII, strictly 
 speaking part of ISO-646. These 7-bit codes were commonly called ASCII 
 by their users, despite the differences, e.g. replacing the dollar sign $ 
 with the international currency sign ¤, or replacing the left brace 
 { with the letter s with caron š.

 One consequence of this is that the MIME type for ASCII text is called 
 US ASCII, despite the redundancy, because many people expect ASCII 
 alone to mean whatever national variation they are used to.

 But it gets worse: there are proprietary variations on ASCII which are 
 commonly called ASCII but aren't, including dozens of 8-bit so-called 
 extended ASCII character sets, which is where the problems *really* 
 pile up. Invariably back in the 1980s and early 1990s people used 

Re: ASCII and Unicode [was Re: Managing Google Groups headaches]

2013-12-06 Thread Chris Angelico
On Sat, Dec 7, 2013 at 1:33 PM, rusi rustompm...@gmail.com wrote:
 That seems to suggest that something is not right with the python
 mailing list config. No??

If in doubt, blame someone else, eh?

I'd first check what your browser's actually sending. Firebug will
help there. See if your form fill-out is encoded as UTF-8 or CP-1252.
That's the first step.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: ASCII and Unicode [was Re: Managing Google Groups headaches]

2013-12-06 Thread MRAB

On 07/12/2013 02:41, Chris Angelico wrote:

On Sat, Dec 7, 2013 at 1:33 PM, rusi rustompm...@gmail.com wrote:

That seems to suggest that something is not right with the python
mailing list config. No??


If in doubt, blame someone else, eh?

I'd first check what your browser's actually sending. Firebug will
help there. See if your form fill-out is encoded as UTF-8 or CP-1252.
That's the first step.


Looking back through the thread, it looks like:

Roy posted a reply in us-ascii.

rusi replied in windows-1252, adding the '…'.

Roy replied in us-ascii, but with 'Š' in place of '…'.

rusi replied in utf-8, with '�' in place of '…'

--
https://mail.python.org/mailman/listinfo/python-list


Re: ASCII and Unicode [was Re: Managing Google Groups headaches]

2013-12-06 Thread rusi
On Saturday, December 7, 2013 8:11:45 AM UTC+5:30, Chris Angelico wrote:
 On Sat, Dec 7, 2013 at 1:33 PM, rusi  wrote:
  That seems to suggest that something is not right with the python
  mailing list config. No??

 If in doubt, blame someone else, eh?

 I'd first check what your browser's actually sending. Firebug will
 help there. See if your form fill-out is encoded as UTF-8 or CP-1252.
 That's the first step.

If you give me some tip where to look, I'll do that.
But I dont see what this has to do with forms.

Everything in the python archive (not just my posts) show as Win 1252
[I checked about 6]

Every other page that I checked (most nothing to do with python list,
GG etc) show UTF-8. [I checked about 5]

None of these checkings had forms to be filled.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: ASCII and Unicode [was Re: Managing Google Groups headaches]

2013-12-06 Thread Chris Angelico
On Sat, Dec 7, 2013 at 2:16 PM, rusi rustompm...@gmail.com wrote:
 On Saturday, December 7, 2013 8:11:45 AM UTC+5:30, Chris Angelico wrote:
 On Sat, Dec 7, 2013 at 1:33 PM, rusi  wrote:
  That seems to suggest that something is not right with the python
  mailing list config. No??

 If in doubt, blame someone else, eh?

 I'd first check what your browser's actually sending. Firebug will
 help there. See if your form fill-out is encoded as UTF-8 or CP-1252.
 That's the first step.

 If you give me some tip where to look, I'll do that.
 But I dont see what this has to do with forms.


Page encodings specify what comes from the server to your browser.
Your post went the other way. Tracing the data going back to the
server would tell you how it's encoded.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


RE: Ascii to Unicode.

2010-07-30 Thread Lawrence D'Oliveiro
In message mailman.1309.1280426398.1673.python-l...@python.org, Joe 
Goldthwaite wrote:

 Ascii.csv isn't really a latin-1 encoded file.  It's an ascii file with a
 few characters above the 128 range that are causing Postgresql Unicode
 errors.  Those characters work fine in the Windows world but they're not
 the correct byte representation for Unicode.

In other words, the encoding you want to decode from in this case is 
windows-1252.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Ascii to Unicode.

2010-07-30 Thread Lawrence D'Oliveiro
In message 4c51d3b6$0$1638$742ec...@news.sonic.net, John Nagle wrote:

 UTF-8 is a stream format for Unicode.  It's slightly compressed ...

“Variable-length” is not the same as “compressed”.

Particularly if you’re mainly using non-Roman scripts...

-- 
http://mail.python.org/mailman/listinfo/python-list


RE: Ascii to Unicode.

2010-07-30 Thread Lawrence D'Oliveiro
In message mailman.1307.1280425706.1673.python-l...@python.org, Joe 
Goldthwaite wrote:

 Next I tried to write the unicodestring object to a file thusly;
 
 output.write(unicodestring)
 
 I would have expected the write function to request the byte string from
 the unicodestring object and simply write that byte string to a file.

Encoded according to which encoding?
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Ascii to Unicode.

2010-07-30 Thread John Machin
On Jul 30, 4:18 am, Carey Tilden carey.til...@gmail.com wrote:
 In this case, you've been able to determine the
 correct encoding (latin-1) for those errant bytes, so the file itself
 is thus known to be in that encoding.

The most probably correct encoding is, as already stated, and agreed
by the OP to be, cp1252.


-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Ascii to Unicode.

2010-07-29 Thread Ulrich Eckhardt
Joe Goldthwaite wrote:
   import unicodedata

   input = file('ascii.csv', 'rb')
   output = file('unicode.csv','wb')
 
   for line in input.xreadlines():
  unicodestring = unicode(line, 'latin1')
  output.write(unicodestring.encode('utf-8')) # This second encode
   is what I was missing.

Actually, I see two problems here:
1. ascii.csv is not an ASCII file but a Latin-1 encoded file, so there
starts the first confusion.
2. unicode.csv is not a Unicode file, because Unicode is not a file
format. Rather, it is a UTF-8 encoded file, which is one encoding of
Unicode. This is the second confusion.

 A number of you pointed out what I was doing wrong but I couldn't
 understand it until I realized that the write operation didn't work until
 it was using a properly encoded Unicode string.

The write function wants bytes! Encoding a string in your favourite encoding
yields bytes.

 This still seems odd to me.  I would have thought that the unicode
 function would return a properly encoded byte stream that could then
 simply be written to disk.

No, unicode() takes a byte stream and decodes it according to the given
encoding. You then get an internal representation of the string, a unicode
object. This representation typically resembles UCS2 or UCS4, which are
more suitable for internal manipulation than UTF-8. This object is a string
btw, so typical stuff like concatenation etc are supported. However, the
internal representation is a sequence of Unicode codepoints but not a
guaranteed sequence of bytes which is what you want in a file.

 Instead it seems like you have to re-encode the byte stream to some
 kind of escaped Ascii before it can be written back out.

As mentioned above, you have a string. For writing, that string needs to be
transformed to bytes again.


Note: You can also configure a file to read one encoding or write another.
You then get unicode objects from the input which you can feed to the
output. The important difference is that you only specify the encoding in
one place and it will probably even be more performant. I'd have to search
to find you the according library calls though, but starting point is
http://docs.python.org.

Good luck!

Uli

-- 
Sator Laser GmbH
Geschäftsführer: Thorsten Föcking, Amtsgericht Hamburg HR B62 932

-- 
http://mail.python.org/mailman/listinfo/python-list


RE: Ascii to Unicode.

2010-07-29 Thread Joe Goldthwaite
Hi Steven,

I read through the article you referenced.  I understand Unicode better now.
I wasn't completely ignorant of the subject.  My confusion is more about how
Python is handling Unicode than Unicode itself.  I guess I'm fighting my own
misconceptions. I do that a lot.  It's hard for me to understand how things
work when they don't function the way I *think* they should.

Here's the main source of my confusion.  In my original sample, I had read a
line in from the file and used the unicode function to create a
unicodestring object;

unicodestring = unicode(line, 'latin1')

What I thought this step would do is translate the line to an internal
Unicode representation.  The problem character \xe1 would have been
translated into a correct Unicode representation for the accented a
character. 

Next I tried to write the unicodestring object to a file thusly;

output.write(unicodestring)

I would have expected the write function to request the byte string from the
unicodestring object and simply write that byte string to a file.  I thought
that at this point, I should have had a valid Unicode latin1 encoded file.
Instead get an error that the character \xe1 is invalid.

The fact that the \xe1 character is still in the unicodestring object tells
me it wasn't translated into whatever python uses for its internal Unicode
representation.  Either that or the unicodestring object returns the
original string when it's asked for a byte stream representation.

Instead of just writing the unicodestring object, I had to do this;

output.write(unicodestring.encode('utf-8'))

This is doing what I thought the other steps were doing.  It's translating
the internal unicodestring byte representation to utf-8 and writing it out.
It still seems strange and I'm still not completely clear as to what is
going on at the byte stream level for each of these steps.



-- 
http://mail.python.org/mailman/listinfo/python-list


RE: Ascii to Unicode.

2010-07-29 Thread Joe Goldthwaite
Hi Ulrich,

Ascii.csv isn't really a latin-1 encoded file.  It's an ascii file with a
few characters above the 128 range that are causing Postgresql Unicode
errors.  Those characters work fine in the Windows world but they're not the
correct byte representation for Unicode. What I'm attempting to do is
translate those upper range characters into the correct Unicode
representations so that they look the same in the Postgresql database as
they did in the CSV file.

I wrote up the source of my confusion to Steven so I won't duplicate it
here.  You're comment on defining the encoding of the file directly instead
of using functions to encode and decode the data lead me to the codecs
module.  Using it, I can define the encoding a file open time and then just
read and write the lines.  I ended up with this;

import codecs

input = codecs.open('ascii.csv', encoding='cp1252')
output = codecs.open('unicode.csv', mode='wb', encoding='utf-8')

output.writelines(input.readlines())

input.close()
output.close()

This is doing exactly the same thing but it's much clearer to me.  Readlines
translates the input using the cp1252 codec and writelines encodes it to
utf-8 and writes it out.  And as you mentioned, it's probably higher
performance.  I haven't tested that but since both programs do the job in
seconds, performance isn't and issue.

Thanks again to everyone who posted.  I really do appreciate it.


-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Ascii to Unicode.

2010-07-29 Thread Ethan Furman

Joe Goldthwaite wrote:

Hi Steven,

I read through the article you referenced.  I understand Unicode better now.
I wasn't completely ignorant of the subject.  My confusion is more about how
Python is handling Unicode than Unicode itself.  I guess I'm fighting my own
misconceptions. I do that a lot.  It's hard for me to understand how things
work when they don't function the way I *think* they should.

Here's the main source of my confusion.  In my original sample, I had read a
line in from the file and used the unicode function to create a
unicodestring object;

unicodestring = unicode(line, 'latin1')

What I thought this step would do is translate the line to an internal
Unicode representation.  The problem character \xe1 would have been
translated into a correct Unicode representation for the accented a
character. 


Correct.  At this point you have unicode string.


Next I tried to write the unicodestring object to a file thusly;

output.write(unicodestring)

I would have expected the write function to request the byte string from the
unicodestring object and simply write that byte string to a file.  I thought
that at this point, I should have had a valid Unicode latin1 encoded file.
Instead get an error that the character \xe1 is invalid.


Here's the problem -- there is no byte string representing the unicode 
string, they are completely different.  There are dozens of different 
possible encodings to go from unicode to a byte-string (of which UTF-8 
is one such possibility).



The fact that the \xe1 character is still in the unicodestring object tells
me it wasn't translated into whatever python uses for its internal Unicode
representation.  Either that or the unicodestring object returns the
original string when it's asked for a byte stream representation.


Wrong.  It so happens that some of the unicode points are the same as 
some (but not all) of the ascii and upper-ascii values.  When you 
attempt to write a unicode string without specifying which encoding you 
want, python falls back to ascii (not upper-ascii) so any character 
outside the 0-127 range is going to raise an error.



Instead of just writing the unicodestring object, I had to do this;

output.write(unicodestring.encode('utf-8'))

This is doing what I thought the other steps were doing.  It's translating
the internal unicodestring byte representation to utf-8 and writing it out.
It still seems strange and I'm still not completely clear as to what is
going on at the byte stream level for each of these steps.



Don't think of unicode as a byte stream.  It's a bunch of numbers that 
map to a bunch of symbols.  The byte stream only comes into play when 
you want to send unicode somewhere (file, socket, etc) and you then have 
to encode the unicode into bytes.


Hope this helps!

~Ethan~
--
http://mail.python.org/mailman/listinfo/python-list


Re: Ascii to Unicode.

2010-07-29 Thread Carey Tilden
On Thu, Jul 29, 2010 at 10:59 AM, Joe Goldthwaite j...@goldthwaites.com wrote:
 Hi Ulrich,

 Ascii.csv isn't really a latin-1 encoded file.  It's an ascii file with a
 few characters above the 128 range that are causing Postgresql Unicode
 errors.  Those characters work fine in the Windows world but they're not the
 correct byte representation for Unicode. What I'm attempting to do is
 translate those upper range characters into the correct Unicode
 representations so that they look the same in the Postgresql database as
 they did in the CSV file.

Having bytes outside of the ASCII range means, by definition, that the
file is not ASCII encoded.  ASCII only defines bytes 0-127.  Bytes
outside of that range mean either the file is corrupt, or it's in a
different encoding.  In this case, you've been able to determine the
correct encoding (latin-1) for those errant bytes, so the file itself
is thus known to be in that encoding.

Carey
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Ascii to Unicode.

2010-07-29 Thread Ethan Furman

Joe Goldthwaite wrote:

Hi Ulrich,

Ascii.csv isn't really a latin-1 encoded file.  It's an ascii file with a
few characters above the 128 range . . .


It took me a while to get this point too (if you already have gotten 
it, I apologize, but the above comment leads me to believe you haven't).


*Every* file is an encoded file... even your UTF-8 file is encoded using 
the UTF-8 format.  Someone correct me if I'm wrong, but I believe 
lower-ascii (0-127) matches up to the first 128 Unicode code points, so 
while those first 128 code-points translate easily to ascii, ascii is 
still an encoding, and if you have characters higher than 127, you don't 
really have an ascii file -- you have (for example) a cp1252 file (which 
also, not coincidentally, shares the first 128 characters/code points 
with ascii).


Hopefully I'm not adding to the confusion.  ;)

~Ethan~
--
http://mail.python.org/mailman/listinfo/python-list


Re: Ascii to Unicode.

2010-07-29 Thread John Nagle

On 7/28/2010 3:58 PM, Joe Goldthwaite wrote:

This still seems odd to me.  I would have thought that the unicode function
would return a properly encoded byte stream that could then simply be
written to disk. Instead it seems like you have to re-encode the byte stream
to some kind of escaped Ascii before it can be written back out.


   Here's what's really going on.

   Unicode strings within Python have to be indexable.  So the internal
representation of Unicode has (usually) two bytes for each character,
so they work like arrays.

   UTF-8 is a stream format for Unicode.  It's slightly compressed;
each character occupies 1 to 4 bytes, and the base ASCII characters
(0..127 only, not 128..255) occupy one byte each.  The format is
described in http://en.wikipedia.org/wiki/UTF-8;.  A UTF-8 file or
stream has to be parsed from the beginning to keep track of where each
Unicode character begins.  So it's not a suitable format for
data being actively worked on in memory; it can't be easily indexed.

   That's why it's necessary to convert to UTF-8 before writing
to a file or socket.

John Nagle
--
http://mail.python.org/mailman/listinfo/python-list


Re: Ascii to Unicode.

2010-07-29 Thread MRAB

John Nagle wrote:

On 7/28/2010 3:58 PM, Joe Goldthwaite wrote:
This still seems odd to me.  I would have thought that the unicode 
function

would return a properly encoded byte stream that could then simply be
written to disk. Instead it seems like you have to re-encode the byte 
stream

to some kind of escaped Ascii before it can be written back out.


   Here's what's really going on.

   Unicode strings within Python have to be indexable.  So the internal
representation of Unicode has (usually) two bytes for each character,
so they work like arrays.

   UTF-8 is a stream format for Unicode.  It's slightly compressed;
each character occupies 1 to 4 bytes, and the base ASCII characters
(0..127 only, not 128..255) occupy one byte each.  The format is
described in http://en.wikipedia.org/wiki/UTF-8;.  A UTF-8 file or
stream has to be parsed from the beginning to keep track of where each
Unicode character begins.  So it's not a suitable format for
data being actively worked on in memory; it can't be easily indexed.


Not entirely correct. The advantage of UTF-8 is that although different
codepoints might be encoded into different numbers of bytes it's easy to
tell whether a particular byte is the first in its sequence, so you
don't have to parse from the start of the file. It is true, however, it
can't be easily indexed.


   That's why it's necessary to convert to UTF-8 before writing
to a file or socket.


--
http://mail.python.org/mailman/listinfo/python-list


Re: Ascii to Unicode.

2010-07-29 Thread Steven D'Aprano
On Thu, 29 Jul 2010 11:14:24 -0700, Ethan Furman wrote:

 Don't think of unicode as a byte stream.  It's a bunch of numbers that
 map to a bunch of symbols.

Not only are Unicode strings a bunch of numbers (code points, in 
Unicode terminology), but the numbers are not necessarily all the same 
width.

The full Unicode system allows for 1,114,112 characters, far more than 
will fit in a two-byte code point. The Basic Multilingual Plane (BMP) 
includes the first 2**16 (65536) of those characters, or code points 
U+ through U+; there are a further 16 supplementary planes of 
2**16 characters each, or code points U+1 through U+10.

As I understand it (and I welcome corrections), some implementations of 
Unicode only support the BMP and use a fixed-width implementation of 16-
bit characters for efficiency reasons. Supporting the entire range of 
code points would require either a fixed-width of 21-bits (which would 
then probably be padded to four bytes), or a more complex variable-width 
implementation.

It looks to me like Python uses a 16-bit implementation internally, which 
leads to some rather unintuitive results for code points in the 
supplementary place... 

 c = chr(2**18)
 c
'\U0004'
 len(c)
2


-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Ascii to Unicode.

2010-07-29 Thread Mark Tolonen


Joe Goldthwaite j...@goldthwaites.com wrote in message 
news:5a04846ed83745a8a99a944793792...@newmbp...

Hi Steven,

I read through the article you referenced.  I understand Unicode better 
now.
I wasn't completely ignorant of the subject.  My confusion is more about 
how
Python is handling Unicode than Unicode itself.  I guess I'm fighting my 
own
misconceptions. I do that a lot.  It's hard for me to understand how 
things

work when they don't function the way I *think* they should.

Here's the main source of my confusion.  In my original sample, I had read 
a

line in from the file and used the unicode function to create a
unicodestring object;

unicodestring = unicode(line, 'latin1')

What I thought this step would do is translate the line to an internal
Unicode representation.


Correct.


The problem character \xe1 would have been
translated into a correct Unicode representation for the accented a
character.


Which just so happens to be u'\xe1', which probably adds to your confusion 
later :^)  The first 256 Unicode code points map to latin1.




Next I tried to write the unicodestring object to a file thusly;

output.write(unicodestring)

I would have expected the write function to request the byte string from 
the
unicodestring object and simply write that byte string to a file.  I 
thought

that at this point, I should have had a valid Unicode latin1 encoded file.
Instead get an error that the character \xe1 is invalid.


Incorrect.  The unicodestring object doesn't save the original byte string, 
so there is nothing to request.


The fact that the \xe1 character is still in the unicodestring object 
tells

me it wasn't translated into whatever python uses for its internal Unicode
representation.  Either that or the unicodestring object returns the
original string when it's asked for a byte stream representation.


Both incorrect.  As I mentioned earlier, the first Unicode code points map 
to latin1.  It *was* translated to a Unicode code point whose value (but not 
internal representation!) is the same as latin1.



Instead of just writing the unicodestring object, I had to do this;

output.write(unicodestring.encode('utf-8'))


This is exactly what you need to do...explicitly encode the Unicode string 
into a byte string.



This is doing what I thought the other steps were doing.  It's translating
the internal unicodestring byte representation to utf-8 and writing it 
out.

It still seems strange and I'm still not completely clear as to what is
going on at the byte stream level for each of these steps.


I'm surprised that by now no one has mentioned the codecs module.  You 
original stated you are using Python 2.4.4, which I looked up and does 
support the codecs module.


   import codecs

   infile = codecs.open('ascii.csv,'r','latin1')
   outfile = codecs.open('unicode.csv','w','utf-8')
   for line in infile:
   outfile.write(line)
   infile.close()
   outfile.close()

As you can see, codecs.open takes a parameter for the encoding of the file. 
Lines read are automatically decoded into Unicode; Unicode lines written are 
automatically encoded into a byte stream.


-Mark


--
http://mail.python.org/mailman/listinfo/python-list


Re: Ascii to Unicode.

2010-07-29 Thread Nobody
On Thu, 29 Jul 2010 23:49:40 +, Steven D'Aprano wrote:

 It looks to me like Python uses a 16-bit implementation internally,

It typically uses the platform's wchar_t, which is 16-bit on Windows and
(typically) 32-bit on Unix.

IIRC, it's possible to build Python with 32-bit Unicode on Windows, but
that will be inefficient (because it has to convert to/from 16-bit
when calling Windows API functions) and will break any C modules which
pass the pointer to the internal buffer directly to API functions.

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Ascii to Unicode.

2010-07-28 Thread MRAB

Joe Goldthwaite wrote:

Hi,

I've got an Ascii file with some latin characters. Specifically \xe1 and
\xfc.  I'm trying to import it into a Postgresql database that's running in
Unicode mode. The Unicode converter chokes on those two characters.




I could just manually replace those to characters with something valid but
if any other invalid characters show up in later versions of the file, I'd
like to handle them correctly.


I've been playing with the Unicode stuff and I found out that I could
convert both those characters correctly using the latin1 encoder like this;


import unicodedata

s = '\xe1\xfc'
print unicode(s,'latin1')


The above works.  When I try to convert my file however, I still get an
error;

import unicodedata

input = file('ascii.csv', 'r')
output = file('unicode.csv','w')

for line in input.xreadlines():
output.write(unicode(line,'latin1'))

input.close()
output.close()

Traceback (most recent call last):
  File C:\Users\jgold\CloudmartFiles\UnicodeTest.py, line 10, in __main__
output.write(unicode(line,'latin1'))
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1' in position
295: ordinal not in range(128)

I'm stuck using Python 2.4.4 which may be handling the strings differently
depending on if they're in the program or coming from the file.  I just
haven't been able to figure out how to get the Unicode conversion working
from the file data.

Can anyone explain what is going on?


What you need to remember is that files contain bytes.

When you say ASCII file what you mean is that the file contains bytes
which represent text encoded as ASCII, and such a file by definition
can't contain bytes outside the range 0-127. Therefore your file isn't
an ASCII file. So then you've decided to treat it as a file containing
bytes which represent text encoded as Latin-1.

You're reading bytes from a file, decoding them to Unicode, and then
trying to write them to a file, but the output file expects bytes (did I
say that files contain bytes? :-)), so it's trying to encode back to
bytes using the default encoding, which is ASCII. u'\xe1' can't be 
encoded as ASCII, therefore UnicodeEncodeError is raised.

--
http://mail.python.org/mailman/listinfo/python-list


Re: Ascii to Unicode.

2010-07-28 Thread Thomas Jollans
On 07/28/2010 08:32 PM, Joe Goldthwaite wrote:
 Hi,
 
 I've got an Ascii file with some latin characters. Specifically \xe1 and
 \xfc.  I'm trying to import it into a Postgresql database that's running in
 Unicode mode. The Unicode converter chokes on those two characters.
 
 I could just manually replace those to characters with something valid but
 if any other invalid characters show up in later versions of the file, I'd
 like to handle them correctly.
 
 
 I've been playing with the Unicode stuff and I found out that I could
 convert both those characters correctly using the latin1 encoder like this;
 
 
   import unicodedata
 
   s = '\xe1\xfc'
   print unicode(s,'latin1')
 
 
 The above works.  When I try to convert my file however, I still get an
 error;
 
   import unicodedata
 
   input = file('ascii.csv', 'r')
   output = file('unicode.csv','w')

output is still a binary file - there are no unicode files. You need to
encode the text somehow.

 Traceback (most recent call last):
   File C:\Users\jgold\CloudmartFiles\UnicodeTest.py, line 10, in __main__
 output.write(unicode(line,'latin1'))
 UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1' in position
 295: ordinal not in range(128)

by default, Python tries to encode strings using ASCII. This, obviously,
won't work here.

Do you know which encoding your database expects ? I'd assume it'd
understand UTF-8. Everybody uses UTF-8.



   for line in input.xreadlines():
   output.write(unicode(line,'latin1'))

unicode(line, 'latin1') is unicode, you need it to be a UTF-8 bytestring:

unicode(line, 'latin1').encode('utf-8')

or:

line.decode('latin1').encode('utf-8')
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Ascii to Unicode.

2010-07-28 Thread John Nagle

On 7/28/2010 11:32 AM, Joe Goldthwaite wrote:

Hi,

I've got an Ascii file with some latin characters. Specifically \xe1 and
\xfc.  I'm trying to import it into a Postgresql database that's running in
Unicode mode. The Unicode converter chokes on those two characters.

I could just manually replace those to characters with something valid but
if any other invalid characters show up in later versions of the file, I'd
like to handle them correctly.


I've been playing with the Unicode stuff and I found out that I could
convert both those characters correctly using the latin1 encoder like this;


import unicodedata

s = '\xe1\xfc'
print unicode(s,'latin1')


The above works.  When I try to convert my file however, I still get an
error;

import unicodedata

input = file('ascii.csv', 'r')
output = file('unicode.csv','w')

for line in input.xreadlines():
output.write(unicode(line,'latin1'))

input.close()
output.close()


Try this, which will get you a UTF-8 file, the usual standard for
Unicode in a file.

for rawline in input :
unicodeline = unicode(line,'latin1')# Latin-1 to Unicode
output.write(unicodeline.encode('utf-8')) # Unicode to as UTF-8


John Nagle

--
http://mail.python.org/mailman/listinfo/python-list


Re: Ascii to Unicode.

2010-07-28 Thread Thomas Jollans
On 07/28/2010 09:29 PM, John Nagle wrote:
 for rawline in input :
 unicodeline = unicode(line,'latin1')# Latin-1 to Unicode
 output.write(unicodeline.encode('utf-8')) # Unicode to as UTF-8

you got your blocks wrong.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Ascii to Unicode.

2010-07-28 Thread John Machin
On Jul 29, 4:32 am, Joe Goldthwaite j...@goldthwaites.com wrote:
 Hi,

 I've got an Ascii file with some latin characters. Specifically \xe1 and
 \xfc.  I'm trying to import it into a Postgresql database that's running in
 Unicode mode. The Unicode converter chokes on those two characters.

 I could just manually replace those to characters with something valid but
 if any other invalid characters show up in later versions of the file, I'd
 like to handle them correctly.

 I've been playing with the Unicode stuff and I found out that I could
 convert both those characters correctly using the latin1 encoder like this;

         import unicodedata

         s = '\xe1\xfc'
         print unicode(s,'latin1')

 The above works.  When I try to convert my file however, I still get an
 error;

         import unicodedata

         input = file('ascii.csv', 'r')
         output = file('unicode.csv','w')

         for line in input.xreadlines():
                 output.write(unicode(line,'latin1'))

         input.close()
         output.close()

 Traceback (most recent call last):
   File C:\Users\jgold\CloudmartFiles\UnicodeTest.py, line 10, in __main__
     output.write(unicode(line,'latin1'))
 UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1' in position
 295: ordinal not in range(128)

 I'm stuck using Python 2.4.4 which may be handling the strings differently
 depending on if they're in the program or coming from the file.  I just
 haven't been able to figure out how to get the Unicode conversion working
 from the file data.

 Can anyone explain what is going on?

Hello hello ... you are running on Windows; the likelihood that you
actually have data encoded in latin1 is very very small. Follow MRAB's
answer but replace latin1 by cp1252.
-- 
http://mail.python.org/mailman/listinfo/python-list


RE: Ascii to Unicode.

2010-07-28 Thread Joe Goldthwaite

 Hello hello ... you are running on Windows; the likelihood that you
 actually have data encoded in latin1 is very very small. Follow MRAB's
 answer but replace latin1 by cp1252.

I think you're right. The database I'm working with is a US zip code
database.  It gets updated monthly.  The problem fields are some city names
in Puerto Rico. I thought I had tried the cp1252 codec and that it didn't
work. I tried it again and it works now so I was doing something else wrong.

I agree that's probably what I should be using.  Both latin1 and cp1252
produce the same output for the two characters I'm having the trouble with
but I changed it to cp1252 anyway.  I think it will avoid problems in the
future

Thanks John.

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Ascii to Unicode.

2010-07-28 Thread Steven D'Aprano
On Wed, 28 Jul 2010 15:58:01 -0700, Joe Goldthwaite wrote:

 This still seems odd to me.  I would have thought that the unicode
 function would return a properly encoded byte stream that could then
 simply be written to disk. Instead it seems like you have to re-encode
 the byte stream to some kind of escaped Ascii before it can be written
 back out.

I'm afraid that's not even wrong. The unicode function returns a unicode 
string object, not a byte-stream, just as the list function returns a 
sequence of objects, not a byte-stream.

Perhaps this will help:

http://www.joelonsoftware.com/articles/Unicode.html


Summary:

ASCII is not a synonym for bytes, no matter what some English-speakers 
think. ASCII is an encoding from bytes like \x41 to characters like A.

Unicode strings are a sequence of code points. A code point is a number, 
implemented in some complex fashion that you don't need to care about. 
Each code point maps conceptually to a letter; for example, the English 
letter A is represented by the code point U+0041 and the Arabic letter 
Ain is represented by the code point U+0639.

You shouldn't make any assumptions about the size of each code-point, or 
how they are put together. You shouldn't expect to write code points to a 
disk and have the result make sense, any more than you could expect to 
write a sequence of tuples or sets or dicts to disk in any sensible 
fashion. You have to serialise it to bytes first, and that's what the 
encode method does. Decode does the opposite, taking bytes and creating 
unicode strings from them.

For historical reasons -- backwards compatibility with files already 
created, back in the Bad Old Days before unicode -- there are a whole 
slew of different encodings available. There is no 1:1 mapping between 
bytes and strings. If all you have are the bytes, there is literally no 
way of knowing what string they represent (although sometimes you can 
guess). You need to know what the encoding used was, or take a guess, or 
make repeated decodings until something doesn't fail and hope that's the 
right one.

As a general rule, Python will try encoding/decoding using the ASCII 
encoding unless you tell it differently.

Any time you are writing to disk, you need to serialise the objects, 
regardless of whether they are floats, or dicts, or unicode strings.


-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: ascii to unicode line endings

2007-05-03 Thread fidtz
On 2 May, 17:29, Jean-Paul Calderone [EMAIL PROTECTED] wrote:
 On 2 May 2007 09:19:25 -0700, [EMAIL PROTECTED] wrote:



 The code:

 import codecs

 udlASCII = file(c:\\temp\\CSVDB.udl,'r')
 udlUNI = codecs.open(c:\\temp\\CSVDB2.udl,'w',utf_16)

 udlUNI.write(udlASCII.read())

 udlUNI.close()
 udlASCII.close()

 This doesn't seem to generate the correct line endings. Instead of
 converting 0x0D/0x0A to 0x0D/0x00/0x0A/0x00, it leaves it as  0x0D/
 0x0A

 I have tried various 2 byte unicode encoding but it doesn't seem to
 make a difference. I have also tried modifying the code to read and
 convert a line at a time, but that didn't make any difference either.

 I have tried to understand the unicode docs but nothing seems to
 indicate why an seemingly incorrect conversion is being done.
 Obviously I am missing something blindingly obvious here, any help
 much appreciated.

 Consider this simple example:

import codecs
f = codecs.open('test-newlines-file', 'w', 'utf16')
f.write('\r\n')
f.close()
f = file('test-newlines-file')
f.read()
   '\xff\xfe\r\x00\n\x00'
   

 And how it differs from your example.  Are you sure you're examining
 the resulting output properly?

 By the way, \r\0\n\0 isn't a unicode line ending, it's just the UTF-16
 encoding of \r\n.

 Jean-Paul

I am not sure what you are driving at here, since I started with an
ascii file, whereas you just write a unicode file to start with. I
guess the direct question is is there a simple way to convert my
ascii file to a utf16 file?. I thought either string.encode() or
writing to a utf16 file would do the trick but it probably isn't that
simple!

I used a binary file editor I have used a great deal for all sorts of
things to get the hex values.

Dom

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: ascii to unicode line endings

2007-05-03 Thread Jean-Paul Calderone
On 3 May 2007 04:30:37 -0700, [EMAIL PROTECTED] wrote:
On 2 May, 17:29, Jean-Paul Calderone [EMAIL PROTECTED] wrote:
 On 2 May 2007 09:19:25 -0700, [EMAIL PROTECTED] wrote:



 The code:

 import codecs

 udlASCII = file(c:\\temp\\CSVDB.udl,'r')
 udlUNI = codecs.open(c:\\temp\\CSVDB2.udl,'w',utf_16)

 udlUNI.write(udlASCII.read())

 udlUNI.close()
 udlASCII.close()

 This doesn't seem to generate the correct line endings. Instead of
 converting 0x0D/0x0A to 0x0D/0x00/0x0A/0x00, it leaves it as  0x0D/
 0x0A

 I have tried various 2 byte unicode encoding but it doesn't seem to
 make a difference. I have also tried modifying the code to read and
 convert a line at a time, but that didn't make any difference either.

 I have tried to understand the unicode docs but nothing seems to
 indicate why an seemingly incorrect conversion is being done.
 Obviously I am missing something blindingly obvious here, any help
 much appreciated.

 Consider this simple example:

import codecs
f = codecs.open('test-newlines-file', 'w', 'utf16')
f.write('\r\n')
f.close()
f = file('test-newlines-file')
f.read()
   '\xff\xfe\r\x00\n\x00'
   

 And how it differs from your example.  Are you sure you're examining
 the resulting output properly?

 By the way, \r\0\n\0 isn't a unicode line ending, it's just the UTF-16
 encoding of \r\n.

 Jean-Paul

I am not sure what you are driving at here, since I started with an
ascii file, whereas you just write a unicode file to start with. I
guess the direct question is is there a simple way to convert my
ascii file to a utf16 file?. I thought either string.encode() or
writing to a utf16 file would do the trick but it probably isn't that
simple!

There's no such thing as a unicode file.  The only difference between
the code you posted and the code I posted is that mine is self-contained
and demonstrates that the functionality works as you expected it to work,
whereas the code you posted is requires external resources which are not
available to run and produces external results which are not available to
be checked regarding their correctness.

So what I'm driving at is that both your example and mine are doing it
correctly (because they are doing the same thing), and mine demonstrates
that it is correct, but we have to take your word on the fact that yours
doesn't work. ;)

Jean-Paul
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: ascii to unicode line endings

2007-05-03 Thread Jerry Hill
On 2 May 2007 09:19:25 -0700, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote:
 The code:

 import codecs

 udlASCII = file(c:\\temp\\CSVDB.udl,'r')
 udlUNI = codecs.open(c:\\temp\\CSVDB2.udl,'w',utf_16)
 udlUNI.write(udlASCII.read())
 udlUNI.close()
 udlASCII.close()

 This doesn't seem to generate the correct line endings. Instead of
 converting 0x0D/0x0A to 0x0D/0x00/0x0A/0x00, it leaves it as  0x0D/
 0x0A

That code (using my own local files, of course) basically works for me.

If I open my input file with mode 'r', as you did above, my '\r\n'
pairs get transformed to '\n' when I read them in and are written to
my output file as 0x00 0x0A.  If I open the input file in binary mode
'rb' then my output file shows the expected sequence of 0x00 0x0D 0x00
0x0A.

Perhaps there's a quirk of your version of python or your platform?  I'm running
Python 2.5.1 (r251:54863, Apr 18 2007, 08:51:08) [MSC v.1310 32 bit
(Intel)] on win32

-- 
Jerry
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: ascii to unicode line endings

2007-05-03 Thread fidtz
On 3 May, 13:00, Jean-Paul Calderone [EMAIL PROTECTED] wrote:
 On 3 May 2007 04:30:37 -0700, [EMAIL PROTECTED] wrote:



 On 2 May, 17:29, Jean-Paul Calderone [EMAIL PROTECTED] wrote:
  On 2 May 2007 09:19:25 -0700, [EMAIL PROTECTED] wrote:

  The code:

  import codecs

  udlASCII = file(c:\\temp\\CSVDB.udl,'r')
  udlUNI = codecs.open(c:\\temp\\CSVDB2.udl,'w',utf_16)

  udlUNI.write(udlASCII.read())

  udlUNI.close()
  udlASCII.close()

  This doesn't seem to generate the correct line endings. Instead of
  converting 0x0D/0x0A to 0x0D/0x00/0x0A/0x00, it leaves it as  0x0D/
  0x0A

  I have tried various 2 byte unicode encoding but it doesn't seem to
  make a difference. I have also tried modifying the code to read and
  convert a line at a time, but that didn't make any difference either.

  I have tried to understand the unicode docs but nothing seems to
  indicate why an seemingly incorrect conversion is being done.
  Obviously I am missing something blindingly obvious here, any help
  much appreciated.

  Consider this simple example:

 import codecs
 f = codecs.open('test-newlines-file', 'w', 'utf16')
 f.write('\r\n')
 f.close()
 f = file('test-newlines-file')
 f.read()
'\xff\xfe\r\x00\n\x00'

  And how it differs from your example.  Are you sure you're examining
  the resulting output properly?

  By the way, \r\0\n\0 isn't a unicode line ending, it's just the UTF-16
  encoding of \r\n.

  Jean-Paul

 I am not sure what you are driving at here, since I started with an
 ascii file, whereas you just write a unicode file to start with. I
 guess the direct question is is there a simple way to convert my
 ascii file to a utf16 file?. I thought either string.encode() or
 writing to a utf16 file would do the trick but it probably isn't that
 simple!

 There's no such thing as a unicode file.  The only difference between
 the code you posted and the code I posted is that mine is self-contained
 and demonstrates that the functionality works as you expected it to work,
 whereas the code you posted is requires external resources which are not
 available to run and produces external results which are not available to
 be checked regarding their correctness.

 So what I'm driving at is that both your example and mine are doing it
 correctly (because they are doing the same thing), and mine demonstrates
 that it is correct, but we have to take your word on the fact that yours
 doesn't work. ;)

 Jean-Paul

Thanks for the advice. I cannot prove what is going on. The following
code seems to work fine as far as console output goes, but the actual
bit patterns of the files on disk are not what I am expecting (or
expected as input by the ultimate user of the converted file). Which I
can't prove of course.

 import codecs
 testASCII = file(c:\\temp\\test1.txt,'w')
 testASCII.write(\n)
 testASCII.close()
 testASCII = file(c:\\temp\\test1.txt,'r')
 testASCII.read()
'\n'
Bit pattern on disk : \0x0D\0x0A
 testASCII.seek(0)
 testUNI = codecs.open(c:\\temp\\test2.txt,'w','utf16')
 testUNI.write(testASCII.read())
 testUNI.close()
 testUNI = file(c:\\temp\\test2.txt,'r')
 testUNI.read()
'\xff\xfe\n\x00'
Bit pattern on disk:\0xff\0xfe\0x0a\0x00
Bit pattern I was expecting:\0xff\0xfe\0x0d\0x00\0x0a\0x00
 testUNI.close()

Dom

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: ascii to unicode line endings

2007-05-03 Thread fidtz
On 3 May, 13:39, Jerry Hill [EMAIL PROTECTED] wrote:
 On 2 May 2007 09:19:25 -0700, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote:

  The code:

  import codecs

  udlASCII = file(c:\\temp\\CSVDB.udl,'r')
  udlUNI = codecs.open(c:\\temp\\CSVDB2.udl,'w',utf_16)
  udlUNI.write(udlASCII.read())
  udlUNI.close()
  udlASCII.close()

  This doesn't seem to generate the correct line endings. Instead of
  converting 0x0D/0x0A to 0x0D/0x00/0x0A/0x00, it leaves it as  0x0D/
  0x0A

 That code (using my own local files, of course) basically works for me.

 If I open my input file with mode 'r', as you did above, my '\r\n'
 pairs get transformed to '\n' when I read them in and are written to
 my output file as 0x00 0x0A.  If I open the input file in binary mode
 'rb' then my output file shows the expected sequence of 0x00 0x0D 0x00
 0x0A.

 Perhaps there's a quirk of your version of python or your platform?  I'm 
 running
 Python 2.5.1 (r251:54863, Apr 18 2007, 08:51:08) [MSC v.1310 32 bit
 (Intel)] on win32

 --
 Jerry

Thanks very much! Not sure if you intended to fix my whole problem,
but changing the read mode to 'rb' has done the trick :)

Dom

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: ascii to unicode line endings

2007-05-03 Thread Marc 'BlackJack' Rintsch
In [EMAIL PROTECTED], fidtz wrote:

 import codecs
 testASCII = file(c:\\temp\\test1.txt,'w')
 testASCII.write(\n)
 testASCII.close()
 testASCII = file(c:\\temp\\test1.txt,'r')
 testASCII.read()
 '\n'
 Bit pattern on disk : \0x0D\0x0A
 testASCII.seek(0)
 testUNI = codecs.open(c:\\temp\\test2.txt,'w','utf16')
 testUNI.write(testASCII.read())
 testUNI.close()
 testUNI = file(c:\\temp\\test2.txt,'r')
 testUNI.read()
 '\xff\xfe\n\x00'
 Bit pattern on disk:\0xff\0xfe\0x0a\0x00
 Bit pattern I was expecting:\0xff\0xfe\0x0d\0x00\0x0a\0x00
 testUNI.close()

Files opened with `codecs.open()` are always opened in binary mode.  So if
you want '\n' to be translated into a platform specific character sequence
you have to do it yourself.

Ciao,
Marc 'BlackJack' Rintsch
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: ascii to unicode line endings

2007-05-02 Thread Jean-Paul Calderone
On 2 May 2007 09:19:25 -0700, [EMAIL PROTECTED] wrote:
The code:

import codecs

udlASCII = file(c:\\temp\\CSVDB.udl,'r')
udlUNI = codecs.open(c:\\temp\\CSVDB2.udl,'w',utf_16)

udlUNI.write(udlASCII.read())

udlUNI.close()
udlASCII.close()

This doesn't seem to generate the correct line endings. Instead of
converting 0x0D/0x0A to 0x0D/0x00/0x0A/0x00, it leaves it as  0x0D/
0x0A

I have tried various 2 byte unicode encoding but it doesn't seem to
make a difference. I have also tried modifying the code to read and
convert a line at a time, but that didn't make any difference either.

I have tried to understand the unicode docs but nothing seems to
indicate why an seemingly incorrect conversion is being done.
Obviously I am missing something blindingly obvious here, any help
much appreciated.

Consider this simple example:

   import codecs
   f = codecs.open('test-newlines-file', 'w', 'utf16')
   f.write('\r\n')
   f.close()
   f = file('test-newlines-file')
   f.read()
  '\xff\xfe\r\x00\n\x00'
  

And how it differs from your example.  Are you sure you're examining
the resulting output properly?

By the way, \r\0\n\0 isn't a unicode line ending, it's just the UTF-16
encoding of \r\n.

Jean-Paul
-- 
http://mail.python.org/mailman/listinfo/python-list