Re: hex dump w/ or w/out utf-8 chars

2013-07-24 Thread wxjmfauth
I do not find the thread, where a Python core dev spoke
about French, so I'm putting here.

This stupid Flexible String Representation splits Unicode
in chunks and one of these chunks is latin-1 (iso-8859-1).

If we consider that latin-1 is unusable for 17 (seventeen)
European languages based on the latin alphabet, one can not
say Python is really well prepared.

Most of the problems are coming from the extensive usage of
diacritics in these languages. Thanks to the FSR again,
working with normalized forms does not work very well. At
least, there is some consistency.

Now, if we consider that most of the new characters will
be part of the BMP ("daily" used chars), it is hard to
present Python as a modern language. It sticks more
to the past and it not really prepared for the future,
the acceptance of new chars like ẞ or the new Turkish lira
sign ((U+20BA).

>>> sys.getsizeof('š')
40
>>> sys.getsizeof('0')
26

14 bytes to encode a non-latin-1 char is not so bad.


jmf

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Timing of string membership (was Re: hex dump w/ or w/out utf-8 chars)

2013-07-14 Thread Chris Angelico
On Mon, Jul 15, 2013 at 2:18 PM, Terry Reedy  wrote:
> On 7/14/2013 10:56 AM, Chris Angelico wrote:
> As issue about finding stings in strings was opened last September and, as
> reported on this list, fixes were applied about last March. As I remember,
> some but not all of the optimizations were applied to 3.3. Perhaps some were
> applied too late for 3.3.1 (3.3.2 is 3.3.1 with some emergency patches to
> correct regressions).

D'oh. I knew there was something raised and solved regarding that, but
I forgot to go check a 3.4 alpha to see if it exhibited the same.
Whoops. My bad. Sorry!

> Python 3.4.0a2:
 import timeit
>
 timeit.repeat("a = 'hundred'; 'x' in a")
> [0.17396483610667152, 0.16277956641670813, 0.1627937074749941]
 timeit.repeat("a = 'hundreo'; 'x' in a")
> [0.18441108179403187, 0.16277311071618783, 0.16270517215355085]
>
> The difference is gone, again, as previously reported.

Yep, that looks exactly like I would have hoped it would.

>> 0.1765129367 ASCII in ASCII, as set
>
> Much of this time is overhead; 'pass' would not run too much faster.
>
>> 0.1817367850 SMP in BMP
>> 0.1884555160 SMP in ASCII
>> 0.2132371572 BMP in ASCII
>
> For these, 3.3 does no searching because it knows from the internal char
> kind that the answer is No without looking.

Yeah, I mainly included those results so I could say to jmf "Look, FSR
allows some string membership operations to be, I kid you not, as fast
as set operations!".

>> 0.3137454621 ASCII in ASCII
>> 0.4472624314 BMP in BMP
>> 0.6672795006 SMP in SMP
>> 0.7493052888 ASCII in BMP
>> 0.9261783271 ASCII in SMP
>> 0.9865787412 BMP in SMP
>
>> Otherwise, an actual search must be done. Searching
>> for characters in strings of the same width gets slower as the strings
>> get larger in memory (unsurprising). What I'm seeing of the top-end
>> results, though, is that the search for a narrower string in a wider
>> one is quite significantly slower.
>
> 50% longer is not bad, even

Hard to give an estimate; my first tests were the ASCII in ASCII and
ASCII in BMP, which then looked more like 2:1 time. However, rescaling
the needle to BMP makes it more like the 50% you're quoting, so yes,
it's not as much as I thought.

In any case, the most important thing to note is: 3.4 has already
fixed this, ergo jmf should shut up about it. And here I thought I
could credit him with a second actually-useful report...

ChrisA
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Timing of string membership (was Re: hex dump w/ or w/out utf-8 chars)

2013-07-14 Thread Terry Reedy

On 7/14/2013 10:56 AM, Chris Angelico wrote:

On Sun, Jul 14, 2013 at 11:44 PM,   wrote:



timeit.repeat("a = 'hundred'; 'x' in a")

[0.11785943134991479, 0.09850454944486256, 0.09761604599423179]

timeit.repeat("a = 'hundreœ'; 'x' in a")

[0.23955250303158593, 0.2195812612416752, 0.22133896997401692]

sys.version

'3.3.2 (v3.3.2:d047928ae3f6, May 16 2013, 00:03:43) [MSC v.1600 32 bit (Intel)]'


As issue about finding stings in strings was opened last September and, 
as reported on this list, fixes were applied about last March. As I 
remember, some but not all of the optimizations were applied to 3.3. 
Perhaps some were applied too late for 3.3.1 (3.3.2 is 3.3.1 with some 
emergency patches to correct regressions).


Python 3.4.0a2:
>>> import timeit
>>> timeit.repeat("a = 'hundred'; 'x' in a")
[0.17396483610667152, 0.16277956641670813, 0.1627937074749941]
>>> timeit.repeat("a = 'hundreo'; 'x' in a")
[0.18441108179403187, 0.16277311071618783, 0.16270517215355085]

The difference is gone, again, as previously reported.


jmf has raised an interesting point. Some string membership operations
do seem oddly slow.


He raised it a year ago and action was taken.



# Get ourselves a longish ASCII string with no duplicates - escape
apostrophe and backslash for code later on

asciichars=''.join(chr(i) for i in 
range(32,128)).replace("\\",r"\\").replace("'",r"\'")
haystack=[

("ASCII",asciichars+"\u0001"),
("BMP",asciichars+"\u1234"),
("SMP",asciichars+"\U00012345"),
]

needle=[

("ASCII","\u0002"),
("BMP","\u1235"),
("SMP","\U00012346"),
]

useset=[

("",""),
(", as set","; a=set(a)"),
]

for time,desc in sorted((min(timeit.repeat("'%s' in a"%n,("a='%s'"%h)+s)),"%s in 
%s%s"%(nd,hd,sd)) for nd,n in needle for hd,h in haystack for sd,s in useset):

print("%.10f %s"%(time,desc))

0.1765129367 ASCII in ASCII, as set
0.1767096097 BMP in SMP, as set
0.1778647845 ASCII in BMP, as set
0.1785266004 BMP in BMP, as set
0.1789093307 SMP in SMP, as set
0.1790431465 SMP in BMP, as set
0.1796504863 BMP in ASCII, as set
0.1803854959 SMP in ASCII, as set
0.1810674262 ASCII in SMP, as set


Much of this time is overhead; 'pass' would not run too much faster.


0.1817367850 SMP in BMP
0.1884555160 SMP in ASCII
0.2132371572 BMP in ASCII


For these, 3.3 does no searching because it knows from the internal char 
kind that the answer is No without looking.



0.3137454621 ASCII in ASCII
0.4472624314 BMP in BMP
0.6672795006 SMP in SMP
0.7493052888 ASCII in BMP
0.9261783271 ASCII in SMP
0.9865787412 BMP in SMP


...


Set membership is faster than string membership, though marginally on
something this short. If the needle is wider than the haystack, it
obviously can't be present, so a false return comes back at the speed
of a set check.


Jim ignores these cases where 3.3+ uses the information about the max 
codepoint to do the operation much faster than in 3.2.



Otherwise, an actual search must be done. Searching
for characters in strings of the same width gets slower as the strings
get larger in memory (unsurprising). What I'm seeing of the top-end
results, though, is that the search for a narrower string in a wider
one is quite significantly slower.


50% longer is not bad, even


I don't know of an actual proven use-case for this, but it seems
likely to happen (eg you take user input and want to know if there are
any HTML-sensitive characters in it, so you check ('<' in string or
'&' in string), for instance).


In my editing of code, I nearly always search for words or long names.

 The question is, is it worth

constructing an "expanded string" at the haystack's width prior to
doing the search?


I would not make any assumptions about what Python does or does not do 
without checking the code. All I know is that Python uses a modified 
version of one of the pre-process and skip-forward algorithms 
(Boyer-Moore?, Knuth-Pratt?, I forget). These are designed to work 
efficiently with needles longer than 1 char, and indeed may work better 
with longer needles. Searching for an single char in n chars is O(n). 
Searching for a len m needle is potentially O(m*n) and the point of the 
fancy algorithms is make all searches as close to O(n) as possible.


--
Terry Jan Reedy


--
http://mail.python.org/mailman/listinfo/python-list


Timing of string membership (was Re: hex dump w/ or w/out utf-8 chars)

2013-07-14 Thread Chris Angelico
On Sun, Jul 14, 2013 at 11:44 PM,   wrote:
> Le dimanche 14 juillet 2013 12:44:12 UTC+2, Steven D'Aprano a écrit :
>> On Sun, 14 Jul 2013 01:20:33 -0700, wxjmfauth wrote:
>>
>>
>>
>> > For a very simple reason, the latin-1 block: considered and accepted
>>
>> > today as beeing a Unicode design mistake.
>>
>>
>>
>> Latin-1 (also known as ISO-8859-1) was based on DEC's "Multinational
>>
>> Character Set", which goes back to 1983. ISO-8859-1 was first published
>>
>> in 1985, and was in use on Commodore computers the same year.
>>
>>
>>
>> The concept of Unicode wasn't even started until 1987, and the first
>>
>> draft wasn't published until the end of 1990. Unicode wasn't considered
>>
>> ready for production use until 1991, six years after Latin-1 was already
>>
>> in use in people's computers.
>>
>>
>>
>>
>>
>>
>>
>> --
>>
>> Steven
>
> --
>
> "Unicode" (in fact iso-14xxx) was not created in one
> night (Deus ex machina).
>
> What's count today is this:
>
 timeit.repeat("a = 'hundred'; 'x' in a")
> [0.11785943134991479, 0.09850454944486256, 0.09761604599423179]
 timeit.repeat("a = 'hundreœ'; 'x' in a")
> [0.23955250303158593, 0.2195812612416752, 0.22133896997401692]


 sys.getsizeof('d')
> 26
 sys.getsizeof('œ')
> 40
 sys.version
> '3.3.2 (v3.3.2:d047928ae3f6, May 16 2013, 00:03:43) [MSC v.1600 32 bit 
> (Intel)]'

jmf has raised an interesting point. Some string membership operations
do seem oddly slow.

# Get ourselves a longish ASCII string with no duplicates - escape
apostrophe and backslash for code later on
>>> asciichars=''.join(chr(i) for i in 
>>> range(32,128)).replace("\\",r"\\").replace("'",r"\'")
>>> haystack=[
("ASCII",asciichars+"\u0001"),
("BMP",asciichars+"\u1234"),
("SMP",asciichars+"\U00012345"),
]
>>> needle=[
("ASCII","\u0002"),
("BMP","\u1235"),
("SMP","\U00012346"),
]
>>> useset=[
("",""),
(", as set","; a=set(a)"),
]
>>> for time,desc in sorted((min(timeit.repeat("'%s' in 
>>> a"%n,("a='%s'"%h)+s)),"%s in %s%s"%(nd,hd,sd)) for nd,n in needle for hd,h 
>>> in haystack for sd,s in useset):
print("%.10f %s"%(time,desc))

0.1765129367 ASCII in ASCII, as set
0.1767096097 BMP in SMP, as set
0.1778647845 ASCII in BMP, as set
0.1785266004 BMP in BMP, as set
0.1789093307 SMP in SMP, as set
0.1790431465 SMP in BMP, as set
0.1796504863 BMP in ASCII, as set
0.1803854959 SMP in ASCII, as set
0.1810674262 ASCII in SMP, as set
0.1817367850 SMP in BMP
0.1884555160 SMP in ASCII
0.2132371572 BMP in ASCII
0.3137454621 ASCII in ASCII
0.4472624314 BMP in BMP
0.6672795006 SMP in SMP
0.7493052888 ASCII in BMP
0.9261783271 ASCII in SMP
0.9865787412 BMP in SMP

(In separate testing I ascertained that it makes little difference
whether the character is absent from the string or is the last
character in it. Presumably the figures would be lower if the
character is at the start of the string, but this is not germane to
this discussion.)

Set membership is faster than string membership, though marginally on
something this short. If the needle is wider than the haystack, it
obviously can't be present, so a false return comes back at the speed
of a set check. Otherwise, an actual search must be done. Searching
for characters in strings of the same width gets slower as the strings
get larger in memory (unsurprising). What I'm seeing of the top-end
results, though, is that the search for a narrower string in a wider
one is quite significantly slower.

I don't know of an actual proven use-case for this, but it seems
likely to happen (eg you take user input and want to know if there are
any HTML-sensitive characters in it, so you check ('<' in string or
'&' in string), for instance). The question is, is it worth
constructing an "expanded string" at the haystack's width prior to
doing the search?

ChrisA
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: hex dump w/ or w/out utf-8 chars

2013-07-14 Thread wxjmfauth
Le dimanche 14 juillet 2013 12:44:12 UTC+2, Steven D'Aprano a écrit :
> On Sun, 14 Jul 2013 01:20:33 -0700, wxjmfauth wrote:
> 
> 
> 
> > For a very simple reason, the latin-1 block: considered and accepted
> 
> > today as beeing a Unicode design mistake.
> 
> 
> 
> Latin-1 (also known as ISO-8859-1) was based on DEC's "Multinational 
> 
> Character Set", which goes back to 1983. ISO-8859-1 was first published 
> 
> in 1985, and was in use on Commodore computers the same year.
> 
> 
> 
> The concept of Unicode wasn't even started until 1987, and the first 
> 
> draft wasn't published until the end of 1990. Unicode wasn't considered 
> 
> ready for production use until 1991, six years after Latin-1 was already 
> 
> in use in people's computers.
> 
> 
> 
> 
> 
> 
> 
> -- 
> 
> Steven

--

"Unicode" (in fact iso-14xxx) was not created in one
night (Deus ex machina).

What's count today is this:

>>> timeit.repeat("a = 'hundred'; 'x' in a")
[0.11785943134991479, 0.09850454944486256, 0.09761604599423179]
>>> timeit.repeat("a = 'hundreœ'; 'x' in a")
[0.23955250303158593, 0.2195812612416752, 0.22133896997401692]
>>> 
>>> 
>>> sys.getsizeof('d')
26
>>> sys.getsizeof('œ')
40
>>> sys.version
'3.3.2 (v3.3.2:d047928ae3f6, May 16 2013, 00:03:43) [MSC v.1600 32 bit (Intel)]'

jmf


-- 
http://mail.python.org/mailman/listinfo/python-list


Re: hex dump w/ or w/out utf-8 chars

2013-07-14 Thread Steven D'Aprano
On Sun, 14 Jul 2013 01:20:33 -0700, wxjmfauth wrote:

> For a very simple reason, the latin-1 block: considered and accepted
> today as beeing a Unicode design mistake.

Latin-1 (also known as ISO-8859-1) was based on DEC's "Multinational 
Character Set", which goes back to 1983. ISO-8859-1 was first published 
in 1985, and was in use on Commodore computers the same year.

The concept of Unicode wasn't even started until 1987, and the first 
draft wasn't published until the end of 1990. Unicode wasn't considered 
ready for production use until 1991, six years after Latin-1 was already 
in use in people's computers.



-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: hex dump w/ or w/out utf-8 chars

2013-07-14 Thread wxjmfauth
Le samedi 13 juillet 2013 21:02:24 UTC+2, Dave Angel a écrit :
> On 07/13/2013 10:37 AM, wxjmfa...@gmail.com wrote:
> 
> 
> 
> 
> 
> Fortunately for us, Python (in version 3.3 and later) and Pike did it 
> 
> right.  Some day the others may decide to do similarly.
> 
> 
> 

---
Possible but I doubt.
For a very simple reason, the latin-1 block: considered
and accepted today as beeing a Unicode design mistake.

jmf

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: hex dump w/ or w/out utf-8 chars

2013-07-13 Thread Neil Hodgson

wxjmfa...@gmail.com:


The FSR is naive and badly working. I can not force people
to understand the coding of the characters [*].


   You could at least *try*.

   If there really was a problem with the FSR and you truly understood 
this problem then surely you would be able to communicate the problem to 
at least one person on the list.


   Neil

--
http://mail.python.org/mailman/listinfo/python-list


Re: hex dump w/ or w/out utf-8 chars

2013-07-13 Thread Dave Angel

On 07/13/2013 10:37 AM, wxjmfa...@gmail.com wrote:




The FSR is naive and badly working. I can not force people
to understand the coding of the characters [*].


That would be very hard, since you certainly do not.


I'm the first to recognize that Python and/or Pike are
free to do what they wish.


Fortunately for us, Python (in version 3.3 and later) and Pike did it 
right.  Some day the others may decide to do similarly.




Luckily, for the crowd, those who do not even know that the
coding of characters exists, all the serious actors active in
text processing are working properly.


Here, I'm really glad you don't know English, because if you had a 
decent grasp of the language, somebody might assume you knew what you

were talking about.



jmf

* By nature characters and numbers are differents.



By nature Jmf has his own distorted reality.

--
DaveA

--
http://mail.python.org/mailman/listinfo/python-list


Re: hex dump w/ or w/out utf-8 chars

2013-07-13 Thread wxjmfauth
Le samedi 13 juillet 2013 11:49:10 UTC+2, Steven D'Aprano a écrit :
> On Sat, 13 Jul 2013 00:56:52 -0700, wxjmfauth wrote:
> 
> 
> 
> > You are confusing the knowledge of a coding scheme and the intrisinc
> 
> > information a "coding scheme" *may* have, in a mandatory way, to work
> 
> > properly. These are conceptualy two different things.
> 
> 
> 
> *May* have, in a *mandatory* way?
> 
> 
> 
> JMF, I know you are not a native English speaker, so you might not be 
> 
> aware just how silly your statement is. If it *may* have, it is optional, 
> 
> since it *may not* have instead. But if it is optional, it is not 
> 
> mandatory.
> 
> 
> 
> You are making so much fuss over such a simple, obvious implementation 
> 
> for strings. The language Pike has done the same thing for probably a 
> 
> decade or so.
> 
> 
> 
> Ironically, Python has done the same thing for integers for many versions 
> 
> too. They just didn't call it "Flexible Integer Representation", but 
> 
> that's what it is. For integers smaller than 2**31, they are stored as C 
> 
> longs (plus object overhead). For integers larger than 2**31, they are 
> 
> promoted to a BigNum implementation that can handle unlimited digits.
> 
> 
> 
> Using Python 2.7, where it is more obvious because the BigNum has an L 
> 
> appended to the display, and a different type:
> 
> 
> 
> py> for n in (1, 2**20, 2**30, 2**31, 2**65):
> 
> ... print repr(n), type(n), sys.getsizeof(n)
> 
> ...
> 
> 1  12
> 
> 1048576  12
> 
> 1073741824  12
> 
> 2147483648L  18
> 
> 36893488147419103232L  22
> 
> 
> 
> 
> 
> You have been using Flexible Integer Representation for *years*, and it 
> 
> works great, and you've never noticed any problems.
> 
> 
> 
> 
> 
> 
> 
> -- 
> 
> Steven

--

The FSR is naive and badly working. I can not force people
to understand the coding of the characters [*].

I'm the first to recognize that Python and/or Pike are
free to do what they wish.

Luckily, for the crowd, those who do not even know that the
coding of characters exists, all the serious actors active in
text processing are working properly.

jmf

* By nature characters and numbers are differents.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: hex dump w/ or w/out utf-8 chars

2013-07-13 Thread Chris Angelico
On Sat, Jul 13, 2013 at 7:49 PM, Steven D'Aprano
 wrote:
> Ironically, Python has done the same thing for integers for many versions
> too. They just didn't call it "Flexible Integer Representation", but
> that's what it is. For integers smaller than 2**31, they are stored as C
> longs (plus object overhead). For integers larger than 2**31, they are
> promoted to a BigNum implementation that can handle unlimited digits.

Hmm. That's true of Python 2 (mostly - once an operation yields a
long, it never reverts to int, whereas a string will shrink if you
remove the wider characters from it), but not, I think, of Python 3.
The optimization isn't there any more. At least, I did some tinkering
a while ago (on 3.2, I think), so maybe it's been reinstated since. As
of Python 3 and the unification of types, it's definitely possible to
put that in as a pure optimization, anyhow.

ChrisA
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: hex dump w/ or w/out utf-8 chars

2013-07-13 Thread Steven D'Aprano
On Sat, 13 Jul 2013 00:56:52 -0700, wxjmfauth wrote:

> You are confusing the knowledge of a coding scheme and the intrisinc
> information a "coding scheme" *may* have, in a mandatory way, to work
> properly. These are conceptualy two different things.

*May* have, in a *mandatory* way?

JMF, I know you are not a native English speaker, so you might not be 
aware just how silly your statement is. If it *may* have, it is optional, 
since it *may not* have instead. But if it is optional, it is not 
mandatory.

You are making so much fuss over such a simple, obvious implementation 
for strings. The language Pike has done the same thing for probably a 
decade or so.

Ironically, Python has done the same thing for integers for many versions 
too. They just didn't call it "Flexible Integer Representation", but 
that's what it is. For integers smaller than 2**31, they are stored as C 
longs (plus object overhead). For integers larger than 2**31, they are 
promoted to a BigNum implementation that can handle unlimited digits.

Using Python 2.7, where it is more obvious because the BigNum has an L 
appended to the display, and a different type:

py> for n in (1, 2**20, 2**30, 2**31, 2**65):
... print repr(n), type(n), sys.getsizeof(n)
...
1  12
1048576  12
1073741824  12
2147483648L  18
36893488147419103232L  22


You have been using Flexible Integer Representation for *years*, and it 
works great, and you've never noticed any problems.



-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: hex dump w/ or w/out utf-8 chars

2013-07-13 Thread Chris Angelico
On Sat, Jul 13, 2013 at 5:56 PM,   wrote:
> Try to write an editor, a text widget, with with a coding
> scheme like the Flexible String Represenation. You will
> quickly notice, it is impossible (understand correctly).
> (You do not need a computer, just a sheet of paper and a pencil)
> Hint: what is the character at the caret position?

I would use an internal representation that allows insertion and
deletion - in its simplest form, a list of strings. And those strings
would be whatever I can most conveniently work with.

I've never built a text editor widget, because my libraries always
provide them. But there is a rough parallel in the display storage for
Gypsum, which stores a series of lines, each of which is a series of
sections in different colors. (A line might be a single section, ie
one color for its whole length.) I store them in arrays of (color,
string, color, string, color, string...). The strings I use are in the
format wanted by my display subsystem - which in my case is the native
string type of the language, which... oh, what a pity for jmf, is a
flexible object that uses 8, 16, or 32 bits for each character.

ChrisA
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: hex dump w/ or w/out utf-8 chars

2013-07-13 Thread Steven D'Aprano
On Sat, 13 Jul 2013 00:56:52 -0700, wxjmfauth wrote:

> I am convinced you are not conceptually understanding utf-8 very well. I
> wrote many times, "utf-8 does not produce bytes, but Unicode Encoding
> Units".

Just because you write it many times, doesn't make it correct. You are 
simply wrong. UTF-8 produces bytes. That's what gets written to files and 
transmitted over networks, bytes, not "Unicode Encoding Units", whatever 
they are.


> A similar coding scheme: iso-6937 .
> 
> Try to write an editor, a text widget, with with a coding scheme like
> the Flexible String Represenation. You will quickly notice, it is
> impossible (understand correctly). (You do not need a computer, just a
> sheet of paper and a pencil) Hint: what is the character at the caret
> position?

That is a simple index operation into the buffer. If the caret position 
is 10 characters in, you index buffer[10-1] and it will give you the 
character to the left of the caret. buffer[10] will give you the 
character to the right of the caret. It is simple, trivial, and easy. The 
buffer itself knows whether to look ahead 10 bytes, 10*2 bytes or 10*4 
bytes.

Here is an example of such a tiny buffer, implemented in Python 3.3 with 
the hated Flexible String Representation. In each example, imagine the 
caret is five characters from the left:

12345|more characters here...

It works regardless of whether your characters are ASCII:


py> buffer = '12345ABCD...'
py> buffer[5-1]  # character to the left of the caret
'5'
py> buffer[5]  # character to the right of the caret
'A'


Latin 1:

py> buffer = '12345áßçð...'
py> buffer[5-1]  # character to the left of the caret
'5'
py> buffer[5]  # character to the right of the caret
'á'


Other BMP characters:

py> buffer = '12345αдᚪ∞...'
py> buffer[5-1]  # character to the left of the caret
'5'
py> buffer[5]  # character to the right of the caret
'α'


And Supplementary Plane Characters:

py> buffer = ('12345'
... '\N{ALCHEMICAL SYMBOL FOR AIR}'
... '\N{ALCHEMICAL SYMBOL FOR FIRE}'
... '\N{ALCHEMICAL SYMBOL FOR EARTH}'
... '\N{ALCHEMICAL SYMBOL FOR WATER}'
... '...')
py> buffer
'12345🜁🜂🜃🜄...'
py> len(buffer)
12
py> buffer[5-1]  # character to the left of the caret
'5'
py> buffer[5]  # character to the right of the caret
'🜁'
py> unicodedata.name(buffer[5])
'ALCHEMICAL SYMBOL FOR AIR'


And it all Just Works in Python 3.3. So much for "impossible to tell" 
what the character at the carat is. It is *trivial*.



Ah, but how about Python 3.2? We set up the same buffer:


py> buffer = ('12345'
... '\N{ALCHEMICAL SYMBOL FOR AIR}'
... '\N{ALCHEMICAL SYMBOL FOR FIRE}'
... '\N{ALCHEMICAL SYMBOL FOR EARTH}'
... '\N{ALCHEMICAL SYMBOL FOR WATER}'
... '...')
py> buffer
'12345🜁🜂🜃🜄...'
py> len(buffer)
16

Sixteen? Sixteen? Where did the extra four characters come from? They 
came from *surrogate pairs*.


py> buffer[5-1]  # character to the left of the caret
'5'
py> buffer[5]  # character to the right of the caret
'\ud83d'


Funny, that looks different.


py> unicodedata.name(buffer[5])
Traceback (most recent call last):
  File "", line 1, in 
ValueError: no such name


No name?

Because buffer[5] is only *half* of the surrogate pair. It is broken, and 
there is really no way of fixing that breakage in Python 3.2 with a 
narrow build. You can fix it with a wide build, but only at the cost of 
every string, every name, using double the amount of storage, whether it 
needs it or not.


-- 
Steven

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: hex dump w/ or w/out utf-8 chars

2013-07-13 Thread Lele Gaifax
wxjmfa...@gmail.com writes:

> Try to write an editor, a text widget, with with a coding
> scheme like the Flexible String Represenation. You will
> quickly notice, it is impossible (understand correctly).
> (You do not need a computer, just a sheet of paper and a pencil)
> Hint: what is the character at the caret position?

I am convinced you are not conceptually understanding FST very well.
Alternatively, you may have a strange notion of “impossible”.
Or both.

ciao, lele.
-- 
nickname: Lele Gaifax | Quando vivrò di quello che ho pensato ieri
real: Emanuele Gaifas | comincerò ad aver paura di chi mi copia.
l...@metapensiero.it  | -- Fortunato Depero, 1929.

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: hex dump w/ or w/out utf-8 chars

2013-07-13 Thread wxjmfauth
Le vendredi 12 juillet 2013 04:16:21 UTC+2, Chris Angelico a écrit :
> On Fri, Jul 12, 2013 at 4:42 AM,   wrote:
> 
> > BTW, since
> 
> > when a serious coding scheme need an extermal marker?
> 
> >
> 
> 
> 
> All of them.
> 
> 
> 
> Content-type: text/plain; charset=UTF-8
> 
> 
> 
> ChrisA

--


No one.

You are confusing the knowledge of a coding scheme and the intrisinc
information a "coding scheme" *may* have, in a mandatory way, to work
properly. These are conceptualy two different things.

I am convinced you are not conceptually understanding utf-8 very well.
I wrote many times, "utf-8 does not produce bytes, but Unicode Encoding
Units".

A similar coding scheme: iso-6937 . 

Try to write an editor, a text widget, with with a coding
scheme like the Flexible String Represenation. You will
quickly notice, it is impossible (understand correctly).
(You do not need a computer, just a sheet of paper and a pencil)
Hint: what is the character at the caret position?

jmf


-- 
http://mail.python.org/mailman/listinfo/python-list


Re: hex dump w/ or w/out utf-8 chars

2013-07-12 Thread Steven D'Aprano
On Fri, 12 Jul 2013 23:01:47 +0100, Joshua Landau wrote:

> Isn't a superscript "c" the symbol for radians?


Only in the sense that a superscript "o" is the symbol for degrees.

Semantically, both degree-sign and radian-sign are different "things" 
than merely an o or c in superscript.

Nevertheless, in mathematics at least, it is normal to leave out the 
radian sign when talking about angles. By default, "1.2" means "1.2 
radians", not "1.2 degrees".


-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: hex dump w/ or w/out utf-8 chars

2013-07-12 Thread Tim Roberts
Joshua Landau  wrote:
>
>Isn't a superscript "c" the symbol for radians?

That's very rarely used.  More common is "rad".  The problem with a
superscript "c" is that it looks too much like a degree symbol.
-- 
Tim Roberts, t...@probo.com
Providenza & Boekelheide, Inc.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: hex dump w/ or w/out utf-8 chars

2013-07-12 Thread Joshua Landau
On 9 July 2013 10:34,  wrote:

> There is no symbole for radian because mathematically
> radian is a pure number, a unitless number. You can
> hower sepecify a = ... in radian (rad).
>

Isn't a superscript "c" the symbol for radians?
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: hex dump w/ or w/out utf-8 chars

2013-07-12 Thread wxjmfauth
Le vendredi 12 juillet 2013 05:18:44 UTC+2, Steven D'Aprano a écrit :
> On Thu, 11 Jul 2013 11:42:26 -0700, wxjmfauth wrote:
> 
> 
> Now all your strings will be just as heavy, every single variable name 
> 
> and attribute name will use four times as much memory. Happy now?
> 



>>> 㑖 = 999
>>> class C:
... cœur = 'heart'
...

- Why always this magic number "four"?
- Are you able to think once non-ascii?
- Have you once had the feeling to be penalized,
because you are using fonts with OpenType technology?
- Have once had problem with pdf? I can tell you,
utf32 is peanuts compared to the used CID-font you
are using.
- Did you toy once with a unicode TeX engine?
- Did you take a look at a rendering engine code like HarfBuzz?


jmf
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: hex dump w/ or w/out utf-8 chars

2013-07-12 Thread Chris Angelico
On Fri, Jul 12, 2013 at 4:42 AM,   wrote:
> BTW, since
> when a serious coding scheme need an extermal marker?
>

All of them.

Content-type: text/plain; charset=UTF-8

ChrisA
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: hex dump w/ or w/out utf-8 chars

2013-07-11 Thread Steven D'Aprano
On Thu, 11 Jul 2013 11:42:26 -0700, wxjmfauth wrote:

> And what to say about this "ucs4" char/string '\U0001d11e' which is
> weighting 18 bytes more than an "a".
> 
 sys.getsizeof('\U0001d11e')
> 44
> 
> A total absurdity. 


You should stick to Python 3.1 and 3.2 then:

py> print(sys.version)
3.1.3 (r313:86834, Nov 28 2010, 11:28:10)
[GCC 4.4.5]
py> sys.getsizeof('\U0001d11e')
36
py> sys.getsizeof('a')
36


Now all your strings will be just as heavy, every single variable name 
and attribute name will use four times as much memory. Happy now?


> How does is come? Very simple, once you split Unicode
> in subsets, not only you have to handle these subsets, you have to
> create "markers" to differentiate them. Not only, you produce "markers",
> you have to handle the mess generated by these "markers". Hiding this
> markers in the everhead of the class does not mean that they should not
> be counted as part of the coding scheme. BTW, since when a serious
> coding scheme need an extermal marker?

Since always.

How do you think that (say) a C compiler can tell the difference between 
the long 1199876496 and the float 67923.125? They both have exactly the 
same four bytes:

py> import struct
py> struct.pack('f', 67923.125)
b'\x90\xa9\x84G'
py> struct.pack('l', 1199876496)
b'\x90\xa9\x84G'


*Everything* in a computer is bytes. The only way to tell them apart is 
by external markers.



-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: hex dump w/ or w/out utf-8 chars

2013-07-11 Thread wxjmfauth
Le jeudi 11 juillet 2013 15:32:00 UTC+2, Chris Angelico a écrit :
> On Thu, Jul 11, 2013 at 11:18 PM,   wrote:
> 
> > Just to stick with this funny character ẞ, a ucs-2 char
> 
> > in the Flexible String Representation nomenclature.
> 
> >
> 
> > It seems to me that, when one needs more than ten bytes
> 
> > to encode it,
> 
> >
> 
>  sys.getsizeof('a')
> 
> > 26
> 
>  sys.getsizeof('ẞ')
> 
> > 40
> 
> >
> 
> > this is far away from the perfection.
> 
> 
> 
> Better comparison is to see how much space is used by one copy of it,
> 
> and how much by two copies:
> 
> 
> 
> >>> sys.getsizeof('aa')-sys.getsizeof('a')
> 
> 1
> 
> >>> sys.getsizeof('ẞẞ')-sys.getsizeof('ẞ')
> 
> 2
> 
> 
> 
> String objects have overhead. Big deal.
> 
> 
> 
> > BTW, for a modern language, is not ucs2 considered
> 
> > as obsolete since many, many years?
> 
> 
> 
> Clearly. And similarly, the 16-bit integer has been completely
> 
> obsoleted, as there is no reason anyone should ever bother to use it.
> 
> Same with the float type - everyone uses double or better these days,
> 
> right?
> 
> 
> 
> http://www.postgresql.org/docs/current/static/datatype-numeric.html
> 
> http://www.cplusplus.com/doc/tutorial/variables/
> 
> 
> 
> Nope, nobody uses small integers any more, they're clearly completely 
> obsolete.
> 
> 
> 

Sure there is some overhead because a str is a class.
It still remain that a "ẞ" weights 14 bytes more than
an "a".

In "aẞ", the ẞ weights 6 bytes.

>>> sys.getsizeof('a')
26
>>> sys.getsizeof('aẞ')
42

and in "aẞẞ", the ẞ weights 2 bytes

sys.getsizeof('aẞẞ')

And what to say about this "ucs4" char/string '\U0001d11e' which
is weighting 18 bytes more than an "a".

>>> sys.getsizeof('\U0001d11e')
44

A total absurdity. How does is come? Very simple, once you
split Unicode in subsets, not only you have to handle these
subsets, you have to create "markers" to differentiate them.
Not only, you produce "markers", you have to handle the
mess generated by these "markers". Hiding this markers
in the everhead of the class does not mean that they should
not be counted as part of the coding scheme. BTW, since
when a serious coding scheme need an extermal marker?



>>> sys.getsizeof('aa') - sys.getsizeof('a')
1

Shortly, if my algebra is still correct:

(overhead + marker + 2*'a') - (overhead + marker + 'a')
= (overhead + marker + 2*'a') - overhead - marker - 'a')
= overhead - overhead + marker - marker + 2*'a' - 'a'
= 0 + 0 + 'a'
= 1

The "marker" has magically disappeared.

jmf


-- 
http://mail.python.org/mailman/listinfo/python-list


Re: hex dump w/ or w/out utf-8 chars

2013-07-11 Thread wxjmfauth
Le jeudi 11 juillet 2013 20:42:26 UTC+2, wxjm...@gmail.com a écrit :
> Le jeudi 11 juillet 2013 15:32:00 UTC+2, Chris Angelico a écrit :
> 
> > On Thu, Jul 11, 2013 at 11:18 PM,   wrote:
> 
> > 
> 
> > > Just to stick with this funny character ẞ, a ucs-2 char
> 
> > 
> 
> > > in the Flexible String Representation nomenclature.
> 
> > 
> 
> > >
> 
> > 
> 
> > > It seems to me that, when one needs more than ten bytes
> 
> > 
> 
> > > to encode it,
> 
> > 
> 
> > >
> 
> > 
> 
> >  sys.getsizeof('a')
> 
> > 
> 
> > > 26
> 
> > 
> 
> >  sys.getsizeof('ẞ')
> 
> > 
> 
> > > 40
> 
> > 
> 
> > >
> 
> > 
> 
> > > this is far away from the perfection.
> 
> > 
> 
> > 
> 
> > 
> 
> > Better comparison is to see how much space is used by one copy of it,
> 
> > 
> 
> > and how much by two copies:
> 
> > 
> 
> > 
> 
> > 
> 
> > >>> sys.getsizeof('aa')-sys.getsizeof('a')
> 
> > 
> 
> > 1
> 
> > 
> 
> > >>> sys.getsizeof('ẞẞ')-sys.getsizeof('ẞ')
> 
> > 
> 
> > 2
> 
> > 
> 
> > 
> 
> > 
> 
> > String objects have overhead. Big deal.
> 
> > 
> 
> > 
> 
> > 
> 
> > > BTW, for a modern language, is not ucs2 considered
> 
> > 
> 
> > > as obsolete since many, many years?
> 
> > 
> 
> > 
> 
> > 
> 
> > Clearly. And similarly, the 16-bit integer has been completely
> 
> > 
> 
> > obsoleted, as there is no reason anyone should ever bother to use it.
> 
> > 
> 
> > Same with the float type - everyone uses double or better these days,
> 
> > 
> 
> > right?
> 
> > 
> 
> > 
> 
> > 
> 
> > http://www.postgresql.org/docs/current/static/datatype-numeric.html
> 
> > 
> 
> > http://www.cplusplus.com/doc/tutorial/variables/
> 
> > 
> 
> > 
> 
> > 
> 
> > Nope, nobody uses small integers any more, they're clearly completely 
> > obsolete.
> 
> > 
> 
> > 
> 
> > 
> 
> 
> 
> Sure there is some overhead because a str is a class.
> 
> It still remain that a "ẞ" weights 14 bytes more than
> 
> an "a".
> 
> 
> 
> In "aẞ", the ẞ weights 6 bytes.
> 
> 
> 
> >>> sys.getsizeof('a')
> 
> 26
> 
> >>> sys.getsizeof('aẞ')
> 
> 42
> 
> 
> 
> and in "aẞẞ", the ẞ weights 2 bytes
> 
> 
> 
> sys.getsizeof('aẞẞ')
> 
> 
> 
> And what to say about this "ucs4" char/string '\U0001d11e' which
> 
> is weighting 18 bytes more than an "a".
> 
> 
> 
> >>> sys.getsizeof('\U0001d11e')
> 
> 44
> 
> 
> 
> A total absurdity. How does is come? Very simple, once you
> 
> split Unicode in subsets, not only you have to handle these
> 
> subsets, you have to create "markers" to differentiate them.
> 
> Not only, you produce "markers", you have to handle the
> 
> mess generated by these "markers". Hiding this markers
> 
> in the everhead of the class does not mean that they should
> 
> not be counted as part of the coding scheme. BTW, since
> 
> when a serious coding scheme need an extermal marker?
> 
> 
> 
> 
> 
> 
> 
> >>> sys.getsizeof('aa') - sys.getsizeof('a')
> 
> 1
> 
> 
> 
> Shortly, if my algebra is still correct:
> 
> 
> 
> (overhead + marker + 2*'a') - (overhead + marker + 'a')
> 
> = (overhead + marker + 2*'a') - overhead - marker - 'a'
> 
> = overhead - overhead + marker - marker + 2*'a' - 'a'
> 
> = 0 + 0 + 'a'
> 
> = 1
> 
> 
> 
> The "marker" has magically disappeared.
> 
> 
> 
> jmf

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: hex dump w/ or w/out utf-8 chars

2013-07-11 Thread Chris Angelico
On Thu, Jul 11, 2013 at 11:18 PM,   wrote:
> Just to stick with this funny character ẞ, a ucs-2 char
> in the Flexible String Representation nomenclature.
>
> It seems to me that, when one needs more than ten bytes
> to encode it,
>
 sys.getsizeof('a')
> 26
 sys.getsizeof('ẞ')
> 40
>
> this is far away from the perfection.

Better comparison is to see how much space is used by one copy of it,
and how much by two copies:

>>> sys.getsizeof('aa')-sys.getsizeof('a')
1
>>> sys.getsizeof('ẞẞ')-sys.getsizeof('ẞ')
2

String objects have overhead. Big deal.

> BTW, for a modern language, is not ucs2 considered
> as obsolete since many, many years?

Clearly. And similarly, the 16-bit integer has been completely
obsoleted, as there is no reason anyone should ever bother to use it.
Same with the float type - everyone uses double or better these days,
right?

http://www.postgresql.org/docs/current/static/datatype-numeric.html
http://www.cplusplus.com/doc/tutorial/variables/

Nope, nobody uses small integers any more, they're clearly completely obsolete.

ChrisA
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: hex dump w/ or w/out utf-8 chars

2013-07-11 Thread wxjmfauth
Le lundi 8 juillet 2013 19:52:17 UTC+2, Chris Angelico a écrit :
> On Tue, Jul 9, 2013 at 3:31 AM,   wrote:
> 
> > Unfortunately (as probably I told you before) I will never pass to
> 
> > Python 3...  Guido should not always listen only to gurus like him...
> 
> > I don't like Python as before...starting from OOP and ending with codecs
> 
> > like utf-8. Regarding OOP, much appreciated expecially by experts, he
> 
> > could use python 2 for hiding the complexities of OOP (improving, as an
> 
> > effect, object's code hiding) moving classes and objects to
> 
> > imported methods, leaving in this way the programming style to the
> 
> > well known old style: sequential programming and functions.
> 
> > About utf-8... the same solution: keep utf-8 but for the non experts, add
> 
> > methods to convert to solutions which use the range 128-255 of only one
> 
> > byte (I do not give a damn about chinese and "similia"!...)
> 
> > I know that is a lost battle (in italian "una battaglia persa")!
> 
> 
> 
> Well, there won't be a Python 2.8, so you really should consider
> 
> moving at some point. Python 3.3 is already way better than 2.7 in
> 
> many ways, 3.4 will improve on 3.3, and the future is pretty clear.
> 
> But nobody's forcing you, and 2.7.x will continue to get
> 
> bugfix/security releases for a while. (Personally, I'd be happy if
> 
> everyone moved off the 2.3/2.4 releases. It's not too hard supporting
> 
> 2.6+ or 2.7+.)
> 
> 
> 
> The thing is, you're thinking about UTF-8, but you should be thinking
> 
> about Unicode. I recommend you read these articles:
> 
> 
> 
> http://www.joelonsoftware.com/articles/Unicode.html
> 
> http://unspecified.wordpress.com/2012/04/19/the-importance-of-language-level-abstract-unicode-strings/
> 
> 
> 
> So long as you are thinking about different groups of characters as
> 
> different, and wanting a solution that maps characters down into the
> 
> <256 range, you will never be able to cleanly internationalize. With
> 
> Python 3.3+, you can ignore the differences between ASCII, BMP, and
> 
> SMP characters; they're all just "characters". Everything works
> 
> perfectly with Unicode.
> 

---

Just to stick with this funny character ẞ, a ucs-2 char
in the Flexible String Representation nomenclature.

It seems to me that, when one needs more than ten bytes
to encode it, 

>>> sys.getsizeof('a')
26
>>> sys.getsizeof('ẞ')
40

this is far away from the perfection.

BTW, for a modern language, is not ucs2 considered
as obsolete since many, many years?

jmf



-- 
http://mail.python.org/mailman/listinfo/python-list


Re: hex dump w/ or w/out utf-8 chars

2013-07-10 Thread wxjmfauth
For those who are interested. The official proposal request
for the encoding of the Latin uppercase letter Sharp S in
ISO/IEC 10646; DIN (The German Institute for Standardization)
proposal is available on the web. A pdf with the rationale.
I do not remember from where I got it, probably from a German
web site.

Fonts:
I'm observing the inclusion of this glyph since years. More
and more fonts are supporting it. Available in many fonts,
it is suprisingly not available in Cambria (at least the version
I'm using). STIX does not includes it, it has been requested. Ditto,
for the Latin Modern, the default bundle of fonts for the Unicode
TeX engines.

Last but not least, Python.
Thanks to the Flexible String Representation, it is not
necessary to mention the disastrous, erratic behaviour of
Python, when processing text containing it. It's more than
clear, a serious user willing to process the contain of
'DER GROẞE DUDEN' (a reference German dictionary) will be
better served by using something else.

The irony is that this Flexible String Representation has
been created by a German.

jmf

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: hex dump w/ or w/out utf-8 chars

2013-07-09 Thread Steven D'Aprano
On Tue, 09 Jul 2013 12:15:29 +0200, Chris “Kwpolska” Warrick wrote:

> On Tue, Jul 9, 2013 at 11:34 AM,   wrote:
>> Note the difference between SS and ẞ 'FRANZ-JOSEF-STRAUSS-STRAẞE'
> 
> This is a capital Eszett.  Which just happens not to exist in German.
> Germans do not use this character, it is not available on German
> keyboards, and the German spelling rules have you replace ß with SS.
> And, surprise surprise, STRASSE is the example the Council for German
> Orthography used ([0] page 29, §25 E3).
> 
> [0]: http://www.neue-rechtschreibung.de/regelwerk.pdf


Only half-right. Uppercase Eszett has been used in Germany going back at 
least to 1879, and appears to be gaining popularity. In 2010 the use of 
uppercase ß apparently became mandatory for geographical place names when 
written in uppercase in official documentation.

http://opentype.info/blog/2011/01/24/capital-sharp-s/

http://en.wikipedia.org/wiki/Capital_ẞ

Font support is still quite poor, but at least half a dozen Windows 7 
fonts provide it, and at least one Mac font.


-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: hex dump w/ or w/out utf-8 chars

2013-07-09 Thread Chris “Kwpolska” Warrick
On Tue, Jul 9, 2013 at 11:34 AM,   wrote:
> Note the difference between SS and ẞ
> 'FRANZ-JOSEF-STRAUSS-STRAẞE'

This is a capital Eszett.  Which just happens not to exist in German.
Germans do not use this character, it is not available on German
keyboards, and the German spelling rules have you replace ß with SS.
And, surprise surprise, STRASSE is the example the Council for German
Orthography used ([0] page 29, §25 E3).

[0]: http://www.neue-rechtschreibung.de/regelwerk.pdf

--
Kwpolska  | GPG KEY: 5EAAEA16
stop html mail| always bottom-post
http://asciiribbon.org| http://caliburn.nl/topposting.html
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: hex dump w/ or w/out utf-8 chars

2013-07-09 Thread Dave Angel

On 07/09/2013 09:00 AM, Neil Cerutti wrote:

   

Interestingly similar scheme. It wonder if 5-bit chars was a
common compression scheme. The Z-machine spec was never
officially published either. I believe a "task force" reverse
engineered it sometime in the 90's.




Baudot was 5 bits.  It used shift-codes to get upper case and digits, if 
I recall.


And ASCII was 7 bits so there could be one more for parity.

--
DaveA

--
http://mail.python.org/mailman/listinfo/python-list


Re: hex dump w/ or w/out utf-8 chars

2013-07-09 Thread Skip Montanaro
> It wonder if 5-bit chars was a
> common compression scheme.

http://en.wikipedia.org/wiki/List_of_binary_codes

Baudot was pretty common, as I recall, though ASCII and EBCDIC ruled
by the time I started punching cards.

Skip
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: hex dump w/ or w/out utf-8 chars

2013-07-09 Thread Neil Cerutti
On 2013-07-09, Dave Angel  wrote:
>> One of the first Python project I undertook was a program to
>> dump the ZSCII strings from Infocom game files. They are
>> mostly packed one character per 5 bits, with escapes to (I had
>> to recheck the Z-machine spec) latin-1. Oh, those clever
>> implementors: thwarting hexdumping cheaters and cramming their
>> games onto microcomputers with one blow.
>
> In 1973 I played with encoding some data that came over the
> public airwaves (I never learned the specific radio technology,
> probably used sidebands of FM stations). The data was encoded,
> with most characters taking 5 bits, and the decoded stream was
> like a ticker-tape.  With some hardware and the right software,
> you could track Wall Street in real time.  (Or maybe it had the
> usual 15 minute delay).
>
> Obviously, they didn't publish the spec any place. But some
> others had the beginnings of a decoder, and I expanded on that.
> We never did anything with it, it was just an interesting
> challenge.

Interestingly similar scheme. It wonder if 5-bit chars was a
common compression scheme. The Z-machine spec was never
officially published either. I believe a "task force" reverse
engineered it sometime in the 90's.

-- 
Neil Cerutti
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: hex dump w/ or w/out utf-8 chars

2013-07-09 Thread Dave Angel

On 07/09/2013 08:22 AM, Neil Cerutti wrote:

On 2013-07-08, Dave Angel  wrote:

I appreciate you've been around a long time, and worked in a
lot of languages.  I've programmed professionally in at least
35 languages since 1967.  But we've come a long way from the
6bit characters I used in 1968.  At that time, we packed them
10 characters to each word.


One of the first Python project I undertook was a program to dump
the ZSCII strings from Infocom game files. They are mostly packed
one character per 5 bits, with escapes to (I had to recheck the
Z-machine spec) latin-1. Oh, those clever implementors: thwarting
hexdumping cheaters and cramming their games onto microcomputers
with one blow.



In 1973 I played with encoding some data that came over the public 
airwaves (I never learned the specific radio technology, probably used 
sidebands of FM stations). The data was encoded, with most characters 
taking 5 bits, and the decoded stream was like a ticker-tape.  With some 
hardware and the right software, you could track Wall Street in real 
time.  (Or maybe it had the usual 15 minute delay).


Obviously, they didn't publish the spec any place. But some others had 
the beginnings of a decoder, and I expanded on that.  We never did 
anything with it, it was just an interesting challenge.


--
DaveA

--
http://mail.python.org/mailman/listinfo/python-list


Re: hex dump w/ or w/out utf-8 chars

2013-07-09 Thread Neil Cerutti
On 2013-07-08, Dave Angel  wrote:
> I appreciate you've been around a long time, and worked in a
> lot of languages.  I've programmed professionally in at least
> 35 languages since 1967.  But we've come a long way from the
> 6bit characters I used in 1968.  At that time, we packed them
> 10 characters to each word.

One of the first Python project I undertook was a program to dump
the ZSCII strings from Infocom game files. They are mostly packed
one character per 5 bits, with escapes to (I had to recheck the
Z-machine spec) latin-1. Oh, those clever implementors: thwarting
hexdumping cheaters and cramming their games onto microcomputers
with one blow.

-- 
Neil Cerutti
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: hex dump w/ or w/out utf-8 chars

2013-07-09 Thread wxjmfauth
Le mardi 9 juillet 2013 09:00:02 UTC+2, Steven D'Aprano a écrit :
> On Mon, 08 Jul 2013 10:53:18 -0700, ferdy.blatsco wrote:
> 
> 
> 
> > Not using python 3, for me (a programmer which was present at the
> 
> > beginning of computer science, badly interacting with many languages
> 
> > from assembler to Fortran and from c to Pascal and so on) it was an hard
> 
> > job to arrange the abrupt transition from characters only equal to bytes
> 
> 
> 
> Characters have *never* been equal to bytes. Not even Perl treats the 
> 
> character 'A' as equal to the byte 0x0A:
> 
> 
> 
> if (0x0A eq 'A') {print "Equal\n";}
> 
> else {print "Unequal\n";}
> 
> 
> 
> will print Unequal, even if you replace "eq" with "==". Nor does Perl 
> 
> consider the character 'A' equal to 65.
> 
> 
> 
> If you have learned to think of characters being equal to bytes, you have 
> 
> learned wrong.
> 
> 
> 
> 
> 
> > to some special characters defined with 2, 3 bytes and even more. I
> 
> > should have preferred another solution... but i'm not Guido!
> 
> 
> 
> What's a special character?
> 
> 
> 
> To an Italian, the characters J, K, W, X and Y are "special characters" 
> 
> which do not exist in the ordinary alphabet. To a German, they are not 
> 
> special, but S is special because you write SS as ß, but only in 
> 
> lowercase.
> 
> 
> 
> To a mathematician, σ is just as ordinary as it would be to a Greek; but 
> 
> the mathematician probably won't recognise ς unless she actually is 
> 
> Greek, even though they are the same letter.
> 
> 
> 
> To an American electrician, Ω is an ordinary character, but ω isn't.
> 
> 
> 
> To anyone working with angles, or temperatures, the degree symbol ° is an 
> 
> ordinary character, but the radian symbol is not. (I can't even find it.)
> 
> 
> 
> The English have forgotten that W used to be a ligature for VV, and 
> 
> consider it a single ordinary character. But the ligature Æ is considered 
> 
> an old-fashioned way of writing AE.
> 
> 
> 
> But to Danes and Norwegians, Æ is an ordinary letter, as distinct from AE 
> 
> as TH is from Þ. (Which English used to have.) And so on... 
> 
> 
> 
> I don't know what a special character is, unless it is the ASCII NUL 
> 
> character, since that terminates C strings.




The concept of "special characters" does not exist.
However, the definition of a "character" is a problem
per se (character, glyph, grapheme, ...).

You are confusing Unicode, typography and linguistic.

There is no symbole for radian because mathematically
radian is a pure number, a unitless number. You can
hower sepecify a = ... in radian (rad).

Note the difference between SS and ẞ
'FRANZ-JOSEF-STRAUSS-STRAẞE'

jmf



-- 
http://mail.python.org/mailman/listinfo/python-list


Re: hex dump w/ or w/out utf-8 chars

2013-07-09 Thread Steven D'Aprano
On Mon, 08 Jul 2013 10:53:18 -0700, ferdy.blatsco wrote:

> Not using python 3, for me (a programmer which was present at the
> beginning of computer science, badly interacting with many languages
> from assembler to Fortran and from c to Pascal and so on) it was an hard
> job to arrange the abrupt transition from characters only equal to bytes

Characters have *never* been equal to bytes. Not even Perl treats the 
character 'A' as equal to the byte 0x0A:

if (0x0A eq 'A') {print "Equal\n";}
else {print "Unequal\n";}

will print Unequal, even if you replace "eq" with "==". Nor does Perl 
consider the character 'A' equal to 65.

If you have learned to think of characters being equal to bytes, you have 
learned wrong.


> to some special characters defined with 2, 3 bytes and even more. I
> should have preferred another solution... but i'm not Guido!

What's a special character?

To an Italian, the characters J, K, W, X and Y are "special characters" 
which do not exist in the ordinary alphabet. To a German, they are not 
special, but S is special because you write SS as ß, but only in 
lowercase.

To a mathematician, σ is just as ordinary as it would be to a Greek; but 
the mathematician probably won't recognise ς unless she actually is 
Greek, even though they are the same letter.

To an American electrician, Ω is an ordinary character, but ω isn't.

To anyone working with angles, or temperatures, the degree symbol ° is an 
ordinary character, but the radian symbol is not. (I can't even find it.)

The English have forgotten that W used to be a ligature for VV, and 
consider it a single ordinary character. But the ligature Æ is considered 
an old-fashioned way of writing AE.

But to Danes and Norwegians, Æ is an ordinary letter, as distinct from AE 
as TH is from Þ. (Which English used to have.) And so on... 

I don't know what a special character is, unless it is the ASCII NUL 
character, since that terminates C strings.



-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: hex dump w/ or w/out utf-8 chars

2013-07-08 Thread Steven D'Aprano
On Tue, 09 Jul 2013 07:49:45 +1000, Chris Angelico wrote:

> On Tue, Jul 9, 2013 at 6:56 AM, Dave Angel  wrote:
>> But Unicode has nothing to do with Guido, and it has existed for about
>> 25 years (if I recall correctly).
> 
> Depends how you measure. According to [1], the work kinda began back
> then (25 years ago being 1988), but it wasn't till 1991/92 that the spec
> was published. Also, the full Unicode range with multiple planes came
> about in 1996, with Unicode 2.0, so that could also be considered the
> beginning of Unicode. But that still means it's nearly old enough to
> drink, so programmers ought to be aware of it.

Yes, yes, a thousand times yes. It's really not that hard to get the 
basics of Unicode.

"When I discovered that the popular web development tool PHP has almost 
complete ignorance of character encoding issues, blithely using 8 bits 
for characters, making it darn near impossible to develop good 
international web applications, I thought, enough is enough.

So I have an announcement to make: if you are a programmer working in 
2003 and you don't know the basics of characters, character sets, 
encodings, and Unicode, and I catch you, I'm going to punish you by 
making you peel onions for 6 months in a submarine. I swear I will."

http://www.joelonsoftware.com/articles/Unicode.html

Also: http://nedbatchelder.com/text/unipain.html




To start with, if you're writing code for Python 2.x, and not using u'' 
for strings, then you're making a rod for your own back. Do yourself a 
favour and get into the habit of always using u'' strings in Python 2.


I'll-start-taking-my-own-advice-next-week-I-promise-ly yrs,



-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: hex dump w/ or w/out utf-8 chars

2013-07-08 Thread Steven D'Aprano
On Tue, 09 Jul 2013 00:32:00 +0100, MRAB wrote:

> On 08/07/2013 23:02, Joshua Landau wrote:
>> On 8 July 2013 22:38, MRAB  wrote:
>>> On 08/07/2013 21:56, Dave Angel wrote:
 Characters do not have a width.
>>>
>>> [snip]
>>>
>>> It depends what you mean by "width"! :-)
>>>
>>> Try this (Python 3):
>>>
>> print("A\N{FULLWIDTH LATIN CAPITAL LETTER A}")
>>> AA
>>
>> Serious question: How would one find the width of a character by that
>> definition?
>>
>  >>> import unicodedata
>  >>> unicodedata.east_asian_width("A")
> 'Na'
>  >>> unicodedata.east_asian_width("\N{FULLWIDTH LATIN CAPITAL LETTER
>  >>> A}")
> 'F'
> 
> The possible widths are:
> 
>  N  = Neutral
>  A  = Ambiguous
>  H  = Halfwidth
>  W  = Wide
>  F  = Fullwidth
>  Na = Narrow
> 
> All you then need to do is find out what those actually mean...

In some East-Asian encodings, there are code-points for Latin characters 
in two forms: "half-width" and "full-width". The half-width form took up 
a single fixed-width column; the full-width forms took up two fixed-width 
columns, so they would line up nicely in columns with Asian characters.

See also:

http://www.unicode.org/reports/tr11/

and search Wikipedia for "full-width" and "half-width".


-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: hex dump w/ or w/out utf-8 chars

2013-07-08 Thread MRAB

On 08/07/2013 23:02, Joshua Landau wrote:

On 8 July 2013 22:38, MRAB  wrote:

On 08/07/2013 21:56, Dave Angel wrote:

Characters do not have a width.


[snip]

It depends what you mean by "width"! :-)

Try this (Python 3):


print("A\N{FULLWIDTH LATIN CAPITAL LETTER A}")

AA


Serious question: How would one find the width of a character by that
definition?


>>> import unicodedata
>>> unicodedata.east_asian_width("A")
'Na'
>>> unicodedata.east_asian_width("\N{FULLWIDTH LATIN CAPITAL LETTER A}")
'F'

The possible widths are:

N  = Neutral
A  = Ambiguous
H  = Halfwidth
W  = Wide
F  = Fullwidth
Na = Narrow

All you then need to do is find out what those actually mean...

--
http://mail.python.org/mailman/listinfo/python-list


Re: hex dump w/ or w/out utf-8 chars

2013-07-08 Thread Chris Angelico
On Tue, Jul 9, 2013 at 8:45 AM, Dave Angel  wrote:
> On 07/08/2013 05:49 PM, Chris Angelico wrote:
>>
>> On Tue, Jul 9, 2013 at 6:56 AM, Dave Angel  wrote:
>>>
>>> But Unicode has nothing to do with Guido, and it has existed for about 25
>>> years (if I recall correctly).
>>
>>
>> Depends how you measure. According to [1], the work kinda began back
>> then (25 years ago being 1988), but it wasn't till 1991/92 that the
>> spec was published. Also, the full Unicode range with multiple planes
>> came about in 1996, with Unicode 2.0, so that could also be considered
>> the beginning of Unicode. But that still means it's nearly old enough
>> to drink, so programmers ought to be aware of it.
>>
>
> Well, then I'm glad I stuck the qualifier on it.  I remember where I was
> working, and that company folded in 1992.  I was working on NT long before
> its official release in 1993, and it used Unicode, even if the spec was
> sliding along.  I'm sure I got unofficial versions of things through
> Microsoft, at the time.

No doubt! Of course, this list is good at dealing with the hard facts
and making sure the archives are accurate, but that doesn't change
your memory.

Anyway, your fundamental point isn't materially affected by whether
Unicode is 17 or 25 years old. It's been around plenty long enough by
now, we should use it. Same with IPv6, too...

ChrisA
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: hex dump w/ or w/out utf-8 chars

2013-07-08 Thread Dave Angel

On 07/08/2013 05:49 PM, Chris Angelico wrote:

On Tue, Jul 9, 2013 at 6:56 AM, Dave Angel  wrote:

But Unicode has nothing to do with Guido, and it has existed for about 25
years (if I recall correctly).


Depends how you measure. According to [1], the work kinda began back
then (25 years ago being 1988), but it wasn't till 1991/92 that the
spec was published. Also, the full Unicode range with multiple planes
came about in 1996, with Unicode 2.0, so that could also be considered
the beginning of Unicode. But that still means it's nearly old enough
to drink, so programmers ought to be aware of it.



Well, then I'm glad I stuck the qualifier on it.  I remember where I was 
working, and that company folded in 1992.  I was working on NT long 
before its official release in 1993, and it used Unicode, even if the 
spec was sliding along.  I'm sure I got unofficial versions of things 
through Microsoft, at the time.




--
DaveA

--
http://mail.python.org/mailman/listinfo/python-list


Re: hex dump w/ or w/out utf-8 chars

2013-07-08 Thread Joshua Landau
On 8 July 2013 22:38, MRAB  wrote:
> On 08/07/2013 21:56, Dave Angel wrote:
>> Characters do not have a width.
>
> [snip]
>
> It depends what you mean by "width"! :-)
>
> Try this (Python 3):
>
 print("A\N{FULLWIDTH LATIN CAPITAL LETTER A}")
> AA

Serious question: How would one find the width of a character by that
definition?
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: hex dump w/ or w/out utf-8 chars

2013-07-08 Thread Chris Angelico
On Tue, Jul 9, 2013 at 6:56 AM, Dave Angel  wrote:
> But Unicode has nothing to do with Guido, and it has existed for about 25
> years (if I recall correctly).

Depends how you measure. According to [1], the work kinda began back
then (25 years ago being 1988), but it wasn't till 1991/92 that the
spec was published. Also, the full Unicode range with multiple planes
came about in 1996, with Unicode 2.0, so that could also be considered
the beginning of Unicode. But that still means it's nearly old enough
to drink, so programmers ought to be aware of it.

[1] http://en.wikipedia.org/wiki/Unicode#History

ChrisA
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: hex dump w/ or w/out utf-8 chars

2013-07-08 Thread MRAB

On 08/07/2013 21:56, Dave Angel wrote:

On 07/08/2013 01:53 PM, ferdy.blat...@gmail.com wrote:

Hi Steven,

thank you for your reply... I really needed another python guru which
is also an English teacher! Sorry if English is not my mother tongue...
"uncorrect" instead of "incorrect" (I misapplied the "similarity
principle" like "unpleasant...>...uncorrect").

Apart from these trifles, you said:

All characters are UTF-8, characters. "a" is a UTF-8 character. So is "ă".

Not using python 3, for me (a programmer which was present at the beginning of
computer science, badly interacting with many languages from assembler to
Fortran and from c to Pascal and so on) it was an hard job to arrange the
abrupt transition from characters only equal to bytes to some special
characters defined with 2, 3 bytes and even more.


Characters do not have a width.

[snip]

It depends what you mean by "width"! :-)

Try this (Python 3):

>>> print("A\N{FULLWIDTH LATIN CAPITAL LETTER A}")
AA

--
http://mail.python.org/mailman/listinfo/python-list


Re: hex dump w/ or w/out utf-8 chars

2013-07-08 Thread Dave Angel

On 07/08/2013 01:53 PM, ferdy.blat...@gmail.com wrote:

Hi Steven,

thank you for your reply... I really needed another python guru which
is also an English teacher! Sorry if English is not my mother tongue...
"uncorrect" instead of "incorrect" (I misapplied the "similarity
principle" like "unpleasant...>...uncorrect").

Apart from these trifles, you said:

All characters are UTF-8, characters. "a" is a UTF-8 character. So is "ă".

Not using python 3, for me (a programmer which was present at the beginning of
computer science, badly interacting with many languages from assembler to
Fortran and from c to Pascal and so on) it was an hard job to arrange the
abrupt transition from characters only equal to bytes to some special
characters defined with 2, 3 bytes and even more.


Characters do not have a width.  They are Unicode code points, an 
abstraction.  It's only when you encode them in byte strings that a code 
point takes on any specific width.  And some encodings go to one-byte 
strings (and get errors for characters that don't match), some go to 
two-bytes each, some variable, etc.



I should have preferred another solution... but i'm not Guido!


But Unicode has nothing to do with Guido, and it has existed for about 
25 years (if I recall correctly).  It's only that Python 3 is finally 
embracing it, and making it the default type for characters, as it 
should be.  As far as I'm concerned, the only reason it shouldn't have 
been done long ago was that programs were trying to fit on 640k DOS 
machines.  Even before Unicode, there were multi-byte encodings around 
(eg. Microsoft's MBCS), and each was thoroughly incompatible with all 
the others.  And the problem with one-byte encodings is that if you need 
to use a Greek currency symbol in a document that's mostly Norwegian (or 
some such combination of characters), there might not be ANY valid way 
to do it within a single "character set."


Python 2 supports all the same Unicode features as 3;  it's just that it 
defaults to byte strings.  So it's HARDER to get it right.


Except for special purpose programs like a file dumper, it's usually 
unnecessary for a Python 3 programmer to deal with individual bytes from 
a byte string.  Text files are a bunch of bytes, and somebody has to 
interpret them as characters.  If you let open() handle it, and if you 
give it the correct encoding, it just works.  Internally, all strings 
are Unicode, and you don't care where they came from, or what human 
language they may have characters from.  You can combine strings from 
multiple places, without much worry that they might interfere.



Windows NT/2000/XP/Vista/7 has used Unicode for its file system (NTFS) 
from the beginning (approx 1992), and has had Unicode versions of each 
of its API's for nearly as long.


I appreciate you've been around a long time, and worked in a lot of 
languages.  I've programmed professionally in at least 35 languages 
since 1967.  But we've come a long way from the 6bit characters I used 
in 1968.  At that time, we packed them 10 characters to each word.


--
DaveA

--
http://mail.python.org/mailman/listinfo/python-list


Re: hex dump w/ or w/out utf-8 chars

2013-07-08 Thread Chris Angelico
On Tue, Jul 9, 2013 at 3:53 AM,   wrote:
>>> All characters are UTF-8, characters. "a" is a UTF-8 character. So is "ă".
> Not using python 3, for me (a programmer which was present at the beginning of
> computer science, badly interacting with many languages from assembler to
> Fortran and from c to Pascal and so on) it was an hard job to arrange the
> abrupt transition from characters only equal to bytes to some special
> characters defined with 2, 3 bytes and even more.

Even back then, bytes and characters were different. 'A' is a
character, 0x41 is a byte. And they correspond 1:1 if and only if you
know that your characters are represented in ASCII. Other encodings
(eg EBCDIC) mapped things differently. The only difference now is that
more people are becoming aware that there are more than 256 characters
in the world.

Like Magic 2014 and its treatment of Slivers, at some point you're
going to have to master the difference between bytes and characters,
or else be eternally hacking around stuff in your code, so now is as
good a time as any.

ChrisA
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: hex dump w/ or w/out utf-8 chars

2013-07-08 Thread ferdy . blatsco
Hi Steven,

thank you for your reply... I really needed another python guru which
is also an English teacher! Sorry if English is not my mother tongue...
"uncorrect" instead of "incorrect" (I misapplied the "similarity
principle" like "unpleasant...>...uncorrect").

Apart from these trifles, you said:
>> All characters are UTF-8, characters. "a" is a UTF-8 character. So is "ă".
Not using python 3, for me (a programmer which was present at the beginning of
computer science, badly interacting with many languages from assembler to
Fortran and from c to Pascal and so on) it was an hard job to arrange the
abrupt transition from characters only equal to bytes to some special
characters defined with 2, 3 bytes and even more.
I should have preferred another solution... but i'm not Guido!

I said:
> in the first version the utf-8 conversion to hex was shown horizontally
And you replied:
>> Oh! We're supposed to read the output *downwards*! 
You are correct, but I was only referring to "special characters"...
My main concern was compactness of output and besides that every group of
bytes used for defining "special characters" is well represented with high
nibble in the range outside ascii 0-127.

Your following observations are connected more or less to the above point
and sorry if the interpretation of output... sucks!
I think that, for the interested user, all the question is of minor
importance.

Only another point is relevant for me:
>> The loop variable just gets reset once it reaches the top of the loop
>> again.
Apart your kind observation (... "hideously ugly to read") referring to
my code snippet incrementing the loop variable... you are correct.
I will never make the same mistake!

Bye, Blatt.



-- 
http://mail.python.org/mailman/listinfo/python-list


Re: hex dump w/ or w/out utf-8 chars

2013-07-08 Thread Chris Angelico
On Tue, Jul 9, 2013 at 3:31 AM,   wrote:
> Unfortunately (as probably I told you before) I will never pass to
> Python 3...  Guido should not always listen only to gurus like him...
> I don't like Python as before...starting from OOP and ending with codecs
> like utf-8. Regarding OOP, much appreciated expecially by experts, he
> could use python 2 for hiding the complexities of OOP (improving, as an
> effect, object's code hiding) moving classes and objects to
> imported methods, leaving in this way the programming style to the
> well known old style: sequential programming and functions.
> About utf-8... the same solution: keep utf-8 but for the non experts, add
> methods to convert to solutions which use the range 128-255 of only one
> byte (I do not give a damn about chinese and "similia"!...)
> I know that is a lost battle (in italian "una battaglia persa")!

Well, there won't be a Python 2.8, so you really should consider
moving at some point. Python 3.3 is already way better than 2.7 in
many ways, 3.4 will improve on 3.3, and the future is pretty clear.
But nobody's forcing you, and 2.7.x will continue to get
bugfix/security releases for a while. (Personally, I'd be happy if
everyone moved off the 2.3/2.4 releases. It's not too hard supporting
2.6+ or 2.7+.)

The thing is, you're thinking about UTF-8, but you should be thinking
about Unicode. I recommend you read these articles:

http://www.joelonsoftware.com/articles/Unicode.html
http://unspecified.wordpress.com/2012/04/19/the-importance-of-language-level-abstract-unicode-strings/

So long as you are thinking about different groups of characters as
different, and wanting a solution that maps characters down into the
<256 range, you will never be able to cleanly internationalize. With
Python 3.3+, you can ignore the differences between ASCII, BMP, and
SMP characters; they're all just "characters". Everything works
perfectly with Unicode.

ChrisA
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: hex dump w/ or w/out utf-8 chars

2013-07-08 Thread ferdy . blatsco

Hi Chris,
glad to have received your contribution, but I was expecting much more
critics...
Starting from the "little nitpick" about the comment dispositon in my
script... you are correct... It is a bad habit on my part to place
variables subjected to change at the beginning of the script... and then
forget it...
About the mistake due to replace, you gave me a perfect explanation.
Unfortunately (as probably I told you before) I will never pass to
Python 3...  Guido should not always listen only to gurus like him...
I don't like Python as before...starting from OOP and ending with codecs
like utf-8. Regarding OOP, much appreciated expecially by experts, he
could use python 2 for hiding the complexities of OOP (improving, as an
effect, object's code hiding) moving classes and objects to
imported methods, leaving in this way the programming style to the
well known old style: sequential programming and functions.
About utf-8... the same solution: keep utf-8 but for the non experts, add
methods to convert to solutions which use the range 128-255 of only one
byte (I do not give a damn about chinese and "similia"!...)
I know that is a lost battle (in italian "una battaglia persa")!

Bye, Blatt
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: hex dump w/ or w/out utf-8 chars

2013-07-07 Thread Steven D'Aprano
On Sun, 07 Jul 2013 17:22:26 -0700, blatt wrote:

> Hi all,
> but a particular hello to Chris Angelino which with their critics and
> suggestions pushed me to make a full revision of my application on hex
> dump in presence of utf-8 chars.

I don't understand what you are trying to say. All characters are UTF-8 
characters. "a" is a UTF-8 character. So is "ă".


> If you are not using python 3, the utf-8 codec can add further
> programming problems, 

On the contrary, I find that so long as you understand what you are doing 
it solves problems, not adds them. However, if you are confused about the 
difference between characters (text strings) and bytes, or if you are 
dealing with arbitrary binary data and trying to treat it as if it were 
UTF-8 encoded text, then you can have errors. Those errors are a good 
thing.


> especially if you are not a guru The script
> seems very long but I commented too much ... sorry. It is very useful
> (at least IMHO...)
> It works under Linux. but there is still a little problem which I didn't
> solve (at least programmatically...).
> 
> 
> # -*- coding: utf-8 -*-
> # px.py vers. 11 (pxb.py)   
> # python 2.6.6 # hex-dump w/ or w/out utf-8 chars
> # Using spaces as separators, this script shows 
> # (better than tabnanny) uncorrect  indentations.

The word you are looking for is "incorrect".


> # to save output > python pxb.py hex.txt > px9_out_hex.txt
> 
> nLenN=3  # n. of digits for lines
> 
> # version almost thoroughly rewritten on the ground of 
> # the critics and modifications suggested by Chris Angelico
> 
> # in the first version the utf-8 conversion to hex was shown
> horizontaly:
> 
> # 005 # qwerty: non è unicode bensì ascii 
> # 2 7767773 666 ca 766 6667ca 676660
> # 3 175249a efe 38 5e93f45 25e33c 13399a

Oh! We're supposed to read the output *downwards*! That's not very 
intuitive. It took me a while to work that out. You should at least say 
so.


> # ... but I had to insert additional chars to keep the
> # synchronization between the literal and the hex part
> 
> # 005 # qwerty: non è. unicode bensì. ascii 
> # 2 7767773 666 ca 766 6667ca 676660
> # 3 175249a efe 38 5e93f45 25e33c 13399a

Well that sucks, because now sometimes you have to read downwards 
(character 'q' -> hex 71, reading downwards) and sometimes you read both 
downwards and across (character 'è' -> hex c3a8). Sometimes a dot means a 
dot and sometimes it means filler. How is the user supposed to know when 
to read down and when across?

 
> # in the second version I followed Chris suggestion:
> # "to show the hex utf-8 vertically"

You're already showing UTF-8 characters vertically, if they happen to be 
a one-byte character. Better to be consistent and always show characters 
vertical, regardless of whether they are one, two or four bytes.


> # 005 # qwerty: non è unicode bensì ascii
> # 2 7767773 666 c 766 6667c 676660
> # 3 175249a efe 3 5e93f45 25e33 13399a 
> #   a a
> #   8 c

Much better! Now at least you can trivially read down the column to see 
the bytes used for each character. As an alternative, you can space each 
character to show the bytes horizontally, displaying spaces and other 
invisible characters either as dots, backslash escapes, or Unicode 
control pictures, whichever you prefer. The example below uses dots for 
spaces and backslash escape for newline:

q  w  e  r  t  y  :  .  n  o  n  .  è .  u  n  i  
71 77 65 72 74 79 3a 20 6e 6f 6e 20 c3 a8 20 75 6e 69

c  o  d  e  .  b  e  n  s  ì .  a  s  c  i  i  \n
63 6f 64 65 20 62 65 6e 73 c3 ac 20 61 73 63 69 69 0a


There will always be some ambiguity between (e.g.) dot representing a 
dot, and it representing an invisible control character or space, but the 
reader can always tell them apart by reading the hex value, which you 
*always* read horizontally whether it is one byte, two or four. There's 
never any confusion whether you should read down or across.

Unfortunately, most fonts don't support the Unicode control pictures. But 
if you choose to use them, here they are, together with their Unicode 
name. You can use the form

'\N{...}'  # Python 3
u'\N{...}'  # Python 2

to get the characters, replacing ... with the name shown below:


␀ SYMBOL FOR NULL
␁ SYMBOL FOR START OF HEADING
␂ SYMBOL FOR START OF TEXT
␃ SYMBOL FOR END OF TEXT
␄ SYMBOL FOR END OF TRANSMISSION
␅ SYMBOL FOR ENQUIRY
␆ SYMBOL FOR ACKNOWLEDGE
␇ SYMBOL FOR BELL
␈ SYMBOL FOR BACKSPACE
␉ SYMBOL FOR HORIZONTAL TABULATION
␊ SYMBOL FOR LINE FEED
␋ SYMBOL FOR VERTICAL TABULATION
␌ SYMBOL FOR FORM FEED
␍ SYMBOL FOR CARRIAGE RETURN
␎ SYMBOL FOR SHIFT OUT
␏ SYMBOL FOR SHIFT IN
␐ SYMBOL FOR 

Re: hex dump w/ or w/out utf-8 chars

2013-07-07 Thread Chris Angelico
On Mon, Jul 8, 2013 at 10:22 AM, blatt  wrote:
> Hi all,
> but a particular hello to Chris Angelino which with their critics and
> suggestions pushed me to make a full revision of my application on
> hex dump in presence of utf-8 chars.

Hiya! Glad to have been of assistance :)

> As I already told to Chris... critics are welcome!

No problem.

> # -*- coding: utf-8 -*-
> # px.py vers. 11 (pxb.py)   # python 2.6.6
> # hex-dump w/ or w/out utf-8 chars
> # Using spaces as separators, this script shows
> # (better than tabnanny)  uncorrect  indentations.
>
> # to save output > python pxb.py hex.txt > px9_out_hex.txt
>
> nLenN=3  # n. of digits for lines
>
> # chomp heaps and heaps of comments

Little nitpick, since you did invite criticism :) When I went to copy
and paste your code, I skipped all the comments and started at the
line of hashes... and then didn't have the nLenN definition. Posting
code to a forum like this is a huge invitation to try the code (it's
the very easiest way to know what it does), so I would recommend
having all your comments at the top, and all the code in a block
underneath. It'd be that bit easier for us to help you. Not a big
deal, though, I did figure out what was going on :)

> sLineHex  =lF[n].encode('hex').replace('20','  ')

Here's the problem. Your hex string ends with "220a", and the
replace() method doesn't concern itself with the divisions between
bytes. It finds the second 2 of 22 and the leading 0 of 0a and
replaces them.

I think the best solution may be to avoid the .encode('hex') part,
since it's not available in Python 3 anyway. Alternatively (if Py3
migration isn't a concern), you could do something like this:

sLineHexND=lF[n].encode('hex') # ND = no delimiter (space)
sLineHex  =sLineHexND # No reason to redo the encoding
twentypos=0
while True:
twentypos=sLineHex.find("20",twentypos)
if twentypos==-1: break # We've reached the end of the string
if not twentypos%2: # It's at an even-numbered position, replace it
sLineHex=sLineHex[:twentypos]+'  '+sLineHex[twentypos+2:]
twentypos+=1
# then continue on as before

> sLineHexH =sLineHex[::2]
> sLineHexL =sLineHex[1::2]
> [ code continues ]

Hope that helps!

ChrisA
-- 
http://mail.python.org/mailman/listinfo/python-list


hex dump w/ or w/out utf-8 chars

2013-07-07 Thread blatt
Hi all,
but a particular hello to Chris Angelino which with their critics and
suggestions pushed me to make a full revision of my application on
hex dump in presence of utf-8 chars.
If you are not using python 3, the utf-8 codec can add further programming
problems, especially if you are not a guru
The script seems very long but I commented too much ... sorry.
It is very useful (at least IMHO...)
It works under Linux. but there is still a little problem which I didn't
solve (at least programmatically...).


# -*- coding: utf-8 -*-
# px.py vers. 11 (pxb.py)   # python 2.6.6
# hex-dump w/ or w/out utf-8 chars
# Using spaces as separators, this script shows
# (better than tabnanny)  uncorrect  indentations.

# to save output > python pxb.py hex.txt > px9_out_hex.txt

nLenN=3  # n. of digits for lines

# version almost thoroughly rewritten on the ground of
# the critics and modifications suggested by Chris Angelico

# in the first version the utf-8 conversion to hex was shown horizontaly:

# 005 # qwerty: non è unicode bensì ascii
# 2 7767773 666 ca 766 6667ca 676660
# 3 175249a efe 38 5e93f45 25e33c 13399a

# ... but I had to insert additional chars to keep the
# synchronization between the literal and the hex part

# 005 # qwerty: non è. unicode bensì. ascii
# 2 7767773 666 ca 766 6667ca 676660
# 3 175249a efe 38 5e93f45 25e33c 13399a

# in the second version I followed Chris suggestion:
# "to show the hex utf-8 vertically"

# 005 # qwerty: non è unicode bensì ascii
# 2 7767773 666 c 766 6667c 676660
# 3 175249a efe 3 5e93f45 25e33 13399a
#   a a
#   8 c

# between the two solutions, I selected the first one + syncronization,
# which seems more compact and easier to program (... I'm lazy...)

# various run options:
# std  : python px.py file
# bash cat : cat  file | python px.py (alias hex)
# bash echo: echo line | python px.py""

# works on any n. of bytes for utf-8

# For the user: it is helpful to have in a separate file
# all special characters of interest, together with their names.

# error:

# echo '345"789"'|hex> 345"789"  345"789"
#  33323332  instead of  333233320
#  3452789 a""   34527892a

# ... correction: avoiding "\n at end of test-line
# echo "345'789'"|hex   >  345'789'
#  333233320
#  34577897a

# same error in every run option

# If someone can solve this bug...

###


import fileinput
import sys, commands

lF=[]   # input file as list
for line in fileinput.input():  # handles all the details of args-or-stdin
lF.append(line)
sSpacesXLN = ' ' * (nLenN+1)


for n in xrange(len(lF)):
sLineHexND=lF[n].encode('hex') # ND = no delimiter (space)
sLineHex  =lF[n].encode('hex').replace('20','  ')
sLineHexH =sLineHex[::2]
sLineHexL =sLineHex[1::2]

sSynchro=''
for k in xrange(0,len(sLineHexND),2):
if sLineHexND[k]<'8':
sSynchro+= sLineHexND[k]+sLineHexND[k+1]
k+=1
elif sLineHexND[k]=='c':
sSynchro+='c'+sLineHexND[k+1]+sLineHexND[k+2]+sLineHexND[k+3]+'2e'
k+=3
elif sLineHexND[k]=='e':
sSynchro+='e'+sLineHexND[k+1]+sLineHexND[k+2]+sLineHexND[k+3]+\
  sLineHexND[k+4]+sLineHexND[k+5]+'2e2e'
k+=5

# text output (synchroinized)
print str(n+1).zfill(nLenN)+' '+sSynchro.decode('hex'),
print sSpacesXLN + sLineHexH
print sSpacesXLN + sLineHexL+ '\n'


If there are problems of understanding, probably due to fonts, the best
thing is import it in an editor with "mono" fonts...

As I already told to Chris... critics are welcome!

Bye, Blatt.










-- 
http://mail.python.org/mailman/listinfo/python-list