Re: A few questiosn about encoding

2013-06-25 Thread wxjmfauth
Le dimanche 23 juin 2013 18:30:40 UTC+2, Steven D'Aprano a écrit :
 On Sun, 23 Jun 2013 08:51:41 -0700, wxjmfauth wrote:
 
 
 
  utf-8: how many bytes to hold an a in memory? one byte.
 
  
 
  flexible string representation: how many bytes to hold an a in memory?
 
  One byte? No, two. (Funny, it consumes more memory to hold an ascii char
 
  than ascii itself)
 
 
 
 Incorrect. Python strings have overhead because they are objects, so 
 
 let's see the difference adding a single character makes:
 
 
 
 # Python 3.3, with the hated flexible string representation:
 
 py sys.getsizeof('a'*100) - sys.getsizeof('a'*99)
 
 1
 
 
 
 # Python 3.2:
 
 py sys.getsizeof('a'*100) - sys.getsizeof('a'*99)
 
 4
 
 
 
 
 
 How about a French é character? Of course, ASCII cannot store it *at 
 
 all*, but let's see what Python can do:
 
 
 
 
 
 # The hated Python 3.3 again:
 
 py sys.getsizeof('é'*100) - sys.getsizeof('é'*99)
 
 1
 
 
 
 
 
 # And Python 3.2:
 
 py sys.getsizeof('é'*100) - sys.getsizeof('é'*99)
 
 4
 
 
 
 
 
 
 
  utf-8: In a series of bytes implementing the encoded code points
 
  supposed to hold a string, picking a byte and finding to which encoded
 
  code point it belongs is a no prolem.
 
 
 
 Incorrect. UTF-8 is unsuitable for random access, since it has variable-
 
 width characters, anything from 1 to 4 bytes. So you cannot just jump 
 
 directly to character 1000 in a block of text, you have to inspect each 
 
 byte one-by-one to decide whether it is a 1, 2, 3 or 4 byte character.
 
 
 
 
 
  flexible string representation: In a series of bytes implementing the
 
  encoded code points supposed to hold a string, picking a byte and
 
  finding to which encoded code point it belongs is ... impossible !
 
 
 
 Incorrect. It is absolutely trivial. Each string is marked as either 1-
 
 byte, 2-byte or 4-byte. If it is a 1-byte string, then each byte is one 
 
 character. If it is a 2-byte string, then it is just like Python 3.2 
 
 narrow build, and each two bytes is a character. If it is a 4-byte 
 
 string, then it is just like Python 3.2 wide build, and each four bytes 
 
 is a character. Within a single string, the number of bytes per character 
 
 is fixed, and random access is easy and fast.
 
 
 
 
 
 
 
 -- 
 
 Steven

:-)
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: A few questiosn about encoding

2013-06-23 Thread wxjmfauth
Le jeudi 20 juin 2013 19:17:12 UTC+2, MRAB a écrit :
 On 20/06/2013 17:37, Chris Angelico wrote:
 
  On Fri, Jun 21, 2013 at 2:27 AM,  wxjmfa...@gmail.com wrote:
 
  And all these coding schemes have something in common,
 
  they work all with a unique set of code points, more
 
  precisely a unique set of encoded code points (not
 
  the set of implemented code points (byte)).
 
 
 
  Just what the flexible string representation is not
 
  doing, it artificially devides unicode in subsets and try
 
  to handle eache subset differently.
 
 
 
 
 
 
 
  UTF-16 divides Unicode into two subsets: BMP characters (encoded using
 
  one 16-bit unit) and astral characters (encoded using two 16-bit units
 
  in the D800::/5 netblock, or equivalent thereof). Your beloved narrow
 
  builds are guilty of exactly the same crime as the hated 3.3.
 
 
 
 UTF-8 divides Unicode into subsets which are encoded in 1, 2, 3, or 4
 
 bytes, and those who previously used ASCII still need only 1 byte per
 
 codepoint!

Sorry, but no, it does not work in that way:
confusion between the set of encoded code points
and the implementation of these called code units.

utf-8: how many bytes to hold an a in memory?
one byte.

flexible string representation: how many bytes to
hold an a in memory? One byte? No, two.
(Funny, it consumes more memory to hold an ascii char
than ascii itself)


utf-8: In a series of bytes implementing the encoded code
points supposed to hold a string, picking a byte and
finding to which encoded code point it belongs is a no prolem.

flexible string representation: In a series of bytes
implementing the encoded code points supposed to hold a
string, picking a byte and finding to which encoded code
point it belongs is ... impossible !

One of the cause of the bad working of this flexible string
representation.

The basics of any coding scheme, unicode included.

jmf
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: A few questiosn about encoding

2013-06-23 Thread Steven D'Aprano
On Sun, 23 Jun 2013 08:51:41 -0700, wxjmfauth wrote:

 utf-8: how many bytes to hold an a in memory? one byte.
 
 flexible string representation: how many bytes to hold an a in memory?
 One byte? No, two. (Funny, it consumes more memory to hold an ascii char
 than ascii itself)

Incorrect. Python strings have overhead because they are objects, so 
let's see the difference adding a single character makes:

# Python 3.3, with the hated flexible string representation:
py sys.getsizeof('a'*100) - sys.getsizeof('a'*99)
1

# Python 3.2:
py sys.getsizeof('a'*100) - sys.getsizeof('a'*99)
4


How about a French é character? Of course, ASCII cannot store it *at 
all*, but let's see what Python can do:


# The hated Python 3.3 again:
py sys.getsizeof('é'*100) - sys.getsizeof('é'*99)
1


# And Python 3.2:
py sys.getsizeof('é'*100) - sys.getsizeof('é'*99)
4



 utf-8: In a series of bytes implementing the encoded code points
 supposed to hold a string, picking a byte and finding to which encoded
 code point it belongs is a no prolem.

Incorrect. UTF-8 is unsuitable for random access, since it has variable-
width characters, anything from 1 to 4 bytes. So you cannot just jump 
directly to character 1000 in a block of text, you have to inspect each 
byte one-by-one to decide whether it is a 1, 2, 3 or 4 byte character.


 flexible string representation: In a series of bytes implementing the
 encoded code points supposed to hold a string, picking a byte and
 finding to which encoded code point it belongs is ... impossible !

Incorrect. It is absolutely trivial. Each string is marked as either 1-
byte, 2-byte or 4-byte. If it is a 1-byte string, then each byte is one 
character. If it is a 2-byte string, then it is just like Python 3.2 
narrow build, and each two bytes is a character. If it is a 4-byte 
string, then it is just like Python 3.2 wide build, and each four bytes 
is a character. Within a single string, the number of bytes per character 
is fixed, and random access is easy and fast.



-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: A few questiosn about encoding

2013-06-20 Thread Steven D'Aprano
On Wed, 19 Jun 2013 18:46:59 -0700, Rick Johnson wrote:

 On Thursday, June 13, 2013 2:11:08 AM UTC-5, Steven D'Aprano wrote:
  
 Gah! That's twice I've screwed that up. Sorry about that!
 
 Yeah, and your difficulty explaining the Unicode implementation reminds
 me of a passage from the Python zen:
 
  If the implementation is hard to explain, it's a bad idea.

The *implementation* is easy to explain. It's the names of the encodings 
which I get tangled up in.


ASCII: Supports exactly 127 code points, each of which takes up exactly 7 
bits. Each code point represents a character.

Latin-1, Latin-2, MacRoman, MacGreek, ISO-8859-7, Big5, Windows-1251, and 
about a gazillion other legacy charsets, all of which are mutually 
incompatible: supports anything from 127 to 65535 different code points, 
usually under 256.

UCS-2: Supports exactly 65535 code points, each of which takes up exactly 
two bytes. That's fewer than required, so it is obsoleted by:

UTF-16: Supports all 1114111 code points in the Unicode charset, using a 
variable-width system where the most popular characters use exactly two-
bytes and the remaining ones use a pair of characters.

UCS-4: Supports exactly 4294967295 code points, each of which takes up 
exactly four bytes. That is more than needed for the Unicode charset, so 
this is obsoleted by:

UTF-32: Supports all 1114111 code points, using exactly four bytes each. 
Code points outside of the range 0 through 1114111 inclusive are an error.

UTF-8: Supports all 1114111 code points, using a variable-width system 
where popular ASCII characters require 1 byte, and others use 2, 3 or 4 
bytes as needed.


Ignoring the legacy charsets, only UTF-16 is a terribly complicated 
implementation, due to the surrogate pairs. But even that is not too bad. 
The real complication comes from the interactions between systems which 
use different encodings, and that's nothing to do with Unicode.


-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: A few questiosn about encoding

2013-06-20 Thread MRAB

On 20/06/2013 07:26, Steven D'Aprano wrote:

On Wed, 19 Jun 2013 18:46:59 -0700, Rick Johnson wrote:


On Thursday, June 13, 2013 2:11:08 AM UTC-5, Steven D'Aprano wrote:


Gah! That's twice I've screwed that up. Sorry about that!


Yeah, and your difficulty explaining the Unicode implementation reminds
me of a passage from the Python zen:

 If the implementation is hard to explain, it's a bad idea.


The *implementation* is easy to explain. It's the names of the encodings
which I get tangled up in.


You're off by one below!


ASCII: Supports exactly 127 code points, each of which takes up exactly 7
bits. Each code point represents a character.


128 codepoints.


Latin-1, Latin-2, MacRoman, MacGreek, ISO-8859-7, Big5, Windows-1251, and
about a gazillion other legacy charsets, all of which are mutually
incompatible: supports anything from 127 to 65535 different code points,
usually under 256.


128 to 65536 codepoints.


UCS-2: Supports exactly 65535 code points, each of which takes up exactly
two bytes. That's fewer than required, so it is obsoleted by:


65536 codepoints.

etc.


UTF-16: Supports all 1114111 code points in the Unicode charset, using a
variable-width system where the most popular characters use exactly two-
bytes and the remaining ones use a pair of characters.

UCS-4: Supports exactly 4294967295 code points, each of which takes up
exactly four bytes. That is more than needed for the Unicode charset, so
this is obsoleted by:

UTF-32: Supports all 1114111 code points, using exactly four bytes each.
Code points outside of the range 0 through 1114111 inclusive are an error.

UTF-8: Supports all 1114111 code points, using a variable-width system
where popular ASCII characters require 1 byte, and others use 2, 3 or 4
bytes as needed.


Ignoring the legacy charsets, only UTF-16 is a terribly complicated
implementation, due to the surrogate pairs. But even that is not too bad.
The real complication comes from the interactions between systems which
use different encodings, and that's nothing to do with Unicode.




--
http://mail.python.org/mailman/listinfo/python-list


Re: A few questiosn about encoding

2013-06-20 Thread Rick Johnson
On Thursday, June 20, 2013 1:26:17 AM UTC-5, Steven D'Aprano wrote:
 The *implementation* is easy to explain. It's the names of
 the encodings which I get tangled up in.

Well, ignoring the fact that you're last explanation is
still buggy, you have not actually described an
implementation, no, you've merely generalized ( and quite
vaguely i might add) the technical specification of a few
encoding. Let's ask Wikipedia to enlighten us on the
subject of implementation:


#  Define: Implementation  #

# In computer science, an implementation is a realization  #
# of a technical specification or algorithm as a program,  #
# software component, or other computer system through #
# computer programming and deployment. Many#
# implementations may exist for a given specification or   #
# standard. For example, web browsers contain  #
# implementations of World Wide Web Consortium-recommended #
# specifications, and software development tools contain   #
# implementations of programming languages.#


Do you think someone could reliably implement the alphabet of a new
language in Unicode by using the general outline you
provided? -- again, ignoring your continual fumbling when
explaining that simple generalization :-)

Your generalization is analogous to explaining web browsers
as: software that allows a user to view web pages in the
range www.* Do you think someone could implement a web
browser from such limited specification? (if that was all
they knew?).


 Since we're on the subject of Unicode:

One the most humorous aspects of Unicode is that it has
encodings for Braille characters. Hmm, this presents a
conundrum of sorts. RIDDLE ME THIS?!

Since Braille is a type of reading for the blind by
utilizing the sense of touch (therefore DEMANDING 3
dimensions) and glyphs derived from Unicode are
restrictively two dimensional, because let's face it people,
Unicode exists in your computer, and computer screens are
two dimensional... but you already knew that -- i think?,
then what is the purpose of a Unicode Braille character set?

That should haunt your nightmares for some time.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: A few questiosn about encoding

2013-06-20 Thread Andrew Berg
On 2013.06.20 08:40, Rick Johnson wrote:
 One the most humorous aspects of Unicode is that it has
 encodings for Braille characters. Hmm, this presents a
 conundrum of sorts. RIDDLE ME THIS?!
 
 Since Braille is a type of reading for the blind by
 utilizing the sense of touch (therefore DEMANDING 3
 dimensions) and glyphs derived from Unicode are
 restrictively two dimensional, because let's face it people,
 Unicode exists in your computer, and computer screens are
 two dimensional... but you already knew that -- i think?,
 then what is the purpose of a Unicode Braille character set?
Two dimensional characters can be made into 3 dimensional shapes.
Building numbers are a good example of this.
We already have one Unicode troll; do we really need you too?

-- 
CPython 3.3.2 | Windows NT 6.2.9200 / FreeBSD 9.1
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: A few questiosn about encoding

2013-06-20 Thread Rick Johnson
On Thursday, June 20, 2013 9:04:50 AM UTC-5, Andrew Berg wrote:
 On 2013.06.20 08:40, Rick Johnson wrote:

  then what is the purpose of a Unicode Braille character set?
 Two dimensional characters can be made into 3 dimensional shapes.

Yes in the real world. But what about on your computer
screen? How do you plan on creating tactile representations of
braille glyphs on my monitor? Hey, if you can already do this, 
please share, as it sure would make internet porn more 
interesting!

 Building numbers are a good example of this.

Either the matrix is reality or you must live inside your
computer as a virtual being. Is your name Tron? Are you a pawn
of Master Control? He's such a tyrant!
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: A few questiosn about encoding

2013-06-20 Thread Chris Angelico
On Fri, Jun 21, 2013 at 1:12 AM, Rick Johnson
rantingrickjohn...@gmail.com wrote:
 On Thursday, June 20, 2013 9:04:50 AM UTC-5, Andrew Berg wrote:
 On 2013.06.20 08:40, Rick Johnson wrote:

  then what is the purpose of a Unicode Braille character set?
 Two dimensional characters can be made into 3 dimensional shapes.

 Yes in the real world. But what about on your computer
 screen? How do you plan on creating tactile representations of
 braille glyphs on my monitor? Hey, if you can already do this,
 please share, as it sure would make internet porn more
 interesting!

I had a device for creating embossed text. It predated Unicode by a
couple of years at least (not sure how many, because I was fairly
young at the time). It was made by a company called Epson, it plugged
into the computer via a 25-pin plug, and when it was properly
functioning, it had a ribbon of ink that it would bash through to
darken the underside of the embossed text. But sometimes that ribbon
slipped out of position, and we had beautifully-hammered ASCII text,
unsullied by ink. And since the device did graphics too, it could be
used for the entire Unicode character set if you wanted.

Not sure that it would improve your porn any, but I've no doubt you
could try if you wanted.

ChrisA
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: A few questiosn about encoding

2013-06-20 Thread Chris Angelico
On Thu, Jun 20, 2013 at 11:40 PM, Rick Johnson
rantingrickjohn...@gmail.com wrote:
 Your generalization is analogous to explaining web browsers
 as: software that allows a user to view web pages in the
 range www.* Do you think someone could implement a web
 browser from such limited specification? (if that was all
 they knew?).

Wow. That spec isn't limited, it's downright faulty. Or do you really
think that (a) there is such a thing as the range www.*, and that
(b) that range has anything to do with web browsers?

ChrisA
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: A few questiosn about encoding

2013-06-20 Thread wxjmfauth
Le jeudi 20 juin 2013 13:43:28 UTC+2, MRAB a écrit :
 On 20/06/2013 07:26, Steven D'Aprano wrote:
 
  On Wed, 19 Jun 2013 18:46:59 -0700, Rick Johnson wrote:
 
 
 
  On Thursday, June 13, 2013 2:11:08 AM UTC-5, Steven D'Aprano wrote:
 
 
 
  Gah! That's twice I've screwed that up. Sorry about that!
 
 
 
  Yeah, and your difficulty explaining the Unicode implementation reminds
 
  me of a passage from the Python zen:
 
 
 
   If the implementation is hard to explain, it's a bad idea.
 
 
 
  The *implementation* is easy to explain. It's the names of the encodings
 
  which I get tangled up in.
 
 
 
 You're off by one below!
 
 
 
  ASCII: Supports exactly 127 code points, each of which takes up exactly 7
 
  bits. Each code point represents a character.
 
 
 
 128 codepoints.
 
 
 
  Latin-1, Latin-2, MacRoman, MacGreek, ISO-8859-7, Big5, Windows-1251, and
 
  about a gazillion other legacy charsets, all of which are mutually
 
  incompatible: supports anything from 127 to 65535 different code points,
 
  usually under 256.
 
 
 
 128 to 65536 codepoints.
 
 
 
  UCS-2: Supports exactly 65535 code points, each of which takes up exactly
 
  two bytes. That's fewer than required, so it is obsoleted by:
 
 
 
 65536 codepoints.
 
 
 
 etc.
 
 
 
  UTF-16: Supports all 1114111 code points in the Unicode charset, using a
 
  variable-width system where the most popular characters use exactly two-
 
  bytes and the remaining ones use a pair of characters.
 
 
 
  UCS-4: Supports exactly 4294967295 code points, each of which takes up
 
  exactly four bytes. That is more than needed for the Unicode charset, so
 
  this is obsoleted by:
 
 
 
  UTF-32: Supports all 1114111 code points, using exactly four bytes each.
 
  Code points outside of the range 0 through 1114111 inclusive are an error.
 
 
 
  UTF-8: Supports all 1114111 code points, using a variable-width system
 
  where popular ASCII characters require 1 byte, and others use 2, 3 or 4
 
  bytes as needed.
 
 
 
 
 
  Ignoring the legacy charsets, only UTF-16 is a terribly complicated
 
  implementation, due to the surrogate pairs. But even that is not too bad.
 
  The real complication comes from the interactions between systems which
 
  use different encodings, and that's nothing to do with Unicode.
 
 
 
 

And all these coding schemes have something in common,
they work all with a unique set of code points, more
precisely a unique set of encoded code points (not
the set of implemented code points (byte)).

Just what the flexible string representation is not
doing, it artificially devides unicode in subsets and try
to handle eache subset differently.

On this other side, that is because it is impossible to
work properly with multiple sets of encoded code points
that all these coding schemes exist today. There are simply
no other way.

Even exotic schemes like CID-fonts used in pdf
are based on that scheme.

jmf

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: A few questiosn about encoding

2013-06-20 Thread Chris Angelico
On Fri, Jun 21, 2013 at 2:27 AM,  wxjmfa...@gmail.com wrote:
 And all these coding schemes have something in common,
 they work all with a unique set of code points, more
 precisely a unique set of encoded code points (not
 the set of implemented code points (byte)).

 Just what the flexible string representation is not
 doing, it artificially devides unicode in subsets and try
 to handle eache subset differently.



UTF-16 divides Unicode into two subsets: BMP characters (encoded using
one 16-bit unit) and astral characters (encoded using two 16-bit units
in the D800::/5 netblock, or equivalent thereof). Your beloved narrow
builds are guilty of exactly the same crime as the hated 3.3.

ChrisA
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: A few questiosn about encoding

2013-06-20 Thread Andreas Perstinger
Rick Johnson rantingrickjohn...@gmail.com wrote:

 Since we're on the subject of Unicode:

One the most humorous aspects of Unicode is that it has
encodings for Braille characters. Hmm, this presents a
conundrum of sorts. RIDDLE ME THIS?!

Since Braille is a type of reading for the blind by
utilizing the sense of touch (therefore DEMANDING 3
dimensions) and glyphs derived from Unicode are
restrictively two dimensional, because let's face it people,
Unicode exists in your computer, and computer screens are
two dimensional... but you already knew that -- i think?,
then what is the purpose of a Unicode Braille character set?

That should haunt your nightmares for some time.

From http://www.unicode.org/versions/Unicode6.2.0/ch15.pdf
The intent of encoding the 256 Braille patterns in the Unicode
Standard is to allow input and output devices to be implemented that
can interchange Braille data without having to go through a
context-dependent conversion from semantic values to patterns, or vice
versa. In this manner, final-form documents can be exchanged and
faithfully rendered.

http://files.pef-format.org/specifications/pef-2008-1/pef-specification.html#Unicode

I wish you a pleasant sleep tonight.

Bye, Andreas
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: A few questiosn about encoding

2013-06-20 Thread MRAB

On 20/06/2013 17:37, Chris Angelico wrote:

On Fri, Jun 21, 2013 at 2:27 AM,  wxjmfa...@gmail.com wrote:

And all these coding schemes have something in common,
they work all with a unique set of code points, more
precisely a unique set of encoded code points (not
the set of implemented code points (byte)).

Just what the flexible string representation is not
doing, it artificially devides unicode in subsets and try
to handle eache subset differently.




UTF-16 divides Unicode into two subsets: BMP characters (encoded using
one 16-bit unit) and astral characters (encoded using two 16-bit units
in the D800::/5 netblock, or equivalent thereof). Your beloved narrow
builds are guilty of exactly the same crime as the hated 3.3.


UTF-8 divides Unicode into subsets which are encoded in 1, 2, 3, or 4
bytes, and those who previously used ASCII still need only 1 byte per
codepoint!

--
http://mail.python.org/mailman/listinfo/python-list


Re: A few questiosn about encoding

2013-06-20 Thread Chris Angelico
On Fri, Jun 21, 2013 at 3:17 AM, MRAB pyt...@mrabarnett.plus.com wrote:
 On 20/06/2013 17:37, Chris Angelico wrote:

 On Fri, Jun 21, 2013 at 2:27 AM,  wxjmfa...@gmail.com wrote:

 And all these coding schemes have something in common,
 they work all with a unique set of code points, more
 precisely a unique set of encoded code points (not
 the set of implemented code points (byte)).

 Just what the flexible string representation is not
 doing, it artificially devides unicode in subsets and try
 to handle eache subset differently.



 UTF-16 divides Unicode into two subsets: BMP characters (encoded using
 one 16-bit unit) and astral characters (encoded using two 16-bit units
 in the D800::/5 netblock, or equivalent thereof). Your beloved narrow
 builds are guilty of exactly the same crime as the hated 3.3.

 UTF-8 divides Unicode into subsets which are encoded in 1, 2, 3, or 4
 bytes, and those who previously used ASCII still need only 1 byte per
 codepoint!

Yes, but there's never (AFAIK) been a Python implementation that
represents strings in UTF-8; UTF-16 was one of two options for Python
2.2 through 3.2, and is the one that jmf always seems to be measuring
against.

ChrisA
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: A few questiosn about encoding

2013-06-20 Thread Jussi Piitulainen
Rick Johnson writes:
 On Thursday, June 20, 2013 9:04:50 AM UTC-5, Andrew Berg wrote:
  On 2013.06.20 08:40, Rick Johnson wrote:
 
 then what is the purpose of a Unicode Braille character set?
  Two dimensional characters can be made into 3 dimensional shapes.
 
 Yes in the real world. But what about on your computer screen? How
 do you plan on creating tactile representations of braille glyphs on
 my monitor? Hey, if you can already do this, please share, as it
 sure would make internet porn more interesting!

Search for braille display on the web. A wikipedia article also led me
to braille e-book. (Or search for braille porn, since you are so
inclined - the concept turns out to be already out there on the web.)
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: A few questiosn about encoding

2013-06-20 Thread Mark Lawrence

On 20/06/2013 17:27, wxjmfa...@gmail.com wrote:

Le jeudi 20 juin 2013 13:43:28 UTC+2, MRAB a écrit :

On 20/06/2013 07:26, Steven D'Aprano wrote:


On Wed, 19 Jun 2013 18:46:59 -0700, Rick Johnson wrote:







On Thursday, June 13, 2013 2:11:08 AM UTC-5, Steven D'Aprano wrote:







Gah! That's twice I've screwed that up. Sorry about that!







Yeah, and your difficulty explaining the Unicode implementation reminds



me of a passage from the Python zen:







  If the implementation is hard to explain, it's a bad idea.







The *implementation* is easy to explain. It's the names of the encodings



which I get tangled up in.






You're off by one below!






ASCII: Supports exactly 127 code points, each of which takes up exactly 7



bits. Each code point represents a character.






128 codepoints.




Latin-1, Latin-2, MacRoman, MacGreek, ISO-8859-7, Big5, Windows-1251, and



about a gazillion other legacy charsets, all of which are mutually



incompatible: supports anything from 127 to 65535 different code points,



usually under 256.






128 to 65536 codepoints.




UCS-2: Supports exactly 65535 code points, each of which takes up exactly



two bytes. That's fewer than required, so it is obsoleted by:






65536 codepoints.



etc.




UTF-16: Supports all 1114111 code points in the Unicode charset, using a



variable-width system where the most popular characters use exactly two-



bytes and the remaining ones use a pair of characters.







UCS-4: Supports exactly 4294967295 code points, each of which takes up



exactly four bytes. That is more than needed for the Unicode charset, so



this is obsoleted by:







UTF-32: Supports all 1114111 code points, using exactly four bytes each.



Code points outside of the range 0 through 1114111 inclusive are an error.







UTF-8: Supports all 1114111 code points, using a variable-width system



where popular ASCII characters require 1 byte, and others use 2, 3 or 4



bytes as needed.











Ignoring the legacy charsets, only UTF-16 is a terribly complicated



implementation, due to the surrogate pairs. But even that is not too bad.



The real complication comes from the interactions between systems which



use different encodings, and that's nothing to do with Unicode.










And all these coding schemes have something in common,
they work all with a unique set of code points, more
precisely a unique set of encoded code points (not
the set of implemented code points (byte)).

Just what the flexible string representation is not
doing, it artificially devides unicode in subsets and try
to handle eache subset differently.

On this other side, that is because it is impossible to
work properly with multiple sets of encoded code points
that all these coding schemes exist today. There are simply
no other way.

Even exotic schemes like CID-fonts used in pdf
are based on that scheme.

jmf



I entirely agree with the viewpoints of jmfauth, Nick the Greek, rr, 
Xah Lee and Ilias Lazaridis on the grounds that disagreeing and stating 
my beliefs ends up with the Python Mailing List police standing on my 
back doorsetep.  Give me the NSA or GCHQ any day of the week :(


--
Steve is going for the pink ball - and for those of you who are 
watching in black and white, the pink is next to the green. Snooker 
commentator 'Whispering' Ted Lowe.


Mark Lawrence

--
http://mail.python.org/mailman/listinfo/python-list


Re: A few questiosn about encoding

2013-06-19 Thread Rick Johnson
On Thursday, June 13, 2013 2:11:08 AM UTC-5, Steven D'Aprano wrote:
 
 Gah! That's twice I've screwed that up. 
 Sorry about that!

Yeah, and your difficulty explaining the Unicode implementation reminds me of a 
passage from the Python zen:

 If the implementation is hard to explain, it's a bad idea.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: A few questiosn about encoding

2013-06-17 Thread Antoon Pardon
Op 15-06-13 02:28, Cameron Simpson schreef:
 On 14Jun2013 15:59, Nikos as SuperHost Support supp...@superhost.gr wrote:
 | So, a numeral = a string representation of a number. Is this correct?

 No, a numeral is an individual digit from the string representation of a 
 number.
 So: 65 requires two numerals: '6' and '5'.
Wrong context. A numeral as an individual digit is when you are talking about
individual characters in a font. In such a context the set of glyphs that
represent a digit are the numerals.

However in a context of programming, numerals in general refer to the set of
strings that represent a number.

-- 
Antoon.

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: A few questiosn about encoding

2013-06-17 Thread Cameron Simpson
On 17Jun2013 08:49, Antoon Pardon antoon.par...@rece.vub.ac.be wrote:
| Op 15-06-13 02:28, Cameron Simpson schreef:
|  On 14Jun2013 15:59, Nikos as SuperHost Support supp...@superhost.gr wrote:
|  | So, a numeral = a string representation of a number. Is this correct?
| 
|  No, a numeral is an individual digit from the string representation of a 
number.
|  So: 65 requires two numerals: '6' and '5'.
|
| Wrong context. A numeral as an individual digit is when you are talking about
| individual characters in a font. In such a context the set of glyphs that
| represent a digit are the numerals.
| 
| However in a context of programming, numerals in general refer to the set of
| strings that represent a number.

No, those are just numbers or numeric strings (if you're being
overt about them being strings at all). They're numeric strings
because they're composed of numerals. If you think otherwise your
vocabulary needs adjusting. A numeral is a single digit.
-- 
Cameron Simpson c...@zip.com.au

English is a living language, but simple illiteracy is no basis for
linguistic evolution.   - Dwight MacDonald
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: A few questiosn about encoding

2013-06-17 Thread Antoon Pardon
Op 17-06-13 09:08, Cameron Simpson schreef:
 On 17Jun2013 08:49, Antoon Pardon antoon.par...@rece.vub.ac.be wrote:
 | Op 15-06-13 02:28, Cameron Simpson schreef:
 |  On 14Jun2013 15:59, Nikos as SuperHost Support supp...@superhost.gr 
 wrote:
 |  | So, a numeral = a string representation of a number. Is this correct?
 | 
 |  No, a numeral is an individual digit from the string representation of a 
 number.
 |  So: 65 requires two numerals: '6' and '5'.
 |
 | Wrong context. A numeral as an individual digit is when you are talking 
 about
 | individual characters in a font. In such a context the set of glyphs that
 | represent a digit are the numerals.
 | 
 | However in a context of programming, numerals in general refer to the set of
 | strings that represent a number.

 No, those are just numbers or numeric strings (if you're being
 overt about them being strings at all). They're numeric strings
 because they're composed of numerals. If you think otherwise your
 vocabulary needs adjusting. A numeral is a single digit.


-- 
http://mail.python.org/mailman/listinfo/python-list


Re: A few questiosn about encoding

2013-06-17 Thread Antoon Pardon
Op 17-06-13 09:08, Cameron Simpson schreef:
 On 17Jun2013 08:49, Antoon Pardon antoon.par...@rece.vub.ac.be wrote:
 | Op 15-06-13 02:28, Cameron Simpson schreef:
 |  On 14Jun2013 15:59, Nikos as SuperHost Support supp...@superhost.gr 
 wrote:
 |  | So, a numeral = a string representation of a number. Is this correct?
 | 
 |  No, a numeral is an individual digit from the string representation of a 
 number.
 |  So: 65 requires two numerals: '6' and '5'.
 |
 | Wrong context. A numeral as an individual digit is when you are talking 
 about
 | individual characters in a font. In such a context the set of glyphs that
 | represent a digit are the numerals.
 | 
 | However in a context of programming, numerals in general refer to the set of
 | strings that represent a number.

 No, those are just numbers or numeric strings (if you're being
 overt about them being strings at all). They're numeric strings
 because they're composed of numerals. If you think otherwise your
 vocabulary needs adjusting. A numeral is a single digit.
Just because you are unfamiliar with a context in which numeral means
a representation of a number, doesn't imply my vocabularly needs adjusting.

-- 
Antoon Pardon

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: A few questiosn about encoding

2013-06-16 Thread Chris “Kwpolska” Warrick
On Sat, Jun 15, 2013 at 10:35 PM, Benjamin Schollnick
benja...@schollnick.net wrote:
 Nick,

 The only thing that i didn't understood is this line.
 First please tell me what is a byte value

 \x1b is a sequence you find inside strings (and byte strings, the
 b'...' format).


 \x1b is a character(ESC) represented in hex format

 b'\x1b' is a byte object that represents what?


 chr(27).encode('utf-8')
 b'\x1b'

 b'\x1b'.decode('utf-8')
 '\x1b'

 After decoding it gives the char ESC in hex format
 Shouldn't it result in value 27 which is the ordinal of ESC ?


 I'm sorry are you not listening?

 1b is a HEXADECIMAL Number.  As a so-called programmer, did you seriously
 not consider that?

 Try this:

 1) Open a Web browser
 2) Go to Google.com
 3) Type in Hex 1B
 4) Click on the first link
 5) In the Hexadecimal column find 1B.

 Or open your favorite calculator, and convert Hexadecimal 1B to Decimal
 (Base 10).

 - Benjamin



 --
 http://mail.python.org/mailman/listinfo/python-list


Better: a programmer should know how to convert hexadecimal to decimal.

0x1B = (0x1 * 16^1) + (0xB * 16^0) = (1 * 16) + (11 * 1) = 16 + 11 = 27

It’s that easy, and a programmer should be able to do that in their
brain, at least with small numbers.  Or at least know that:
http://lmgtfy.com/?q=0x1B+in+decimal

Or at least `hex(27)`; or '`{:X}'.format(27)`; or `'%X' % 27`.  (I
despise hex numbers with lowercase letters, but that’s my personal
opinion.)

--
Kwpolska http://kwpolska.tk | GPG KEY: 5EAAEA16
stop html mail| always bottom-post
http://asciiribbon.org| http://caliburn.nl/topposting.html
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: A few questiosn about encoding

2013-06-15 Thread Denis McMahon
On Fri, 14 Jun 2013 16:58:20 +0300, Nick the Gr33k wrote:

 On 14/6/2013 1:14 μμ, Cameron Simpson wrote:
 Normally a character in a b'...' item represents the byte value
 matching the character's Unicode ordinal value.

 The only thing that i didn't understood is this line.
 First please tell me what is a byte value

Seriously? You don't understand the term byte? And you're the support 
desk for a webhosting company?

-- 
Denis McMahon, denismfmcma...@gmail.com
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: A few questiosn about encoding

2013-06-15 Thread Grant Edwards
On 2013-06-15, Denis McMahon denismfmcma...@gmail.com wrote:
 On Fri, 14 Jun 2013 16:58:20 +0300, Nick the Gr33k wrote:

 On 14/6/2013 1:14 , Cameron Simpson wrote:
 Normally a character in a b'...' item represents the byte value
 matching the character's Unicode ordinal value.

 The only thing that i didn't understood is this line.
 First please tell me what is a byte value

 Seriously? You don't understand the term byte? And you're the support 
 desk for a webhosting company?

Well, we haven't had this thread for a week or so...

There is some ambiguity in the term byte.  It used to mean the
smallest addressable unit of memory (which varied in the past -- at
one point, both 20 and 60 bit bytes were common).  These days the
smallest addressable unit of memory is almost always 8 bits on desktop
and embedded processors (but often not on DSPs).  That's why when IEEE
stadards want to refer to an 8-bit chunk of data they use the term
octet.

:)


-- 
http://mail.python.org/mailman/listinfo/python-list


Re: A few questiosn about encoding

2013-06-15 Thread Nick the Gr33k

On 15/6/2013 5:44 μμ, Grant Edwards wrote:


There is some ambiguity in the term byte.  It used to mean the
smallest addressable unit of memory (which varied in the past -- at
one point, both 20 and 60 bit bytes were common).  These days the
smallest addressable unit of memory is almost always 8 bits on desktop
and embedded processors (but often not on DSPs).  That's why when IEEE
stadards want to refer to an 8-bit chunk of data they use the term
octet.


What the difference between a byte and a byte's value?


--
What is now proved was at first only imagined!
--
http://mail.python.org/mailman/listinfo/python-list


Re: A few questiosn about encoding

2013-06-15 Thread Roy Smith
In article kphul7$74q$1...@reader1.panix.com,
 Grant Edwards invalid@invalid.invalid wrote:

 There is some ambiguity in the term byte.  It used to mean the
 smallest addressable unit of memory (which varied in the past -- at
 one point, both 20 and 60 bit bytes were common).

I would have defined it more like, some arbitrary collection of 
adjacent bits which hold some useful value.  Doesn't need to be 
addressable, nor does it need to be the smallest such thing.

For example, on the pdp-10 (36 bit word), it was common to treat a word 
as either four 9-bit bytes, or five 7-bit bytes (with one bit left 
over), depending on what you were doing.  And, of course, a nybble was 
something smaller than a byte!

And, yes, especially in networking, everybody talks about octets when 
they want to make sure people understand what they mean.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: A few questiosn about encoding

2013-06-15 Thread Nick the Gr33k

On 15/6/2013 5:59 μμ, Roy Smith wrote:


And, yes, especially in networking, everybody talks about octets when
they want to make sure people understand what they mean.


1 byte = 8 bits

in networking though since we do not use encoding schemes with variable 
lengths like utf-8 is, how do we separate when a byte value start and 
when it stops?


do we need a start bit and a stop bit for that?



--
What is now proved was at first only imagined!
--
http://mail.python.org/mailman/listinfo/python-list


Re: A few questiosn about encoding

2013-06-15 Thread Steven D'Aprano
On Sat, 15 Jun 2013 17:49:13 +0300, Nick the Gr33k wrote:

 What the difference between a byte and a byte's value?

Nothing.


-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: A few questiosn about encoding

2013-06-15 Thread Joel Goldstick
On Sat, Jun 15, 2013 at 11:14 AM, Nick the Gr33k supp...@superhost.grwrote:

 On 15/6/2013 5:59 μμ, Roy Smith wrote:

  And, yes, especially in networking, everybody talks about octets when
 they want to make sure people understand what they mean.


 1 byte = 8 bits

 in networking though since we do not use encoding schemes with variable
 lengths like utf-8 is, how do we separate when a byte value start and when
 it stops?

 do we need a start bit and a stop bit for that?



 And this is specific to python how?





 --
 What is now proved was at first only imagined!
 --
 http://mail.python.org/**mailman/listinfo/python-listhttp://mail.python.org/mailman/listinfo/python-list




-- 
Joel Goldstick
http://joelgoldstick.com
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: A few questiosn about encoding

2013-06-15 Thread Nick the Gr33k

On 14/6/2013 4:58 μμ, Nick the Gr33k wrote:

On 14/6/2013 1:14 μμ, Cameron Simpson wrote:

Normally a character in a b'...' item represents the byte value
matching the character's Unicode ordinal value.


The only thing that i didn't understood is this line.
First please tell me what is a byte value


\x1b is a sequence you find inside strings (and byte strings, the
b'...' format).


\x1b is a character(ESC) represented in hex format

b'\x1b' is a byte object that represents what?


  chr(27).encode('utf-8')
b'\x1b'

  b'\x1b'.decode('utf-8')
'\x1b'

After decoding it gives the char ESC in hex format
Shouldn't it result in value 27 which is the ordinal of ESC ?

  No, I mean conceptually, there is no difference between a code-point
  and its ordinal value. They are the same thing.

Why Unicode charset doesn't just contain characters, but instead it
contains a mapping of (characters -- ordinals) ?

I mean what we do is to encode a character like chr(65).encode('utf-8')

What's the reason of existence of its corresponding ordinal value since
it doesn't get involved into the encoding process?

Thank you very much for taking the time to explain.


Can someone please explain these questions too?

--
What is now proved was at first only imagined!
--
http://mail.python.org/mailman/listinfo/python-list


Re: A few questiosn about encoding

2013-06-15 Thread Benjamin Schollnick
Nick,

 The only thing that i didn't understood is this line.
 First please tell me what is a byte value
 
 \x1b is a sequence you find inside strings (and byte strings, the
 b'...' format).
 
 \x1b is a character(ESC) represented in hex format
 
 b'\x1b' is a byte object that represents what?
 
 
  chr(27).encode('utf-8')
 b'\x1b'
 
  b'\x1b'.decode('utf-8')
 '\x1b'
 
 After decoding it gives the char ESC in hex format
 Shouldn't it result in value 27 which is the ordinal of ESC ?

I'm sorry are you not listening?

1b is a HEXADECIMAL Number.  As a so-called programmer, did you seriously not 
consider that?

Try this:

1) Open a Web browser
2) Go to Google.com
3) Type in Hex 1B 
4) Click on the first link
5) In the Hexadecimal column find 1B.

Or open your favorite calculator, and convert Hexadecimal 1B to Decimal (Base 
10).

- Benjamin


-- 
http://mail.python.org/mailman/listinfo/python-list


Re: A few questiosn about encoding

2013-06-14 Thread Zero Piraeus
:

On 14 June 2013 01:34, Nick the Gr33k supp...@superhost.gr wrote:
 Why doesn't it work like this?

 leading 0 = 1 byte flag
 leading 1 = 2 bytes flag
 leading 00 = 3 bytes flag
 leading 01 = 4 bytes flag
 leading 10 = 5 bytes flag
 leading 11 = 6 bytes flag

 Wouldn't it be more logical?

Think about it. Let's say that, as per your scheme, a leading 0
indicates 1 byte (as is indeed the case in UTF8). What things could
follow that leading 0? How does that impact your choice of a leading
00 or 01 for other numbers of bytes?

... okay, you're obviously going to need to be spoon-fed a little more
than that. Here's a byte:

  01010101

Is that a single byte representing a code point in the 0-127 range, or
the first of 4 bytes representing something else, in your proposed
scheme? How can you tell?

Now look at the way UTF8 does it:
http://en.wikipedia.org/wiki/Utf-8#Description

Really, follow the link and study the table carefully. Don't continue
reading this until you believe you understand the choices that the
designers of UTF8 made, and why they made them.

Pay particular attention to the possible values for byte 1. Do you
notice the difference between that scheme, and yours:

  0xxx
  1xxx
  00xx
  01xx
  10xx
  11xx

If you don't see it, keep looking until you do ... this email gives
you more than enough hints to work it out. Don't ask someone here to
explain it to you. If you want to become competent, you must use your
brain.

 -[]z.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: A few questiosn about encoding

2013-06-14 Thread Nick the Gr33k

On 14/6/2013 4:00 πμ, Cameron Simpson wrote:

On 13Jun2013 17:19, Nikos as SuperHost Support supp...@superhost.gr wrote:
| A code-point and the code-point's ordinal value are associated into
| a Unicode charset. They have the so called 1:1 mapping.
|
| So, i was under the impression that by encoding the code-point into
| utf-8 was the same as encoding the code-point's ordinal value into
| utf-8.
|
| So, now i believe they are two different things.
| The code-point *is what actually* needs to be encoded and *not* its
| ordinal value.

Because there is a 1:1 mapping, these are the same thing: a code
point is directly _represented_ by the ordinal value, and the ordinal
value is encoded for storage as bytes.


So, you are saying that:

chr(16474).encode('utf-8')   #being the code-point encoded

ord(chr(16474)).encode('utf-8') #being the code-point's ordinal 
encoded which gives an error.


that shows us that a character is what is being be encoded to utf-8 but 
the character's ordinal cannot.


So, whay you say and the ordinal value is encoded for storage as 
bytes. ?




|  The leading 0b is just syntax to tell you this is base 2, not base 8
|  (0o) or base 10 or base 16 (0x). Also, leading zero bits are dropped.
|
| But byte objects are represented as '\x' instead of the
| aforementioned '0x'. Why is that?

You're confusing a string representation of a single number in
some base (eg 2 or 16) with the string-ish representation of a
bytes object.


 bin(16474)
'0b10001011010'
that is a binary format string representation of number 16474, yes?

 hex(16474)
'0x405a'
that is a hexadecimal format string representation of number 16474, yes?

WHILE:

b'abc\x1b\n' = a string representation of a byte, which in turn is a 
series of integers, so that makes this a string representation of 
integers, is this correct?


\x1b = ESC character

\ = for seperating bytes
x = to flag that the following bytes are going to be represented as hex 
values? whats exactly 'x' means here? character perhaps?


Still its not clear into my head what the difference of '0x1b' and 
'\x1b' is:


i think:
0x1b = an integer represented in hex format

\x1b = a character represented in hex format

id this true?





| How can i view this byte's object representation as hex() or as bin()?

See above. A bytes is a _sequence_ of values. hex() and bin() print
individual values in hexadecimal or binary respectively.


 for value in b'\x97\x98\x99\x27\x10':
... print(value, hex(value), bin(value))
...
151 0x97 0b10010111
152 0x98 0b10011000
153 0x99 0b10011001
39 0x27 0b100111
16 0x10 0b1


 for value in b'abc\x1b\n':
... print(value, hex(value), bin(value))
...
97 0x61 0b111
98 0x62 0b1100010
99 0x63 0b1100011
27 0x1b 0b11011
10 0xa 0b1010


Why these two give different values when printed?
--
What is now proved was at first only imagined!
--
http://mail.python.org/mailman/listinfo/python-list


Re: A few questiosn about encoding

2013-06-14 Thread Nick the Gr33k

On 14/6/2013 9:00 πμ, Zero Piraeus wrote:

:

On 14 June 2013 01:34, Nick the Gr33k supp...@superhost.gr wrote:

Why doesn't it work like this?

leading 0 = 1 byte flag
leading 1 = 2 bytes flag
leading 00 = 3 bytes flag
leading 01 = 4 bytes flag
leading 10 = 5 bytes flag
leading 11 = 6 bytes flag

Wouldn't it be more logical?


Think about it. Let's say that, as per your scheme, a leading 0
indicates 1 byte (as is indeed the case in UTF8). What things could
follow that leading 0? How does that impact your choice of a leading
00 or 01 for other numbers of bytes?

... okay, you're obviously going to need to be spoon-fed a little more
than that. Here's a byte:

   01010101

Is that a single byte representing a code point in the 0-127 range, or
the first of 4 bytes representing something else, in your proposed
scheme? How can you tell?


Indeed.

You cannot tell if it stands for 1 byte or a 4 byte sequence:

0 + 1010101 = leading 0 stands for 1byte representation of a code-point

01 + 010101 = leading 01 stands for 4byte representation of a code-point

the problem here in my scheme of how utf8 encoding works is that you 
cannot tell whether the flag is '0' or '01'


Same happen with leading '1' and '11'. You cannot tell what the flag is, 
so you cannot know if the Unicode code-point is being represented as 
2-byte sequence or 6 bye sequence


Understood



Now look at the way UTF8 does it:
http://en.wikipedia.org/wiki/Utf-8#Description

Really, follow the link and study the table carefully. Don't continue
reading this until you believe you understand the choices that the
designers of UTF8 made, and why they made them.

Pay particular attention to the possible values for byte 1. Do you
notice the difference between that scheme, and yours:

   0xxx
   1xxx
   00xx
   01xx
   10xx
   11xx

If you don't see it, keep looking until you do ... this email gives
you more than enough hints to work it out. Don't ask someone here to
explain it to you. If you want to become competent, you must use your
brain.


0xxx
110x10xx
111010xx10xx
0xxx10xx10xx10xx

I did read the link but i still cannot see why

1. '110' is the flag for 2-byte code-point
2. why the in the 2nd byte and every subsequent byte leading flag has to 
be '10'


--
What is now proved was at first only imagined!
--
http://mail.python.org/mailman/listinfo/python-list


Re: A few questiosn about encoding

2013-06-14 Thread Antoon Pardon
Op 13-06-13 10:08, Νικόλαος Κούρας schreef:
 On 13/6/2013 10:58 πμ, Chris Angelico wrote:
 On Thu, Jun 13, 2013 at 5:42 PM,  ��
 supp...@superhost.gr wrote:
 On 13/6/2013 10:11 ��, Steven D'Aprano wrote:
 No! That creates a string from 16474 in base two:
 '0b10001011010'

 I disagree here.
 16474 is a number in base 10. Doing bin(16474) we get the binary
 representation of number 16474 and not a string.
 Why you say we receive a string while python presents a binary number?

 You can disagree all you like. Steven cited a simple point of fact,
 one which can be verified in any Python interpreter. Nikos, you are
 flat wrong here; bin(16474) creates a string.

 Indeed python embraced it in single quoting '0b10001011010' and
 not as 0b10001011010 which in fact makes it a string.

 But since bin(16474) seems to create a string rather than an expected
 number(at leat into my mind) then how do we get the binary
 representation of the number 16474 as a number?

You don't. You should remember that python (or any programming language)
doesn't print numbers. It always prints string representations of
numbers. It is just so that we are so used to the decimal representation
that we think of that representation as being the number.

Normally that is not a problem but it can cause confusion when you are
working with mulitple representations.

-- 
Antoon Pardon

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: A few questiosn about encoding

2013-06-14 Thread Nick the Gr33k

On 14/6/2013 10:36 πμ, Antoon Pardon wrote:

Op 13-06-13 10:08, Νικόλαος Κούρας schreef:

On 13/6/2013 10:58 πμ, Chris Angelico wrote:

On Thu, Jun 13, 2013 at 5:42 PM,  ��
supp...@superhost.gr wrote:

On 13/6/2013 10:11 ��, Steven D'Aprano wrote:

No! That creates a string from 16474 in base two:
'0b10001011010'


I disagree here.
16474 is a number in base 10. Doing bin(16474) we get the binary
representation of number 16474 and not a string.
Why you say we receive a string while python presents a binary number?


You can disagree all you like. Steven cited a simple point of fact,
one which can be verified in any Python interpreter. Nikos, you are
flat wrong here; bin(16474) creates a string.


Indeed python embraced it in single quoting '0b10001011010' and
not as 0b10001011010 which in fact makes it a string.

But since bin(16474) seems to create a string rather than an expected
number(at leat into my mind) then how do we get the binary
representation of the number 16474 as a number?


You don't. You should remember that python (or any programming language)
doesn't print numbers. It always prints string representations of
numbers. It is just so that we are so used to the decimal representation
that we think of that representation as being the number.

Normally that is not a problem but it can cause confusion when you are
working with mulitple representations.

Hold on!
Youa re basically saying here that:


 16474
16474

is nto a number as we think but instead is string representation of a 
number?


I dont think so, if it were a string representation of a number that 
would print the following:


 16474
'16474'

Python prints numbers:

 16474
16474
 0b10001011010
16474
 0x405a
16474

it prints them all to decimal format though.
but when we need a decimal integer to be turned into bin() or hex() we 
can bin(number) hex(number) and just remove the pair of single quoting.


--
What is now proved was at first only imagined!
--
http://mail.python.org/mailman/listinfo/python-list


Re: A few questiosn about encoding

2013-06-14 Thread Antoon Pardon
Op 14-06-13 09:49, Nick the Gr33k schreef:
 On 14/6/2013 10:36 πμ, Antoon Pardon wrote:
 Op 13-06-13 10:08, Νικόλαος Κούρας schreef:

 Indeed python embraced it in single quoting '0b10001011010' and
 not as 0b10001011010 which in fact makes it a string.

 But since bin(16474) seems to create a string rather than an expected
 number(at leat into my mind) then how do we get the binary
 representation of the number 16474 as a number?

 You don't. You should remember that python (or any programming language)
 doesn't print numbers. It always prints string representations of
 numbers. It is just so that we are so used to the decimal representation
 that we think of that representation as being the number.

 Normally that is not a problem but it can cause confusion when you are
 working with mulitple representations.
 Hold on!
 Youa re basically saying here that:


  16474
 16474

 is nto a number as we think but instead is string representation of a
 number?
Yes, or if you prefer what python prints is the decimal notation of the number. 
 


 I dont think so, if it were a string representation of a number that
 would print the following:

  16474
 '16474'

No it wouldn't, You are confusing representation in the everyday meaning
with representation as python jargon.


 Python prints numbers:
No it doesn't, numbers are abstract concepts that can be represented in
various notations, these notations are strings. Those notaional strings
end up being printed. As I said before we are so used in using the
decimal notation that we often use the notation and the number interchangebly
without a problem. But when we are working with multiple notations that
can become confusing and we should be careful to seperate numbers from their
representaions/notations.


 but when we need a decimal integer

There are no decimal integers. There is only a decimal notation of the number.
Decimal, octal etc are not characteristics of the numbers themselves.

-- 

Antoon Pardon

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: A few questiosn about encoding

2013-06-14 Thread Nick the Gr33k

On 14/6/2013 11:22 πμ, Antoon Pardon wrote:


Python prints numbers:

No it doesn't, numbers are abstract concepts that can be represented in
various notations, these notations are strings. Those notaional strings
end up being printed. As I said before we are so used in using the
decimal notation that we often use the notation and the number interchangebly
without a problem. But when we are working with multiple notations that
can become confusing and we should be careful to seperate numbers from their
representaions/notations.


How do we separate a number then from its represenation-natation?

What is a notation anywat? is it a way of displayment? but that would be 
a represeantion then


Please explain this line as it uses both terms.

No it doesn't, numbers are abstract concepts that can be represented in
various notations


but when we need a decimal integer


There are no decimal integers. There is only a decimal notation of the number.
Decimal, octal etc are not characteristics of the numbers themselves.


So everything we see like:

16474
nikos
abc123

everything is a string and nothing is a number? not even number 1?

--
What is now proved was at first only imagined!
--
http://mail.python.org/mailman/listinfo/python-list


Don't feed the troll... (was: Re: A few questiosn about encoding)

2013-06-14 Thread Heiko Wundram

Am 14.06.2013 10:37, schrieb Nick the Gr33k:

So everything we see like:

16474
nikos
abc123

everything is a string and nothing is a number? not even number 1?


Come on now, this is _so_ obviously trolling, it's not even remotely 
funny anymore. Why doesn't killfiling work with the mailing list version 
of the python list? :-(


--
--- Heiko.
--
http://mail.python.org/mailman/listinfo/python-list


Re: A few questiosn about encoding

2013-06-14 Thread Cameron Simpson
On 14Jun2013 11:37, Nikos as SuperHost Support supp...@superhost.gr wrote:
| On 14/6/2013 11:22 πμ, Antoon Pardon wrote:
| 
| Python prints numbers:
| No it doesn't, numbers are abstract concepts that can be represented in
| various notations, these notations are strings. Those notaional strings
| end up being printed. As I said before we are so used in using the
| decimal notation that we often use the notation and the number interchangebly
| without a problem. But when we are working with multiple notations that
| can become confusing and we should be careful to seperate numbers from their
| representaions/notations.
| 
| How do we separate a number then from its represenation-natation?

Shrug. When you print a number, Python transcribes a string
representation of it to your terminal.

| What is a notation anywat? is it a way of displayment? but that
| would be a represeantion then

Yep. Same thing. A notation is a particulart formal method of
representation.

| No it doesn't, numbers are abstract concepts that can be represented in
| various notations
| 
| but when we need a decimal integer
| 
| There are no decimal integers. There is only a decimal notation of the 
number.
| Decimal, octal etc are not characteristics of the numbers themselves.
| 
| So everything we see like:
| 
| 16474
| nikos
| abc123
| 
| everything is a string and nothing is a number? not even number 1?

Everything you see like that is textual information. Internally to
Python, various types are used: strings, bytes, integers etc. But
when you print something, text is output.

Cheers,
-- 
Cameron Simpson c...@zip.com.au

A long-forgotten loved one will appear soon. Buy the negatives at any price.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Don't feed the troll... (was: Re: A few questiosn about encoding)

2013-06-14 Thread Fábio Santos
On 14 Jun 2013 10:20, Heiko Wundram modeln...@modelnine.org wrote:

 Am 14.06.2013 10:37, schrieb Nick the Gr33k:

 So everything we see like:

 16474
 nikos
 abc123

 everything is a string and nothing is a number? not even number 1?


 Come on now, this is _so_ obviously trolling, it's not even remotely
funny anymore. Why doesn't killfiling work with the mailing list version of
the python list? :-(

I have skimmed the archives for this month, and I estimate that a third of
this month's activity on this list was helping this person. About 80% of
that is wasted in explaining basic concepts he refuses to read in links
given to him. A depressingly large number of replies to his posts are
seemingly ignored.

Since this is a lot of spam, I feel like leaving the list, but I also
honestly want to help people use python and the replies to questions of
others often give me much insight on several matters.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: A few questiosn about encoding

2013-06-14 Thread Cameron Simpson
On 14Jun2013 09:59, Nikos as SuperHost Support supp...@superhost.gr wrote:
| On 14/6/2013 4:00 πμ, Cameron Simpson wrote:
| On 13Jun2013 17:19, Nikos as SuperHost Support supp...@superhost.gr wrote:
| | A code-point and the code-point's ordinal value are associated into
| | a Unicode charset. They have the so called 1:1 mapping.
| |
| | So, i was under the impression that by encoding the code-point into
| | utf-8 was the same as encoding the code-point's ordinal value into
| | utf-8.
| |
| | So, now i believe they are two different things.
| | The code-point *is what actually* needs to be encoded and *not* its
| | ordinal value.
| 
| Because there is a 1:1 mapping, these are the same thing: a code
| point is directly _represented_ by the ordinal value, and the ordinal
| value is encoded for storage as bytes.
| 
| So, you are saying that:
| 
| chr(16474).encode('utf-8')   #being the code-point encoded
| 
| ord(chr(16474)).encode('utf-8') #being the code-point's ordinal
| encoded which gives an error.
| 
| that shows us that a character is what is being be encoded to utf-8
| but the character's ordinal cannot.
| 
| So, whay you say and the ordinal value is encoded for storage
| as bytes. ?

No, I mean conceptually, there is no difference between a codepoint
and its ordinal value. They are the same thing.

Inside Python itself, a character (a string of length 1; there is
no separate character type) is a distinct type. Interally, the
characters in a string are stored numericly. As Unicode codepoints,
as their ordinal values.

It is a meaningful idea to store a Python string encoded into bytes
using some text encoding scheme (utf-8, iso-8859-7, what have you).

It is not a meaningful thing to store a number encoded without
some more context. The .encode() method that accepts an encoding
name like utf-8 is specificly an encoding procedure FOR TEXT.

So strings have such a method, and integers do not.

When you write:

  chr(16474)

you receive a _string_, containing the single character whose ordinal
is 16474. It is meaningful to transcribe this string to bytes using
a text encoding procedure like 'utf-8'.

When you write:

  ord(chr(16474))

you get an integer. Because ord() is the reverse of chr(), you get
the integer 16474.

Integers do not have .encode() methods that accept a _text_ encoding
name like 'utf-8' because integers are not text.

| |  The leading 0b is just syntax to tell you this is base 2, not base 8
| |  (0o) or base 10 or base 16 (0x). Also, leading zero bits are dropped.
| |
| | But byte objects are represented as '\x' instead of the
| | aforementioned '0x'. Why is that?
| 
| You're confusing a string representation of a single number in
| some base (eg 2 or 16) with the string-ish representation of a
| bytes object.
| 
|  bin(16474)
| '0b10001011010'
| that is a binary format string representation of number 16474, yes?

Yes.

|  hex(16474)
| '0x405a'
| that is a hexadecimal format string representation of number 16474, yes?

Yes.

| WHILE:
| b'abc\x1b\n' = a string representation of a byte, which in turn is a
| series of integers, so that makes this a string representation of
| integers, is this correct?

A bytes Python object. So not a byte, 5 bytes.
It is a string representation of the series of byte values,
ON THE PREMISE that the bytes may well represent text.
On that basis, b'abc\x1b\n' is a reasonable way to display them.

In other contexts this might not be a sensible way to display these
bytes, and then another format would be chosen, possibly hand
constructed by the programmer, or equally reasonable, the hexlify()
function from the binascii module.

| \x1b = ESC character

Considering the bytes to be representing characters, then yes.

| \ = for seperating bytes

No, \ to introduce a sequence of characters with special meaning.

Normally a character in a b'...' item represents the byte value
matching the character's Unicode ordinal value. But several characters
are hard or confusing to place literally in a b'...' string. For
example a newline character or and escape character.

'a' means 65.
'\n' means 10 (newline, hence the 'n').
'\x1b' means 33 (escape, value 27, value 0x1b in hexadecimal).
And, of course, '\\' means a literal slosh, value 92.

| x = to flag that the following bytes are going to be represented as
| hex values? whats exactly 'x' means here? character perhaps?

A slosh followed by an 'x' means there will be 2 hexadecimal digits
to follow, and those two digits represent the byte value.

So, yes.

| Still its not clear into my head what the difference of '0x1b' and
| '\x1b' is:

They're the same thing in two similar but slightly different formats.

0x1b is a legitimate bare integer value in Python.

\x1b is a sequence you find inside strings (and byte strings, the
b'...' format).

| i think:
| 0x1b = an integer represented in hex format

Yes.

| \x1b = a character represented in hex format

Yes.

| | How can i view this byte's object representation as 

Re: A few questiosn about encoding

2013-06-14 Thread Antoon Pardon
Op 14-06-13 10:37, Nick the Gr33k schreef:
 On 14/6/2013 11:22 πμ, Antoon Pardon wrote:

 Python prints numbers:
 No it doesn't, numbers are abstract concepts that can be represented in
 various notations, these notations are strings. Those notaional strings
 end up being printed. As I said before we are so used in using the
 decimal notation that we often use the notation and the number
 interchangebly
 without a problem. But when we are working with multiple notations that
 can become confusing and we should be careful to seperate numbers
 from their
 representaions/notations.

 How do we separate a number then from its represenation-natation?
What do you mean? Internally there is no representation linked
to the number, so there is nothing to be seperated. Only when
a number needs to be printed, is a representation for that
number built and displayed.


 What is a notation anywat? is it a way of displayment? but that would
 be a represeantion then
Yes a notation is a representation. However represenation is also
a bit of python jargon that has a specific meaning. So in order to
not confuse with multiple possible meanings for representation I
chose to use notation


 There are no decimal integers. There is only a decimal notation of
 the number.
 Decimal, octal etc are not characteristics of the numbers themselves.


 So everything we see like:

 16474
 nikos
 abc123

 everything is a string and nothing is a number? not even number 1?
There is a difference between everything we see as you
write earlier and just plain eveything as you write
later. Python works with numbers, but at the moment
it has to display such a number it has to produce something
that is printable. So it will build a string that can be
used as a notation for that number, a numeral. And that
is what will be displayed.

-- 
Antoon.

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Don't feed the troll... (was: Re: A few questiosn about encoding)

2013-06-14 Thread rusi
On Jun 14, 3:20 pm, Fábio Santos fabiosantos...@gmail.com wrote:
  Come on now, this is _so_ obviously trolling, it's not even remotely

 funny anymore. Why doesn't killfiling work with the mailing list version of
 the python list? :-(

 I have skimmed the archives for this month, and I estimate that a third of
 this month's activity on this list was helping this person. About 80% of
 that is wasted in explaining basic concepts he refuses to read in links
 given to him. A depressingly large number of replies to his posts are
 seemingly ignored.

 Since this is a lot of spam, I feel like leaving the list, but I also
 honestly want to help people use python and the replies to questions of
 others often give me much insight on several matters.

Adding my +1 to this sentiment.

In older saner and more politically incorrect times, when there was a
student who was as idiotic as Nikos, he would be made to:
-- run five rounds of the field
-- stay after school
-- write pages of I shall not talk in class

In the age of cut-n-paste the last has lost its sting. Likewise the
first two are hard to administer across the internet.
Still if we are genuinely interested in solving this problem, ways may
be found, for example:

Any question from Nikos that has any English error, should be returned
with:
Correct your English before we look at your python.

If he is brazen enough to correct one error and leave the other 35,
then we put in a 24-hour delay for each reply.

I am sure others can come up with better solutions if we wish.

The alternative is that this disease has an unfavorable prognosis:
[Yes Nikos is an infectious disease: I believe I can pull out mails
from Steven and Grant Edwards whic hare begng tolook sspcicious ly
like Nikos [Sorry Im not much good at imitation!] ]

And that unfavorable prognosis is what Fabio is suggesting -- people
will start leaving the list/group.

Nikos:
This is not against you personally.  Just your current mode of conduct
towards this list.
And that mode quite simply is this: You have no interest in python,
you are only interested in the immediate questions of your web-hosting.
-- 
http://mail.python.org/mailman/listinfo/python-list



Re: A few questiosn about encoding

2013-06-14 Thread Nick the Gr33k

On 14/6/2013 1:19 μμ, Cameron Simpson wrote:

On 14Jun2013 11:37, Nikos as SuperHost Support supp...@superhost.gr wrote:
| On 14/6/2013 11:22 πμ, Antoon Pardon wrote:
|
| Python prints numbers:
| No it doesn't, numbers are abstract concepts that can be represented in
| various notations, these notations are strings. Those notaional strings
| end up being printed. As I said before we are so used in using the
| decimal notation that we often use the notation and the number interchangebly
| without a problem. But when we are working with multiple notations that
| can become confusing and we should be careful to seperate numbers from their
| representaions/notations.
|
| How do we separate a number then from its represenation-natation?

Shrug. When you print a number, Python transcribes a string
representation of it to your terminal.


 16
16

So the output 16 is in fact a string representation of the number 16 ?

Then in what 16 and '16; differ to?



| What is a notation anywat? is it a way of displayment? but that
| would be a represeantion then

Yep. Same thing. A notation is a particulart formal method of
representation.



Can you elaborate please?

| No it doesn't, numbers are abstract concepts that can be represented in
| various notations
|
| but when we need a decimal integer
| 
| There are no decimal integers. There is only a decimal notation of the 
number.
| Decimal, octal etc are not characteristics of the numbers themselves.
|
| So everything we see like:
|
| 16474
| nikos
| abc123
|
| everything is a string and nothing is a number? not even number 1?

Everything you see like that is textual information. Internally to
Python, various types are used: strings, bytes, integers etc. But
when you print something, text is output.

Cheers,


Thanks!

--
What is now proved was at first only imagined!
--
http://mail.python.org/mailman/listinfo/python-list


Re: A few questiosn about encoding

2013-06-14 Thread Nick the Gr33k

On 14/6/2013 1:50 μμ, Antoon Pardon wrote:


Python works with numbers, but at the moment
it has to display such a number it has to produce something
that is printable. So it will build a string that can be
used as a notation for that number, a numeral. And that
is what will be displayed.


so a number is just a number but when this number needs to be displayed 
into a monitor, then the printed form of that number we choose to call 
it a numeral?


So, a numeral = a string representation of a number. Is this correct?

--
What is now proved was at first only imagined!
--
http://mail.python.org/mailman/listinfo/python-list


Re: A few questiosn about encoding

2013-06-14 Thread Antoon Pardon
Op 14-06-13 14:59, Nick the Gr33k schreef:

 On 14/6/2013 1:50 μμ, Antoon Pardon wrote:
 Python works with numbers, but at the moment
 it has to display such a number it has to produce something
 that is printable. So it will build a string that can be
 used as a notation for that number, a numeral. And that
 is what will be displayed.
 so a number is just a number but when this number needs to be displayed 
 into a monitor, then the printed form of that number we choose to call 
 it a numeral?
 So, a numeral = a string representation of a number. Is this correct?
Yes, when you print an integer, what actually happens is something along
the following algorithm (python 2 code):


def write_int(out, nr):
ord0 = ord('0')
lst = []
negative = False
if nr  0:
negative = True
nr = -nr
while nr:
digit = nr % 10
lst.append(chr(digit + ord0))
nr /= 10
if negative:
lst.append('-')
lst.reverse()
if not lst:
lst.append('0')
numeral = ''.join(lst)
out.write(numeral)

-- 
Antoon Pardon





-- 
http://mail.python.org/mailman/listinfo/python-list


Re: A few questiosn about encoding

2013-06-14 Thread Nick the Gr33k

On 14/6/2013 1:14 μμ, Cameron Simpson wrote:

Normally a character in a b'...' item represents the byte value
matching the character's Unicode ordinal value.


The only thing that i didn't understood is this line.
First please tell me what is a byte value


\x1b is a sequence you find inside strings (and byte strings, the
b'...' format).


\x1b is a character(ESC) represented in hex format

b'\x1b' is a byte object that represents what?


 chr(27).encode('utf-8')
b'\x1b'

 b'\x1b'.decode('utf-8')
'\x1b'

After decoding it gives the char ESC in hex format
Shouldn't it result in value 27 which is the ordinal of ESC ?

 No, I mean conceptually, there is no difference between a code-point
 and its ordinal value. They are the same thing.

Why Unicode charset doesn't just contain characters, but instead it 
contains a mapping of (characters -- ordinals) ?


I mean what we do is to encode a character like chr(65).encode('utf-8')

What's the reason of existence of its corresponding ordinal value since 
it doesn't get involved into the encoding process?


Thank you very much for taking the time to explain.
--
What is now proved was at first only imagined!
--
http://mail.python.org/mailman/listinfo/python-list


Re: A few questiosn about encoding

2013-06-14 Thread Joel Goldstick
let's cut to the chase and start with telling us what you DO know Nick.
That would take less typing


On Fri, Jun 14, 2013 at 9:58 AM, Nick the Gr33k supp...@superhost.grwrote:

 On 14/6/2013 1:14 μμ, Cameron Simpson wrote:

 Normally a character in a b'...' item represents the byte value
 matching the character's Unicode ordinal value.


 The only thing that i didn't understood is this line.
 First please tell me what is a byte value


  \x1b is a sequence you find inside strings (and byte strings, the
 b'...' format).


 \x1b is a character(ESC) represented in hex format

 b'\x1b' is a byte object that represents what?


  chr(27).encode('utf-8')
 b'\x1b'

  b'\x1b'.decode('utf-8')
 '\x1b'

 After decoding it gives the char ESC in hex format
 Shouldn't it result in value 27 which is the ordinal of ESC ?

  No, I mean conceptually, there is no difference between a code-point

  and its ordinal value. They are the same thing.

 Why Unicode charset doesn't just contain characters, but instead it
 contains a mapping of (characters -- ordinals) ?

 I mean what we do is to encode a character like chr(65).encode('utf-8')

 What's the reason of existence of its corresponding ordinal value since it
 doesn't get involved into the encoding process?

 Thank you very much for taking the time to explain.

 --
 What is now proved was at first only imagined!
 --
 http://mail.python.org/**mailman/listinfo/python-listhttp://mail.python.org/mailman/listinfo/python-list




-- 
Joel Goldstick
http://joelgoldstick.com
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: A few questiosn about encoding

2013-06-14 Thread Nick the Gr33k

On 14/6/2013 6:21 μμ, Joel Goldstick wrote:

let's cut to the chase and start with telling us what you DO know Nick.
That would take less typing
Well, my biggest successes up until now where to build 3 websites 
utilizing database saves and retrievals


in PHP
in Perl
and later in Python

with absolute ignorance of

Apache Configuration:
CGI:
Linux:

with just basic knowledge of linux.
I'am very proud of it.



--
What is now proved was at first only imagined!
--
http://mail.python.org/mailman/listinfo/python-list


Re: A few questiosn about encoding

2013-06-14 Thread Chris Angelico
On Sat, Jun 15, 2013 at 1:26 AM, Nick the Gr33k supp...@superhost.gr wrote:
 Well, my biggest successes up until now where to build 3 websites utilizing
 database saves and retrievals

 in PHP
 in Perl
 and later in Python

 with absolute ignorance of

 Apache Configuration:
 CGI:
 Linux:

 with just basic knowledge of linux.
 I'am very proud of it.

Translation:

I just built a car. I don't know anything about internal combustion
engines or road rules or metalwork, and I'm very proud of the
monstrosity that I'm now selling to my friends.

Would you buy a car built by someone who proudly announces that he has
no clue how to build one? Why do you sell web hosting services when
you have no clue how to provide them?

ChrisA
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Don't feed the troll... (was: Re: A few questiosn about encoding)

2013-06-14 Thread D'Arcy J.M. Cain
On Fri, 14 Jun 2013 11:06:55 +0200
Heiko Wundram modeln...@modelnine.org wrote:
 Come on now, this is _so_ obviously trolling, it's not even remotely 
 funny anymore. Why doesn't killfiling work with the mailing list
 version of the python list? :-(

A big problem, other than Mr. Support's shenanigans with his email
address, is that even those of us who seem to have successfully
*plonked* him get the responses to him.  The biggest issue with a troll
isn't so much the annoying emails from him but the amplified slew of
responses.  That's the point of a troll after all.

The answer is to always make sure that you include the previous poster
in the reply as a Cc or To.  I filter out any email that has the string
supp...@superhost.gr in a header so I would also filter out the
replies if people would follow that simple rule.

I have suggested this before but the push back I get is that then
people would get two copies of the email, one to them and one to the
list.  My answer is simple.  Get a proper email system that filters out
duplicates.  Is there an email client out there that does not have this
facility?

-- 
D'Arcy J.M. Cain da...@druid.net |  Democracy is three wolves
http://www.druid.net/darcy/|  and a sheep voting on
+1 416 788 2246 (DoD#0082)(eNTP)   |  what's for dinner.
IM: da...@vex.net, VOIP: sip:da...@vex.net
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Don't feed the troll... (was: Re: A few questiosn about encoding)

2013-06-14 Thread Chris Angelico
On Sat, Jun 15, 2013 at 3:13 AM, D'Arcy J.M. Cain da...@druid.net wrote:
 The answer is to always make sure that you include the previous poster
 in the reply as a Cc or To.  I filter out any email that has the string
 supp...@superhost.gr in a header so I would also filter out the
 replies if people would follow that simple rule.

 I have suggested this before but the push back I get is that then
 people would get two copies of the email, one to them and one to the
 list.  My answer is simple.  Get a proper email system that filters out
 duplicates.  Is there an email client out there that does not have this
 facility?

The main downside to that is not the first response, to
somebody@somewhere and python-list, but the subsequent ones. Do you
include everyone's addresses? And if so, how do they then get off the
list? (This is a serious consideration. I had some very angry people
asking me to unsubscribe them from a (private) mailman list I run, but
they weren't subscribed at all - they were being cc'd.)

I prefer to simply mail the list. You should be able to mute entire
threads, and he doesn't start more than a couple a day usually.

ChrisA
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Don't feed the troll... (was: Re: A few questiosn about encoding)

2013-06-14 Thread Grant Edwards
On 2013-06-14, Chris Angelico ros...@gmail.com wrote:
 On Sat, Jun 15, 2013 at 3:13 AM, D'Arcy J.M. Cain da...@druid.net wrote:
 The answer is to always make sure that you include the previous poster
 in the reply as a Cc or To.  I filter out any email that has the string
 supp...@superhost.gr in a header so I would also filter out the
 replies if people would follow that simple rule.

 I have suggested this before but the push back I get is that then
 people would get two copies of the email, one to them and one to the
 list.  My answer is simple.  Get a proper email system that filters out
 duplicates.  Is there an email client out there that does not have this
 facility?

 The main downside to that is not the first response, to
 somebody@somewhere and python-list, but the subsequent ones. Do you
 include everyone's addresses? And if so, how do they then get off the
 list? (This is a serious consideration. I had some very angry people
 asking me to unsubscribe them from a (private) mailman list I run, but
 they weren't subscribed at all - they were being cc'd.)

I think the answer is to automatically kill all threads stared by
him.

Unfortunately, I don't know if that's possible in most newsreaders.

-- 
Grant Edwards   grant.b.edwardsYow! A dwarf is passing out
  at   somewhere in Detroit!
  gmail.com
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: A few questiosn about encoding

2013-06-14 Thread Walter Hurry
On Sat, 15 Jun 2013 03:03:02 +1000, Chris Angelico wrote:

 Why do you sell web hosting services when you
 have no clue how to provide them?
 
And why do you continue responding to this timewaster? Please, please 
just killfile him and let's all move on.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: A few questiosn about encoding

2013-06-14 Thread Cameron Simpson
On 14Jun2013 15:59, Nikos as SuperHost Support supp...@superhost.gr wrote:
| So, a numeral = a string representation of a number. Is this correct?

No, a numeral is an individual digit from the string representation of a number.
So: 65 requires two numerals: '6' and '5'.
-- 
Cameron Simpson c...@zip.com.au

In life, you should always try to know your strong points, but this is
far less important than knowing your weak points.
Martin Fitzpatrick mfitzpatr...@scot.bbc.co.uk
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: A few questiosn about encoding

2013-06-14 Thread Cameron Simpson
On 14Jun2013 16:58, Nikos as SuperHost Support supp...@superhost.gr wrote:
| On 14/6/2013 1:14 μμ, Cameron Simpson wrote:
| Normally a character in a b'...' item represents the byte value
| matching the character's Unicode ordinal value.
| 
| The only thing that i didn't understood is this line.
| First please tell me what is a byte value

The numeric value stored in a byte. Bytes are just small integers
in the range 0..255; the values available with 8 bits of storage.

| \x1b is a sequence you find inside strings (and byte strings, the
| b'...' format).
| 
| \x1b is a character(ESC) represented in hex format

Yes.

| b'\x1b' is a byte object that represents what?

An array of 1 byte, whose value is 0x1b or 27.

|  chr(27).encode('utf-8')
| b'\x1b'

Transcribing the ESC Unicode character to byte storage.

|  b'\x1b'.decode('utf-8')
| '\x1b'

Reading a single byte array containing a 27 and decoding it assuming 'utf-8'.
This obtains a single character string containing the ESC character.

| After decoding it gives the char ESC in hex format
| Shouldn't it result in value 27 which is the ordinal of ESC ?

When printing strings, the non-printable characters in the string are
_represented_ in hex format, so \x1b was printed.

|  No, I mean conceptually, there is no difference between a code-point
|  and its ordinal value. They are the same thing.
| 
| Why Unicode charset doesn't just contain characters, but instead it
| contains a mapping of (characters -- ordinals) ?

Look, as far as a computer is concerned a character and an ordinal
are the same thing because you just store character ordinals in
memory when you store a string.

When characters are _displayed_, your Terminal (or web browser or
whatever) takes character ordinals and looks them up in a _font_,
which is a mapping of character ordinals to glyphs (character
images), and renders the character image onto your screen.

| I mean what we do is to encode a character like chr(65).encode('utf-8')
| What's the reason of existence of its corresponding ordinal value
| since it doesn't get involved into the encoding process?

Stop thinking of Unicode code points and ordinal values as separate
things. They are effectively two terms for the same thing. So there
is no corresponding ordinal value. 65 _is_ the ordinal value.

When you run:

  chr(65).encode('utf-8')

you're going:

  chr(65) == 'A'
Producing a string with just one character in it.
Internally, Python stores an array of character ordinals, thus: [65]

  'A'.encode('utf-8')
Walk along all the ordinals in the string and transribe them as bytes.
For 65, the byte encoding in 'utf-8' is a single byte of value 65.
So you get an array of bytes (a bytes object in Python), thus: [65]

-- 
Cameron Simpson c...@zip.com.au

The double cam chain setup on the 1980's DOHC CB750 was another one of
Honda's pointless engineering breakthroughs. You know the cycle (if you'll
pardon the pun :-), Wonderful New Feature is introduced with much fanfare,
WNF is fawned over by the press, WNF is copied by the other three Japanese
makers (this step is sometimes optional), and finally, WNF is quietly dropped
by Honda.
- Blaine Gardner, blgar...@sim.es.com
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: A few questiosn about encoding

2013-06-13 Thread Νικόλαος Κούρας

On 13/6/2013 3:13 πμ, Steven D'Aprano wrote:

On Wed, 12 Jun 2013 14:23:49 +0300, Νικόλαος Κούρας wrote:


So, how many bytes does UTF-8 stored for codepoints  127 ?


Two, three or four, depending on the codepoint.


The amount of bytes needed by UTF-8 to store a code-point(character), 
depends on the ordinal value of the code-point in the Unicode charset, 
correct?


If this is correct then the higher the ordinal value(which is an decimal 
integer) in the Unicode charset the more bytes needed for storage.


Its like the bigger a decimal integer is the bigger binary number it 
produces.


Is this correct?



example for codepoint 256, 1345, 16474 ?


You can do this yourself. I have already given you enough information in
previous emails to answer this question on your own, but here it is again:

Open an interactive Python session, and run this code:

c = ord(16474)
len(c.encode('utf-8'))


That will tell you how many bytes are used for that example.

This si actually wrong.

ord()'s arguments must be a character for which we expect its ordinal value.

 chr(16474)
'䁚'

Some Chinese symbol.
So code-point '䁚' has a Unicode ordinal value of 16474, correct?

where in after encoding this glyph's ordinal value to binary gives us 
the following bytes:


 bin(16474).encode('utf-8')
b'0b10001011010'

Now, we take tow symbols out:

'b' symbolism which is there to tell us that we are looking a bytes 
object as well as the
'0b' symbolism which is there to tell us that we are looking a binary 
representation of a bytes object


Thus, there we count 15 bits left.
So it says 15 bits, which is 1-bit less that 2 bytes.
Is the above statements correct please?


but thinking this through more and more:

 chr(16474).encode('utf-8')
b'\xe4\x81\x9a'
 len(b'\xe4\x81\x9a')
3

it seems that the bytestring the encode process produces is of length 3.

So i take it is 3 bytes?

but there is a mismatch of what  bin(16474).encode('utf-8') and  
chr(16474).encode('utf-8') is telling us here.


Care to explain that too please ?






--
http://mail.python.org/mailman/listinfo/python-list


Re: A few questiosn about encoding

2013-06-13 Thread Νικόλαος Κούρας

On 12/6/2013 11:30 μμ, Nobody wrote:

On Wed, 12 Jun 2013 14:23:49 +0300, Νικόλαος Κούρας wrote:


So, how many bytes does UTF-8 stored for codepoints  127 ?


U+..U+007F  1 byte
U+0080..U+07FF  2 bytes
U+0800..U+  3 bytes

=U+1   4 bytes


'U' stands for Unicode code-point which means a character right?

How can you be able to tell up to what character utf-8 needs 1 byte or 2 
bytes or 3?



And some of the bytes' bits are used to tell where a code-points 
representations stops, right?  i mean if we have a code-point that needs 
2 bytes to be stored that the high bit must be set to 1 to signify that 
this character's encoding stops at 2 bytes.


I just know that 2^8 = 256, that's by first look 265 places, which mean 
256 positions to hold a code-point which in turn means a character.


We take the high bit out and then we have 2^7 which is enough positions 
for 0-127 standard ASCII. High bit is set to '0' to signify that char is 
encoded in 1 byte.


Please tell me that i understood correct so far.

But how about for 2 or 3 or 4 bytes?

Am i saying ti correct ?



--
http://mail.python.org/mailman/listinfo/python-list


Re: A few questiosn about encoding

2013-06-13 Thread jmfauth

--

UTF-8, Unicode (consortium): 1 to 4 *Unicode Transformation Unit*

UTF-8, ISO 10646: 1 to 6 *Unicode Transformation Unit*

(still actual, unless tealy freshly modified)

jmf

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: A few questiosn about encoding

2013-06-13 Thread Chris Angelico
On Thu, Jun 13, 2013 at 4:21 PM, Νικόλαος Κούρας supp...@superhost.gr wrote:
 How can you be able to tell up to what character utf-8 needs 1 byte or 2
 bytes or 3?

You look up Wikipedia, using the handy links that have been put to you
MULTIPLE TIMES.

ChrisA
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: A few questiosn about encoding

2013-06-13 Thread Steven D'Aprano
On Thu, 13 Jun 2013 09:09:19 +0300, Νικόλαος Κούρας wrote:

 On 13/6/2013 3:13 πμ, Steven D'Aprano wrote:

 Open an interactive Python session, and run this code:

 c = ord(16474)
 len(c.encode('utf-8'))


 That will tell you how many bytes are used for that example.
 This si actually wrong.
 
 ord()'s arguments must be a character for which we expect its ordinal
 value.

Gah! 

That's twice I've screwed that up. Sorry about that!


   chr(16474)
 '䁚'
 
 Some Chinese symbol.
 So code-point '䁚' has a Unicode ordinal value of 16474, correct?

Correct.

 
 where in after encoding this glyph's ordinal value to binary gives us
 the following bytes:
 
   bin(16474).encode('utf-8')
 b'0b10001011010'

No! That creates a string from 16474 in base two:

'0b10001011010'

The leading 0b is just syntax to tell you this is base 2, not base 8 
(0o) or base 10 or base 16 (0x). Also, leading zero bits are dropped.

Then you encode the string '0b10001011010' into UTF-8. There are 17 
characters in this string, and they are all ASCII characters to they take 
up 1 byte each, giving you bytes b'0b10001011010' (in ASCII form). In 
hex form, they are:

b'\x30\x62\x31\x30\x30\x30\x30\x30\x30\x30\x31\x30\x31\x31\x30\x31\x30'

which takes up a lot more room, which is why Python prefers to show ASCII 
characters as characters rather than as hex.

What you want is:

chr(16474).encode('utf-8')


[...]
 Thus, there we count 15 bits left.
 So it says 15 bits, which is 1-bit less that 2 bytes. Is the above
 statements correct please?

No. There are 17 BYTES there. The string 0 doesn't get turned into a 
single bit. It still takes up a full byte, 0x30, which is 8 bits.


 but thinking this through more and more:
 
   chr(16474).encode('utf-8')
 b'\xe4\x81\x9a'
   len(b'\xe4\x81\x9a')
 3
 
 it seems that the bytestring the encode process produces is of length 3.

Correct! Now you have got the right idea.




-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: A few questiosn about encoding

2013-06-13 Thread Νικόλαος Κούρας

On 13/6/2013 10:11 πμ, Steven D'Aprano wrote:


   chr(16474)
'䁚'

Some Chinese symbol.
So code-point '䁚' has a Unicode ordinal value of 16474, correct?


Correct.



where in after encoding this glyph's ordinal value to binary gives us
the following bytes:

   bin(16474).encode('utf-8')
b'0b10001011010'


An observations here that you please confirm as valid.

1. A code-point and the code-point's ordinal value are associated into a 
Unicode charset. They have the so called 1:1 mapping.


So, i was under the impression that by encoding the code-point into 
utf-8 was the same as encoding the code-point's ordinal value into utf-8.


That is why i tried to:
bin(16474).encode('utf-8') instead of chr(16474).encode('utf-8')

So, now i believe they are two different things.
The code-point *is what actually* needs to be encoded and *not* its 
ordinal value.




The leading 0b is just syntax to tell you this is base 2, not base 8
(0o) or base 10 or base 16 (0x). Also, leading zero bits are dropped.


But byte objects are represented as '\x' instead of the aforementioned 
'0x'. Why is that?



 No! That creates a string from 16474 in base two:
 '0b10001011010'

I disagree here.
16474 is a number in base 10. Doing bin(16474) we get the binary 
representation of number 16474 and not a string.

Why you say we receive a string while python presents a binary number?



Then you encode the string '0b10001011010' into UTF-8. There are 17
characters in this string, and they are all ASCII characters to they take
up 1 byte each, giving you bytes b'0b10001011010' (in ASCII form).


0b10001011010 stands for a number in base 2 for me not as a string.
Have i understood something wrong?


--
http://mail.python.org/mailman/listinfo/python-list


Re: A few questiosn about encoding

2013-06-13 Thread Chris Angelico
On Thu, Jun 13, 2013 at 5:42 PM, Νικόλαος Κούρας supp...@superhost.gr wrote:
 On 13/6/2013 10:11 πμ, Steven D'Aprano wrote:
 No! That creates a string from 16474 in base two:
 '0b10001011010'

 I disagree here.
 16474 is a number in base 10. Doing bin(16474) we get the binary
 representation of number 16474 and not a string.
 Why you say we receive a string while python presents a binary number?

You can disagree all you like. Steven cited a simple point of fact,
one which can be verified in any Python interpreter. Nikos, you are
flat wrong here; bin(16474) creates a string.

ChrisA
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: A few questiosn about encoding

2013-06-13 Thread Νικόλαος Κούρας

On 13/6/2013 10:58 πμ, Chris Angelico wrote:

On Thu, Jun 13, 2013 at 5:42 PM,  �� supp...@superhost.gr wrote:

On 13/6/2013 10:11 ��, Steven D'Aprano wrote:

No! That creates a string from 16474 in base two:
'0b10001011010'


I disagree here.
16474 is a number in base 10. Doing bin(16474) we get the binary
representation of number 16474 and not a string.
Why you say we receive a string while python presents a binary number?


You can disagree all you like. Steven cited a simple point of fact,
one which can be verified in any Python interpreter. Nikos, you are
flat wrong here; bin(16474) creates a string.


Indeed python embraced it in single quoting '0b10001011010' and not 
as 0b10001011010 which in fact makes it a string.


But since bin(16474) seems to create a string rather than an expected 
number(at leat into my mind) then how do we get the binary 
representation of the number 16474 as a number?

--
http://mail.python.org/mailman/listinfo/python-list


Re: A few questiosn about encoding

2013-06-13 Thread Chris Angelico
On Thu, Jun 13, 2013 at 6:08 PM, Νικόλαος Κούρας supp...@superhost.gr wrote:
 On 13/6/2013 10:58 πμ, Chris Angelico wrote:

 On Thu, Jun 13, 2013 at 5:42 PM,  �� supp...@superhost.gr
 wrote:

 On 13/6/2013 10:11 ��, Steven D'Aprano wrote:

 No! That creates a string from 16474 in base two:
 '0b10001011010'


 I disagree here.
 16474 is a number in base 10. Doing bin(16474) we get the binary
 representation of number 16474 and not a string.
 Why you say we receive a string while python presents a binary number?


 You can disagree all you like. Steven cited a simple point of fact,
 one which can be verified in any Python interpreter. Nikos, you are
 flat wrong here; bin(16474) creates a string.


 Indeed python embraced it in single quoting '0b10001011010' and not as
 0b10001011010 which in fact makes it a string.

 But since bin(16474) seems to create a string rather than an expected
 number(at leat into my mind) then how do we get the binary representation of
 the number 16474 as a number?

In Python 2:
 16474

In Python 3, you have to fiddle around with ctypes, but broadly
speaking, the same thing.

ChrisA
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: A few questiosn about encoding

2013-06-13 Thread Νικόλαος Κούρας

On 13/6/2013 11:20 πμ, Chris Angelico wrote:

On Thu, Jun 13, 2013 at 6:08 PM, Νικόλαος Κούρας supp...@superhost.gr wrote:

On 13/6/2013 10:58 πμ, Chris Angelico wrote:


On Thu, Jun 13, 2013 at 5:42 PM,  �� supp...@superhost.gr
wrote:


On 13/6/2013 10:11 ��, Steven D'Aprano wrote:


No! That creates a string from 16474 in base two:
'0b10001011010'



I disagree here.
16474 is a number in base 10. Doing bin(16474) we get the binary
representation of number 16474 and not a string.
Why you say we receive a string while python presents a binary number?



You can disagree all you like. Steven cited a simple point of fact,
one which can be verified in any Python interpreter. Nikos, you are
flat wrong here; bin(16474) creates a string.



Indeed python embraced it in single quoting '0b10001011010' and not as
0b10001011010 which in fact makes it a string.

But since bin(16474) seems to create a string rather than an expected
number(at leat into my mind) then how do we get the binary representation of
the number 16474 as a number?


In Python 2:

16474
typing 16474 in interactive session both in python 2 and 3 gives back 
the number 16474


while we want the the binary representation of the number 16474


--
http://mail.python.org/mailman/listinfo/python-list


Re: A few questiosn about encoding

2013-06-13 Thread Nobody
On Thu, 13 Jun 2013 12:01:55 +1000, Chris Angelico wrote:

 On Thu, Jun 13, 2013 at 11:40 AM, Steven D'Aprano
 steve+comp.lang.pyt...@pearwood.info wrote:
 The *mechanism* of UTF-8 can go up to 6 bytes (or even 7 perhaps?), but
 that's not UTF-8, that's UTF-8-plus-extra-codepoints.
 
 And a proper UTF-8 decoder will reject \xC0\x80 and \xed\xa0\x80, even
 though mathematically they would translate into U+ and U+D800
 respectively. The UTF-16 *mechanism* is limited to no more than Unicode
 has currently used, but I'm left wondering if that's actually the other
 way around - that Unicode planes were deemed to stop at the point where
 UTF-16 can't encode any more.

Indeed. 5-byte and 6-byte sequences were originally part of the UTF-8
specification, allowing for 31 bits. Later revisions of the standard
imposed the UTF-16 limit on Unicode as a whole.

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: A few questiosn about encoding

2013-06-13 Thread Steven D'Aprano
On Thu, 13 Jun 2013 12:41:41 +0300, Νικόλαος Κούρας wrote:

 In Python 2:
 16474
 typing 16474 in interactive session both in python 2 and 3 gives back
 the number 16474
 
 while we want the the binary representation of the number 16474

Python does not work that way. Ints *always* display in decimal. 
Regardless of whether you enter the decimal in binary:

py 0b10001011010
16474


octal:

py 0o40132
16474


or hexadecimal:

py 0x405A
16474


ints always display in decimal. The only way to display in another base 
is to build a string showing what the int would look like in a different 
base:

py hex(16474)
'0x405a'

Notice that the return value of bin, oct and hex are all strings. If they 
were ints, then they would display in decimal, defeating the purpose!


-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: A few questiosn about encoding

2013-06-13 Thread Νικόλαος Κούρας

On 13/6/2013 2:49 μμ, Steven D'Aprano wrote:

Please confirm these are true statement:

A code-point and the code-point's ordinal value are associated into a 
Unicode charset. They have the so called 1:1 mapping.


So, i was under the impression that by encoding the code-point into 
utf-8 was the same as encoding the code-point's ordinal value into utf-8.


So, now i believe they are two different things.
The code-point *is what actually* needs to be encoded and *not* its 
ordinal value.



 The leading 0b is just syntax to tell you this is base 2, not base 8
 (0o) or base 10 or base 16 (0x). Also, leading zero bits are dropped.

But byte objects are represented as '\x' instead of the aforementioned 
'0x'. Why is that?



ints always display in decimal. The only way to display in another base
is to build a string showing what the int would look like in a different
base:

py hex(16474)
'0x405a'

Notice that the return value of bin, oct and hex are all strings. If they
were ints, then they would display in decimal, defeating the purpose!


Thank you didn't knew that! indeed it working like this.

To encode a number we have to turn it into a string first.

16474.encode('utf-8')
b'16474'

That 'b' stand for bytes.
How can i view this byte's object representation as hex() or as bin()?


Also:
 len('0b10001011010')
17

You said this string consists of 17 chars.
Why the leading syntax of '0b' counts as bits as well? Shouldn't be 15 
bits instead of 17?




--
http://mail.python.org/mailman/listinfo/python-list


Re: A few questiosn about encoding

2013-06-13 Thread Cameron Simpson
On 13Jun2013 17:19, Nikos as SuperHost Support supp...@superhost.gr wrote:
| A code-point and the code-point's ordinal value are associated into
| a Unicode charset. They have the so called 1:1 mapping.
| 
| So, i was under the impression that by encoding the code-point into
| utf-8 was the same as encoding the code-point's ordinal value into
| utf-8.
| 
| So, now i believe they are two different things.
| The code-point *is what actually* needs to be encoded and *not* its
| ordinal value.

Because there is a 1:1 mapping, these are the same thing: a code
point is directly _represented_ by the ordinal value, and the ordinal
value is encoded for storage as bytes.

|  The leading 0b is just syntax to tell you this is base 2, not base 8
|  (0o) or base 10 or base 16 (0x). Also, leading zero bits are dropped.
| 
| But byte objects are represented as '\x' instead of the
| aforementioned '0x'. Why is that?

You're confusing a string representation of a single number in
some base (eg 2 or 16) with the string-ish representation of a
bytes object.

The former is just notation for writing a number in different bases, eg:

  27base 10
  1bbase 16
  33base 8
  11011 base 2

A common convention, and the one used by hex(), oct() and bin() in
Python, is to prefix the non-base-10 representations with 0x for
base 16, 0o for base 8 (octal) and 0b for base 2 (binary):

  27
  0x1b
  0o33
  0b11011

This allows the human reader or a machine lexer to decide what base
the number is written in, and therefore to figure out what the
underlying numeric value is.

Conversely, consider the bytes object consisting of the values [97,
98, 99, 27, 10]. In ASCII (and UTF-8 and the iso-8859-x encodings)
these may all represent the characters ['a', 'b', 'c', ESC, NL].
So when printing a bytes object, which is a sequence of small integers 
representing
values stored in bytes, it is compact to print:

  b'abc\x1b\n'

which is ['a', 'b', 'c', chr(27), newline].

The slosh (\) is the common convention in C-like languages and many
others for representing special characters not directly represents
by themselves. So \\ for a slosh, \n for a newline and \x1b
for character 27 (ESC).

The bytes object is still just a sequence on integers, but because
it is very common to have those integers represent text, and very
common to have some text one want represented as bytes in a direct
1:1 mapping, this compact text form is useful and readable. It is
also legal Python syntax for making a small bytes object.

To demonstrate that this is just a _representation_, run this:

   [ i for i in b'abc\x1b\n' ]
  [97, 98, 99, 27, 10]

at an interactive Python 3 prompt. See? Just numbers.

| To encode a number we have to turn it into a string first.
| 
| 16474.encode('utf-8')
| b'16474'
| 
| That 'b' stand for bytes.

Syntactic details. Read this:
  
http://docs.python.org/3/reference/lexical_analysis.html#string-and-bytes-literals

| How can i view this byte's object representation as hex() or as bin()?

See above. A bytes is a _sequence_ of values. hex() and bin() print
individual values in hexadecimal or binary respectively. You could
do this:

  for value in b'16474':
print(value, hex(value), bin(value))

Cheers,
-- 
Cameron Simpson c...@zip.com.au

Uhlmann's Razor: When stupidity is a sufficient explanation, there is no need
 to have recourse to any other.
- Michael M. Uhlmann, assistant attorney general
  for legislation in the Ford Administration
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: A few questiosn about encoding

2013-06-13 Thread Nick the Gr33k

On 14/6/2013 1:46 πμ, Dennis Lee Bieber wrote:

On Wed, 12 Jun 2013 09:09:05 + (UTC),  ??
supp...@superhost.gr declaimed the following:


(*) infact UTF8 also indicates the end of each character



Up to a point.  The initial byte encodes the length and the top few
bits, but the subsequent octets aren’t distinguishable as final in
isolation.  0x80-0xBF can all be either medial or final.



So, the first high-bits are a directive that UTF-8 uses to know how many
bytes each character is being represented as.

0-127 codepoints(characters) use 1 bit to signify they need 1 bit for
storage and the rest 7 bits to actually store the character ?


Not quite... The leading bit is a 0 - which means 0..127 are sent
as-is, no manipulation.


So, in utf-8, the leading bit which is a zero 0, its actually a flag to 
tell that the code-point needs 1 byte to be stored and the rest 7 bits 
is for the actual value of 0-127 code-points ?



128-256 codepoints(characters) use 2 bit to signify they need 2 bits for
storage and the rest 14 bits to actually store the character ?


128..255 -- in what encoding? These all have the leading bit with a
value of 1. In 8-bit encodings (ISO-Latin-1) the meaning of those values is
inherent in the specified encoding and they are sent as-is.


So, latin-iso or greek-iso, the leading 0 is not a flag like it is in 
utf-8 encoding because latin-iso and greek-iso and all *-iso use all 8 
bits for storage?


But, in utf-8, the leading bit, which is 1, is to tell that the 
code-point needs 2 byte to be stored and the rest 7 bits is for the 
actual value of 128-255 code-points ?


But why 2 bytes? leading 1 is a flag and the rest 7 bits can hold the 
encoded value.


Bu that is not the case since we know that utf-8 needs 2 bytes to store 
code-points 127-255




1110 starts a three byte sequence, 0 starts a four byte sequence...
Basically, count the number of leading 1-bits before a 0 bit, and that
tells you how many bytes are in the multi-byte sequence -- and all bytes
that start with 10 are supposed to be the continuations of a multibyte set
(and not a signal that this is a 1-byte entry -- those only have a leading
0)


Why doesn't it work like this?

leading 0 = 1 byte flag
leading 1 = 2 bytes flag
leading 00 = 3 bytes flag
leading 01 = 4 bytes flag
leading 10 = 5 bytes flag
leading 11 = 6 bytes flag

Wouldn't it be more logical?



Original UTF-8 allowed for 31-bits to specify a character in the Unicode
set. It used 6 bytes -- 48 bits total, but 7 bits of the first byte were
the flag (6 leading 1 bits and a 0 bit), and two bits (leading 10) of each
continuation.


utf8 6 byted = 48 bits - 7 bits(from first bytes) - 2 bits(for each 
continuation) * 5 = 48 - 7 - 10 = 31 bits indeed to store the actual 
code-point. But 2^31 is still a huge number to store any kind of 
character isnt it?






--
What is now proved was at first only imagined!
--
http://mail.python.org/mailman/listinfo/python-list


Re: A few questiosn about encoding

2013-06-12 Thread Νικόλαος Κούρας
 (*) infact UTF8 also indicates the end of each character

 Up to a point.  The initial byte encodes the length and the top few
 bits, but the subsequent octets aren’t distinguishable as final in
 isolation.  0x80-0xBF can all be either medial or final.


So, the first high-bits are a directive that UTF-8 uses to know how many 
bytes each character is being represented as.

0-127 codepoints(characters) use 1 bit to signify they need 1 bit for 
storage and the rest 7 bits to actually store the character ?

while

128-256 codepoints(characters) use 2 bit to signify they need 2 bits for 
storage and the rest 14 bits to actually store the character ?

Isn't 14 bits way to many to store a character ? 
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: A few questiosn about encoding

2013-06-12 Thread Steven D'Aprano
On Wed, 12 Jun 2013 09:09:05 +, Νικόλαος Κούρας wrote:

 Isn't 14 bits way to many to store a character ?

No.

There are 1114111 possible characters in Unicode. (And in Japan, they 
sometimes use TRON instead of Unicode, which has even more.)

If you list out all the combinations of 14 bits:

   00
   01
   10
   11
[...]
   10
   11

you will see that there are only 32767 (2**15-1) such values. You can't 
fit 1114111 characters with just 32767 values.



-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: A few questiosn about encoding

2013-06-12 Thread Νικόλαος Κούρας

On 12/6/2013 12:24 μμ, Steven D'Aprano wrote:

On Wed, 12 Jun 2013 09:09:05 +, Νικόλαος Κούρας wrote:


Isn't 14 bits way to many to store a character ?


No.

There are 1114111 possible characters in Unicode. (And in Japan, they
sometimes use TRON instead of Unicode, which has even more.)

If you list out all the combinations of 14 bits:

   00
   01
   10
   11
[...]
   10
   11

you will see that there are only 32767 (2**15-1) such values. You can't
fit 1114111 characters with just 32767 values.




Thanks Steven,
So, how many bytes does UTF-8 stored for codepoints  127 ?

example for codepoint 256, 1345, 16474 ?
--
http://mail.python.org/mailman/listinfo/python-list


Re: A few questiosn about encoding

2013-06-12 Thread Dave Angel

On 06/12/2013 05:24 AM, Steven D'Aprano wrote:

On Wed, 12 Jun 2013 09:09:05 +, Νικόλαος Κούρας wrote:


Isn't 14 bits way to many to store a character ?


No.

There are 1114111 possible characters in Unicode. (And in Japan, they
sometimes use TRON instead of Unicode, which has even more.)

If you list out all the combinations of 14 bits:

   00
   01
   10
   11
[...]
   10
   11

you will see that there are only 32767 (2**15-1) such values. You can't
fit 1114111 characters with just 32767 values.




Actually, it's worse.  There are 16536 such values (2**14), assuming you 
include null, which you did in your list.


--
DaveA
--
http://mail.python.org/mailman/listinfo/python-list


Re: A few questiosn about encoding

2013-06-12 Thread Ulrich Eckhardt

Am 12.06.2013 13:23, schrieb Νικόλαος Κούρας:

So, how many bytes does UTF-8 stored for codepoints  127 ?


What has your research turned up? I personally consider it lazy and 
respectless to get lots of pointers that you could use for further 
research and ask for more info before you even followed these links.




example for codepoint 256, 1345, 16474 ?


Yes, examples exist. Gee, if there only was an information network that 
you could access and where you could locate information on various 
programming-related topics somehow. Seriously, someone should invent 
this thing! But still, even without it, you have all the tools (i.e. 
Python) in your hand to generate these examples yourself! Check out ord, 
bin, encode, decode for a start.



Uli

--
http://mail.python.org/mailman/listinfo/python-list


Re: A few questiosn about encoding

2013-06-12 Thread Nobody
On Wed, 12 Jun 2013 14:23:49 +0300, Νικόλαος Κούρας wrote:

 So, how many bytes does UTF-8 stored for codepoints  127 ?

U+..U+007F  1 byte
U+0080..U+07FF  2 bytes
U+0800..U+  3 bytes
=U+1   4 bytes

So, 1 byte for ASCII, 2 bytes for other Latin characters, Greek, Cyrillic,
Arabic, and Hebrew, 3 bytes for Chinese/Japanese/Korean, 4 bytes for dead
languages and mathematical symbols.

The mechanism used by UTF-8 allows sequences of up to 6 bytes, for a total
of 31 bits, but UTF-16 is limited to U+10 (slightly more than 20 bits).

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: A few questiosn about encoding

2013-06-12 Thread Steven D'Aprano
On Wed, 12 Jun 2013 14:23:49 +0300, Νικόλαος Κούρας wrote:

 So, how many bytes does UTF-8 stored for codepoints  127 ?

Two, three or four, depending on the codepoint.


 example for codepoint 256, 1345, 16474 ?

You can do this yourself. I have already given you enough information in 
previous emails to answer this question on your own, but here it is again:

Open an interactive Python session, and run this code:

c = ord(16474)
len(c.encode('utf-8'))


That will tell you how many bytes are used for that example.



-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: A few questiosn about encoding

2013-06-12 Thread Steven D'Aprano
On Wed, 12 Jun 2013 21:30:23 +0100, Nobody wrote:

 The mechanism used by UTF-8 allows sequences of up to 6 bytes, for a
 total of 31 bits, but UTF-16 is limited to U+10 (slightly more than
 20 bits).

Same with UTF-8 and UTF-32, both of which are limited to U+10 because 
that is what Unicode is limited to.

The *mechanism* of UTF-8 can go up to 6 bytes (or even 7 perhaps?), but 
that's not UTF-8, that's UTF-8-plus-extra-codepoints. Likewise the 
mechanism of UTF-32 could go up to 0x, but doing so means you 
don't have Unicode chars any more, and hence your byte-string is not 
valid UTF-32:

py b = b'\xFF'*8
py b.decode('UTF-32')
Traceback (most recent call last):
  File stdin, line 1, in module
UnicodeDecodeError: 'utf32' codec can't decode bytes in position 0-3: 
codepoint not in range(0x11)


-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: A few questiosn about encoding

2013-06-12 Thread Chris Angelico
On Thu, Jun 13, 2013 at 11:40 AM, Steven D'Aprano
steve+comp.lang.pyt...@pearwood.info wrote:
 The *mechanism* of UTF-8 can go up to 6 bytes (or even 7 perhaps?), but
 that's not UTF-8, that's UTF-8-plus-extra-codepoints.

And a proper UTF-8 decoder will reject \xC0\x80 and \xed\xa0\x80,
even though mathematically they would translate into U+ and U+D800
respectively. The UTF-16 *mechanism* is limited to no more than
Unicode has currently used, but I'm left wondering if that's actually
the other way around - that Unicode planes were deemed to stop at the
point where UTF-16 can't encode any more. Not that it matters; with
most of the current planes completely unallocated, it seems unlikely
we'll be needing more.

ChrisA
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: A few questiosn about encoding

2013-06-09 Thread Fábio Santos
On 9 Jun 2013 11:49, Νικόλαος Κούρας nikos.gr...@gmail.com wrote:

 A few questiosn about encoding please:

  Since 1 byte can hold up to 256 chars, why not utf-8 use 1-byte for
  values up to 256?

 Because then how do you tell when you need one byte, and when you need
 two? If you read two bytes, and see 0x4C 0xFA, does that mean two
 characters, with ordinal values 0x4C and 0xFA, or one character with
 ordinal value 0x4CFA?

 I mean utf-8 could use 1 byte for storing the 1st 256 characters. I meant
up to 256, not above 256.


  UTF-8 and UTF-16 and UTF-32
  I though the number beside of UTF- was to declare how many bits the
  character set was using to store a character into the hdd, no?

 Not exactly, but close. UTF-32 is completely 32-bit (4 byte) values.
 UTF-16 mostly uses 16-bit values, but sometimes it combines two 16-bit
 values to make a surrogate pair.

 A surrogate pair is like itting for example Ctrl-A, which means is a
combination character that consists of 2 different characters?
 Is this what a surrogate is? a pari of 2 chars?


 UTF-8 uses 8-bit values, but sometimes
 it combines two, three or four of them to represent a single code-point.

 'a' to be utf8 encoded needs 1 byte to be stored ? (since ordinal = 65)
 'α΄' to be utf8 encoded needs 2 bytes to be stored ? (since ordinal is 
127 )
 'a chinese ideogramm' to be utf8 encoded needs 4 byte to be stored ?
(since ordinal   65000 )

 The amount of bytes needed to store a character solely depends on the
character's ordinal value in the Unicode table?
 --
 http://mail.python.org/mailman/listinfo/python-list

In short, a utf-8 character takes 1 to 4 bytes. A utf-16 character takes 2
to 4 bytes. A utf-32 always takes 4 bytes.

The process of encoding bytes to characters is called encoding. The
opposite is decoding. This is all made transparent in python with the
encode() and decode() methods. You normally don't care about this kind of
things.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: A few questiosn about encoding

2013-06-09 Thread Nobody
On Sun, 09 Jun 2013 03:44:57 -0700, Νικόλαος Κούρας wrote:

 Since 1 byte can hold up to 256 chars, why not utf-8 use 1-byte for 
 values up to 256? 
 
Because then how do you tell when you need one byte, and when you need 
two? If you read two bytes, and see 0x4C 0xFA, does that mean two 
characters, with ordinal values 0x4C and 0xFA, or one character with 
ordinal value 0x4CFA? 
 
 I mean utf-8 could use 1 byte for storing the 1st 256 characters. I
 meant up to 256, not above 256.

But then you've used up all 256 possible bytes for storing the first 256
characters, and there aren't any left for use in multi-byte sequences.

You need some means to distinguish between a single-byte character and an
individual byte within a multi-byte sequence.

UTF-8 does that by allocating specific ranges to specific purposes.
0x00-0x7F are single-byte characters, 0x80-0xBF are continuation bytes of
multi-byte sequences, 0xC0-0xFF are leading bytes of multi-byte sequences.

This scheme has the advantage of making UTF-8 non-modal, i.e. if a byte is
corrupted, added or removed, it will only affect the character containing
that particular byte; the encoder can re-synchronise at the beginning of
the following character.

OTOH, with encodings such as UTF-16, UTF-32 or ISO-2022, adding or
removing a byte will result in desyncronisation, with all subsequent
characters being corrupted.

 A surrogate pair is like itting for example Ctrl-A, which means is a
 combination character that consists of 2 different characters? Is this
 what a surrogate is? a pari of 2 chars?

A surrogate pair is a pair of 16-bit codes used to represent a single
Unicode character whose code is greater than 0x.

The 2048 codepoints from 0xD800 to 0xDFFF inclusive aren't used to
represent characters, but surrogates. Unicode characters with codes
in the range 0x1-0x10 are represented in UTF-16 as a pair of
surrogates. First, 0x1 is subtracted from the code, giving a value in
the range 0-0xF (20 bits). The top ten bits are added to 0xD800 to
give a value in the range 0xD800-0xDBFF, while the bottom ten bits are
added to 0xDC00 to give a value in the range 0xDC00-0xDFFF.

Because the codes used for surrogates aren't valid as individual
characters, scanning a string for a particular character won't
accidentally match part of a multi-word character.

 'a' to be utf8 encoded needs 1 byte to be stored ? (since ordinal = 65)
 'α΄' to be utf8 encoded needs 2 bytes to be stored ? (since ordinal is
  127 ) 'a chinese ideogramm' to be utf8 encoded needs 4 byte to be
 stored ? (since ordinal   65000 )

Most Chinese, Japanese and Korean (CJK) characters have codepoints within
the BMP (i.e. = 0x), so they only require 3 bytes in UTF-8. The
codepoints above the BMP are mostly for archaic ideographs (those no
longer in normal use), mathematical symbols, dead languages, etc.

 The amount of bytes needed to store a character solely depends on the
 character's ordinal value in the Unicode table?

Yes. UTF-8 is essentially a mechanism for representing 31-bit unsigned
integers such that smaller integers require fewer bytes than larger
integers (subsequent revisions of Unicode cap the range of possible
codepoints to 0x10, as that's all that UTF-16 can handle).

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: A few questiosn about encoding

2013-06-09 Thread Chris “Kwpolska” Warrick
On Sun, Jun 9, 2013 at 12:44 PM, Νικόλαος Κούρας nikos.gr...@gmail.com wrote:
 A few questiosn about encoding please:

 Since 1 byte can hold up to 256 chars, why not utf-8 use 1-byte for
 values up to 256?

Because then how do you tell when you need one byte, and when you need
two? If you read two bytes, and see 0x4C 0xFA, does that mean two
characters, with ordinal values 0x4C and 0xFA, or one character with
ordinal value 0x4CFA?

 I mean utf-8 could use 1 byte for storing the 1st 256 characters. I meant up 
 to 256, not above 256.

It is required so the computer can know where characters begin.
0x0080 (first non-ASCII character) becomes 0xC280 in UTF-8.  Further
details here: http://en.wikipedia.org/wiki/UTF-8#Description

 UTF-8 and UTF-16 and UTF-32
 I though the number beside of UTF- was to declare how many bits the
 character set was using to store a character into the hdd, no?

Not exactly, but close. UTF-32 is completely 32-bit (4 byte) values.
UTF-16 mostly uses 16-bit values, but sometimes it combines two 16-bit
values to make a surrogate pair.

 A surrogate pair is like itting for example Ctrl-A, which means is a 
 combination character that consists of 2 different characters?
 Is this what a surrogate is? a pari of 2 chars?

http://en.wikipedia.org/wiki/UTF-16#Code_points_U.2B1_to_U.2B10

Long story short: codepoint - 0x1 (up to 20 bits) → two 10-bit
numbers → 0xD800 + first_half 0xDC00 + second_half.  Rephrasing:

We take MATHEMATICAL BOLD CAPITAL B (U+1D401).  If you have UTF-8: 퐁

It is over 0x, and we need to use surrogate pairs.  We end up with
0xD401, or 0b11010101.  Both representations are worthless, as
we have a 16-bit number, not a 20-bit one.  We throw in some leading
zeroes and end up with 0b11010101.  Split it in half and
we get 0b110101 and 0b01, which we can now shorten to
0b110101 and 0b1, or translate to hex as 0x0035 and 0x0001.  0xD800 +
0x0035 and 0xDC00 + 0x0035 → 0xD835 0xDC00.  Type it into python and:

 b'\xD8\x35\xDC\x01'.decode('utf-16be')
'퐁'

And before you ask: that “BE” stands for Big-Endian.  Little-Endian
would mean reversing the bytes in a codepoint, which would make it
'\x35\xD8\x01\xDC' (the name is based on the first 256 characters,
which are 0x6500 for 'a' in a little-endian encoding.

Another question you may ask: 0xD800…0xDFFF are reserved in Unicode
for the purposes of UTF-16, so there is no conflicts.

UTF-8 uses 8-bit values, but sometimes
it combines two, three or four of them to represent a single code-point.

 'a' to be utf8 encoded needs 1 byte to be stored ? (since ordinal = 65)
 'α΄' to be utf8 encoded needs 2 bytes to be stored ? (since ordinal is  127 )

yup.  α is at 0x03B1, or 945 decimal.

 'a chinese ideogramm' to be utf8 encoded needs 4 byte to be stored ? (since 
 ordinal   65000 )

Not necessarily, as CJK characters start at U+2E80, which is in the
3-byte range (0x0800 through 0x) — the table is here:
http://en.wikipedia.org/wiki/UTF-8#Description

--
Kwpolska http://kwpolska.tk | GPG KEY: 5EAAEA16
stop html mail| always bottom-post
http://asciiribbon.org| http://caliburn.nl/topposting.html
-- 
http://mail.python.org/mailman/listinfo/python-list